A Quick tour of MGCLex

MGCLex is a "Lexer Generator" in the spirit of Flex or ScanGen. Input is in the form of a specification file containing a list of Token-rule pairs, 1 per line. Each pair consists of a regular expression pattern and an identifier for the pattern. MGCLex uses the patterns to compile and a deterministic finite automaton used to recognize those patterns which is provided as output in the form of a C header file.

MGCLex uses the Aho/Sethi/Ullman direct DFA construction algorithm as outlined in the "Dragon Book". The algorithm, notable in that it avoids creating an intermediary NFA, instead producing a DFA from a regular expression's parse tree.

Downloading and Installing MGCLex

The source code for MGCLex is available on Github and can be cloned directly to your system to build and install.

git clone https://github.com/maxgoren/MGCLex

Alternatively you can download the codebase from here: MGCLex.tar.gz

wget http://maxgcoding.com/mgclex.tar.gz
tar -xvzf mgclex.tar.gz

Once you have the code for MGCLex, it can be built using your local 'make' system and GCC:

cd MGCLex
make && sudo make install && make clean

MGCLex is now installed on your system and ready to use.

A Desk Calculator Example

If we want to tokenize basic algebraic expressions we need to be able to recognize numbers, the simple operators ( +, -, *, /), as well as parentheses for parenthesized subexpressions. 

{"[0-9]+(\.)?[0-9]*", TK_NUMBER}
{"\+", TK_PLUS}
{"-", TK_MINUS}
{"\*", TK_MULT}
{"/", TK_DIV}
{"\(", TK_LPAREN}
{"\)", TK_RPAREN}
An MGCLex specification for a desk calculator

In the above example we specified the number token using character classes to match all digits from 0 to 9 at least once, followed by an optional literal period followed by more digits. Alternatively, we could use perl style shortcuts to specify we want to match all of the digits from 0 to 9:

{"\d+(\.\d+)*", TK_NUMBER}

And of course If you only wanted to support integers, we could keep it real simple using either character classes or escaped shortcuts and the "at least once" closure operator:

{"\d+", TK_NUMBER}
{"[0-9]+", TK_NUMBER}

Generating the DFA

Once our specification file is completed, we are ready to generate the DFA. This as simple as running mgclex in the terminal, using our specification file as the first parameter, and (optionally) our desired filename for the out put header file. 

mgclex calc.mlex lexer_dfa.h

If an output filename is not supplied, it defaults to mgclex_matrix.h.

mgoren@~/mgclex/example$ mgclex calc.mlex
[*] Reading MGCLex Spec file...
Pattern: [0-9]+(\.)?[0-9]*
Symbol:  TK_NUMBER
Registered  TK_NUMBER at 1
------------------------
Pattern: \+
Symbol:  TK_PLUS
Registered  TK_PLUS at 2
------------------------
Pattern: -
Symbol:  TK_MINUS
Registered  TK_MINUS at 3
------------------------
Pattern: \*
Symbol:  TK_MULT
Registered  TK_MULT at 4
------------------------
Pattern: /
Symbol:  TK_DIV
Registered  TK_DIV at 5
------------------------
Pattern: \(
Symbol:  TK_LPAREN
Registered  TK_LPAREN at 6
------------------------
Pattern: \)
Symbol:  TK_RPAREN
Registered  TK_RPAREN at 7
------------------------
Read: 7 Rules for 7 Symbols
[*] Initializing...
Done building Combined RE.
[*] Compiling DFA...
[*] Writing matrix and accept states to mgclex_matrix.h
[*] Cleaning up...
[*] Complete!
mgoren@~/mgclex/example$

Testing it out

Matching text  with a DFA is about as simple/fast as you can get when it comes to tokenization. you begin in the start state, and so long as there are transitions coming out of the current state, you take it, or there is no match. Below is the main loop of the nextToken() procedure from example/matrix_lex_ex.c

  int state = START;
  for (char* p = input; *p; *p++) {
        state = matrix[state][*p];
        printf("%d -> ", state);
        if (state > 0 && accept[state] > -1) {
            last_match = state;
            match_len = (p-input)+1;
        }
        if (state < 1) {
            break;
        }
    }

The following output shows the transitions taken to recognize each symbol(each symbol on its own line):

mgoren@~/mgclex/example$ ./ex "(231+57)/19"
7 -> 0 ->
2 -> 2 -> 2 -> 0 ->
3 -> 0 ->
2 -> 2 -> 0 ->
8 -> 0 ->
6 -> 0 ->
2 -> 2 ->

Notice how the lest state of each line is '0' ? That means there are no transitions out of the previous state, so matching can stop. The next bit out out bit, is obviously of more importance, as it is the reason for this entire exercise: a list of the recognized tokens:

<TK_LPAREN, (>
<TK_NUMBER, 231>
<TK_PLUS, +>
<TK_NUMBER, 57>
<TK_RPAREN, )>
<TK_DIV, />
<TK_NUMBER, 19>
mgoren@~/mgclex/example$

As you can see, each token was recognized and identified, and the list of tokens is suitable for use as input to a parser or whatever use you had in mind.

Anyway, that's what I got for you. Until next time, Happy Hacking!


Leave A Comment