A Quick tour of MGCLex
MGCLex is a "Lexer Generator" in the spirit of Flex or ScanGen. Input is in the form of a specification file containing a list of Token-rule pairs, 1 per line. Each pair consists of a regular expression pattern and an identifier for the pattern. MGCLex uses the patterns to compile and a deterministic finite automaton used to recognize those patterns which is provided as output in the form of a C header file.
MGCLex uses the Aho/Sethi/Ullman direct DFA construction algorithm as outlined in the "Dragon Book". The algorithm, notable in that it avoids creating an intermediary NFA, instead producing a DFA from a regular expression's parse tree.
Downloading and Installing MGCLex
The source code for MGCLex is available on Github and can be cloned directly to your system to build and install.
git clone https://github.com/maxgoren/MGCLex
Alternatively you can download the codebase from here: MGCLex.tar.gz
wget http://maxgcoding.com/mgclex.tar.gz
tar -xvzf mgclex.tar.gz
Once you have the code for MGCLex, it can be built using your local 'make' system and GCC:
cd MGCLex
make && sudo make install && make clean
MGCLex is now installed on your system and ready to use.
A Desk Calculator Example
If we want to tokenize basic algebraic expressions we need to be able to recognize numbers, the simple operators ( +, -, *, /), as well as parentheses for parenthesized subexpressions.
{"[0-9]+(\.)?[0-9]*", TK_NUMBER}
{"\+", TK_PLUS}
{"-", TK_MINUS}
{"\*", TK_MULT}
{"/", TK_DIV}
{"\(", TK_LPAREN}
{"\)", TK_RPAREN}
An MGCLex specification for a desk calculator
In the above example we specified the number token using character classes to match all digits from 0 to 9 at least once, followed by an optional literal period followed by more digits. Alternatively, we could use perl style shortcuts to specify we want to match all of the digits from 0 to 9:
{"\d+(\.\d+)*", TK_NUMBER}
And of course If you only wanted to support integers, we could keep it real simple using either character classes or escaped shortcuts and the "at least once" closure operator:
{"\d+", TK_NUMBER}
{"[0-9]+", TK_NUMBER}
Generating the DFA
Once our specification file is completed, we are ready to generate the DFA. This as simple as running mgclex in the terminal, using our specification file as the first parameter, and (optionally) our desired filename for the out put header file.
mgclex calc.mlex lexer_dfa.h
If an output filename is not supplied, it defaults to mgclex_matrix.h.
mgoren@~/mgclex/example$ mgclex calc.mlex
[*] Reading MGCLex Spec file...
Pattern: [0-9]+(\.)?[0-9]*
Symbol: TK_NUMBER
Registered TK_NUMBER at 1
------------------------
Pattern: \+
Symbol: TK_PLUS
Registered TK_PLUS at 2
------------------------
Pattern: -
Symbol: TK_MINUS
Registered TK_MINUS at 3
------------------------
Pattern: \*
Symbol: TK_MULT
Registered TK_MULT at 4
------------------------
Pattern: /
Symbol: TK_DIV
Registered TK_DIV at 5
------------------------
Pattern: \(
Symbol: TK_LPAREN
Registered TK_LPAREN at 6
------------------------
Pattern: \)
Symbol: TK_RPAREN
Registered TK_RPAREN at 7
------------------------
Read: 7 Rules for 7 Symbols
[*] Initializing...
Done building Combined RE.
[*] Compiling DFA...
[*] Writing matrix and accept states to mgclex_matrix.h
[*] Cleaning up...
[*] Complete!
mgoren@~/mgclex/example$
Testing it out
Matching text with a DFA is about as simple/fast as you can get when it comes to tokenization. you begin in the start state, and so long as there are transitions coming out of the current state, you take it, or there is no match. Below is the main loop of the nextToken() procedure from example/matrix_lex_ex.c
int state = START;
for (char* p = input; *p; *p++) {
state = matrix[state][*p];
printf("%d -> ", state);
if (state > 0 && accept[state] > -1) {
last_match = state;
match_len = (p-input)+1;
}
if (state < 1) {
break;
}
}
The following output shows the transitions taken to recognize each symbol(each symbol on its own line):
mgoren@~/mgclex/example$ ./ex "(231+57)/19"
7 -> 0 ->
2 -> 2 -> 2 -> 0 ->
3 -> 0 ->
2 -> 2 -> 0 ->
8 -> 0 ->
6 -> 0 ->
2 -> 2 ->
Notice how the lest state of each line is '0' ? That means there are no transitions out of the previous state, so matching can stop. The next bit out out bit, is obviously of more importance, as it is the reason for this entire exercise: a list of the recognized tokens:
<TK_LPAREN, (>
<TK_NUMBER, 231>
<TK_PLUS, +>
<TK_NUMBER, 57>
<TK_RPAREN, )>
<TK_DIV, />
<TK_NUMBER, 19>
mgoren@~/mgclex/example$
As you can see, each token was recognized and identified, and the list of tokens is suitable for use as input to a parser or whatever use you had in mind.
Anyway, that's what I got for you. Until next time, Happy Hacking!
-
A Quick tour of MGCLex
-
Compiling Regular Expressions for "The VM Approach"
-
Composable Linked Digraphs: An efficient NFA Data Structure for Thompsons Construction
-
Improving the Space Efficiency of Suffix Arrays
-
Augmenting B+ Trees For Order Statistics
-
Top-Down AST Construction of Regular Expressions with Recursive Descent
-
Balanced Deletion for in-memory B+ Trees
-
Building an AST from a Regular Expression Bottom-up
-
The Aho, Sethi, Ullman Direct DFA Construction Part 2: Building the DFA from the Followpos Table
-
The Aho, Sethi, Ullman Direct DFA Construction, Part 1: Constructing the Followpos Table
Leave A Comment