Squeezing DFA row's with Pair Compression
It's no secret that the price we pay for using a DFA in the process of lexical analysis is the (potentially) enourmous transition tables which must be managed. There are many ways of representing transition tables. Anyone who has peaked at an uncompressed transition matrix of even a small grammar can tell you that storing them as-is will waste alot of space, For ASCII, each state would need 256 entries, many of them blank. For unicode, well, theres 1.1 million possible unicode characters, so you do the math on that one.
Pair Compressed Rows
A table with pair compressed rows doesnt store empty entries, instead each row becomes, as the name would suggest, a list of pairs with an exception for the first entry, which is a singleton containing the number of pairs in each row.
In this way, a row that in it's uncompressed form is represented like this:
int example_row[] = { 0, 0, 0, 4, 0, 0, 2, 0, 7, 0 };
Is transformed into the following representation:
int compressed_row[] = {3, 3,4, 6,2, 8,7};
The new array has the number of pairs as the first entry, followed by the actual list of pairs of (col, row). Over many rows, each being many columns wide this adds up to substantial savings. The original square 2d matrix instead becomes an array of pointers to different size arrays, which is why we reserve the first entry for storing the length. In languages where arrays store their length implicitly, this can be skipped.
Compressing an Existing Matrix
With the first entry of every row being the entry count, the first step is obtaining said entry count. In my implementation, the error state is 0, so we scan each row for non-zero entries, summing their total.
int entryCount(int matrix[][256], int i) {
int count = 0;
for (int j = 0; j < 255; j++) {
if (matrix[i][j] != 0) {
count++;
}
}
return count;
}
Next, we allocate an array of the correct size for the compressed row, which will be twice the entry count plus one for the entry count itsself.
int** compressMatrix(int matrix[][256], int num_states) {
int** cm = malloc(sizeof(int*)*num_states+1);
for (int i = 1; i <= num_states; i++) {
int ec = entryCount(matrix, i);
if (ec == 0) {
cm[i] = NULL;
} else {
int *row = malloc(sizeof(int)*(2*ec+1));
row[0] = ec;
int epos = 1;
for (int j = 0; j < 256; j++) {
if (matrix[i][j] != 0) {
row[epos++] = i;
row[epos++] = j;
}
}
cm[i] = row;
}
}
return cm;
}
Obtaining the next state in compressed tables
The eagle eyed among you may have noticed that we have essentially turned the matrix representation of a graph into its adjacency list representation, and this is totally accurate. However, one side effect of using this compressed format is that we can no longer obtain the next state with a simple array lookup anymore. Instead, we have to scan our list of pairs. As previously mentioned, the first item in each row is the number of pairs the list contains, the remaining spaces are the pairs themselves.
To find the next state, we start at position one, and go two-by-two down the list of pairs until we find a match or run out of pairs:
int Lexer::get_next(int state, char p) {
if (mgc_lexer_matrix[state]) {
for (int i = 1; i < 2*mgc_lexer_matrix[state][0]; i += 2) {
if (mgc_lexer_matrix[state][i] == p)
return mgc_lexer_matrix[state][i+1];
}
}
return 0;
}
And that's all there is to it! Happy hacking.
-
Squeezing DFA row's with Pair Compression
-
Designing Abstract Syntax Trees: Homogenous vs. Heterogenous Node Structures
-
From LR Items to LR States
-
Calculating Follow Sets of a Context Free Grammar
-
Streaming Images from ESP32-CAM for viewing on a CYD-esp32
-
Constructing the Firsts Set of a Context Free Grammar
-
Inchworm Evaluation, Or Evaluating Prefix-Expressions with a Queue
-
Data Structures For Representing Context Free Grammar
-
A B Tree of Binary Search Trees
-
Implementing enhanced for loops in Bytecode
Leave A Comment