The Aho, Sethi, Ullman Direct DFA Construction Part 2: Building the DFA from Followpos

In part one of this post I covered building and annotating an abstract syntax tree from a postfix regular expression, as well as populating the firstpos, lastpos, and followpos sets for each node in the AST is it relates to it's position in the regular expression. Having laid that ground work we are now ready for the "main event": Constructing a DFA to recognize our regular expression using the followpos table we constructed earlier.

Building the DFA

initially the only unmarked state in Dstates is firstpos(root),
where root is the root of the syntax tree for the provided RE.
while there is an unmarked state T in Dstates do begin
     mark T;
     for each input symbol a do begin
         let U be the set of positions that are in followpos(p)
                  for some position p in T such that symbol p is a.
         if U is not empty and is not already in Dstates then
                  add U as an unmarked state to Dstates
         Dtran[T, a] := U
     end
end

The actual structure of our DFA will be comprised of the following structures.  Dstates is an array of DFAState structures, with their label field being used to index into the array. The transition table associated to each state is stored as an AVL tree. The choice of using an AVL tree is a design decision explained below, but frequently is implemented as either a 2D array, or some form of compressed matrix (like an array of AVL trees).

typedef struct {
    int label;
    int token_id;
    bool is_accepting;
    Set* positions;
} DFAState;

typedef struct Transition_ {
    char ch;
    int to;
    int height;
    struct Transition_* left;
    struct Transition_* right;
} Transition;

typedef struct {
    DFAState** states;
    Transition** dtrans;
    int numstates;
} DFA;

In addition to the followpos table we created in the previous post, there are two additional inputs to buildDFA: our constructed AST, and a postfix regular expression without explicit concat operators. From this input string, we construct the "alphabet" table: The list of characters belonging to the language the DFA can recognize.

int symbolIsInAlphabet(char* str, int n, char c) {
    for (int i = 0; i < n; i++) {
        if (str[i] == c)
            return i;
    }
    return -1;
}

void initAlphabet(char* alphabet, char* re)
     for (int i = 0; re[i]; i++)
        if (re[i] != '#' && (is_char(re[i]) || is_digit(re[i]) || is_special(re[i])) && symbolIsInAlphabet(alphabet, k, re[i]) == -1)
            alphabet[k++] = re[i];
}

Once the alphabet has been established, we begin by adding the start state to the DFA. The start state is the collection of positions at firstpos(root). After adding the start state to the DFA, we place it on a queue, as during construction of the DFA in a states are procesessed in First In First Out ordering.

void buildDFA(re_ast* ast, char* re)
    DFA dfa;
    StateQueue fq;
    initAlphabet(alphabet, re);
    initDFA(&dfa,numleaves+1);
    addState(&dfa, createState(nextStateNum(&dfa), copySet(ast->firstpos)));
    initQueue(&fq);
    enQueue(&fq,  dfa.states[1]);
    while (!emptyQueue(&fq)) {
        DFAState* curr_state = deQueue(&fq);
        for (char *input_symbol = alphabet; *input_symbol; input_symbol++) {
            Set* next_states = calculateNextStatesPositions(curr_state, *input_symbol);

With the start state initialized and added to the queue we start the main loop of the algorithm. We pop the next state from the queue and while iterating over the alphabet we check to see if it is possible to make a transition from the current state to the next state using calculateNextStatesPositions().

 All that work in constructing the followpos table now bears fruit as we have all the information we need to construct the next state. Take note how we take care to adjust the logic based on if we are in a character class or not, as well as checking if the current position is a wildcard symbol.

Set* calculateNextStatesPositions(DFAState* curr_state, char input_symbol) {
    Set* next_states = createSet(nonleaves+1);
    for (int i = 0; i < curr_state->positions->n; i++) {
        int t = curr_state->positions->members[i];
        if ((ast_node_table[t]->token.symbol == RE_CCL && findInCharClass(ast_node_table[t]->token.ccl, input_symbol)) || 
            (ast_node_table[t]->token.symbol != RE_CCL && (ast_node_table[t]->token.ch == input_symbol || ast_node_table[t]->token.symbol == RE_PERIOD))) {
            next_states = setUnion(next_states, ast_node_table[t]->followpos);
        }
    }
    return next_states;
}

States are representitive of a Set of positions. Inorder to compare states, we compare sets of positions. This is one area where changing the set representation from a sorted int set to say, a bitset, would yield _huge_ speed ups.

int findStateByPositions(DFA* dfa, Set* next_states) {
    for (int i = 1; i <= dfa->numstates; i++) {
        if (setsEqual(next_states, dfa->states[i]->positions)) {
            return i;
        }
    }
    return -1;
}

If next_states is returned empty, we simply free the set and try the next input character. If it does return a set with positions we have to check the current existing states in the DFA to see if it is a new state or not. If it is a new state, we add it to the DFA, and place it on the queue to be processed, adding a transition for the current alphabet character in the process. If the state already exists, we simply add the transition and move on.

            if (!isSetEmpty(next_states)) {
                if ((found = findStateByPositions(&dfa, next_states)) > -1) {
                    dfa.dtrans[curr_state->label] = addTransition(dfa.dtrans[curr_state->label], dfa.states[found]->label, *input_symbol);
                    freeSet(next_states);
                } else {
                    DFAState* new_state = createState(nextStateNum(&dfa), copySet(next_states)); 
                    addState(&dfa, new_state);
                    dfa.dtrans[curr_state->label] = addTransition(dfa.dtrans[curr_state->label], new_state->label, *input_symbol);
                    enQueue(&fq, new_state);
                }
            } else {
                free(next_states);
            }
        }
    }
    for (int i = 1; i <= dfa.numstates; i++) {
        dfa.states[i] = markAcceptState(dfa.states[i]);
    }
    return dfa;
}

Once all of the states and transitions have been added to the DFA we iterate through the states, marking any accepting states as such along the way. Because I plan to use my DFA as a lexer, It also assigns a token id to an accepting state, preferring lower token_ids if two tokens share an accept state, this allows us to choose tokens by priority - essential for proper lexing. 

DFAState* markAcceptState(DFAState* next_state) {
    for (int j = 0; j < next_state->positions->n; j++) {
        int pos = next_state->positions->members[j];
        if (ast_node_table[pos]->token.ch == '#')  {      
            next_state->is_accepting = true; 
            int leaf_id = ast_node_table[pos]->tk_token_id;
            if (next_state->token_id == -1 || leaf_id < next_state->token_id) {
                next_state->token_id = leaf_id;
            }
        }
    }
    return next_state;
}

And with that, our DFA has been constructed. Now lets put it to use!

Pattern Matching With DFA

Now for the fun part, using the constructed to DFA to recognize patterns in text described as regular expressions. Using a DFA for this purpose is delightfully simple - especially when you compare the same operation performed on an NFA. It is truly a night and day difference.

bool simulateDFA(DFA dfa, char* text) {
    DFAState* state = dfa.states[1];
    for (char *sp = text; *sp != '\0'; sp++) {
        printf("Current State: %d, Input Symbol: %c\n", state->label, *sp);
        DFAState* next = NULL;
        Transition* it = findTransition(dfa.dtrans[state->label], *sp);
        if (it != NULL) {
            if (*sp == it->ch || ast_node_table[state->label]->token.symbol == RE_PERIOD) {
                next = dfa.states[it->to];
            }
        } else if (ast_node_table[state->label]->token.symbol == RE_PERIOD) {
            it = findTransition(dfa.dtrans[state->label], '.');
            if (it != NULL)
                next = dfa.states[it->to];
        }  
        if (!next) {
            printf("Out of transitions, No match.\n");
            return false;
        }
        state = next;
    }
    printf("Final State: %d\n", state->label);
    return state->is_accepting;
}

We start the current state set as our DFA's start state (1) and iterate over the input string, checking if a transition exists from the current state on the current input character. If a transition exists, we take it. If there is no transition to take, then the text doesn't match the pattern. Finally, should we consume all of our input, and upon termination the last state we transitioned to is also an accepting state, then we have found a match. Simple, efficient, no backtracking.

bool matchDFA(char* re, char *text) {
    re = augmentRE(re);
    re_ast* ast = re2ast(re);
    DFA dfa = re2dfa(re, ast);
    bool ans = simulateDFA(dfa, text);
    cleanup(&dfa, ast);
    return ans;
}

Well, that wraps up our adventure with crafting a DFA from a regular expression. Until next time, Happy Hacking!


Leave A Comment