Implementing Thompsons Construction Bottom-Up: Building NFA from Regular Expressions

mgcadmin 15-10-2024

Regular expressions and NFA have been an active area of study in theoretical computer science since the late 1950's. As such, there is a large body of work exploring a multitude of ways to evaluate regular expressions for the purpose of pattern matching. Thompson's Construction is a method for the building of Non-Deterministic Finite Automata(NFA) from Regular Expreessions(RE). It is famously the algorithm outlined in the "dragon book", amongst many other notable texts for building pattern matching NFA[1].

In previous posts I have discussed one such implementation presented by Robert Sedgewick in his book "Algorithms" (4th ed). As I've said before, when it comes to data structures and algorithms, you can never have too many choices. With this in mind I've decided to explore some other implementations for evaluating regular expressions.

Thhompsons construction is less an algorithm and more a high-level idea of how to construct an NFA for a regular expression by creating individual NFA's for each of its sub-expressions and then "wiring" them together into larger NFA's until we have constructed an NFA for the entire expression. The method I describe below proceeds bottom-up by building an abstract syntax tree of the regular expression, which is then traversed post-order to build the NFA's and connect the sub expressions.

Non-Deterministic Finite Automata

Non-Deterministic Finite Automata are Mathematical models of computation. To put it simply though, an NFA is a set of states, with one state being designated as the starting state, and another state being designated as the accepting state. Each state has an associated (possibly empty) set of transitions from it's self to other states.

I'm sure you have quickly realized that what I just described is a directed graph. All NFA's (and DFA's for that matter) can be modeled as directed graphs. In addition to the directed graph, every NFA also has an alphabet of symbols that represent the language recognized by that particular NFA.

typedef int State;

class Edge {
    private:
        State from;
        State to;
    public:
        Edge(State s, State t) : from(s), to(t) { }
        State getFrom() const { return from; }
        State getTo() const { return to; }
        virtual RegExToken getLabel() = 0;
        virtual bool matches(char c) = 0;
        virtual bool positionIs(int index) = 0;
        virtual bool isEpsilon() = 0;
        virtual ~Edge() { }
};

class NFA {
    private:
        State start;
        State accept;
        unordered_map<State, set<Edge*>> states;
    public:
        NFA() {
            start = 0;
            accept = 0;
        }
        NFA(const NFA& nfa) {
            start = nfa.start;
            accept = nfa.accept;
            for (auto & m : nfa.states) {
                makeState(m.first);
                for (auto & t : m.second) {
                    addTransition(t);
                }
            }
        }
        void makeState(State name) {
            if (states.find(name) == states.end()) {
                states.insert(make_pair(name, set<Edge*>()));
            }
        }
        void setStart(State ss) {
            start = ss;
        }
        void setAccept(State as) {
            accept = as;
        }
        State getStart() {
            return start;
        }
        State getAccept() {
            return accept;
        }
        void addTransition(Edge* t) {
            if (states.at(t->getFrom()).find(t) == states.at(t->getFrom()).end())
                states[t->getFrom()].insert(t);
        }
        int size() {
            return states.size();
        }
        auto getStates() const {
            return states;
        }
        set<Edge*>& getTransitions(State state) {
            return states[state];
        }
        NFA& operator=(const NFA& nfa) {
            if (this != &nfa) {
                start = nfa.start;
                accept = nfa.accept;
                for (auto m : nfa.states) {
                    makeState(m.first);
                    for (auto t : m.second) {
                        addTransition(t);
                    }
                }
            }
            return *this;
        }
};

Sticking with directed graph model, a transition is an edge connecting 2 states and an edge {f, t} where f is the from state, t is the to state. Not all transitions in an NFA behave the same way. The edge can be either a character labeld transition, which consumes input when traversed or an Epsilon transition, which while still causing a change of state when traversed, does not consume any input. This last fact is key as it is what allows us to "wire up" the NFA's as we go using epsilon transitions.

class CharEdge : public Edge {
    private:
        RegExToken label;
        bool checkInRange(char c) {
            char lo, hi;
            bool is_good = false;
            for (int i = 1; i < label.charachters.size()-1; i++) {
                if (i+1 < label.charachters.size() && label.charachters[i] == '-') {
                    lo = label.charachters[i-1];
                    hi = label.charachters[i+1];
                    if (hi < lo) {
                        char tmp = hi;
                        hi = lo;
                        lo = tmp;
                    }
                    if (c >= lo && c <= hi) {
                        is_good = true;
                        break;
                    }
                }
            }
            return is_good;
        }
    public:
        CharEdge(State From, State To, RegExToken c) : Edge(From, To) {
            label = c;
        }
        ~CharEdge() {

        }
        bool isEpsilon() {
            return false;
        }
        bool positionIs(int i) {
            return true;
        }
        bool matches(char c) {
            if (label.symbol == RE_SPECIFIEDSET) {
                for (char m : label.charachters) {
                    if (c == m)
                        return true;
                }
                return false;
            } else if (label.symbol == RE_SPECIFIEDRANGE) {
                return checkInRange(c);
            }
            return label.charachters[0] == c;
        }
        RegExToken getLabel() {
            return label;
        }
};

class EpsilonEdge : public Edge {
    public:
        EpsilonEdge(State From, State To) : Edge(From, To) { }
        ~EpsilonEdge() { }
        bool matches(char c) {
            return true;
        }
        bool positionIs(int i) {
            return true;
        }
        bool isEpsilon() {
            return true;
        }
        RegExToken getLabel() {
            return RegExToken(RE_NONE, "&");
        }
};

bool operator<(const Edge& s, const Edge& t) {
    return s.getFrom() < t.getFrom() && s.getTo() < t.getTo();
}

bool operator==(const Edge& s, const Edge& t) {
    return s.getFrom() == t.getFrom() && s.getTo() == t.getTo();
}

bool operator!=(const Edge& s, const Edge& t) {
    return !(s == t);
}

To build our NFA, we populate the NFA with states, and add the correct type of transitions to each source-states edge list. Each NFA has a start and accept state which must set before being used. Once our NFA is built, the match() method is used to see if the NFA built from the supplied regular expression recognizes a provided input string. As can be seen from the example below however, building even the most trivial of NFA's by hand can be tedious, and thus is prone to error. As such, being able to automate this process is important.

void testNFA() {
    NFA nfa;
    nfa.makeState(0); nfa.makeState(1); nfa.makeState(2); nfa.makeState(3);
    nfa.setStart(0);
    nfa.setAccept(3);
    nfa.addTransition(new CharEdge(0, 1, 'a')); //start -> a
    nfa.addTransition(new CharEdge(1, 2,'b')); // a -> b
    nfa.addTransition(new CharEdge(2, 2,'b')); // b -> b
    nfa.addTransition(new EpsilonEdge(2, nfa.getAccept())); // b -> accept
    cout<<nfa.match("abb")<<endl;
    cout<<nfa.match("aab")<<endl;
    cout<<nfa.match("ab")<<endl;
    cout<<nfa.match("a")<<endl;
}

The solution to automating the constuction is to convert a Regular expression string to its corresponding abstract syntax tree, and then build our NFA automatically from the tree. We'll return to implementing match() once we have our NFA constructed from an AST, so lets get to it.

Building an AST from a Regular Expression

Parsing the Expression

Parsing of regular expressions begins by converting the parenthesized in-fix expression into it's non-parenthesized postfix representation. To accomplish this I've used the shunting yard algorithm introduced by Edsgar Dijkstra, and which i've covered in a previous post. This algorithm constructs the AST from the bottom up. An alternative method is to construct the AST using recursive descent, which i will not be discussing.

class Parser {
       private:
           bool isOp(char c);
           int precedence(char c);
           int leftAssociative(char c);
           string addConcatOp(string& str);
           string makePostfix(string& str);
           RegularExpression* makeTree(string& postfix);
      public:
          Parser();
          RegularExpression* parse();
};

To implement the shunting yard algorithm for regular expressions, we need to know both the precedence, and associativity of the various operators. A quick glance to chapter 3 of the dragon book by Aho, Sethi, and Ullman[1] tells us that all of the operators are left associative. Well, at least thats simple.

bool leftAssociative(char c) {
    switch (c) {
        case '*':
        case '+':
        case '?':
        case '|':
        case '@':
            return true;
        default:
            break;
    }
    return false;
}

When it comes to precedence, Closure's have the highest precedence, followed by concatenation and then alternation.

int precedence(char c) {
    switch (c) {
        case '*': return 50;
        case '+': return 50;
        case '?': return 50;
        case '@': return 30;
        case '|': return 20;
        default:
            break;
    }
    return 10;
}

Making Implicit Operators Explicit

In regular expressions the concat operator is only implicitly present. That is, we know that two symbols next to each other are concatenated, but there is no visible operator symbol like there is for closures and alternation. This presents us with a slight problem when it comes to implementing the shunting yard algorithm: It's awfully hard to reposition an operator from infix to postfix when there is nothing to denote it's presence.

string addConcatOp(string str) {
    string fixed;
    for (int i = 0; i < str.length(); i++) {
        fixed.push_back(str[i]);
        if (str[i] == '(' || str[i] == '|')
            continue;
        if (i+1 < str.length()) {
            char p = str[i+1];
            if (p == '|' || p == '*' || p == '+' || p == ')')
                continue;
            fixed.push_back('@');
        }
    }
    return fixed;
}

To remedy this problem, we will pre-process the infix expression string. Before we can transform the expression to postfix form we first explicitly insert an '@' symbol to represent the conc@enation operator into the expression string. After processing, a string which used to be represented as "word" will now appear as "w@o@r@d".

Converting The Expression To Postfix

With our explicit concatenation operators in place the rest of the shunting yard algoritm proceeds the same as it would for an algebraic expression.

string makePostfix(string& str) {
    str = addConcatOp(str);
    cout<<"Inserting explicit concat operators: "<<str<<endl;
    string postfix;                                                                                                   
    for (int i = 0; i < str.length(); i++) {
        if (str[i] == '(') {
            ops.push(str[i]);
        } else if (str[i] == '|' || str[i] == '*' || str[i] == '+' || str[i] == '@' || str[i] == '?') {
                if (precedence(str[i]) < precedence(ops.top()) || (precedence(str[i]) == precedence(ops.top()) && leftAssociative(str[i]))) {
                    char c = ops.pop();
                    postfix.push_back(c);
                    ops.push(str[i]);
                } else {
                    ops.push(str[i]);
                }
        } else if (str[i] == ')') {
            while (!ops.empty()) {
                char c = ops.pop();
                if (c == '(')
                    break;
                else postfix.push_back(c);
            }
        } else {
            postfix.push_back(str[i]);
        }
    }
    while (!ops.empty()) {
        char c = ops.pop();
        if (c != '(')
            postfix.push_back(c);
    }
    return postfix;
}

Building The AST

Armed with our post-fix expression, building an AST is fairly straight forward. To represent the AST i've implemented a class heirarchy which derives from the base class RegularExpression. The ExpressionLiteral class will be used to represent the literal values as the leaf nodes and the ExpressionOperator class for representing operators as the internal nodes of the AST.

class RegularExpression {
    public:
        virtual char getSymbol() = 0;
        virtual RegularExpression* getLeft() = 0;
        virtual RegularExpression* getRight() = 0;
};

class ExpressionLiteral : public RegularExpression {
    private:
        char symbol;
    public:
        ExpressionLiteral(char sym) { symbol = sym; }
        RegularExpression* getLeft() {
            return nullptr;
        }
        RegularExpression* getRight() {
            return nullptr;
        }
        char getSymbol() {
            return symbol;
        }
};

class ExpressionOperator : public RegularExpression {
    private:
        RegularExpression* left;
        RegularExpression* right;
        char symbol;
    public:
        ExpressionOperator(char c, RegularExpression* ll, RegularExpression* rr) {
            symbol = c;
            left = ll;
            right = rr;
        }
        char getSymbol() {
            return symbol;
        }
        RegularExpression* getLeft() {
            return left;
        }
        RegularExpression* getRight() {
            return right;
        }
};

The actual construction of the AST is performed by iterating over the expression string and when a non-operator symbol is encountered an ExpressionLiteral node is created which is then pushed onto a stack of temporary trees.

Operators are handled similarly except when an operator node is created, previously created trees are popped from the stack and set as the operator nodes left and right children respectively, with the new tree with our operator node as the root being pushed onto the temporary tree stack.

RegularExpression* makeTree(string postfix) {
    Stack<RegularExpression*> sf;
    for (char c : postfix) {
        cout<<"Processing: "<<c<<" ";
        if (!isOp(c)) {
            cout<<"Alphabet."<<endl;
            sf.push(new ExpressionLiteral(c));
        } else {
            cout<<"Operator."<<endl;
            auto right = sf.empty() ? nullptr:sf.pop();
            auto left = sf.empty() ? nullptr:sf.pop();
            sf.push(new ExpressionOperator(c, left, right));
        }
    }
    return sf.pop();
}

When all of the input has been consumed, the stack should contain one item: a pointer to the root of our expression tree. With our AST constructed, we can now swing back around to our NFA, and start thinking about how we can automate it's construction from our AST.

Constructing The NFA

As mentioned above, thompsons construction builds an NFA for a regular expression by building automata for its sub expressions and combining them to create the final automaton. Each operator has a distinct automata, as do character transitions, and even empty expressions. Lets take a look at the individual automata we're going to be building.

There are a few operations performed during the construction of each NFA that will take place regardless of the automata we construct and as such make sense to be abstracted out in to helper functions. One of these tasks, is the creation of a new NFA with a start and end state, as well as the generation of state names.

        int label = 0;
        int makeStateLabel() {
            return label++;
        }
        void initNextNFA(NFA& nnfa) {
            int nstart = makeStateLabel();
            nnfa.makeState(nstart);
            nnfa.setStart(nstart);
            int nend = makeStateLabel();
            nnfa.makeState(nend);
            nnfa.setAccept(nend);
        }

It should also be mentioned that when I say we are "wiring NFA's together" I'm not implying that we are some how linking NFA objects together - we're not. Instead, we merge NFA's by copying their transitions into a new NFA and throwing the old one away:

       void copyTransitions(NFA& nnfa, NFA onfa) {
            for (auto state : onfa.getStates()) {
                nnfa.makeState(state.first);
                for (auto trans : onfa.getTransitions(state.first)) {
                    nnfa.addTransition(trans);
                }
            }
        }

At each step of the construction, new NFA are created from previously constructed NFA until we arrive at the completed, desired automaton.

Atomic NFA For Symbols and E-transitions

Ok, let's start putting some NFA's together. The simplest of them are the so-called "atomic" NFA's: the empty expression, and the single character expression.

NFA singleTransitionNFA(RegExToken c) {
       NFA nfa;
       initNextNFA(nfa);
       if (c.charachters.empty())
           nfa.addTransition(new EpsilonEdge(nfa.getStart(), nfa.getAccept()));
       else
           nfa.addTransition(new CharEdge(nfa.getStart(), nfa.getAccept(), c)); 
       return nfa;
}

Infact, an empty expression and a character expression are the same automata, the difference being the type of edge the transition has. An empty Expression is simply a start state, an ending state, and an Epsilon transition connecting them.

NFA emptyExpr() {
       RegExToken c; 
       return singleTransitionNFA(c);
}

With that definition in mind, then the single character automata is a starting state connecting to an accepting state, with a character labeled edge connecting them.

NFA atomicNFA(RegExToken c) {
       return singleTransitionNFA(c);
}

The Concatenation Operator

Next up is concatenation. To implement the concatenation automaton, we take the two automata to be joined, and create an epsilon transition from the first automata's accepting state to the second automata's starting state. Finally, we set the first automata's accepting state to be the second automata's accepting state. See what I mean by "wiring together" the automata?

NFA concatNFA(NFA first, NFA second) {
       NFA nnfa;
       copyTransitions(nnfa, first);
       copyTransitions(nnfa, second);
       nnfa.setStart(first.getStart());
       nnfa.setAccept(second.getAccept());
       nnfa.addTransition(new EpsilonEdge(first.getAccept(), second.getStart()));
       return nnfa;
}

The Alternation Operator

And now we get to the Or operator. Take a moment to think about how we've implemented the previous types of automata and see if you can come up with how to wire this one.

We begin by creating a new starting state. From the new starting state we create two epsilon transitions which connect the new start state, to the start states of our two choices. We then create a new accept state and connect the accept state of our respective automata with the new one with epsilon changes.

NFA alternateNFA(NFA first, NFA second) {
       //create new NFA with start and end state.
       NFA nnfa;
       initNextNFA(nnfa);
       //copy states and transitions from both NFA's to new NFA
       copyTransitions(nnfa, first);
       copyTransitions(nnfa, second);
       //Add new E-transitions from new start state to first and second NFAs
       nnfa.addTransition(new EpsilonEdge(nnfa.getStart(), first.getStart()));
       nnfa.addTransition(new EpsilonEdge(nnfa.getStart(), second.getStart()));
       //Add new E-transitions from first and second accept state to new accept state.
       nnfa.addTransition(new EpsilonEdge(first.getAccept(), nnfa.getAccept()));
       nnfa.addTransition(new EpsilonEdge(second.getAccept(), nnfa.getAccept()));
       return nnfa;
}

The Closure Operators

Finally, we arrive at our closure operators: *, ?, and +. The Kleene star (*) is for matching a character 0 or more times, while the + closure matches those that occur at least once but possibly more times, and as such the two NFA's are constructed in a very similar fashion.

The ? operator is for matching a character either 0 or 1 times and is more closely related to the or operator. Actually, it IS an or operator, with the choices being a single character transition, or an epsilon transition.

NFA kleeneNFA(NFA torepeat, bool mustMatch) {
       NFA nnfa;
       initNextNFA(nnfa);
       copyTransitions(nnfa, torepeat);
       nnfa.addTransition(new EpsilonEdge(torepeat.getAccept(), nnfa.getStart()));
       nnfa.addTransition(new EpsilonEdge(nnfa.getStart(), torepeat.getStart()));
       nnfa.addTransition(new EpsilonEdge(torepeat.getAccept(), nnfa.getAccept()));
       if (!mustMatch)
           nnfa.addTransition(new EpsilonEdge(nnfa.getStart(), nnfa.getAccept()));
       return nnfa;
}

NFA zeroOrOnce(NFA onfa) {
       NFA nnfa;
       initNextNFA(nnfa);
       copyTransitions(nnfa, onfa);
       //wire in match choice
       nnfa.addTransition(new EpsilonEdge(nnfa.getStart(), onfa.getStart()));
       nnfa.addTransition(new EpsilonEdge(onfa.getAccept(), nnfa.getAccept()));
       //wire in epsilon choice.
       nnfa.addTransition(new EpsilonEdge(nnfa.getStart(), nnfa.getAccept()));
       return nnfa;
}

NFA repeatNTimes(NFA a, int N) {
      Stack<NFA> sf;
      for (int i = 0; i < N; i++) {
           NFA tnfa;
           initNextNFA(tnfa);
           copyTransitions(tnfa, a);
           tnfa.addTransition(new EpsilonEdge(tnfa.getStart(), a.getStart()));
           tnfa.addTransition(new EpsilonEdge(a.getAccept(), tnfa.getAccept()));
           sf.push(tnfa);
       }
       NFA fnfa;
       initNextNFA(fnfa);
       State prev = fnfa.getStart();
       while (!sf.empty()) {
            NFA tmp = sf.pop();
            copyTransitions(fnfa, tmp);
            fnfa.addTransition(new EpsilonEdge(prev, tmp.getStart()));
            prev = tmp.getAccept();
       }
       fnfa.addTransition(new EpsilonEdge(prev, fnfa.getAccept()));
       return fnfa;
}

All that remains to do is take our AST and output the fully constructed NFA! Using the above procedures, this is a simple task. All it takes is a post-order traversal of our tree with a stack to hold our intermediate subexpressions.

 class NFACompiler {
     private: 
        Stack<NFA> nfaStack;
        NFA buildOperatorNFA(RegularExpression* ast) {
            switch (ast->getSymbol()) {
                case '@': {
                    NFA b = nfaStack.pop();
                    NFA a = nfaStack.pop();
                    return concatNFA(a, b);
                }
                break;
                case '|': {
                    NFA b = nfaStack.pop();
                    NFA a = nfaStack.pop();
                    return alternateNFA(a, b);
                }
                case '*': {
                    NFA a = nfaStack.pop();
                    return kleeneNFA(a, false);
                }
                case '+': {
                    NFA a = nfaStack.pop();
                    return kleeneNFA(a, true);
                }
                case '?': {
                    NFA a = nfaStack.pop();
                     return zeroOrOnce(a);
                }
                default:
                    break;
            }
            return NFA();
        }
        void gen_nfa(RegularExpression* ast) {
            if (ast != nullptr) {
                gen_nfa(ast->getLeft());
                gen_nfa(ast->getRight());
                if (!isOp(ast->getSymbol())) {
                    nfaStack.push(atomicNFA(ast->getSymbol()));
                } else {
                    nfaStack.push(buildOperatorNFA(ast));
                }
            }
        }
    public:
         NFACompiler() { }
         NFA compile(string pattern) {
            Parser parser;
            string re = "(" + pattern + ")";
            RegularExpression* ast = parser.parse(re);
            gen_nfa(ast);
            return nfaStack.pop();
        }
};

Finally, we're ready to simulate the NFA, so lets get back to talking about impelementing match().

Simulating the NFA for pattern matching

One of the ways we can implement match() is via an iterative Depth First Search, unfortunately the straight depth first search approach is susceptible to the same catastrophic backtracking that plagues Perl, PCRE, Java, and many other popular regex engines. It will also need some modification before we can use it on NFA's with closure's as It has the potential of getting stuck in loops of epsilon transitions. The fix for that problem is easy at least.

     struct Node {
          int strPos;
          State state;
          unordered_set<Transition> epsHistory;
          Node(int sp, State s, unordered_set<Transition> t) : strPos(sp), state(s), epsHistory(t) { }
     };
     bool matchBT(string text) {
            unordered_set<Transition> epsHistory;
            Queue<Node> sf;
            sf.push(Node(0, start, epsHistory));
            while (!sf.empty()) {
                int strPos = sf.top().strPos;
                State currState = sf.top().state;
                epsHistory = sf.top().epsHistory;
                sf.pop();
                char input = text[strPos];
                if (accept == currState) {
                    return true;
                }
                for (Transition t : states[currState]) {
                    if ((t.edge->matches(input) || t.edge->matches('.')) || t.edge->isEpsilon()) {
                        if (t.edge->isEpsilon()) { 
                            if (epsHistory.find(t) != epsHistory.end()) {
                                continue;
                            }
                            epsHistory.insert(t);
                            sf.push(Node(strPos, t.to, epsHistory));
                        } else {
                            epsHistory.clear();
                            sf.push(Node(strPos + 1, t.to, epsHistory));
                        }
                    }
                    cout<<endl;
                }
            }
            return false;
        }

To avoid getting lost in loops of Epsilon transitions, we simply cache the epsilon transitions we've already crossed, and if we encounter them again we skip them. We must be careful to clear this cache however whenever we cross a non-Epsilon transition.

Allright, let's see it in action:

#include <iostream>
#include "compiler.hpp"
using std::cout;
using std::endl;

int main(int argc, char* argv[]) {
    NFACompiler compiler;
    NFA nfa = compiler.compile("(a*b|ac)d");
    cout<<"---------------------"<<endl;
    cout<<nfa.matchBT("aaaaaabd")<<endl;
    return 0;
}

Parsing: ((a*b|ac)d)
Inserting explicit concat operators: ((a*@b|a@c)@d)
Postfix: a*b@ac@|d@
---------------------
Attempting to match: aaaaabd, Start state: 10, Accept state: 13
State: 10, Input: a
10-(&)->6

10-(&)->2

State: 2, Input: a
2-(&)->3

2-(&)->0

State: 0, Input: a
0-(a)->1

State: 1, Input: a
1-(&)->3

1-(&)->2

State: 2, Input: a
2-(&)->3

2-(&)->0

State: 0, Input: a
0-(a)->1

State: 1, Input: a
1-(&)->3

1-(&)->2

State: 2, Input: a
2-(&)->3

2-(&)->0

State: 0, Input: a
0-(a)->1

State: 1, Input: a
1-(&)->3

1-(&)->2

State: 2, Input: a
2-(&)->3

2-(&)->0

State: 0, Input: a
0-(a)->1

State: 1, Input: a
1-(&)->3

1-(&)->2

State: 2, Input: a
2-(&)->3

2-(&)->0

State: 0, Input: a
0-(a)->1

State: 1, Input: b
1-(&)->3

1-(&)->2

State: 2, Input: b
2-(&)->3

2-(&)->0

State: 0, Input: b
Dead end.

State: 3, Input: b
3-(&)->4

State: 4, Input: b
4-(b)->5

State: 5, Input: d
5-(&)->11

State: 11, Input: d
11-(&)->12

State: 12, Input: d
12-(d)->13

State: 13, Input:
Found Accept State.
1

As you can see, it works as advertised, and if we were so inclined, we could call it quits here. But if you're anything like me then that whole "catastrophic backtracking" thing is probably sitting in the corner of your mind waiting to rain on your parade. In that case, allow me to tell you about a slightly more complicated way of simulating the NFA - one that doesnt require backtracking.

A Better Way

As I previously mentioned, a straight depth first search isnt the only way to simulate the NFA - it's just the conceptually simplest. A faster method is the one originally described by Thompson and used in awk and grep. That algorithm, called "Thompsons Algorithm" (noticing a pattern here?) uses power set construction to simulate the NFA, And if you squint just right, you might just notice what looks like a depth first search hidden in there....

class RegExPatternMatcher {
    private:
        NFA nfa;
        // Gathers a list of states reachable from those in 
        // currStates which have transition that consume ch
        set<State> move(set<State> currStates, char ch) {
            set<State> nextStates;
            if (loud) cout<<ch<<": "<<endl;
            for (State s : currStates) {
                for (Edge* t : nfa.getTransitions(s)) {
                    if (t->matches(ch) || t->matches('.')) {
                        if (t->isEpsilon() == false && nextStates.find(t->getTo()) == nextStates.end()) {
                            if (loud) printEdge(*t);
                            nextStates.insert(t->getTo());
                        } 
                    }
                }
            }
            return nextStates;
        }
        //An interesting adaptation of Depth First Search.
        //Gathers a list of states reachable from those in 
        //currStates by using _only_ epsilon transitions.
        set<State> e_closure(set<State> currStates) {
            set<State> nextStates = currStates;
            Stack<State> sf;
            for (State s : currStates)
                sf.push(s);
            while (!sf.empty()) {
                State s = sf.pop();
                for (Edge* t : nfa.getTransitions(s)) {
                    if (t->isEpsilon()) {
                        if (nextStates.find(t->getTo()) == nextStates.end()) {
                            if (loud) printEdge(*t);
                            nextStates.insert(t->getTo());
                            sf.push(t->getTo());
                        }
                    }
                }
            }
            return nextStates;
        }
        int spos;
        string inputText;
        char nextChar() {
            return inputText[spos++];
        }
        char init(const string& text) {
            inputText = text;
            spos = 0;
            return nextChar();
        }
        bool loud;
        const static char eOf = '\0';
    public:
        RegExPatternMatcher(const NFA& fa, bool trace = false) : nfa(fa), loud(trace) {

        }
        RegExPatternMatcher() {
            loud = false;
        }
        void setNFA(NFA& fa) {
            nfa = fa;
        }
        bool match(StringBuffer& sb) {
            set<State> curr = e_closure({nfa.getStart()});
            while (!sb.done()) {
                curr = e_closure(move(curr, sb.get()));
                sb.advance();
            }
            return curr.find(nfa.getAccept()) != curr.end();
        }
};

struct MatchRE {
    NFACompiler compiler;
    bool operator()(StringBuffer& text, string pattern, bool trace) {
        compiler.setTrace(trace);
        return RegExPatternMatcher(compiler.compile(pattern), trace).printNFA().match(text);
    }
    bool operator()(string text, string pattern, bool trace) {
        compiler.setTrace(trace);
        StringBuffer buff(text);
        return RegExPatternMatcher(compiler.compile(pattern), trace).printNFA().match(buff);
    }
};

So how does thompsons algorithm work? It begins by initializing the state list with all of the states reachable from the starting state by traversing only epsilon transitions - thats what the e_closure() method does. Next, for each character in the string we are searching, we find the transitions on that character which are present in our current state list - this is the move() procedure. Next, we gather all of states reachable via epsilon transitions from those states. This repeats for every character in the string until we have consumed all of the input. If, when there is no more input to process, the accept state is in our state list, then the pattern matches.

Instead of pursuing one path to its terminus and then backtracking on failure, this algorithm has the effect of running all of the paths in parallel, doing the search step wise from each state so it never needs to back track, as at each step it can determine the correct transition to take (or fail early if there is none). As far as algorithms go, I find this one to be particularly beautiful.

#include <iostream>
#include "compiler.hpp"
using std::cout;
using std::endl;

int main(int argc, char* argv[]) {
    NFACompiler compiler;
    NFA nfa = compiler.compile("(a*b|ac)d");
    cout<<"---------------------"<<endl;
    if (nfa.match("aaaaaabd")) cout<<"Matched."<<endl;
    return 0;
}

Parsing: ((a*b|ac)d)
Inserting explicit concat operators: ((a*@b|a@c)@d)
Postfix: a*b@ac@|d@
Processing: a Alphabet.
Processing: * Operator.
Processing: b Alphabet.
Processing: @ Operator.
Processing: a Alphabet.
Processing: c Alphabet.
Processing: @ Operator.
Processing: | Operator.
Processing: d Alphabet.
Processing: @ Operator.
        *
          a
      @
        b
    |
        a
      @
        c
  @
    d
---------------------
Init:
	10 - (&) ->2
	10 - (&) ->6
	2 - (&) ->0
	2 - (&) ->3
	3 - (&) ->4
a: 
	0 - (a) ->1
	6 - (a) ->7
	1 - (&) ->2
	1 - (&) ->3
	3 - (&) ->4
	2 - (&) ->0
	7 - (&) ->8
a: 
	0 - (a) ->1
	1 - (&) ->2
	1 - (&) ->3
	3 - (&) ->4
	2 - (&) ->0
a: 
	0 - (a) ->1
	1 - (&) ->2
	1 - (&) ->3
	3 - (&) ->4
	2 - (&) ->0
a: 
	0 - (a) ->1
	1 - (&) ->2
	1 - (&) ->3
	3 - (&) ->4
	2 - (&) ->0
a: 
	0 - (a) ->1
	1 - (&) ->2
	1 - (&) ->3
	3 - (&) ->4
	2 - (&) ->0
b: 
	4 - (b) ->5
	5 - (&) ->11
	11 - (&) ->12
d: 
	12 - (d) ->13
Matched.

Oh, and the set of edges thompsons algorithm collects while simulating the NFA? It just so happens to be the same set of edges you'd need to convert your NDFA to a DFA, but that is a subject for another day.

Well, thats what I got. Until next time, Happy Hacking!

MaxGCoding.com

For the love of programming

Implementing Thompsons Construction Bottom-Up: Building NFA from Regular Expressions

Non-Deterministic Finite Automata

Building an AST from a Regular Expression

Parsing the Expression

Making Implicit Operators Explicit

Converting The Expression To Postfix

Building The AST

Constructing The NFA

Atomic NFA For Symbols and E-transitions

The Concatenation Operator

The Alternation Operator

The Closure Operators

Simulating the NFA for pattern matching

A Better Way

Further Reading:

Leave A Comment

The Aho, Sethi, Ullman Direct DFA Construction, Part 1: Constructing the Followpos Table

Procedural Map Generation with Binary Space Partitioning

Exact Match String Searching: The Knuth-Morris-Pratt Algorithm

The visitor pattern: OOP takes on the Expression Problem

Improving mgcLisp's define syntax

Separating the Men from the Boys, Knuth Style

Reducing rotations during deletion from AVL trees

Parsing Lisp: From Data To Code

Swiz Heaps: A deterministic modification of Randomized Meldable Heaps

Let's talk Eval/Apply