Implementing Thompsons Construction Bottom-Up: Building NFA from Regular Expressions
Regular expressions and NFA have been an active area of study in theoretical computer science since the late 1950's. As such, there is a large body of work exploring a multitude of ways to evaluate regular expressions for the purpose of pattern matching. Thompson's Construction is a method for the building of Non-Deterministic Finite Automata(NFA) from Regular Expreessions(RE). It is famously the algorithm outlined in the "dragon book", amongst many other notable texts for building pattern matching NFA[1].
In previous posts I have discussed one such implementation presented by Robert Sedgewick in his book "Algorithms" (4th ed). As I've said before, when it comes to data structures and algorithms, you can never have too many choices. With this in mind I've decided to explore some other implementations for evaluating regular expressions.
Non-Deterministic Finite Automata
typedef int State;
class Edge {
private:
State from;
State to;
public:
Edge(State s, State t) : from(s), to(t) { }
State getFrom() const { return from; }
State getTo() const { return to; }
virtual RegExToken getLabel() = 0;
virtual bool matches(char c) = 0;
virtual bool positionIs(int index) = 0;
virtual bool isEpsilon() = 0;
virtual ~Edge() { }
};
class NFA {
private:
State start;
State accept;
unordered_map<State, set<Edge*>> states;
public:
NFA() {
start = 0;
accept = 0;
}
NFA(const NFA& nfa) {
start = nfa.start;
accept = nfa.accept;
for (auto & m : nfa.states) {
makeState(m.first);
for (auto & t : m.second) {
addTransition(t);
}
}
}
void makeState(State name) {
if (states.find(name) == states.end()) {
states.insert(make_pair(name, set<Edge*>()));
}
}
void setStart(State ss) {
start = ss;
}
void setAccept(State as) {
accept = as;
}
State getStart() {
return start;
}
State getAccept() {
return accept;
}
void addTransition(Edge* t) {
if (states.at(t->getFrom()).find(t) == states.at(t->getFrom()).end())
states[t->getFrom()].insert(t);
}
int size() {
return states.size();
}
auto getStates() const {
return states;
}
set<Edge*>& getTransitions(State state) {
return states[state];
}
NFA& operator=(const NFA& nfa) {
if (this != &nfa) {
start = nfa.start;
accept = nfa.accept;
for (auto m : nfa.states) {
makeState(m.first);
for (auto t : m.second) {
addTransition(t);
}
}
}
return *this;
}
};
class CharEdge : public Edge {
private:
RegExToken label;
bool checkInRange(char c) {
char lo, hi;
bool is_good = false;
for (int i = 1; i < label.charachters.size()-1; i++) {
if (i+1 < label.charachters.size() && label.charachters[i] == '-') {
lo = label.charachters[i-1];
hi = label.charachters[i+1];
if (hi < lo) {
char tmp = hi;
hi = lo;
lo = tmp;
}
if (c >= lo && c <= hi) {
is_good = true;
break;
}
}
}
return is_good;
}
public:
CharEdge(State From, State To, RegExToken c) : Edge(From, To) {
label = c;
}
~CharEdge() {
}
bool isEpsilon() {
return false;
}
bool positionIs(int i) {
return true;
}
bool matches(char c) {
if (label.symbol == RE_SPECIFIEDSET) {
for (char m : label.charachters) {
if (c == m)
return true;
}
return false;
} else if (label.symbol == RE_SPECIFIEDRANGE) {
return checkInRange(c);
}
return label.charachters[0] == c;
}
RegExToken getLabel() {
return label;
}
};
class EpsilonEdge : public Edge {
public:
EpsilonEdge(State From, State To) : Edge(From, To) { }
~EpsilonEdge() { }
bool matches(char c) {
return true;
}
bool positionIs(int i) {
return true;
}
bool isEpsilon() {
return true;
}
RegExToken getLabel() {
return RegExToken(RE_NONE, "&");
}
};
bool operator<(const Edge& s, const Edge& t) {
return s.getFrom() < t.getFrom() && s.getTo() < t.getTo();
}
bool operator==(const Edge& s, const Edge& t) {
return s.getFrom() == t.getFrom() && s.getTo() == t.getTo();
}
bool operator!=(const Edge& s, const Edge& t) {
return !(s == t);
}
void testNFA() {
NFA nfa;
nfa.makeState(0); nfa.makeState(1); nfa.makeState(2); nfa.makeState(3);
nfa.setStart(0);
nfa.setAccept(3);
nfa.addTransition(new CharEdge(0, 1, 'a')); //start -> a
nfa.addTransition(new CharEdge(1, 2,'b')); // a -> b
nfa.addTransition(new CharEdge(2, 2,'b')); // b -> b
nfa.addTransition(new EpsilonEdge(2, nfa.getAccept())); // b -> accept
cout<<nfa.match("abb")<<endl;
cout<<nfa.match("aab")<<endl;
cout<<nfa.match("ab")<<endl;
cout<<nfa.match("a")<<endl;
}
The solution to automating the constuction is to convert a Regular expression string to its corresponding abstract syntax tree, and then build our NFA automatically from the tree. We'll return to implementing match() once we have our NFA constructed from an AST, so lets get to it.
Building an AST from a Regular Expression
Parsing the Expression
class Parser {
private:
bool isOp(char c);
int precedence(char c);
int leftAssociative(char c);
string addConcatOp(string& str);
string makePostfix(string& str);
RegularExpression* makeTree(string& postfix);
public:
Parser();
RegularExpression* parse();
};
bool leftAssociative(char c) {
switch (c) {
case '*':
case '+':
case '?':
case '|':
case '@':
return true;
default:
break;
}
return false;
}
When it comes to precedence, Closure's have the highest precedence, followed by concatenation and then alternation.
int precedence(char c) {
switch (c) {
case '*': return 50;
case '+': return 50;
case '?': return 50;
case '@': return 30;
case '|': return 20;
default:
break;
}
return 10;
}
Making Implicit Operators Explicit
string addConcatOp(string str) {
string fixed;
for (int i = 0; i < str.length(); i++) {
fixed.push_back(str[i]);
if (str[i] == '(' || str[i] == '|')
continue;
if (i+1 < str.length()) {
char p = str[i+1];
if (p == '|' || p == '*' || p == '+' || p == ')')
continue;
fixed.push_back('@');
}
}
return fixed;
}
Converting The Expression To Postfix
string makePostfix(string& str) {
str = addConcatOp(str);
cout<<"Inserting explicit concat operators: "<<str<<endl;
string postfix;
for (int i = 0; i < str.length(); i++) {
if (str[i] == '(') {
ops.push(str[i]);
} else if (str[i] == '|' || str[i] == '*' || str[i] == '+' || str[i] == '@' || str[i] == '?') {
if (precedence(str[i]) < precedence(ops.top()) || (precedence(str[i]) == precedence(ops.top()) && leftAssociative(str[i]))) {
char c = ops.pop();
postfix.push_back(c);
ops.push(str[i]);
} else {
ops.push(str[i]);
}
} else if (str[i] == ')') {
while (!ops.empty()) {
char c = ops.pop();
if (c == '(')
break;
else postfix.push_back(c);
}
} else {
postfix.push_back(str[i]);
}
}
while (!ops.empty()) {
char c = ops.pop();
if (c != '(')
postfix.push_back(c);
}
return postfix;
}
Building The AST
class RegularExpression {
public:
virtual char getSymbol() = 0;
virtual RegularExpression* getLeft() = 0;
virtual RegularExpression* getRight() = 0;
};
class ExpressionLiteral : public RegularExpression {
private:
char symbol;
public:
ExpressionLiteral(char sym) { symbol = sym; }
RegularExpression* getLeft() {
return nullptr;
}
RegularExpression* getRight() {
return nullptr;
}
char getSymbol() {
return symbol;
}
};
class ExpressionOperator : public RegularExpression {
private:
RegularExpression* left;
RegularExpression* right;
char symbol;
public:
ExpressionOperator(char c, RegularExpression* ll, RegularExpression* rr) {
symbol = c;
left = ll;
right = rr;
}
char getSymbol() {
return symbol;
}
RegularExpression* getLeft() {
return left;
}
RegularExpression* getRight() {
return right;
}
};
RegularExpression* makeTree(string postfix) {
Stack<RegularExpression*> sf;
for (char c : postfix) {
cout<<"Processing: "<<c<<" ";
if (!isOp(c)) {
cout<<"Alphabet."<<endl;
sf.push(new ExpressionLiteral(c));
} else {
cout<<"Operator."<<endl;
auto right = sf.empty() ? nullptr:sf.pop();
auto left = sf.empty() ? nullptr:sf.pop();
sf.push(new ExpressionOperator(c, left, right));
}
}
return sf.pop();
}
Constructing The NFA
int label = 0;
int makeStateLabel() {
return label++;
}
void initNextNFA(NFA& nnfa) {
int nstart = makeStateLabel();
nnfa.makeState(nstart);
nnfa.setStart(nstart);
int nend = makeStateLabel();
nnfa.makeState(nend);
nnfa.setAccept(nend);
}
It should also be mentioned that when I say we are "wiring NFA's together" I'm not implying that we are some how linking NFA objects together - we're not. Instead, we merge NFA's by copying their transitions into a new NFA and throwing the old one away:
void copyTransitions(NFA& nnfa, NFA onfa) {
for (auto state : onfa.getStates()) {
nnfa.makeState(state.first);
for (auto trans : onfa.getTransitions(state.first)) {
nnfa.addTransition(trans);
}
}
}
At each step of the construction, new NFA are created from previously constructed NFA until we arrive at the completed, desired automaton.
Atomic NFA For Symbols and E-transitions
NFA singleTransitionNFA(RegExToken c) {
NFA nfa;
initNextNFA(nfa);
if (c.charachters.empty())
nfa.addTransition(new EpsilonEdge(nfa.getStart(), nfa.getAccept()));
else
nfa.addTransition(new CharEdge(nfa.getStart(), nfa.getAccept(), c));
return nfa;
}
NFA emptyExpr() {
RegExToken c;
return singleTransitionNFA(c);
}
NFA atomicNFA(RegExToken c) {
return singleTransitionNFA(c);
}
The Concatenation Operator
NFA concatNFA(NFA first, NFA second) {
NFA nnfa;
copyTransitions(nnfa, first);
copyTransitions(nnfa, second);
nnfa.setStart(first.getStart());
nnfa.setAccept(second.getAccept());
nnfa.addTransition(new EpsilonEdge(first.getAccept(), second.getStart()));
return nnfa;
}
The Alternation Operator
We begin by creating a new starting state. From the new starting state we create two epsilon transitions which connect the new start state, to the start states of our two choices. We then create a new accept state and connect the accept state of our respective automata with the new one with epsilon changes.
NFA alternateNFA(NFA first, NFA second) {
//create new NFA with start and end state.
NFA nnfa;
initNextNFA(nnfa);
//copy states and transitions from both NFA's to new NFA
copyTransitions(nnfa, first);
copyTransitions(nnfa, second);
//Add new E-transitions from new start state to first and second NFAs
nnfa.addTransition(new EpsilonEdge(nnfa.getStart(), first.getStart()));
nnfa.addTransition(new EpsilonEdge(nnfa.getStart(), second.getStart()));
//Add new E-transitions from first and second accept state to new accept state.
nnfa.addTransition(new EpsilonEdge(first.getAccept(), nnfa.getAccept()));
nnfa.addTransition(new EpsilonEdge(second.getAccept(), nnfa.getAccept()));
return nnfa;
}
The Closure Operators
NFA kleeneNFA(NFA torepeat, bool mustMatch) {
NFA nnfa;
initNextNFA(nnfa);
copyTransitions(nnfa, torepeat);
nnfa.addTransition(new EpsilonEdge(torepeat.getAccept(), nnfa.getStart()));
nnfa.addTransition(new EpsilonEdge(nnfa.getStart(), torepeat.getStart()));
nnfa.addTransition(new EpsilonEdge(torepeat.getAccept(), nnfa.getAccept()));
if (!mustMatch)
nnfa.addTransition(new EpsilonEdge(nnfa.getStart(), nnfa.getAccept()));
return nnfa;
}
NFA zeroOrOnce(NFA onfa) {
NFA nnfa;
initNextNFA(nnfa);
copyTransitions(nnfa, onfa);
//wire in match choice
nnfa.addTransition(new EpsilonEdge(nnfa.getStart(), onfa.getStart()));
nnfa.addTransition(new EpsilonEdge(onfa.getAccept(), nnfa.getAccept()));
//wire in epsilon choice.
nnfa.addTransition(new EpsilonEdge(nnfa.getStart(), nnfa.getAccept()));
return nnfa;
}
NFA repeatNTimes(NFA a, int N) {
Stack<NFA> sf;
for (int i = 0; i < N; i++) {
NFA tnfa;
initNextNFA(tnfa);
copyTransitions(tnfa, a);
tnfa.addTransition(new EpsilonEdge(tnfa.getStart(), a.getStart()));
tnfa.addTransition(new EpsilonEdge(a.getAccept(), tnfa.getAccept()));
sf.push(tnfa);
}
NFA fnfa;
initNextNFA(fnfa);
State prev = fnfa.getStart();
while (!sf.empty()) {
NFA tmp = sf.pop();
copyTransitions(fnfa, tmp);
fnfa.addTransition(new EpsilonEdge(prev, tmp.getStart()));
prev = tmp.getAccept();
}
fnfa.addTransition(new EpsilonEdge(prev, fnfa.getAccept()));
return fnfa;
}
All that remains to do is take our AST and output the fully constructed NFA! Using the above procedures, this is a simple task. All it takes is a post-order traversal of our tree with a stack to hold our intermediate subexpressions.
class NFACompiler {
private:
Stack<NFA> nfaStack;
NFA buildOperatorNFA(RegularExpression* ast) {
switch (ast->getSymbol()) {
case '@': {
NFA b = nfaStack.pop();
NFA a = nfaStack.pop();
return concatNFA(a, b);
}
break;
case '|': {
NFA b = nfaStack.pop();
NFA a = nfaStack.pop();
return alternateNFA(a, b);
}
case '*': {
NFA a = nfaStack.pop();
return kleeneNFA(a, false);
}
case '+': {
NFA a = nfaStack.pop();
return kleeneNFA(a, true);
}
case '?': {
NFA a = nfaStack.pop();
return zeroOrOnce(a);
}
default:
break;
}
return NFA();
}
void gen_nfa(RegularExpression* ast) {
if (ast != nullptr) {
gen_nfa(ast->getLeft());
gen_nfa(ast->getRight());
if (!isOp(ast->getSymbol())) {
nfaStack.push(atomicNFA(ast->getSymbol()));
} else {
nfaStack.push(buildOperatorNFA(ast));
}
}
}
public:
NFACompiler() { }
NFA compile(string pattern) {
Parser parser;
string re = "(" + pattern + ")";
RegularExpression* ast = parser.parse(re);
gen_nfa(ast);
return nfaStack.pop();
}
};
Finally, we're ready to simulate the NFA, so lets get back to talking about impelementing match().
Simulating the NFA for pattern matching
struct Node {
int strPos;
State state;
unordered_set<Transition> epsHistory;
Node(int sp, State s, unordered_set<Transition> t) : strPos(sp), state(s), epsHistory(t) { }
};
bool matchBT(string text) {
unordered_set<Transition> epsHistory;
Queue<Node> sf;
sf.push(Node(0, start, epsHistory));
while (!sf.empty()) {
int strPos = sf.top().strPos;
State currState = sf.top().state;
epsHistory = sf.top().epsHistory;
sf.pop();
char input = text[strPos];
if (accept == currState) {
return true;
}
for (Transition t : states[currState]) {
if ((t.edge->matches(input) || t.edge->matches('.')) || t.edge->isEpsilon()) {
if (t.edge->isEpsilon()) {
if (epsHistory.find(t) != epsHistory.end()) {
continue;
}
epsHistory.insert(t);
sf.push(Node(strPos, t.to, epsHistory));
} else {
epsHistory.clear();
sf.push(Node(strPos + 1, t.to, epsHistory));
}
}
cout<<endl;
}
}
return false;
}
To avoid getting lost in loops of Epsilon transitions, we simply cache the epsilon transitions we've already crossed, and if we encounter them again we skip them. We must be careful to clear this cache however whenever we cross a non-Epsilon transition.
Allright, let's see it in action:
#include <iostream>
#include "compiler.hpp"
using std::cout;
using std::endl;
int main(int argc, char* argv[]) {
NFACompiler compiler;
NFA nfa = compiler.compile("(a*b|ac)d");
cout<<"---------------------"<<endl;
cout<<nfa.matchBT("aaaaaabd")<<endl;
return 0;
}
Parsing: ((a*b|ac)d)
Inserting explicit concat operators: ((a*@b|a@c)@d)
Postfix: a*b@ac@|d@
---------------------
Attempting to match: aaaaabd, Start state: 10, Accept state: 13
State: 10, Input: a
10-(&)->6
10-(&)->2
State: 2, Input: a
2-(&)->3
2-(&)->0
State: 0, Input: a
0-(a)->1
State: 1, Input: a
1-(&)->3
1-(&)->2
State: 2, Input: a
2-(&)->3
2-(&)->0
State: 0, Input: a
0-(a)->1
State: 1, Input: a
1-(&)->3
1-(&)->2
State: 2, Input: a
2-(&)->3
2-(&)->0
State: 0, Input: a
0-(a)->1
State: 1, Input: a
1-(&)->3
1-(&)->2
State: 2, Input: a
2-(&)->3
2-(&)->0
State: 0, Input: a
0-(a)->1
State: 1, Input: a
1-(&)->3
1-(&)->2
State: 2, Input: a
2-(&)->3
2-(&)->0
State: 0, Input: a
0-(a)->1
State: 1, Input: b
1-(&)->3
1-(&)->2
State: 2, Input: b
2-(&)->3
2-(&)->0
State: 0, Input: b
Dead end.
State: 3, Input: b
3-(&)->4
State: 4, Input: b
4-(b)->5
State: 5, Input: d
5-(&)->11
State: 11, Input: d
11-(&)->12
State: 12, Input: d
12-(d)->13
State: 13, Input:
Found Accept State.
1
As you can see, it works as advertised, and if we were so inclined, we could call it quits here. But if you're anything like me then that whole "catastrophic backtracking" thing is probably sitting in the corner of your mind waiting to rain on your parade. In that case, allow me to tell you about a slightly more complicated way of simulating the NFA - one that doesnt require backtracking.
A Better Way
As I previously mentioned, a straight depth first search isnt the only way to simulate the NFA - it's just the conceptually simplest. A faster method is the one originally described by Thompson and used in awk and grep. That algorithm, called "Thompsons Algorithm" (noticing a pattern here?) uses power set construction to simulate the NFA, And if you squint just right, you might just notice what looks like a depth first search hidden in there....
class RegExPatternMatcher {
private:
NFA nfa;
// Gathers a list of states reachable from those in
// currStates which have transition that consume ch
set<State> move(set<State> currStates, char ch) {
set<State> nextStates;
if (loud) cout<<ch<<": "<<endl;
for (State s : currStates) {
for (Edge* t : nfa.getTransitions(s)) {
if (t->matches(ch) || t->matches('.')) {
if (t->isEpsilon() == false && nextStates.find(t->getTo()) == nextStates.end()) {
if (loud) printEdge(*t);
nextStates.insert(t->getTo());
}
}
}
}
return nextStates;
}
//An interesting adaptation of Depth First Search.
//Gathers a list of states reachable from those in
//currStates by using _only_ epsilon transitions.
set<State> e_closure(set<State> currStates) {
set<State> nextStates = currStates;
Stack<State> sf;
for (State s : currStates)
sf.push(s);
while (!sf.empty()) {
State s = sf.pop();
for (Edge* t : nfa.getTransitions(s)) {
if (t->isEpsilon()) {
if (nextStates.find(t->getTo()) == nextStates.end()) {
if (loud) printEdge(*t);
nextStates.insert(t->getTo());
sf.push(t->getTo());
}
}
}
}
return nextStates;
}
int spos;
string inputText;
char nextChar() {
return inputText[spos++];
}
char init(const string& text) {
inputText = text;
spos = 0;
return nextChar();
}
bool loud;
const static char eOf = '\0';
public:
RegExPatternMatcher(const NFA& fa, bool trace = false) : nfa(fa), loud(trace) {
}
RegExPatternMatcher() {
loud = false;
}
void setNFA(NFA& fa) {
nfa = fa;
}
bool match(StringBuffer& sb) {
set<State> curr = e_closure({nfa.getStart()});
while (!sb.done()) {
curr = e_closure(move(curr, sb.get()));
sb.advance();
}
return curr.find(nfa.getAccept()) != curr.end();
}
};
struct MatchRE {
NFACompiler compiler;
bool operator()(StringBuffer& text, string pattern, bool trace) {
compiler.setTrace(trace);
return RegExPatternMatcher(compiler.compile(pattern), trace).printNFA().match(text);
}
bool operator()(string text, string pattern, bool trace) {
compiler.setTrace(trace);
StringBuffer buff(text);
return RegExPatternMatcher(compiler.compile(pattern), trace).printNFA().match(buff);
}
};
So how does thompsons algorithm work? It begins by initializing the state list with all of the states reachable from the starting state by traversing only epsilon transitions - thats what the e_closure() method does. Next, for each character in the string we are searching, we find the transitions on that character which are present in our current state list - this is the move() procedure. Next, we gather all of states reachable via epsilon transitions from those states. This repeats for every character in the string until we have consumed all of the input. If, when there is no more input to process, the accept state is in our state list, then the pattern matches.
Instead of pursuing one path to its terminus and then backtracking on failure, this algorithm has the effect of running all of the paths in parallel, doing the search step wise from each state so it never needs to back track, as at each step it can determine the correct transition to take (or fail early if there is none). As far as algorithms go, I find this one to be particularly beautiful.
#include <iostream>
#include "compiler.hpp"
using std::cout;
using std::endl;
int main(int argc, char* argv[]) {
NFACompiler compiler;
NFA nfa = compiler.compile("(a*b|ac)d");
cout<<"---------------------"<<endl;
if (nfa.match("aaaaaabd")) cout<<"Matched."<<endl;
return 0;
}
Parsing: ((a*b|ac)d)
Inserting explicit concat operators: ((a*@b|a@c)@d)
Postfix: a*b@ac@|d@
Processing: a Alphabet.
Processing: * Operator.
Processing: b Alphabet.
Processing: @ Operator.
Processing: a Alphabet.
Processing: c Alphabet.
Processing: @ Operator.
Processing: | Operator.
Processing: d Alphabet.
Processing: @ Operator.
*
a
@
b
|
a
@
c
@
d
---------------------
Init:
10 - (&) ->2
10 - (&) ->6
2 - (&) ->0
2 - (&) ->3
3 - (&) ->4
a:
0 - (a) ->1
6 - (a) ->7
1 - (&) ->2
1 - (&) ->3
3 - (&) ->4
2 - (&) ->0
7 - (&) ->8
a:
0 - (a) ->1
1 - (&) ->2
1 - (&) ->3
3 - (&) ->4
2 - (&) ->0
a:
0 - (a) ->1
1 - (&) ->2
1 - (&) ->3
3 - (&) ->4
2 - (&) ->0
a:
0 - (a) ->1
1 - (&) ->2
1 - (&) ->3
3 - (&) ->4
2 - (&) ->0
a:
0 - (a) ->1
1 - (&) ->2
1 - (&) ->3
3 - (&) ->4
2 - (&) ->0
b:
4 - (b) ->5
5 - (&) ->11
11 - (&) ->12
d:
12 - (d) ->13
Matched.
Oh, and the set of edges thompsons algorithm collects while simulating the NFA? It just so happens to be the same set of edges you'd need to convert your NDFA to a DFA, but that is a subject for another day.
Well, thats what I got. Until next time, Happy Hacking!
Further Reading:
1) Aho, Sethi, Ullman "Compilers: Principles, Techniques, and Tools" Addison-Wesley, 1986
2) Drozdek, Adam "Data Structures And Algorithms In C++", (3rd ed.) Thomson, 2005
3) Sedgewick, R. Wayne, K "Algorithms" (4th ed.), Addison-Wesley, 2011
4) Russ Cox, "Regular Expression Matching Can be Simple and Fast"-
The Aho, Sethi, Ullman Direct DFA Construction, Part 1: Constructing the Followpos Table
-
Procedural Map Generation with Binary Space Partitioning
-
Exact Match String Searching: The Knuth-Morris-Pratt Algorithm
-
The visitor pattern: OOP takes on the Expression Problem
-
Improving mgcLisp's define syntax
-
Separating the Men from the Boys, Knuth Style
-
Reducing rotations during deletion from AVL trees
-
Parsing Lisp: From Data To Code
-
Swiz Heaps: A deterministic modification of Randomized Meldable Heaps
-
Let's talk Eval/Apply
Leave A Comment