Composable Linked Digraphs: Yet another NFA Data Structure
When it comes to implementing Finite State Automata picking a data structure with which to model the machine is an interesting problem. On the one hand the choice is obvious: Finite state machines, be them deterministic or non, can be viewed as directed graphs (digraphs). Each state in the FA is a vertex in the graph, and each transition from one state to the next is modeled as a directed edge. On the other hand, certain requirements of NFA's - such as multiple edge types - make the use of "normal" (i.e. vertex-centric) digraph data structures found in most DSA books somewhat awkward.
In my previous post on thompsons construction I used what I will refer to from here on out as the "adjacency list approach", because it employs just such a simple adjacency-list graph-representation as the underlying data structure for the automaton. While having the benefit of being simple to implement, the adjacency list approach also has a pretty significant downside: each step requires merging the adjacency lists of the NFA's being combined. For large patterns this can become a significant bottle neck, as mergeing them is an O(N) operation where N is the number of vertices to be merged from the smaller graph into the larger.
typedef int State;
class Edge {
private:
State from;
State to;
public:
Edge(State s, State t) : from(s), to(t) { }
State getFrom() const { return from; }
State getTo() const { return to; }
};
class NFA {
private:
State start;
State accept;
unordered_map<State, set<Edge*>> states;
public:
NFA() {
start = 0;
accept = 0;
}
/* addEdge(), hadEdge(), etc. */
};
It is possible to work around this by "flipping" the abstraction and placing the indices of the start/accept vertices on a stack instead of the actual NFAs, allowing the use of a single graph instance with no copying. This is the approach taken in the book "Algorithms" by Robert Sedgewick. While this offers better performance than explicitly merging adjacency lists, it makes the code more complicated and thus harder to understand. In todays post I'm going to introduce an alternative data structure which will bring the O(N) complexity of merging NFAs down to O(1) without having to flip the abstraction inside out during construction.
Linked Digraphs
While both the adjacency list approach and the Linked Digraph approach model Non Finite Automata as directed graphs, they operate at different levels of abstraction. In the adjacency list representation states/vertices were a simple typedef for an integer, used as an index into a table. Linked Digraphs use explicit NFAState structures, each of which maintain their own adjacency list in the form of a set of transitions to other NFAState objects.
struct NFAState;
struct Transition {
char ch;
bool is_epsilon;
NFAState* dest;
Transition(NFAState* d) : ch('&'), dest(d), is_epsilon(true) { }
Transition(char c, NFAState* d) : ch(c), dest(d), is_epsilon(false) { }
Transition() : dest(nullptr), is_epsilon(false) { }
};
struct NFAState {
State label;
vector<Transition> transitions;
NFAState(State st) : label(st) { }
void addTransition(Transition t) {
transitions.push_back(t);
}
};
struct NFA {
NFAState* start;
NFAState* accept;
NFA(NFAState* s = nullptr, NFAState* a = nullptr) : start(s), accept(s) { }
};
The NFAState objects form a linked-list-like structure of the NFAs directed graph. The top level Linked Digraph object only needs to maintain two pointers: one to the start state and another to the accepting State.
In the examples i'm using raw pointers to keep the code uncluttered because this isnt a post on memory management, however If you are going to use this implementation I would highly recommend to either use some type of smart pointer like std::shared_ptr or some other garbage collection scheme as destroying such a deeply cross-linked structure can become very complicated.
Implementing Thompsons Construction with Linked Digraphs
Since NFA's are recursively defined, we will begin with laying out the procedure for constructing our simple NFA's: The single character transition and empty string automatons, represented by figure (a) in the picture below. Using these "base-case" automata we will then construct the operator automata with the assistance of epsilon transitions. These machines are pictured below with concatenation (b), alternation (c), and closures (d).
Finally, we can combin these larger NFA's together to create an NFA capable of recognizing any regular expression using the provided operators by performing a depth first traversal over the abstract syntax tree of a provided regular expressions.
NFA Building Blocks
Starting with our "base case" NFA's, the single character and empty string automatons comprise of two states and a transition connecting them. Unlike with implementing a DFA, an NFA uses two types of transitions: Character transitions and Epsilon transitions. Character transitions are the same as DFA transitions in that they "consume" a character of input while changing the internal state of the machine. Epsilon Transitions are unique to NFA's as they change the internal state of NFA without consuming input. It is thanks to Epsilon transitions that thompsons construction is possible.
int nextLabel() {
static int label = 0;
return label++;
}
NFA makeAtomic(char ch) {
NFA nfa(new NFAState(nextLabel()),new NFAState(nextLabel()));
nfa.start->addTransition(Transition(ch, nfa.accept));
return nfa;
}
// "The empty string"
NFA makeEpsilonAtomic() {
NFA nfa(new NFAState(nextLabel()),new NFAState(nextLabel()));
nfa.start->addTransition(Transition(nfa.accept));
return nfa;
}
Next up is concatenation (r)( s) -> (rs) , which is implemented by creating an epsilon transition from the first automatons accept state to the second automatons start state, we then assign the second automatons accept state to the first automaton, allowing us to discard the second automaton having effectively "absorbed" all of it's states/transitions in to the first.
NFA makeConcat(NFA a, NFA b) {
a.accept->addTransition(Transition(b.start));
a.accept = b.accept;
return a;
}
This may seem convoluted at first glance, but it saves us from needing to create an additional two states and two epsilon transitions that would be required if we were to implement concat the "easy" way:
NFA badConcatImpl(NFA a, NFA b) {
NFAState* new_start = new NFAState(nextLabel());
NFAState* new_end = new NFAState(nextLabel());
new_start->addTransition(Transition(a.start));
a.accept->addTransition(Transition(b.start));
b.accept->addTransition(Transition(new_end));
return NFA(new_start, new_end);
}
While this second version is more explicit in its construction making it easier to understand whats happening, careful scrutiny of both should convince you that they are equivelant in function, while the first is more space efficient.
Alternation (r|s) allows us to choose between two possible paths by taking two NFA, and creating a new state with two epsilong transitions going from the new state to each of the NFA's start states, and doing the same from their accept states to a new accept state, returning these new start and accept states as a new NFA.
NFA makeAlternate(NFA a, NFA b) {
NFAState* new_start = new NFAState(nextLabel());
NFAState* new_end = new NFAState(nextLabel());
new_start->addTransition(Transition(a.start));
new_start->addTransition(Transition(b.start));
a.accept->addTransition(Transition(new_end));
b.accept->addTransition(Transition(new_end));
return NFA(new_start, new_end);
}
And last but not least, the closure operators (*,+,?) for repetition. The difference between * and + is the presence of a single epsilon transition, the one from the new start to new accept state which allows the machine to accept 0 occurences when using the '*' operator.
NFA makeKleene(NFA a, bool must_accept) {
NFAState* new_start = new NFAState(nextLabel());
NFAState* new_end = new NFAState(nextLabel());
new_start->addTransition(Transition(a.start));
a.accept->addTransition(Transition(new_end));
a.accept->addTransition(Transition(a.start));
if (!must_accept) //Kleene * doesn't need to match, Kleene + does.
new_start->addTransition(Transition(new_end));
return NFA(new_start, new_end);
}
// ? isnt _really_ a closure, i suppose.
NFA makeZeorOrOne(NFA a) {
return makeAlternate(a, makeEpsilonAtomic());
}
Stack 'em up and Sew 'em together
Creating the full NFA is still done via post order traversal of the regular expressions abstract syntax tree. As a matter of fact, this part of thompsons construction does not change from my previous implementation at all: The only thing we changed is the data structure representing the graph. If we've designed and abstracted everything properly thats how it should be. This procedure only interacts with the top level NFA objects.All the actual NFAState manipulation takes place in the procedures we covered above.
class Compiler {
private:
Stack<NFA> st;
NFA nfa;
void trav(astnode* node) {
if (node != nullptr) {
if (node->type == LITERAL) {
cout<<"Character State: "<<node->c<<endl;
st.push(makeAtomic(node->c));
} else if (node->type == CHCLASS) {
st.push(makeCharClass(node->ccl));
} else {
cout<<"Building: "<<node->c<<endl;
switch (node->c) {
case '|': {
trav(node->left);
trav(node->right);
NFA rhs = st.pop();
NFA lhs = st.pop();
st.push(makeAlternate(lhs, rhs));
} break;
case '@': {
trav(node->left);
trav(node->right);
NFA rhs = st.pop();
NFA lhs = st.pop();
st.push(makeConcat(lhs, rhs));
} break;
case '*': {
trav(node->left);
NFA lhs = st.pop();
st.push(makeKleene(lhs, false));
} break;
case '+': {
trav(node->left);
NFA lhs = st.pop();
st.push(makeKleene(lhs, true));
} break;
case '?': {
trav(node->left);
NFA lhs = st.pop();
st.push(makeZeorOrOne(lhs));
} break;
default:
break;
}
}
}
}
public:
Compiler() {
}
NFA compile(astnode* node) {
trav(node);
return st.pop();
}
};
NFA compile(string pattern) {
Parser parser;
astnode* ast = parser.parse(pattern);
print(ast, 1);
Compiler compiler;
return compiler.compile(ast);
}
Traversing the AST, we determine which of the above machines should be constructed based on if the current node is an internal node or a leaf node. Leaf nodes represent character literals, and possibly the empty string. Internal nodes of the AST are for the operators. Post order traversal is used so that the "work" is put off until the stack unwinds guranteeing that operators have readily available operands to stitch together. At each step we creat the appropriate sub-machine, and place it on a stack for use in the next machine to be constructed. When all is said and done there will be a single NFA sitting on the stack, which is our fully constructed automaton.
-
Composable Linked Digraphs: Yet another NFA Data Structure
-
Improving the Space Efficiency of Suffix Arrays
-
Augmenting B+ Trees For Order Statistics
-
Top-Down AST Construction of Regular Expressions with Recursive Descent
-
Balanced Deletion for in-memory B+ Trees
-
Building an AST from a Regular Expression Bottom-up
-
The Aho, Sethi, Ullman Direct DFA Construction Part 2: Building the DFA from the Followpos Table
-
The Aho, Sethi, Ullman Direct DFA Construction, Part 1: Constructing the Followpos Table
-
Procedural Map Generation with Binary Space Partitioning
-
Exact Match String Searching: The Knuth-Morris-Pratt Algorithm
Leave A Comment