Data Structures For Representing Context Free Grammar
A proper grammar is defined as a tuple of sets, mainly the set of Terminal symbols, the set of Non-Terminal symbols, and the set of productions for deriving the language specified by the grammar. In addition, there are functions/sets derived from the grammar which is needed for generating recognizers/parsers for said grammar. Depending on what type/what it is you want to do with your grammar, you may also want to derive the "First" Set, the "Follow" set as these are needed for creating both LL and LR based parsers.
struct Grammar {
set<Symbol> terminals;
set<Symbol> nonterminals;
map<Symbol, ProductionSet> productions;
map<Symbol, set<Symbol>> firsts;
map<Symbol, set<Symbol>> follow;
map<ParseState, SymbolString> predict;
Grammar() {
}
bool isNonTerminal(Symbol s) {
return nonterminals.find(s) != nonterminals.end();
}
bool isTerminal(Symbol s) {
return terminals.find(s) != terminals.end();
}
bool isNullable(Symbol nt) {
for (auto prod : productions[nt]) {
if (prod.rhs[0] == EPS)
return true;
}
return false;
}
};
Todays post is inspired by the development of mgcpgen, a parser generator designed to work with mgclex, my lexer generator. In it I'm going to discuss some the data structures used by mgcpgen in order to go from an LL(1) grammar specification, to a working parser outputing abstract syntax trees. Lets do it.
Symbolic Computation Requires Symbols
Not to state the obvious, but to perform symbolic computation, we will in fact need a way to represent those symbols. Common choices include enumerated types, integers, characters, and of course - strings. Using a typedef allows us to easily change our representation should we decide to switch.
using Symbol = string;
Representing Strings of Symbols
Just as character strings ase comprised of lists of characters, so SymbolStrings are made up of lists of symbols. SymbolStrings contain the terminal and non terminal symbols which make up a multi-symbol production.
struct SymbolString : vector<Symbol> {
SymbolString(vector<Symbol> ss) {
for (auto m : ss) {
this->push_back(m);
}
}
SymbolString() {
}
SymbolString(const SymbolString& ss) {
for (auto m : ss) {
this->push_back(m);
}
}
SymbolString subString(int startIndex) {
SymbolString result;
for (int i = startIndex; i < this->size(); i++)
result.push_back((*this)[i]);
return result;
}
SymbolString& operator=(const SymbolString& ss) {
if (this != &ss) {
for (auto m : ss) {
this->push_back(m);
}
}
return *this;
}
string toString() {
string str = "";
for (Symbol s : *this) {
str += s + " ";
}
return str;
}
bool operator==(const SymbolString& ss) {
auto oit = ss.begin();
auto it = this->begin();
while (oit != ss.end() && it != this->end()) {
if (*oit != *it)
return false;
}
return (oit == ss.end() && it == this->end());
}
bool operator!=(const SymbolString& ss) {
return !(*this == ss);
}
};
Productions and Sets of Productions
Productions are the "rules" which get applied to the associated non-terminal.
struct Production {
int pid; //for executing Actions
Symbol lhs;
SymbolString rhs;
Production(int id, Symbol l, SymbolString r) : pid(id), lhs(l), rhs(r) { }
Production() { }
string toString() {
return lhs + " ::= " + rhs.toString();
}
bool operator==(const Production& op) {
return lhs == op.lhs && rhs == op.rhs;
}
bool operator!=(const Production& op) {
return !(*this==op);
}
bool operator<(const Production& op) const {
return this->lhs < op.lhs;
}
};
struct ProductionSet : vector<Production> {
ProductionSet(vector<Production> rhs) {
for (Production ss : rhs) {
push_back(ss);
}
}
ProductionSet() {
}
};
A Grammar With Expressions & Statements
As an example of pulling everything together into a full grammar, the follow demonstrates how the above data structures are combined to form our grammar. As one can imagine, automating the construction of the Grammar data structure from a specification file is highly encouraged.
Grammar G;
G.nonterminals = { "stmt-seq", "stmt-seqT", "stmt", "print-stmt", "print-stmtT", "expr-stmt", "expr", "exprT", "term", "termT", "factor", "primary"};
G.terminals = { "TK_ID", "TK_NUM", "TK_LPAREN", "TK_RPAREN", "TK_PRINT", "TK_PLUS", "TK_MUL","TK_MINUS", "TK_DIV", "TK_SEMI", "$$", EPS };
G.productions["stmt-seq"] = ProductionSet({Production(1,"stmt-seq",SymbolString({"stmt", "stmt-seqT"}))});
G.productions["stmt-seqT"] = ProductionSet({Production(2,"stmt-seqT",SymbolString({ "TK_SEMI", "stmt-seq"})),
Production(3,"stmt-seqT",SymbolString({EPS}))});
G.productions["stmt"] = ProductionSet({Production(4,"stmt",SymbolString({"expr-stmt"})),
Production(5,"stmt",SymbolString({"print-stmt"})),
Production(22, "stmt", SymbolString({EPS}))});
G.productions["print-stmt"] = ProductionSet({Production(6,"print-stmt",SymbolString({"TK_PRINT", "print-stmtT"}))});
G.productions["print-stmtT"] = ProductionSet({Production(7,"print-stmtT",SymbolString({"expr"}))});
G.productions["expr-stmt"] = ProductionSet({Production(8,"expr-stmt",SymbolString({"expr"}))});
G.productions["expr"] = ProductionSet({Production(9,"expr",SymbolString({"term", "exprT"}))});
G.productions["exprT"] = ProductionSet({Production(10,"exprT",SymbolString({"TK_PLUS", "term", "exprT"})),
Production(11,"exprT",SymbolString({"TK_MINUS", "term", "exprT"})),
Production(12,"exprT",SymbolString({EPS}))});
G.productions["term"] = ProductionSet({Production(13,"term",SymbolString({"factor","termT"}))});
G.productions["termT"] = ProductionSet({Production(14,"termT",SymbolString({"TK_MUL", "factor", "termT"})),
Production(15,"termT",SymbolString({"TK_DIV", "factor", "termT"})),
Production(16,"termT",SymbolString({EPS}))});
G.productions["factor"] = ProductionSet({Production(17, "factor", SymbolString({"TK_MINUS", "factor"})),
Production(18, "factor", SymbolString({"primary"}))});
G.productions["primary"] = ProductionSet({Production(19,"primary",SymbolString({"TK_NUM"})),
Production(20,"primary",SymbolString({"TK_ID"})),
Production(21,"primary",SymbolString({"TK_LPAREN", "expr", "TK_RPAREN"}))});
-
Data Structures For Representing Context Free Grammar
-
A B Tree of Binary Search Trees
-
Implementing enhanced for loops in Bytecode
-
Top-Down Deletion for Red/Black Trees
-
Function Closures For Bytecode VMs: Heap Allocated Activation Records & Access Links
-
Pascal & Bernoulli & Floyd: Triangles
-
A Quick tour of MGCLex
-
Compiling Regular Expressions for "The VM Approach"
-
Composable Linked Digraphs: An efficient NFA Data Structure for Thompsons Construction
-
Improving the Space Efficiency of Suffix Arrays
Leave A Comment