Compiler Construction: Lexical Analysis.

If you’ve been following along with my latest series of posts, then welcome to part four of my journey to compile a high level language to my virtual stack machine. If you haven’t done so yet, you should read previous three posts as this post builds off of material presented there.

When we last left off, our virtual stack machine was both fully functioning and turing complete, and I had introduced the concept of compilers, grammars, and abstract syntax trees. In this post I will cover an important component of compilers: lexical analysis – also known as lexing and will discuss their implementations.

I had started off with the intention of writing a compiler for the BASIC language, but after careful deliberation I have decided that the language I am going to use is pl/0.

PL/0

Pl/0 is a subset of Pascal that was designed By Niklaus Wirth for use in influential books Data Structures + Algorithms = Programs and Compilerbau with the purpose of teaching students how to write a compiler. Despite being just a subset of Pascal, it can be easily expanded to encompass the entirety of Pascal, as well as additional features and is decidedly more useful than BASIC anyhow. The EBNF grammar of PL/0 is:

 program = block .
 
block = [ const ident = number {, ident = number} ;]
  [ var ident {, ident} ;]
  { procedure ident ; block ; } statement .
 
statement = [ ident := expression | call ident
  | ? ident | ! expression
  | begin statement {; statement } end
  | if condition then statement
  | while condition do statement ].
 
condition = odd expression
  |expression (=|#|<|<=|>|>=) expression .
 
expression = [ +|-] term { (+|-) term}.
 
term = factor {(*|/) factor}.
 
factor = ident | number | ( expression )

When reading the above grammar, anything enclosed in [] is optional in that production, and anything enclosed in {} occurs one or more times. This corresponds directly to if statements and while loops when we write our recursive procedures for the parser.

Lexical Analysis

Lexical analysis is the process of making sense out of the combinations of letters, numbers, and symbols that we present to the compiler as source code. Before we check if the syntax is correct, we have to make sure that the budilding blocks of the syntax is correct as well. In a compiler this is the job of whats called a “Lexer”.

A Lexer consumes the input source character by character, and from this stream of characters it outputs a stream of tokens. Just to refresh your memory from my previous post, a languages grammar is composed of “a set of symbols representing terminal tokens and a set of simple symbols representing non terminal tokens. It is this stream of tokens that is consumed by the parser to create an abstract syntax tree. Thus the lexer and the parser are in what’s called a producer/consumer relationship. When fully implemented, our Lexer should take the following valid pl/0 code as input:

 
var x, y, z;
procedure p;
    var m;
    begin
      x := 10;
      m := x;
end;
begin
   call p;
   y := x;
    z := y + 15;
end.

And out put a sequential list of tokens directly usable as input for the parser:

<[ var ], [ varsym ]>
<[ x ], [ idsym ]>
<[ , ], [ comma ]>
<[ y ], [ idsym ]>
<[ , ], [ comma ]>
<[ z ], [ idsym ]>
<[ ; ], [ semicolon ]>
<[ procedure ], [ procsym ]>
<[ p ], [ idsym ]>
<[ ; ], [ semicolon ]>
<[ var ], [ varsym ]>
<[ m ], [ idsym ]>
<[ ; ], [ semicolon ]>
<[ begin ], [ beginsym ]>
<[ x ], [ idsym ]>
<[ := ], [ assignsym ]>
<[ 10 ], [ number ]>
<[ ; ], [ semicolon ]>
<[ m ], [ idsym ]>
<[ := ], [ assignsym ]>
<[ x ], [ idsym ]>
<[ ; ], [ semicolon ]>
<[ end ], [ endsym ]>
<[ ; ], [ semicolon ]>
<[ begin ], [ beginsym ]>
<[ call ], [ callsym ]>
<[ p ], [ idsym ]>
<[ ; ], [ semicolon ]>
<[ y ], [ idsym ]>
<[ := ], [ assignsym ]>
<[ x ], [ idsym ]>
<[ ; ], [ semicolon ]>
<[ z ], [ idsym ]>
<[ := ], [ assignsym ]>
<[ y ], [ idsym ]>
<[ + ], [ add ]>
<[ 15 ], [ number ]>
<[ ; ], [ semicolon ]>
<[ end ], [ endsym ]>
<[ . ], [ period ]>
<[ ], [ eofsym ]>

There are a lot of diferent ways to go about accomplishing this, but at the heart of it is string matching. The textbook way of doing this is to examine the input character by character, recognizing individual characters as valid symbols themselves, or expanding symbols based on a combination of what the current character is combined with N number of characters ahead. This is called the lookahead.

Do make this happen, we need a list of valid tokens, and a list of valid keywords. For the list of valid tokens I use an enum of tokens typedef’d to Symbols which index a vector of strings containing their names. For the keywords, I used a binary search tree based symbol table. The output of the lexer is a Linked list containing the token pairs shown above.

#define SM_GRAMMAR_DEFS_CPP
#ifndef SM_GRAMMAR_DEFS_HPP
#include <string>
#include <vector>
using std::vector;
using std::string;

vector<string> tokennames = { "lparen", "rparen", "add", "sub", "multiply", "divide", "semicolon", "comma", "period",
                             "equals", "assignsym", "beginsym", "endsym", "procsym", "callsym",  "whilesym", "idsym", "ifsym", "dosym",
                             "thensym", "number", "errsym", "whitespace", "varsym", "eofsym"};

typedef enum syms {
    lparen,rparen,add,sub,multiply,divide,semicolon,comma,period,equals,assignsym,beginsym,endsym,
    procsym,callsym,whilesym,idsym,ifsym, dosym, thensym, number,
    errsym,whitespace,varsym, eofsym
} Symbol;

struct TokenList {
    string text;
    Symbol token;
    TokenList* next;
    TokenList(string t, Symbol s) {
        text = t;
        token = s;
        next = nullptr;
    }
};

#endif

Next we need a way to represent the input code in memory, a way to track our current position in the input text and what the current character is, and a way to obtain the next character. It also helps for debugging reasons to keep track of what line number of the input you are reading from.

  
class smLexer {
    private:
        vector<string> source;
        int pos;
        char lookahead;
        int lineno;
        string buffer;
        SymbolTable<string, Symbol> keywords;
        Symbol getToken();
        char nextchar();
        void error(string msg);
    public:
        smLexer(vector<string> src);
        TokenList* lex();
};

The source is represented as a vector of strings, which is passed to the constructor. The constructor also populates the keywords symbol table.

  smLexer::smLexer(vector<string> src) {
    source = src;
    lineno = 0;
    pos = 0;
    keywords.insert("begin", beginsym);
    keywords.insert("end", endsym);
    keywords.insert("procedure", procsym);
    keywords.insert("call", callsym);
    keywords.insert("if", ifsym);
    keywords.insert("do", dosym);
    keywords.insert("then", thensym);
    keywords.insert("while", whilesym);
    keywords.insert("var", varsym);
}

TokenList* smLexer::lex() {
    TokenList dummy("dummy", whitespace);
    TokenList* list = &dummy;
    lookahead = 'a';
    while (lookahead != '~') {
        lookahead = nextchar();
        Symbol curr = getToken();
        if (curr != whitespace) {
            list->next = new TokenList(buffer, curr);
            list = list->next;
        }
    }
    return dummy.next;
}

The algorithm is governed by lex(), which repeatedly calls getToken() building up the token list until the entire source file has been consumed. getToken() relies on nextchar() to actually consume the source character by character:

 char smLexer::nextchar() {
    if (pos < source[lineno].size()) {
        return source[lineno][pos++];
    } else if (lineno < source.size()) {
            lineno += 1;
            pos = 0;
            cout<<"Analyzing line: "<<lineno<<"\n";
            if (lineno == source.size()) {
                cout<<"Source file consumed.\n";
                return '~';
            }
            return nextchar();
    }
}

Symbol smLexer::getToken() {
    buffer.clear();
    if (lookahead == ' ') {
        return whitespace;
    } else if (lookahead == '~') {
        return eofsym;
    } else if (lookahead == '(') {
        buffer = "(";
        return lparen;
    } else if (lookahead == ')') {
        buffer = ")";
        return rparen;
    } else if (lookahead == '+') {
        buffer = "+";
        return add;
    } else if (lookahead == '-') {
        buffer = "-";
        return sub;
    } else if (lookahead == '*') {
        buffer = "*";
        return multiply;
    } else if (lookahead == ';') {
        buffer = ";";
        return semicolon;
    } else if (lookahead == ',') {
        buffer = ",";
        return comma;
    } else if (lookahead == '.') {
        buffer = ".";
        return period;
    } else if (lookahead == '=') {
        lookahead = nextchar();
        if (lookahead == '=') {
            buffer = "==";
            return equals;
        } else {
            string msg = "unknown symbol on line " + to_string(lineno) + " near '" + lookahead + "'";
            error(msg);
        }
    } else if (lookahead == ':') {
        lookahead = nextchar();
        if (lookahead == '=') {
            buffer = ":=";
            return assignsym;
        } else {
            string msg = "unknown symbol on line " + to_string(lineno) + " near '" + lookahead + "'";
            error(msg);
        }
    } else if (isalpha(lookahead)) {
        cout<<"Identifier or keyword?\n";
        buffer.clear();
        while (1) {
            if (isalnum(lookahead)) {
                buffer.push_back(lookahead);
            } else {
                pos--;
                break;
            }
            lookahead = nextchar();
            cout<<buffer<<"\n";
        }
        Symbol ret = keywords.lookup(buffer);
        cout<<"Chhecking symbol table...\n";
        if (ret == errsym) {
            cout<<"Identiier.\n";
            return idsym;
        } else {
            cout<<"Keyword: "<<tokennames[ret]<<"\n";
            return ret;
        }
    } else if (isdigit(lookahead)) {
        buffer.push_back(lookahead);
        while (1) {
            lookahead = nextchar();
            if (isdigit(lookahead)) {
                buffer.push_back(lookahead);
            } else {
                pos--;
                break;
            }
        }
        return number;
    } else {
        if (lookahead == '~')
            return eofsym;
        string msg = "Coult not determine token on line " + to_string(lineno) + " positon " + to_string(pos) + " near " + lookahead;
        error(msg);
    }
}

As you can tell getToken() is really the beating heart of the algorithm. It begins by trying to match every single character token that it can, then it tries to match any two character tokens. With those out of the way it goes on to handling any strings present, carefully building them up inside the buffer variable while tracking both lookahead and pos. Once extracted, it checks the strings against the list of valid keywords, and if it is not a reserved keyword, it is labeled as an identifier. It does the same to make sure any numbers encountered are valid numbers and not mixed letter/character strings. Finally, if it cannot recognize what the input is, it must be erroneous, so it supplies the line number and character the error occurred on to the user, and quits.

In this way, we build our token list which can be fed to our parser in order to create an abstract syntax tree which I will cover in my next post. Until then, happy hacking!

Further Reading

Compilers Principles, Techniques, and Tools By Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman

http://pascal.hansotten.com/niklaus-wirth/pl0/ – A fantastic write up on pl/0 and other teaching languages from Niklaus Wirth

Chapter Five of Algorithms + Data Structures = Algorithms – The full chapter of Niklaus Wirth’s monumental book, where he introduces the pl/0 language and compiler.