Exploring Stack Machines: Compiling high level languages to Stack Machine code

At the end of my previous article we had a fully featured stack machine that is Turing complete – meaning that it is capable of “universal computation”. We already have an assembly language for our stack machine, but programming in assembly gets tedious fast. That means if we want to program in anything besides assembly for our stack machine, we are going to have to write a compiler for it. That is a whole lot easier said than done..

To keep things simple at first, we are going to write a compiler that takes in a mathematical expression, and outputs a program that can be run on our stack machine to get the answer. There is a lot of theory that needs to be covered in order for this to make sense, and there is really no way around it, so grab a snack and something to drink and get comfortable because were about to dive down one heck of a rabbit hole.

A Little Blurb About Compilers

A compiler, at its heart, is a translator. It takes as input data in format A and outputs data in format B. This can mean translating english to spanish, or it could mean translating Pascal to assembly – it works the same way. That is, both natural languages and programming languages have syntaxes, and syntaxes are made up of grammars. Grammars are a set of rules that quite the structure of the language. Through careful analysis of the input and the correct application of these rules, in the correct order, we can translate between language A and B.

Compilation happens in phases, with different parts of the compiler handling different tasks. The different tasks of a compiler are:

  • lexical analysis
  • parsing
  • semantic analysis
  • error checking
  • code generation

Lexical analysis is handled by the lexer the lexer takes a string of characters and turns them into a stream of annotated tokens. This stream of tokens is used as input to the parser. The parser performs the semantic analysis, converting our string of tokens into an abstract syntax tree. The abstract syntax tress is an intermediate representation of our program. It can serve as the input for an optimizer, a code generator, or an interpreter. This is the “front end” of the compiler, and the focus of this paper.

There are some other parts too, such as symbol tables, and debugging aids, etc. Not to mention the back end, which in our case is the stack machine we developed in my previous two posts. A compiler can be as simple or complex as you want to make it. Some compilers make multiple reads through the input, making changes to the structure of the code, replacing with symbols with values, perform type checking with each pass. These are called multi pass compilers. Others make just one read of the input and these are called single pass compilers. Some compilers focus on the efficiency of the code they generate and are called optimizing compilers.

Regardless of the number of passes required, all compilers broadly fall in to one of two categories: top down compilers, and bottom up compilers. Bottom up compilers are generally made of such complex algorithms that they are created by compiler generator tools such as lex/yacc, bison, and ANTLR, specialized software that does nothing to remove the mystery of what a compiler does, though for more complex higher level languages, they are the preferred method.

I will be focusing on making a top down compiler, as they are straight forward enough to be implemented by hand and offer a great understanding of how the whole process works.

Syntax and Grammar

By knowing the grammars of the input and target languages, we can apply the grammar rules to translate the input from syntax A and output it in syntax B. Again, easier said than done – but not impossible. Do to this, we are going to use a technique called recursive descent parsing.

This is going to start getting a little jargon heavy, so I am going to introduce some terms that are going to come up a lot along the way. I previously mentioned that grammars are rules that guide syntax. We will be concerning ourselves with what is called “context free grammars” which are used to describe grammar in a way usable by a recursive descent parser.

A context free grammar breaks down into four parts:

  1. A set of symbols, or tokens called terminal symbols
  2. A set of tokens called nonterminal symbols
  3. productions, or the results of applying grammar rules to tokens
  4. a non terminal symbol that is also the start symbol

As mentioned, we will begin by compiling mathematical expressions from algebraic notation down to stack machine assembly that runs on our stack machine. Because algebraic equations are most commonly written with in-fix notation, and stack machines evaluate expressions in post-fix order, our ultimate task is to manage that conversion. Thankfully, Edsgar Dijkstra came up with a simple solution for this very problem: the shunting yard algorithm.

We are going to start with as simple an expression s possible: “9-2+6”. The syntax of this expression can be described as “a list of numbers separated by a plus or minus sign.” From the description of the syntax we can derive the rules for its grammar by listing it’s productions:

  • An expression is a list of digits separated by plus or minus signs: list = list + digit | list – digit
  • A list can be just one number: list = digit
  • digit = 0 | 1 | 2 | 3 | N+1

The tokens of our expressions are +, 0, 9, 2, and 6. So the next step is to determine which of those are terminal symbols and which are non terminal symbols and which is the start symbol, from which we can derive the productions. We do this by creating a parse tree. But right now we have a problem. We have an ambiguous grammar which can be interpreted to result in two different but valid answers. And we need to do something about that or this isn’t going to work.

              list
/ | \
list (+) list <-left associative precedence
/ | \ \
digit(9)(-) digit(2) digit(6)

list
/ | \
list (-) list <- right associative precedence
/ / | \
digit(9) digit(2) (+) digit(6)

As it stands, our grammar is what’s called left-recursive, and that is not valid input for a recursive descent parser, because it lacks precedence rules. We can reshape our grammar to remove left recursion through the computer scientist’s best friend: indirection. Right now, our left recursive grammar looks like this:

expr = expr + expr | expr * expr | num

To rewrite our grammar in and remove the ambiguity, we introduce a term this gives our grammar operator precidence:

expr = expr + term | term
term = term * num | num

And now we re-write the grammar to remove the left recursion:

expr = term exprOpt*  
exprOpt = + term
term = num termOpt*
termOpt = * num

The syntax of a mathematical expression is made up of three types of non terminals: expr, term, and factors. expr and term are used to preserve operator precedence and factors are the basic unit of the expression, which in this case are numbers. So the productions for a mathematical grammar is:

expr = expr + term | expr - term | term
term = term * factor | term / factor | factor
factor = number | ( expr )

Armed with our revised grammar, we’re ready to start putting our parser together. Our parser is going to be made up of functions called expr(), term(), factor(), and rest() that call each other recursively to create a parse tree from which our abstract syntax tree is derived.

A Quick Word on Abstract Syntax Trees

In the above section we examined a parse tree and its relation to grammar. And just as a grammar can be represented by a tree, so syntax can be take tree form as well, called an Abstract Syntax Tree, and abstract syntax tree’s are very, very powerful.

            +
/ \
- 6
/ \
9 2

Abstract syntax trees, as show above, differ from parse trees (a concrete syntax tree) because it does away with any information that is not directly useful to its evaluation – symbols have been replaced by values, and it is structured according to the correct operator precedence. Most importantly, abstract syntax trees are what connects our stack machine, grammars, syntax trees, and high level languages.

If you look at the above syntax tree, with our stack machine in mind something should become immediately apparent: Its inner nodes are operations, and its leaf nodes are made of operands.

A stack machine is inherently suited to post-fix expressions. But our expression is written as in-fix. There is one more thing about abstract syntax tree’s that I should mention: They are always traversed, in a depth first, post order, left to right fashion. Let’s perform that traversal on the tree above:

 
#include <iostream>
using namespace std;

struct node {
    char info;
    node* left;
    node* right;
    node(char c, node* l, node* r) {
        info = c;
        left = l;
        right = r;
    }
};

void traverse(node* h) {
    if (h != nullptr) {
        traverse(h->left);
        traverse(h->right);
        cout<<h->info<<" ";
    }
}

int main() {
    node* root = new node('+',
                            new node('-',
                                    new node('9', nullptr, nullptr), new node('2', nullptr, nullptr)),
                            new node('6', nullptr, nullptr));
    traverse(root);
    cout<<endl;
    return 0;
}


max@MaxGorenLaptop:/mnt/c/Users/mgoren$ ./syntree
9 2 - 6 +
max@MaxGorenLaptop:/mnt/c/Users/mgoren$

Oh. Oh my. It would appear that we have converted our infix expression to a postfix expression. This is the guiding principle behind our code generation scheme. By a slight tweak to the data processing part of our ast traversal function, we can easily generate valid stack machine assembly code:

 
void traverse(node* h) {
    if (h != nullptr) {
        traverse(h->left);
        traverse(h->right);
        if (isdigit(h->info))
            cout<<"push "<<h->info<<"\n";
        else if (h->info == '+')
            cout<<"add\n";
        else if (h->info == '-')
            cout<<"sub\n";
    }
}

Recompile and run again and……

 push 9
push 2
sub
push 6
add

And we take that code and run it on our stack machine with stack tracing turned on:

 
#include <iostream>
#include <vector>
#include "stackmachine.hpp"
using namespace std;

int main() {
      vector<string> program = {
            "push 9",
            "push 2",
            "sub",
            "push 5",
            "add",
            "show"
      };
       StackMachine sm(true, false);
       sm.run(program);
       return 0;
}

max@MaxGorenLaptop:/mnt/c/Users/mgoren$ g++ smex2.cpp -o smex && ./smex
[0] push 9
Data Stack: []
[1] push 2
Data Stack: [9 ]
[2] sub
[2] pop
Data Stack: [2 9 ]
[2] pop
Data Stack: [9 ]
[2] push
Data Stack: []
[3] push 6
Data Stack: [7 ]
[4] add
[4] pop
Data Stack: [6 7 ]
[4] pop
Data Stack: [7 ]
[4] push
Data Stack: []
[-]13

Wow!! Now we’re cooking with gas…. or at least we would be, had our code been generated by traversing a dynamic implicitly generated m-ary abstract syntax tree, instead of the explicit binary syntax tree traversed above. Did you really think I explained all that stuff about grammars and syntax just to do a post order traversal of a binary tree?

Recursive Descent Parsing

One of the problems with creating explicit abstract syntax trees is that there is no gurantee, and in fact it is pretty unlikely, that your abstract syntax tree is also a binary tree. While constructing an m-ary tree data structure is not impossible – left child/right sibling trees are an option – the established technique is to create an implicit AST through mutually recursive calls to methods which represent the nodes on the tree as it works its way down through the input, hence the name recursive descent. Parsing works by consuming the input and matching the it to tokens. The token currently being evaluated is called the lookahead token which is continuously updated throughout the process.

Non terminal tokens of our grammar get methods names after them, and they are used to find the terminal tokens. Since we are using the following grammar:

expr = expr + term | expr - term | term
term = term * factor | term / factor | factor
factor = number | ( expr )

We will have methods for expr(), term(), and factor(). These methods make a call to match(), which either validates that the current token is what we are expecting and advances lookahead to the next token, or throws an error.

 class smToyCompiler {
        void term();
        void factor();
        void expr();
        void match(char);
        void error();
        char lookahead;
        string input;
        vector<string> output;
        int pos;
        char getchar() {
            return input[pos++];
        }
    public:
        smToyCompiler() {
            pos = 0;
        }
        vector<string> compile(string in) {
            input = in;
            lookahead = getchar();
            expr();
            output.push_back("show");
            return output;
        }
};

We will start with factor(), since it is the base unit of our expression. Looking at the graammar, we see the productions num | (expr), which means factor is expecting the current token to be a number, or another expression inside parentheses – as that can be reduced to just a number. Anything else is invalid input, and so an error:

 void smToyCompiler::factor() {
    if (isdigit(lookahead)) {
        string cmd = "push ";
        while (isdigit(lookahead)) {
            cmd.push_back(lookahead);
            lookahead = getchar();
        }
        pos--;
        output.push_back(cmd);
        match(lookahead);
    } else if (lookahead == '(') {
        match('('); expr(); match(')');
    } else error();
}

Next we will look at expr(), which is the starting token of our expression. And term(), which is closely related to expr(). This is where we start to see the mutual recursion come in to play.

 void smToyCompiler::expr() {
    term();
    while (1) {
        if (lookahead == '+') {
            match('+'); term(); output.push_back("add");
        } else if (lookahead == '-') {
            match('-'); term(); output.push_back("sub");
        } else break;
    }
}

void smToyCompiler::term() {
    factor();
    while (1) {
        if (lookahead == '*') {
            match('*'); factor(); output.push_back("mult");
        } else if (lookahead == '/') {
            match('/'); factor(); output.push_back("div\n");
        } else break;
    }
}

If we trace the execution of evaluating (19*3), we start with the call to
expr() which immediately calls term(), which immediately calls factor(). For the expression (19*3), factor() finds the first terminal token: ‘(‘, which it matches. Factor() then calls expr(), which again calls term(), which again calls calls factor() and matches on the 19. This recursive call unwinds, and we find ourselves back in term() which matches to the ‘*’ token, and so calls factor(), which immediately matches the number 13. The recursion stack unwinds and we find our selves back in factor() right after the call to expr(), and thus match the closing parentheses. The match function is very straight forward:

 void smToyCompiler::match(char t) {
    if (lookahead == t)
        lookahead = getchar();
    else error();
}

Let’s try evaluating something a little more complex, and see if our compiler will generate the proper instructions for something like (19*3) – 17 + (6*2)

 
int main() {
    string expression = "(19*3)-17+(6*2)";
    smToyCompiler compiler;
    StackMachine sm(false, false);
    vector<string> program = compiler.compile(expression);
    cout<<"Evaluting: "<<expression<<"\nAnswer: ";
    sm.run(program);
    return 0;
}

max@MaxGorenLaptop:/mnt/c/Users/mgoren/smProject$ ./smInfix
Evaluting: (19*3)-17+(62)
Answer: [TOS] 52
max@MaxGorenLaptop:/mnt/c/Users/mgoren/smProject$

Allright! Now we truly are getting some place. Now we need to expand our grammar to include high level constructs like conditional statements, loops, and variable assignments.

Stay tuned for the next installment where we will go on to create a compiler for a subset of the Pascal programming language to run on our stack machine. As always, until then happy hacking!

Further Reading

Compilers, Principles, Techniques, And Tools by Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman.

Algorithms in C++ (2nd. ed), pp. 217 – 317 by Robert Sedgewick

https://mathcenter.oxford.emory.edu/site/cs171/shuntingYardAlgorithm/

All code examples above are available on my github: http://www.github.com/maxgoren/StackMachine