Abstract Syntax Trees: Homogenous vs. Heterogenous Node Structures

Abstract Syntax Trees are the data structure which bridges the cap between the front and back ends of many (most?) compilers. Because of this, its design influcences the implementation of many other areas of the compiler. Of the many design decisions which need to be made, none is as impactful is the choice of node style for the vertices of the AST.

In general there are two choices: Homogenous node structures, as one would see in any general tree structure, and Heterogenous node structures which leverage Object Oriented design patterns to create a class heirarchy of node types. To some extent this choice is influenced by the choice of implementation language. Heterogenous AST's dont make sens in a non-OOP language, while Homogenous AST's are better suited to a procedural approach. Like most things though, the practical applications are far from black and white.

Homogenous AST Design

To start, lets take a look at the leaner and arguably simpler of the two. One major advantage of this design pattern is that is keeps our AST nodes as PODs - Plain old data structures - meaning all operations take place in external procedures. It also makes heavy use of tagged unions as you can see below.

Because they don't rely on a particular type system, they can be both quickly and easily serialized, should you want to either persist or transmist the AST once constructed. On the other hand, because of the use of tagging, operations on nodes can become long switch statements.

enum NodeType {
     EXPR_NODE, STMT_NODE
};

enum ExprType {
     CONST_EXPR, BINARY_EXPR
};

enum StmtType {
    PRINT_STMT, IF_STMT
};

struct AST {
      NodeType kind;
      union {
          ExprType expr;
          StmtType stmt;
      };
      Data attributes;
      AST* child[NUM_CHILDREN];
};

Traversal in homogenous ASTs is as simple in general tree structures, with standard traversal procedures used. 

The Heterogenous AST Pattern

On the complete other end of the spectrum is the heterogenous AST. Heterogenous AST's lend themselves very well to Object orient Programming and OOP design patterns. Both the interpreter pattern and the visitor pattern leverage heterogenous AST's as answers to the larger "expression problem".

class BaseAST {
     virtual ~BaseAST() { }
     virtual Data& getAttributes() = 0;
     virtual void accept(Visitor& visitor) = 0;
};

class ExprNode : public BaseAST {
     virtual ~BaseAST() { }
     virtual Data& getAttributes() = 0;
     virtual void accept(Visitor& visitor) = 0;
};

class StmtNode : public BaseAST {
     virtual ~BaseAST() { }
     virtual Data& getAttributes() = 0;
     virtual void accept(Visitor& visitor) = 0;
};

class ConstExpr : public ExprNode {
      private:
           Data data;
       public:
           ConstExpr(Data& d) : data(d) { }
           void accept(Visitor* visitor) {
                 visitor->visit(this);
            }
            Data& getAttributes() {
                   return data;
             }
};

class BinExpr : public ExprNode {
     private:
        ExprNode* leftChild;
        ExprNode* rightChild;
        Data op;
     public:
          BinExpr(Data ot, ExprNode* lhs, ExprNode* rhs) 
           : op(ot), leftChild(lhs), rightChild(rhs) { }
           
           Data& getAttributes() {
                   return data;
           }
          void accept(Visitor* visitor) {
                  visitor->visit(this);
           }
};

class PrintStmt : public StmtNode {
     private:
          ExprNode* expr;
    public:
         PrintStmt(ExprNode* e) : expr(e) { }
         void accept(Visitor* visitor) {
                  visitor->visit(this);
         }
};

One immediatly obvious difference between the two is the length of the definitions. Where in the homogenous design adding a new node type was as simple as adding an entry to an enum, here we need to define a whole seperate class. In exchange for that verbocity however, were given stronger type safety.

Which Is Better?

Neither style of AST is inherently better than the other, and either can be an acceptable choice given the proper tools ans task. Like parsing algorithms and baseball teams, it tends to be whichever one you use first that you tend to reach for the most.


Leave A Comment