Implementing Dynamic Typing with Tagged Unions

mgcadmin 24-03-2024

Many scripting languages, if not most of them, have dynamic typing. where you don't need to specify the type of a variable, and the type being stored can change over time. Languages like Lisp, Perl, PHP, JavaScript, and Python are all examples of dynamically typed languages. Interestingly, 3 of the 5 languages I just listed have their primary implementation written in C, and many JavaScript engines are written in C++. Why does this matter? Well, C and C++ are statically typed languages. So what kind of dark magic is taking place here? In short: how the heck does that work?

Turns out its actually quite simple. No wizardry needed, sorry Gandalf. In fact, there are a couple of popular ways to go about doing this - it's not exactly a new concept. That sound you here is the OOP fanatics drawing up their class hierarchies, and while that certainly is an option, its kind of a heavy weight option for something we want to do rather quickly: access a value in memory. The fewer the layers of abstraction needed to get at that value, the better.

Another popular option, though something you see more from the functional boys is NaN-boxing. NaN boxing is the concept of taking a standard width float, and shoving what ever you want in the bottom 52 bytes of it, with the first 12 bytes used as a tag. While NaN-boxing isn't black magic, it requires a bit more wizardry then the option I'm going to outline today, while bringing with it a whole host boundary conditions you need to manage - not ideal.

Tagged Unions

Tagged unions are common in C, less so in C++, but you still find them. A union is a data structure in C/C++ that can store more than one data type, but only one at a time. I sat here for about 15 minutes trying to think of a better description than that, but maybe I should just show you.

union {
   int a;
   float b;
} data;

data.a = 6; //data stores an int, the value of b is undefined
data.b = 2.7f; //data now stores a float, the value of a is undefined

The important thing to remember is one at a time, and while that one is in use, the rest are undefined. You're probably wondering, "Well how do you know which field to access?" and that's where the concept of a tagged union comes in. A tag is (usually) an enum with entries corresponding to the union, that you can then use to drive a switch statement to access the proper field.

Tagged Unions as a Generic Object Type

To implementing dynamic typing, we can take a hint from Java, which is ironically enough another statically typed language. Java has primitive types, or Boxed types. Javas boxed types are the object equivalent of primitive types. We will only use boxed types, but this will be invisible to the user. Like in Java, all of our boxed Types will inherit from a base type Object, unlike Java we'll use tagged unions instead of a classes.

The first thing we need is our enum of the types an Object can represent:

enum StoreAs {
    AS_INT, AS_REAL, AS_BOOL, AS_STRING
};

We can add more types like arrays and hashes as well, but this is enough to get us started. Tagged unions are held either in a class or struct to keep them together, in C++ a struct is a class, but i still prefer to use struct out of habit.

struct Object {
      StoreAs storedType;
      union {
          int   intVal;
          float realVal;
          bool  boolVal;
          string* strVal;
      };
};

Pay close attention to the fact that strVal is a pointer to a string, not a string. If you're going to implement array or hash types, you'll want to store pointers for those as well. That's because an Object isnt the size of the type it stores. an Object is the size of the largest width type that it CAN store.

Unfortunately unions and C++'s type constructors don't jive too well, so instead we will make use of some helper methods: a base makeObject() method, as well as makeIntObject(), makeBoolObject(), makeRealObject(), and makeStringObject() methods.

Object* makeObject(StoreAs type) {
    Object* obj = new Object;
    obj->type = type;
    return obj;
}

Object* makeIntObject(int value) {
    Object* obj = makeObject(AS_INT);
    obj->intVal = value;
    return obj;
}

Object* makeRealObject(float value) {
    Object* obj = makeObject(AS_REAL);
    obj->realVal = value;
    return obj;
}

Object* makeBoolObject(bool value) {
    Object* obj = makeObject(AS_BOOL);
    obj->boolVal = value;
    return obj;
}

Object* makeStringObject(string* value) {
    Object* obj = makeObject(AS_STRING);
    obj->stringVal = value;
    return obj;
}

Okay, so now that we can create our objects, what about reading from them? This is where the tag portion comes in. We access the value of the object we by using a switch statement or if statement to access the correct field based on the tag.

void Interpreter::printStmt(ASTNode* node) {
    Object* obj = expression(node->left);
    switch (obj->type) {
        case AS_REAL: 
            cout<<obj->realVal<<endl;
            break;
        case AS_INT:
            cout<<obj->intVal<<endl;
            break;
        case AS_BOOL:
            if (obj->boolVal) cout<<"true"<<endl;
            else              cout<<"false"<<endl;
            break;
        case AS_STRING:
            cout<<*obj->stringVal<<endl;
            break;
        default:
            break;
    }
}

In addition, this also facilitates easy run time type checking of operands to ensure they are compatible with the operator, as shown in the following eval() function:

Object* Interpreter::eval(ASTNode* node) {
    Object* lhs = expression(node->left);
    Object* rhs = expression(node->right);
    if (lhs->type == rhs->type && (lhs->type == AS_REAL || lhs->type == AS_INT)) {
        double left = lhs->realVal;
        double right = rhs->realVal;
        switch (node->data.tokenVal) {
            case PLUS:     return makeRealObject(left+right);
            case MINUS:    return makeRealObject(left-right);
            case DIVIDE:
                if (right == 0) {
                    cout<<"Error: attempted divide by zero"<<endl;
                    return makeRealObject(0.0f);
                }
                return makeRealObject(left/right);
            case MULTIPLY: return makeRealObject(left*right);
            case EQUAL:    return makeRealObject(left == right);
            case LESS:     return makeRealObject(left < right);
            default:
                cout<<"Unknown Operator: "<<node->data.stringVal<<endl;
        }
    } else if (lhs->type == AS_STRING && rhs->type == AS_STRING && node->data.tokenVal == PLUS) {
        return makeStringObject(new string(*(lhs->stringVal) + *(rhs->stringVal)));
    } else {
        cout<<"Error: Unsupported operation for those types."<<endl;
    }
    return makeRealObject(-1.0f);
}

So as you can see, implementing dynamic typing for a language isn't some type of dark magic, but more like a stage magician's slight of hand. We're giving the user the illusion of dynamic typing, by managing all of the type dependent processes out of sight and out of mind.

That's all I've got for today, so until next time Happy Hacking!

MaxGCoding.com

For the love of programming

Implementing Dynamic Typing with Tagged Unions

Tagged Unions

Tagged Unions as a Generic Object Type

Leave A Comment

The Aho, Sethi, Ullman Direct DFA Construction, Part 1: Constructing the Followpos Table

Procedural Map Generation with Binary Space Partitioning

Exact Match String Searching: The Knuth-Morris-Pratt Algorithm

The visitor pattern: OOP takes on the Expression Problem

Improving mgcLisp's define syntax

Separating the Men from the Boys, Knuth Style

Reducing rotations during deletion from AVL trees

Parsing Lisp: From Data To Code

Swiz Heaps: A deterministic modification of Randomized Meldable Heaps

Let's talk Eval/Apply