Compiling Regular Expressions for "The VM Approach"

Any software developer with an interest in regular expression engines has almost invariably come across the phenomenal series of articles written by Russ Cox on the subject. His article "Regular Expression Matching can be Simple and Fast" is probably one of the most referenced papers on NFA implementation to be published since 2000, and certainly one of the best. For many, it was the first time hearing of the catastrophic backtracking problem.

The second article in his series, "Regular Expression Matching: The VM approach" accomplishes something few authors ever manage: A banger of a paper that is just as interesting - if not more so - than the first! Alas, this second paper focused on the virtual machine side of things, leaving the compilation of the programs for that machine fairly unexplained except for the most cursory of descriptions, including the following chart describing instruction sequencing.

a     char a e1e2     codes for e1
    codes for e2
e1|e2     split L1, L2
L1: codes for e1
    jmp L3
L2: codes for e2
L3:
e?     split L1, L2
L1: codes for e
L2:
e* L1: split L2, L3
L2: codes for e
    jmp L1
L3:
e+ L1: codes for e
    split L1, L3
L3:

Along with an example for what a (small) example (a+b+) should compile to:

0   char a
1   split 0, 2
2   char b
3   split 2, 4
4   match

While Cox does graciously provide the code for the example compiler used in his article, the code makes heavy use of C-language idioms & pointer manipulation which renders its implementation somewhat difficult to understand for those who arent as familiar with C and _still_ not the easiest to translate into language which don't treat pointers the way C does.

Being a fan of both regular expressions & compilers I knew what I had to do: Explain the process of converting a regular expression string into an instruction format suitable for execution by the virtual machines described in Russ Cox's article and do it in such a way that doesn't rely on language-specific implementation details. Well, lets get to it.

The Instruction Set

 Part of what makes this method both fascinating and fast is the small, yet thoughtful instruction set used to control the virtual machine. The basic implementation has 4 instructions: CHAR, SPLIT, JUMP, and MATCH. The CHAR instruction is used to place a literal for comparison with the current position in the string. SPLIT is used to branch a second execution path through the instructions, while the JUMP instruction is used to skip to another instruction provided as an operand to the instruction. MATCH indicates that we should accept the string being compared as matching the pattern the program encodes. Additionally, I added a HALT instruction, which stops all execution and is just a convenience, but not necessary.

package com.maxgcoding.vm;

public enum InstType {
    CHAR, MATCH, JMP, SPLIT, HALT
}

Each instruction is reprsented by an Instruction object containing its type indicated by the Enum shown above, a String to hold character operands, as well as pointers to the next Instruction. I'm using array indexes as pointers instead of pointing to the actual object like Cox did, as a) pointers behave the same way in Java as they do in C, and b) it will help with debugging as we can easily trace the instruction path.

package com.maxgcoding.vm;

import lombok.Data;
import lombok.experimental.Accessors;

@Data
@Accessors(chain = true)
public class Instruction {
    private InstType inst;
    private String operand;
    private Integer next;
    private Integer alternate;
}

None of the Instructions use EVERY field, with most only using one aside from the type field. The JUMP instruction uses the next field as its operand, while the CHAR instruction uses the String field as it's operand. SPLIT uses next and alt, and the MATCH instruction takes no operands!

The Compiler

Finally, were ready for the main event, and it should come as a surprise to nobody that we will once again find ourselves performing a depth first traversal over an abstract syntax tree. When working with compilers/interpreters it kind of comes with the territory.

package com.maxgcoding.compile;

import com.maxgcoding.parse.Node;
import com.maxgcoding.vm.InstType;
import com.maxgcoding.vm.Instruction;

public class ByteCodeCompiler {
    private Instruction[] code;
    private int ip;

    private void init(int MAX_INST) {
        code = new Instruction[MAX_INST];
        ip = 0;
    }

Each instruction is stored in an array as it's being emitted with the instruction pointer (ip) being used to track the current position in the array as we emit instructions. Our instruction set contains alot of forward references, as such compiling our AST into the instruction set makes prodigious use of back patching. 

    private void emit(Instruction inst) {
        code[ip++] = inst;
    }
    private int skipEmit(int numplace) {
        ip += numplace;
        return ip;
    }
    private void skipTo(int oldpos) {
        ip = oldpos;
    }

Back patching is a technique I've previously covered in my posts on compiling pascal statements to p-code. It's used during code generation for resolving forward references whos target may not be available at the time you need them. skipEmit(int offset) moves  the instruction pointer forward by a provided offset number of spaces, returning the new instruction pointer as its result. skipTo(int target) on the other hand changes the instruction pointer to whatever address is provided to it. 

    public Instruction[] compile(Node node) {
        init();
        build(node);
        emit(new Instruction().setInst(InstType.MATCH));
        return code;
    }

    private void handleOperators(Node node) {
        switch (node.getData().charAt(0)) {
            case '|': handleOrOperator(node); break;
            case '*': handleKleeneOp(node); break;
            case '+': handleAtLeastOnce(node); break;
            case '?': handleZeroOrOnce(node); break;
            case '@':  handleConcat(node); break;
            default:
                break;
        }
    }
    private void build(Node node) {
        switch (node.getType()) {
            case OPERATOR: {
                handleOperators(node);
            } break;
            case LITERAL: {
                handleLiteral(node);
            } break;         
        }
    }

At each node of the tree we dispatch a procedure based on the node type, to emit the apropriate instructions for the operator or literal the node represents. Once we have finished traversing the ast, we emit a final MATCH instruction and return the array containing the compiled instructions. 

Using our handy chart from above detailing the instruction sequences for our regular expressions, we can start implementing the procedures to emit them. The first is obviously the simplest, the CHAR instruction for handling literals. 

a       char a
    private void handleLiteral(Node node) {
        emit(new Instruction().setInst(InstType.CHAR).setOperand(node.getData()));
    }

When I said the previous was the easiest, I was actually lying, concatenation is actually the easiest, as we simply call the build() procedure recursively on each of it's children.

e1e2       codes for e1
    codes for e2
     private void handleConcat(Node node) {
           build(node.getLeft());
           build(node.getRight());
     }

Our first introduction to back patching forward references will be gentle, as we will use it to handle the '?' operator for matching a character that appears exactly zero or one times.

e?       split L1, L2
L1: codes for e
L2:
    private void handleZeroOrOnce(Node node) {
        int spos = skipEmit(0);
        skipEmit(1);
        build(node.getLeft());
        int L1 = skipEmit(0);
        skipTo(spos);
        emit(new Instruction().setInst(InstType.SPLIT).setNext(spos).setAlternate(L1+1));
        skipTo(L1);
    }

The first thing you probably noticed is the calls to skipEmit() with '0' are the parameter: this is used to get the current instruction pointer address. We do this to retrieve an address before calling skipEmit() with an offset so we can return to the skipped position to fill in the missing instructions. This way we can generate the code for the "forward" reference, and then go back and actually create it!

e+   L1: codes for e
    split L1, L3
L3:
    private void handleAtLeastOnce(Node node) {
        int spos = skipEmit(0);
        build(node.getLeft());
        int L1 = skipEmit(0);
        emit(new Instruction().setInst(InstType.SPLIT).setNext(spos).setAlternate(L1+1));
    }
e*   L1: split L2, L3
L2: codes for e
    jmp L1
L3:
    private void handleKleeneOp(Node node) {
        int spos = skipEmit(0);
        skipEmit(1);
        int L1 = skipEmit(0);
        build(node.getLeft());
        int L2 = skipEmit(0);
        skipTo(spos);
        emit(new Instruction().setInst(InstType.SPLIT).setNext(L1).setAlternate(L2+1));
        skipTo(L2);
        emit(new Instruction().setInst(InstType.JMP).setNext(spos));
    }
e1|e2       split L1, L2
L1: codes for e1
    jmp L3
L2: codes for e2
L3:
    private void handleOrOperator(Node node) {
        int spos = skipEmit(0);
        skipEmit(1);
        int L1 = skipEmit(0);
        build(node.getLeft());
        int L2 = skipEmit(0);
        skipEmit(1);
        build(node.getRight());
        int npos = skipEmit(0);
        skipTo(L2);
        emit(new Instruction().setInst(InstType.JMP).setNext(npos));
        skipTo(npos);
        int L3 = skipEmit(0);
        skipTo(spos);
        Instruction ni = new Instruction().setInst(InstType.SPLIT).setNext(L1).setAlternate(L2+1);
        emit(ni);
        skipTo(L3); 
    }

Testing It Out

public final class App {

    public static void main(String[] args) {
        App app = new App();
        app.vmexample("(a*b|a+c)d", List.of("aaaabd", "abd", "aaaacd", "acd", "bd", "cd"));
    }

    private void vmexample(String pattern, List<String> testStrings) {
        Parser p = new Parser();
        ByteCodeCompiler c = new ByteCodeCompiler();
        Node ast = p.parse(pattern);
        Traversal.traverse(ast);
        VirtualMachine vm = new VirtualMachine(c.compile(ast), 0);
        for (String str : testStrings) {
            if (vm.match(str)) {
                System.out.println("Match found!");
            } else {
                System.out.println("No match :(");
            }
        }
    }
   
    private App() {
    
    }
}

And the output:

Checking aaaabd with '(a*b|a+c)d':
Executing 0: [SPLIT    1 6]
Executing 1: [SPLIT    2 4]
Executing 2: [CHAR    a ]
Executing 3: [JMP    1 null]
Executing 1: [SPLIT    2 4]
Executing 2: [CHAR    a ]
Executing 3: [JMP    1 null]
Executing 1: [SPLIT    2 4]
Executing 2: [CHAR    a ]
Executing 3: [JMP    1 null]
Executing 1: [SPLIT    2 4]
Executing 2: [CHAR    a ]
Executing 3: [JMP    1 null]
Executing 1: [SPLIT    2 4]
Executing 2: [CHAR    a ]
Executing 4: [CHAR    b ]
Executing 5: [JMP    9 null]
Executing 9: [CHAR    d ]
Executing 10: [MATCH    null null]
Match found!
Checking abd with '(a*b|a+c)d':
Executing 0: [SPLIT    1 6]
Executing 1: [SPLIT    2 4]
Executing 2: [CHAR    a ]
Executing 3: [JMP    1 null]
Executing 1: [SPLIT    2 4]
Executing 2: [CHAR    a ]
Executing 4: [CHAR    b ]
Executing 5: [JMP    9 null]
Executing 9: [CHAR    d ]
Executing 10: [MATCH    null null]
Match found!
Checking aaaacd with '(a*b|a+c)d':
Executing 0: [SPLIT    1 6]
Executing 1: [SPLIT    2 4]
Executing 2: [CHAR    a ]
Executing 3: [JMP    1 null]
Executing 1: [SPLIT    2 4]
Executing 2: [CHAR    a ]
Executing 3: [JMP    1 null]
Executing 1: [SPLIT    2 4]
Executing 2: [CHAR    a ]
Executing 3: [JMP    1 null]
Executing 1: [SPLIT    2 4]
Executing 2: [CHAR    a ]
Executing 3: [JMP    1 null]
Executing 1: [SPLIT    2 4]
Executing 2: [CHAR    a ]
Executing 4: [CHAR    b ]
Executing 4: [CHAR    b ]
Executing 4: [CHAR    b ]
Executing 4: [CHAR    b ]
Executing 4: [CHAR    b ]
Executing 6: [CHAR    a ]
Executing 7: [SPLIT    6 8]
Executing 6: [CHAR    a ]
Executing 7: [SPLIT    6 8]
Executing 6: [CHAR    a ]
Executing 7: [SPLIT    6 8]
Executing 6: [CHAR    a ]
Executing 7: [SPLIT    6 8]
Executing 6: [CHAR    a ]
Executing 8: [CHAR    c ]
Executing 9: [CHAR    d ]
Executing 10: [MATCH    null null]
Match found!
Checking acd with '(a*b|a+c)d':
Executing 0: [SPLIT    1 6]
Executing 1: [SPLIT    2 4]
Executing 2: [CHAR    a ]
Executing 3: [JMP    1 null]
Executing 1: [SPLIT    2 4]
Executing 2: [CHAR    a ]
Executing 4: [CHAR    b ]
Executing 4: [CHAR    b ]
Executing 6: [CHAR    a ]
Executing 7: [SPLIT    6 8]
Executing 6: [CHAR    a ]
Executing 8: [CHAR    c ]
Executing 9: [CHAR    d ]
Executing 10: [MATCH    null null]
Match found!
Checking bd with '(a*b|a+c)d':
Executing 0: [SPLIT    1 6]
Executing 1: [SPLIT    2 4]
Executing 2: [CHAR    a ]
Executing 4: [CHAR    b ]
Executing 5: [JMP    9 null]
Executing 9: [CHAR    d ]
Executing 10: [MATCH    null null]
Match found!
Checking cd with '(a*b|a+c)d':
Executing 0: [SPLIT    1 6]
Executing 1: [SPLIT    2 4]
Executing 2: [CHAR    a ]
Executing 4: [CHAR    b ]
Executing 6: [CHAR    a ]
No match :(

So there you have it, compiling Regular Expressions for matching with the "VM approach". As usual, all code is available on my Github linked below. Until next time, Happy Hacking!

Further Reading

1) https://swtch.com/~rsc/regexp/regexp2.html

2) https://github.com/maxgoren/vmnfa


Leave A Comment