Post

Building a Register VM Interpreter - Part 5: Disassembler

This guide is about building a register-based VM interpreter from scratch. This part will inplement a Disassembler for easy debugging.

Building a Register VM Interpreter - Part 5: Disassembler

In the previous installment, we laid the fundational work for the interpreter’s memory by building the Chunk. We create a dynamic data structure capable of storing raw bytecode, a constant pool, and an array to track line numbers for debugging.

Now, the VM has a place to store code, but it does not actually have a language it can understand. That is why the Instruction Set Architecture (ISA) comes in.

But we have a glaring problem. If we were to compile a program and print out the contents of the Chunk, we would just see a raw arrays of integers. Staring at theses to figure out what the code is doing is absolutly a nightmare. To keep the sanity intact as we build the rest of the interpreter, we need a window into the virtual machine. We need a tool that can unpack a 32-bit integer and translate them back into a hunman-readable, assembly-like instructions. We need a Disassembler.

In this article, we are going to write a debugging module (debug.c and debug.h). It will iterate through the Chunk, unpack the instructions, and print them out alongside their source code line numbers and constant values.

Disassembler and Line Numbers

The core feature of the debug module is to take a Chunk and iterate through it, decoding and printing instructions one by one.

Let’s create two new files vm/debug.h and vm/debug.c, and take a look at the public interface defined:

1
2
3
4
5
6
7
8
9
10
11
12
#ifndef PICO_DEBUG_H
#define PICO_DEBUG_H

#include "chunk.h"

void dasmChunk(Chunk* chunk, const char* name);
void dasmInstruction(Chunk* chunk, int offset);
int getLine(const Chunk* chunk, int offset);



#endif  // PICO_DEBUG_H

The entry point to our disassembler is dasmChunk. This function is used to dissect an entire block of code, which we can lable with a name to give us some context.

1
2
3
4
5
6
void dasmChunk(Chunk* chunk, const char* name){
    printf("== %s ==\n", name);
    for(size_t offset = 0; offset < chunk->count; offset++){
        dasmInstruction(chunk, offset);
    }
}

Moving Offset

As we designed the instructions with a strictly fixed-width 32-bit set, every single operation takes exactly one slot in the array. So moving offset is just a standard offset++ iteration.

For many other stack-based VMs of languages implementations using raw byte arrays, they usually use variable-length instructions. For example, an OP_RETURN might take up exactly 1 byte, but an OP_LOADK might need 3 bytes (one for OpCode, and two for constant index). The loop cannot just blindly increment by 1. It has to ask the decoding function (dasmInstruction for example).

For the register-based VMs, moving to next instruction is always a guaranteed offset++. It perfectly mirrors how modern physical processors like RISC iterate through memory.

DAsm Output

Now let’s peek into the dasmInstruction to see how it handles printing debug informations before it unpacks the OpCode:

1
2
3
4
5
6
7
8
9
10
11
12
13
void dasmInstruction(Chunk* chunk, int offset){
    printf("offset: %04d ", offset);
    int line = chunk->lines[offset];

    if(offset > 0 && line == chunk->lines[offset - 1]){
        printf(CLR_GRAY "   ~ | " CLR_RESET);
    }else{
        printf(CLR_GREEN "%4d | " CLR_RESET, line);
    }

    // ... OpCode decoding ...

}

The ANSI colors are defined as following macros:

1
2
3
4
5
6
7
8
#define CLR_RESET   "\033[0m"
#define CLR_BOLD    "\033[1m"
#define CLR_GRAY    "\033[90m"
#define CLR_RED     "\033[31m"
#define CLR_GREEN   "\033[32m"
#define CLR_YELLOW  "\033[33m"
#define CLR_CYAN    "\033[36m"
#define CLR_MAGENTA "\033[35m"

We tackle the line number by int line = chunk->lines[offset]; from a parallel lines array. We check if the current offset is greater than 0 and if the current line number is identical to the line number of the previous instruction (chunk->lines[offset - 1]). If they match, instead of repeating the number, we print a simple skip character (~).

With this logic, the output transforms from a cluttered mess into instructions like this:

1
2
3
4
5
6
7
8
9
10
offset: 0000    1 | OP_IMPORT        ...
offset: 0001    ~ | OP_SET_GLOBAL    ...
offset: 0002    4 | OP_LOADK         ...
offset: 0003    ~ | OP_LOADK         ...
offset: 0004    ~ | OP_BUILD_LIST    ...
offset: 0005    ~ | OP_INIT_LIST     ...
offset: 0006    ~ | OP_MOVE          ...
offset: 0007    ~ | OP_SET_GLOBAL    ...
offset: 0008    5 | OP_GET_GLOBAL    ...
offset: 0009    ~ | OP_LOADK         ...

Now that we have our loop iterating perfectly and our line metadata neatly formatted, we are ready to unpack the 32-bit instruction itself and figure out what operation the VM is trying to perform.

Mapping OpCodes to Strings

It is the time to look at the actual 32-bit Instruction.

The very first thing we need to extract from this integer is the command itself. In previous part, we have defined the GET_OPCODE macro, which strips away the upper 24 bits and leaves us with the lowest 8 bits, the OpCode.

1
2
Instruction instruction = chunk->code[offset];
OpCode op = GET_OPCODE(instruction);

While extracting the opcode is easy, it leaves us a problem. The OpCode is simply an enum, meaning it actually evaluates to an integer under the hood. To fix this, we need a lookup table.

The Lookup Table

In debug.c, we define a static array of strings called opNames. The beautiful thing about C enums starting at 0 and incrementing by 1 is that they perfectly mirror array indices.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
static const char* opNames[] = {
    "OP_MOVE",        // R[A] <= R[B]
    "OP_LOADK",       // R[A] <= K[Bx]
    "OP_LOADBOOL",    // R[A] <= (B != 0)
    "OP_LOADNULL",    // R[A], R[A+1], ..., R[A+B-1] <= null

    "OP_GET_GLOBAL", // R[A] <= Gbl[K[Bx]]
    "OP_SET_GLOBAL", // Gbl[K[Bx]] <= R[A]

    "OP_GET_UPVAL",  // R[A] <= Upv[B]
    "OP_SET_UPVAL",  // Upv[B] <= R[A]
    "OP_CLOSE_UPVAL",

    "OP_GET_INDEX", // R[A] <= R[B][R[C]]
    "OP_SET_INDEX", // R[A][R[B]] <= R[C]

    "OP_GET_PROPERTY", // R[A] <= R[B].K[C]
    "OP_SET_PROPERTY", // R[A].K[B] <= R[C]

    "OP_FIELD",

    "OP_ADD", 
    "OP_SUB", 
    "OP_MUL", 
    "OP_DIV", 
    "OP_MOD",

    "OP_NOT", 
    "OP_NEG",

    "OP_EQ", 
    "OP_LT", 
    "OP_LE",

    "OP_JMP",
    "OP_JMP_IF_FALSE",  // R[A] is condition
    "OP_JMP_IF_TRUE",   // R[A] is condition
    "OP_CALL",
    "OP_TAILCALL",
    "OP_DEFER",
    "OP_SYSTEM",
    "OP_RETURN",

    "OP_CLOSURE",

    "OP_CLASS",
    "OP_METHOD",

    "OP_BUILD_LIST",
    "OP_BUILD_MAP",
    "OP_INIT_LIST",
    "OP_FILL_LIST",
    "OP_SLICE",
    "OP_TO_STRING",

    "OP_IMPORT",

    "OP_FOREACH",

    "OP_PRINT",
};

If op ends up being an integer larger than the size of our array, reading opNames[op] will result in a segmentation fault and crashing the entire disassembler. To prevent this, we just add a quick check in dasmInstruction:

1
2
int opCnt = sizeof(opNames) / sizeof(opNames[0]);
const char* opName = (op < opCnt) ? opNames[op] : "UNKNOWN";

Now that we have successfully translate the raw integer into a descriptive string, the next step is to decode the rest 32-bit instruction to find out what we are loading.

DAsm the Instruction Format

In Part 4, we designed the Instruction Set Architecture (ISA) specifically to avoid mess. Every single instruction falls into one of three patterns: iABC, iABx and iAsBx. Because all instructions fit these three patterns, the disassembler only needs a handful of decoding functions. Inside the dasmInstruction, we can use a simple switch statement to group the opcodes and route to the correct function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
switch(op){
        // iABC
        case OP_MOVE:
        case OP_LOADBOOL:
        case OP_LOADNULL:
        case OP_GET_UPVAL:
        case OP_SET_UPVAL:
        case OP_CLOSE_UPVAL:

        case OP_GET_INDEX:
        case OP_SET_INDEX:

        case OP_ADD:
        case OP_SUB:
        case OP_MUL:
        case OP_DIV:
        case OP_MOD: 

        case OP_EQ: 
        case OP_LT: 
        case OP_LE:

        case OP_CALL: 
        case OP_TAILCALL: 
        case OP_RETURN:

        case OP_METHOD:

        case OP_BUILD_LIST: 
        case OP_BUILD_MAP: 
        case OP_INIT_LIST:
        case OP_FILL_LIST:
        case OP_SLICE:
        case OP_TO_STRING:
        case OP_DEFER:
        case OP_SYSTEM:
        case OP_PRINT:
            dasmABC(opName, instruction);
            break;

        // iAB(C==0)
        case OP_NEG:
        case OP_NOT:
            dasmABC(opName, instruction);
            break;

        // iABx
        case OP_CLOSURE:
        case OP_IMPORT:
        case OP_CLASS:
            dasmABx(opName, instruction);
            break;
        
        // iABx (with constant & global)
        case OP_LOADK:
            dasmLoadK(opName, chunk, instruction);
            break;

        case OP_GET_GLOBAL:
        case OP_SET_GLOBAL:
            dasmGlobal(opName, chunk, instruction);
            break;

        case OP_FIELD:
        case OP_GET_PROPERTY:
        case OP_SET_PROPERTY:
            dasmField(opName, chunk, instruction);
            break;

        // iAsBx
        case OP_JMP:
        case OP_JMP_IF_FALSE:
        case OP_JMP_IF_TRUE:
        case OP_FOREACH:
            dasmAsBx(opName, instruction);
            break;

        default:
            printf(
                CLR_RED "%-16s" CLR_GRAY " | " \
                CLR_RED "%-5s" CLR_GRAY " | " \
                CLR_RED "%4s" CLR_GRAY " | " \
                CLR_RED "%5s" CLR_GRAY " | " \
                CLR_RED "%5s" CLR_GRAY " | " \
                CLR_RESET, "UNKNOWN", "?", "?", "?", "?"
            );
            break;
    }

    printf("\n");

We don’t care what OP_ADD or OP_MUL do mathematically. From the disassembler’s perspective, they are exactly the same: they are iABC instructions. They both expect an 8-bit A register, an 8-bit B register/constant, and an 8-bit C register/constant. Therefore, we just pass the raw 32-bit integer and the string name to dasmABC and let it do the unpacking job.

Formatting the Operands

Let’s look at dasmABC, which handles the standard 3-register operations.

1
2
3
4
5
6
7
8
9
10
11
12
13
static void dasmABC(const char* name, Instruction instruction){
    int a = GET_ARG_A(instruction);
    int b = GET_ARG_B(instruction);
    int c = GET_ARG_C(instruction);
    printf(
        CLR_CYAN "%-16s" CLR_GRAY " | " \
        CLR_MAGENTA "%-5s" CLR_GRAY " | " \
        CLR_YELLOW "%4d" CLR_GRAY " | " \
        CLR_YELLOW "%5d" CLR_GRAY " | " \
        CLR_YELLOW "%5d" CLR_GRAY " | " \
        CLR_RESET, name, "iABC", a, b, c
    );
}

We use printf to set paddings, creating tabular outputs.

We can write more in this format:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
static void dasmABx(const char* name, Instruction instruction){
    int a = GET_ARG_A(instruction);
    int bx = GET_ARG_Bx(instruction);
    printf(
        CLR_CYAN "%-16s" CLR_GRAY " | " \
        CLR_MAGENTA "%-5s" CLR_GRAY " | " \
        CLR_YELLOW "%4d" CLR_GRAY " | " \
        CLR_YELLOW "%5d" CLR_GRAY " | " \
        CLR_GRAY "%5s" CLR_GRAY " | " \
        CLR_RESET, name, "iABx", a, bx, "-"
    );
}

static void dasmAsBx(const char* name, Instruction instruction){
    int a = GET_ARG_A(instruction);
    int sbx = GET_ARG_sBx(instruction);
    printf(
        CLR_CYAN "%-16s" CLR_GRAY " | " \
        CLR_MAGENTA "%-5s" CLR_GRAY " | " \
        CLR_YELLOW "%4d" CLR_GRAY " | " \
        CLR_YELLOW "%5d" CLR_GRAY " | " \
        CLR_GRAY "%5s" CLR_GRAY " | " \
        CLR_RESET, name, "iAsBx", a, sbx, "-"
    );
}

However, if we disassemble a simple program like: print "Hello";, we might get an instruction like this:

1
offset: 0000    ~ | OP_LOADK         | iABx  |   1 |    5 |

An iABx instruction tells the VM to “Load the Constant at index 5 into Register 1”. But as a human, seeing index 5 is useless, virtually.We have no idea what index 5 is, unless we manually dump the constant pool and count to the fifth element.

So, dasmLoadK should be implemented with a pool peeking:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
static void dasmLoadK(const char* name, const Chunk* chunk, Instruction instruction){
    int a = GET_ARG_A(instruction);
    int bx = GET_ARG_Bx(instruction);
    Value constant = chunk->constants.values[bx];
    printf(
        CLR_CYAN "%-16s" CLR_GRAY " | " \
        CLR_MAGENTA "%-5s" CLR_GRAY " | " \
        CLR_YELLOW "%4d" CLR_GRAY " | " \
        CLR_YELLOW "%5d" CLR_GRAY " | " \
        CLR_GRAY "%5s" CLR_GRAY " | " \
        CLR_GRAY "'", name, "iABx", a, bx, "-"
    );
    printValue(constant);
    printf(CLR_RESET);
}

We apply this exact same logic to other DAsm functions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
/*********************************************
 *
 * more dasm functions
 * 
 *********************************************/
 
static void dasmGlobal(const char* name, Chunk* chunk, Instruction instruction){
    int a = GET_ARG_A(instruction);
    int bx = GET_ARG_Bx(instruction);
    Value constant = chunk->constants.values[bx];
    printf(
        CLR_CYAN "%-16s" CLR_GRAY " | " \
        CLR_MAGENTA "%-5s" CLR_GRAY " | " \
        CLR_YELLOW "%4d" CLR_GRAY " | " \
        CLR_YELLOW "%5d" CLR_GRAY " | " \
        CLR_GRAY "%5s" CLR_GRAY " | " \
        CLR_GRAY "'", name, "iABx", a, bx, "-"
    );
    printValue(constant);
    printf(CLR_RESET);
}

static void dasmField(const char* name, const Chunk* chunk, Instruction instruction){
    int a = GET_ARG_A(instruction);
    int b = GET_ARG_B(instruction);
    int c = GET_ARG_C(instruction);
    Value constant = chunk->constants.values[c];
    printf(
        CLR_CYAN "%-16s" CLR_GRAY " | " \
        CLR_MAGENTA "%-5s" CLR_GRAY " | " \
        CLR_YELLOW "%4d" CLR_GRAY " | " \
        CLR_YELLOW "%5d" CLR_GRAY " | " \
        CLR_YELLOW "%5d" CLR_GRAY " | " \
        CLR_GRAY "'", name, "iABC", a, b, c
    );
    printValue(constant);
    printf(CLR_RESET);
}

With these functions implemented, the disassembler is fully functional for basic instructions. The output would look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
== Compiled code: <script> ==
== <script> ==
offset: 0039    ~ | OP_LOADK         | iABx  |   26 |    19 |     - | 'List pop returns last element
offset: 0040    ~ | OP_MOVE          | iABC  |   22 |    26 |     0 | 
offset: 0041    ~ | OP_CALL          | iABC  |   19 |     4 |     2 | 
offset: 0042    9 | OP_GET_GLOBAL    | iABx  |   25 |    20 |     - | 'assert
offset: 0043    ~ | OP_LOADK         | iABx  |   26 |    21 |     - | 'eq
offset: 0044    ~ | OP_GET_PROPERTY  | iABC  |   27 |    25 |    26 | 'filled
offset: 0045    ~ | OP_MOVE          | iABC  |   28 |    27 |     0 | 
offset: 0046    ~ | OP_GET_GLOBAL    | iABx  |   29 |    22 |     - | 'list
offset: 0047    ~ | OP_LOADK         | iABx  |   30 |    23 |     - | 'size
offset: 0048    ~ | OP_GET_PROPERTY  | iABC  |   31 |    29 |    30 | 'eq
offset: 0049    ~ | OP_MOVE          | iABC  |   32 |    31 |     0 | 
offset: 0050    ~ | OP_CALL          | iABC  |   32 |     1 |     2 | 
offset: 0051    ~ | OP_MOVE          | iABC  |   29 |    32 |     0 | 
offset: 0052    ~ | OP_LOADK         | iABx  |   30 |    24 |     - | '2
offset: 0053    ~ | OP_LOADK         | iABx  |   35 |    25 |     - | 'List size after pop
offset: 0054    ~ | OP_MOVE          | iABC  |   31 |    35 |     0 | 
offset: 0055    ~ | OP_CALL          | iABC  |   28 |     4 |     2 | 
offset: 0056   12 | OP_LOADK         | iABx  |   35 |    27 |     - | '0
offset: 0057    ~ | OP_LOADK         | iABx  |   36 |    28 |     - | '5
offset: 0058    ~ | UNKNOWN          | ?     |    ? |     ? |     ? |

Wrapping up

By building this disassembler, we have turned a black box into a glass enigne. We can now inspect every single operation, verify our register allocations, and peek into the constant pool to ensure strings and global variables are being referenced correctly. Till now, the interpreter has already got the memory structures (Chunk), the vocabulary (ISA), and the debugging tools ready to go.

Then we need to actually get the code compiled and executed inside the VM.

This post is licensed under CC BY 4.0 by the author.