Building a Register VM Interpreter - Part 4: Bytecode and Chunk

This guide is about building a register-based VM interpreter from scratch. This part will define set of instructions and build Chunks in VM.

Posted Mar 28, 2026

By Theodore Cooper

18 min read

Building a Register VM Interpreter - Part 4: Bytecode and Chunk

In the previous part, we have built a zero-allocation Scanner capable of stripping away noise and translate raw code into a on-demand stream of lexical tokens.

However, a VM cannot execute tokens. Tokens are just grammatical hint for the compiler. To bridge this gap, the compiler must translate these tokens into a linear sequence of operations known as Bytecode.

The bytecode is the machine code for the imaginary CPU. Designing a instruction set is one of the most critical architectural decision in the entire interpreter. This part will define set of instructions and build Chunk, the data structure responsible for holding the compiled bytecode in VM.

The Register-Based Instruction

As discussed in Part 1, the landscape of VM is generally divided into two camps: stack-based and register-based. We chose the register-based path.

In a stack-based VM, adding two numbers requires multiple distinct instructions to push oprands onto a stack and pop them off. In the register-based one, instructions operate directly on a window of virtual registers.

To make this work, our instructions need to be large enough to hold not just the operation itself, but also the indices of the registers it needs to access.

In vm/instruction.h, we define the fundamental instruction type:

        
      
typedef uint32_t Instruction;

Every single instruction in the VM is exactly 32 bits wide.

The Fixed-Width

Some VM use variable-length instructions (e.g., a 1-byte opcode followed by an optional 1-byte or 2-byte operand). While that saves memory, it significantly slows down the execution loop. The VM has to check the opcode to know how many bytes it need to complete an instruction.

This creates a severe “data dependency” at the hardware level. In a variable-length architecture (like the JVM or CPython), the host CPU cannot know the starting memory address of the next instruction until it has fully decoded the current one. This dependency breaks CPU pipelines, stalling hardware instruction prefetching and branch prediction.

By utilizing a Fixed-Width 32-bit instruction set, fetching the next instruction is as simple as bumping a C pointer by 1. It is completely predictable memory access, aligns perfectly with caches, and keeps the VM’s decoding loop incredibly fast. Because this memory access pattern is completely linear and predictable, the CPU’s hardware prefetcher can aggressively load upcoming bytecodes into the ultra-fast L1 cache before the VM’s dispatch loop actually asks for them.

The Bit Layout

Packing an operation into a single 32-bit integer requires specific bit boundaries. This VM utilizes a Little-Endian layout, which means the actual operation code is placed in the lowest 8 bits, following by the operands.

Field Positions:
   31       24 23     16 15      8 7        0
    +---------+---------+---------+---------+
    |    C    |    B    |    A    |  OpCode |
    +---------+---------+---------+---------+
Bits:  8 bits    8 bits    8 bits    8 bits

The first 8 bits (Bits 0-7) is the instruction ID itself (e.g., OP_ADD). Giving this field 8 bits means the VM can support up to 256 unique instructions, which is enough for a fully-featured scripting language.

The 24 bits left is for registers:

The 8-bit width makes a single function access up to 256 local registers at any given time. This layout ensures that VM can instantly decode and execute.

Instruction Patterns

While every instruction is strictly 32 bits and has 8 bits of OpCode at the buttom, the remaining 24 bits can be seperated in different ways depending on what the operation requires. As not every operation needs three distinct registers,. Some operations need large constants, while others need to jump across bytecodes.

We can define several dinstinct parsing patterns in vm/instructions.h to handle these varying requirements.

iABC

The iABCis a standard one of the VM. It divides the 24 bits evenly into three 8-bit fields: A, B and C.

        
      
|    C (8 bits)   |    B (8 bits)   |    A (8 bits)   | OpCode (8 bits) |

Take the addition instruction as an example (OP_ADD). If the compiler encounters the code x = a + b, it will generate an iABC instruction that essentially expresses: OP_ADD R(A), R(B), R(C). The VM reads this, fetches the values from register B and register C, adds them, and stores the result into register A.

Since B and C has a width of 8 bits, they can hold a values from 0 to 255. This is enough for local variables in a call frame. But what if the a is not a local variable, but a constant?

In some architectures (like standard Lua), they use 9-bit fields where the highest bit acts as a flag (e.g., 1 means query the constant pool, 0 means locate the register). However, in this project, VM uses strictly 8-bit fields, leaving zero room for flag bits.

We can solve this problem by shifting the semantic responsibility entirely to the OpCode design. For example, an instruction like OP_ADD strictly and blindly assumes that its operands are registers. The VM’s execution loop does not waste cycles doing bit-testing or conditional branching to figure out what an operand is. It just directly fetch registers.

But what if a constant is located at a index over 255 in the chunks’s constant array? The constant pool for a large script could easily exceed 256 items. An 8-bit field physically cannot hold the index.

iABx

To solve the limitation of 8-bit fields, some register-based VM introduced the iABx format. When an instruction requires a much larger range, we can merge the B and C together.

        
      
|           Bx (16 bits)            |    A (8 bits)   | OpCode (8 bits) |

Now we have a 16-bit unsigned integer field known as Bx. Instead of a maximum value of 255, the Bx field can address up to 65535 slots (MAX_ARG_BX).

#define MAX_ARG_BX      65535

A example of this is the OP_LOADK (Load Konstant) instruction. Its structure is OP_LOADK R(A), K(Bx). The VM looks at the 16-bit Bx value, goes to the the constant array at that exact index, and copies the 64-bit Value into the local register designated by the 8-bit A field.

iAsBx

The iABC and iABx patterns do not deal with unsigned integers (as memory addresses, register slots, and array indices are never negative). But a VM also needs control flow: if, for and while.

Control flow is usually implemented via “jumps” which will be discussed next time. A jump instruction tells the VM’s pointer whether to skip ahead (for an if statement for example) or get backward (to repeat a loop). To jump backward, the offset must be negative.

Of cource we could use standard C signed integers (Two’s Complement), but extracting Two’s Complement signed bits requires cumbersome bit-shifting and sign-extension logic that wastes CPU cycles in the execution loop.

Instead, here is a technique for the iAsBx pattern: Biasing.

#define MAX_BX          MASK_BX           // 65535
#define OFFSET_sBx      (MAX_BX >> 1)     // 32767
#define GET_ARG_sBx(i)  (GET_ARG_Bx(i) - OFFSET_sBx)

The sBx (Signed Bx) field takes the exact same 16 bits as a Bx field. However, when the VM interprets it, it is divided by two (MAX_BX >> 1), which gives us 32767.

This 32767 becomes the “zero”. If the raw 16-bit integer is exactly 32767, the actual jump offset is 0. By doing this, the compiler just shifts the jump offset by OFFSET_sBx before packing the instruction. The VM simply reads it as a standard, fast unsigned integer, subtracts the constant bias (GET_ARG_Bx(i) - OFFSET_sBx), and instantly has a perfectly valid signed offset.

Encoding and Decoding

Now we have figured out a logical blueprint for instructions (iABC, iABx, iAsBx), but how do the interpreter pack these distinct variables into a single 32-bit integer in C?

A tempting, native approach in C is to construct bitfields inside a struct:

        
      
typedef struct{
    uint8_t op: 8;
    uint8_t a:  8;
    uint8_t b:  8;
    uint8_t c:  8;
}Instruction;

Well, this looks clean and readable. However, it is a well-known trap in systems programming and compiler design. The C standard does not guarantee how bitfields are packed in memory. Depending on the different compilers (though the project is only compatible with GCC, it is still worth mentioning) and endianness, op might end up in the highest 8 bits instead of lowest, or the compiler might insert invisible padding bytes.

When we mentioned a “Little-Endian” layout earlier (referring to the OpCode sitting in the lowest 8 bits), there is a crucial distinction to make between physical memory layout and logical value representation.

The C struct bitfield approach relies on how the compiler maps fields to physical memory bytes. On a physical little-endian machine (like x86), the compiler might pack the first defined field into the lowest byte. On a big-endian machine, it might do the exact opposite. If we serialize the chunk on a Linux x64 machine and load it on an ARM Mac, the bytecode could be completely corrupted.

To guarantee absolute determinism and cross-platform compatibility, VM handles instruction packing entirely through raw bitwise arithmetic.

Cascading Configuration

In vm/instruction.h, we do not just hardcode numbers, we define a cascading configuration:

#define SIZE_OP     8
#define SIZE_A      8
#define SIZE_B      8
#define SIZE_C      8
#define SIZE_BX     (SIZE_C + SIZE_B)

#define POS_OP      0
#define POS_A       (POS_OP + SIZE_OP)
#define POS_B       (POS_A + SIZE_A)
#define POS_C       (POS_B + SIZE_B)
#define POS_BX      POS_B

From these sizes, we can calculate exact bitmasks:

#define MASK_OP     ((1 << SIZE_OP) - 1)
#define MASK_A      ((1 << SIZE_A) - 1)
#define MASK_B      ((1 << SIZE_B) - 1)
#define MASK_C      ((1 << SIZE_C) - 1)
#define MASK_BX     ((1 << SIZE_BX) - 1)

Shifting 1 to the left by 8 bits gives 100000000 in binary (256). Subtracting 1 yields 011111111 (255), which offers a perfect 8-bit mask.

Constructing Instructions

When the compiler generates bytecode, it uses the CREATE_ABC macro to pack the operands:

// Constructors
#define CREATE_ABC(op, a, b, c) \
    ((Instruction)((op & MASK_OP) << POS_OP) | \
    ((Instruction)(a & MASK_A) << POS_A) | \
    ((Instruction)(b & MASK_B) << POS_B) | \
    ((Instruction)(c & MASK_C) << POS_C))

Let’s break down the logic. What if a bug tries to pass 9 bits into the a parameter? If we just shifted it, that 9th bit would bleed into the B operand’s territory, corrupting the instruction. Bitwise ANDing (&) the input with the mask acts as a quarantine, chopping off bits. Once the bits are safely truncated, the value will be shifted to the left by POS_A (8 bits). Finally, bitwise OR merges the isolated, shifted blocks together into a single, compact Instruction.

Decoding

During runtime, the VM’s execution loop needs to rip this 32-bit instruction back apart. We can define unpack macros like these:

        
      
// Getters
#define GET_OPCODE(i)   ((OpCode)((i >> POS_OP) & MASK_OP))
#define GET_ARG_A(i)    ((int)((i >> POS_A) & MASK_A))
#define GET_ARG_B(i)    ((int)((i >> POS_B) & MASK_B))
#define GET_ARG_C(i)    ((int)((i >> POS_C) & MASK_C))
#define GET_ARG_Bx(i)   ((int)((i >> POS_BX) & MASK_BX))


// Signed Bx
#define GET_ARG_sBx(i)  (GET_ARG_Bx(i) - OFFSET_sBx)

To get argument B for example, we push the entire instruction to the right by POS_B (16 bits) (i >> POS_B). This pushes the OpCode and A operands completely off the cliff into oblivion. When apply & MASK_B to wipe out, it leaves only the pure original B integer.

Because these are basic bitwise operations, modern C compilers translate them into single assembly instructions. The VM can decode operands practically as fast as the CPU can fetch them.

The Chunk

With the 32-bit fixed-width instruction fully defined, the VM has a standard vocabulary. But a language is more than isolated words; it is a sequence of statements. The compiler needs a place to store the instructions. This storage container is called Chunk.

The chunk is a localized block of bytecodes. Every function, method, and top-level script itself will have its own chunk to hold execution instructions.

In core/chunk.h, the Chunk struct is defined as follows, which might look like a simple collection of different arrays:

        
      
typedef struct Chunk {
    Instruction* code;
    size_t count;
    size_t capacity;
    ValueArray constants;
    int* lines;
    int lineCount;
    int lineCapacity;
} Chunk;

Code Array

The code pointer is a dynamically allocated C array that holds the actual 32-bit Instruction integers. A size_t count is also maintained to tell how many instructions are currently in the array, and the capacity means how much space we have allocated.

Instead of using a linked list of instructions, using a dynamic contiguous array is better for CPU cache locality.

When the VM starts executing a chunk, the host CPU will automatically load chunks of the array from RAM into its L1 cahce. As a C array guarantees that every instruction is physically right next to the previous one in memory, the CPU’s prefetcher can predict exactly what memory we will need next. If we use a linked list, pointers will force the CPU to jump randomly around the heap, causing cache missing which stall the CPU for massive cycles per instruction.

Constant Pool

The next struct is ValueArray constants;. As we discussed in Part 2, the VM represents all the variables using a 64-bit Value via NaN-Boxing.

But wait… the instructions are strictly 32-bit wide. It is impossible for us to fit a 64-bit double into a 32-bit instruction (for example, if the user writes var x = 123.45). We store the 64-bit Value inside the chunk’s constants array. Then, the compiler takes the integer index of where the value is placed and packs that index into the 16-bit Bx field of an OP_LOADK instruction.

In core/chunk.c, adding a constant is straightforward:

        
      
int addConstant(VM* vm, Chunk* chunk, Value value){
    writeValueArray(vm, &chunk->constants, value);
    return (int)(chunk->constants.count - 1);
}

It delegates the memory management to the writeValueArray we built previously, and simply returns the index of the newly added value. The compiler will use this returned index to construct the instruction. This architecture ensures the instructions remain small and fast, while allowing the use of large data types.

While addConstant successfully gets the data into the chunk, it currently possesses a naive behavior: it blindly appends every value it is given. If a user writes a script that uses the number 3.14 or the string "hello" many times, current compiler will create exact number of duplicated entries in the chunk’s constant array.

Some industrial VM will introduce Constant deduplication. Instead of blindly appending, the compiler will first check if an identical value (especially heap-allocated strings or large numbers) already exists in the constant pool. If it does, addConstant will simply return the existing index. This ensures that a frequently used value only ever occupies a single slot per chunk, keeping bytecode compact.

Emitting Bytecode

Now we have the Chunk struct designed, but a struct full of pointer is just a blueprint. We need safe allocation, growth, and eventually a way to destroy the memory that holds bytecode.

Before a compiler can emit any bytecode, it must prep a fresh new chunk. In core/chunk.c, we define initChunk to establish a clean, zeroed-out chunk:

        
      
void initChunk(Chunk* chunk){
    chunk->code = NULL;
    chunk->count = 0;
    chunk->capacity = 0;
    chunk->lines = NULL;
    chunk->lineCount = 0;
    chunk->lineCapacity = 0;
    initValueArray(&chunk->constants);
}

By explicitly assigning the pointers to NULL and numbers to 0, we ensure that the first time the compiler tries to write to this chunk, it will recognize it is empty and allocate the initial block of memory. We also initialize the nested constants array, ensuring the constant pool is ready to receive values.

When the compiler translates a piece of source code, it generates a 32-bit Instruction and needs to append it to the chunk.

Because we don’t know how many lines of code a user’s function will contain until we actually parse and compile it, the code array must be able to grow dynamically.

In core/chunk.c, we append a new instruction to the chunk with function writeChunk:

        
      
void writeChunk(VM* vm, Chunk* chunk, Instruction instruction, int line){
    if(chunk->count + 1 > chunk->capacity){
        size_t oldCapacity = chunk->capacity;
        chunk->capacity = GROW_CAPACITY(oldCapacity);
        chunk->code = GROW_ARRAY(
            vm,
            Instruction,
            chunk->code,
            oldCapacity,
            chunk->capacity
        );
        chunk->lines = GROW_ARRAY(
            vm,
            int,
            chunk->lines,
            oldCapacity * 2,
            chunk->capacity * 2
        );
    }
    chunk->code[chunk->count] = instruction;
    chunk->lines[chunk->count] = line;
    chunk->count++;
}

After doing a capacity check and find one more instruction (count + 1) will exceed the current allocated space (capacity), we calculate a new capacity using a macro GROW_CAPACITY(oldCapacity). This typically doubles the array size. Doubling amortizes the cost of allocation, as the array gets larger, allocations become exponentially rarer.

#define GROW_CAPACITY(capacity) \
    ((capacity) < 8 ? 8 : (capacity) * 2)

Here writeChunk uses a GROW_ARRAY macro to reallocate the chunk->code pointer to the newly sized block of memory.

#define GROW_ARRAY(vm, type, pointer, oldCount, newCount) \
    (type*)reallocate(vm, pointer, sizeof(type) * (oldCount), \
    sizeof(type) * (newCount))

Finally, with enough space, we insert the instruction at the current count index and update the count up by one.

By handling memory growth abstractly behind the GROW_ARRAY macro, the compiler never has to worry about malloc or realloc failing or tracking pointers. It just calls writeChunk and trusts that the VM will make room.

We also grow a parallel lines array at the exact same time. This guarantees that index 5 in the code array perfectly corresponds to index 5 in the lines array. We will dive deeply into why this is crucial for error reporting in the later part.

Every instruction is a 32-bit integer, and every line number is also a standard C int (typically 32 bits). This 1:1 mapping literally doubles the memory footprint of our compiled bytecode. In a production-grade virtual machine (like CPython or the JVM), this would be an unacceptable waste, because dozens of consecutive bytecodes often originate from the exact same line of source code.

A common industrial optimization we could implement in the future is Run-Length Encoding (RLE). Instead of storing raw duplicate line numbers like [10, 10, 10, 10, 11, 11], the chunk would only record compressed metadata like “Line 10 spans 4 instructions; Line 11 spans 2 instructions”. This would drastically compress the debug information, often reducing the metadata memory footprint by over 90%.

Free Chunks

Because C lacks automatic garbage collection for raw pointers, any memory we ask the system for must be explicitly freed to avoid memory leaks. When a function finishes executing and is no longer needed, we must tear down its chunk:

        
      
void freeChunk(VM* vm, Chunk* chunk){
    FREE_ARRAY(vm, Instruction, chunk->code, chunk->capacity);
    FREE_ARRAY(vm, int, chunk->lines, chunk->lineCapacity * 2);
    freeValueArray(vm, &chunk->constants);
    initChunk(chunk);
}

The freeChunk function acts as the destructor. It uses the FREE_ARRAY macro to release the bytecode array and the line tracking array, and then cascades the destruction down into freeValueArray to wipe out the constant pool. Finally, it calls initChunk to reset all the pointers back to NULL, leaving the chunk in a safe, inert state.

        
      
#define FREE_ARRAY(vm, type, pointer, oldCount) \
    reallocate(vm, pointer, sizeof(type) * (oldCount), 0)
    
void freeValueArray(VM* vm, ValueArray* array){
    reallocate(vm, array->values, sizeof(Value) * array->capacity, 0);
    initValueArray(array);
}

Wrapping Up

At this point, we have established the two foundational ends of the interpreter pipeline. From Part 3, we have a front-end Scanner capable of breaking down raw source code into a clean stream of tokens. From this part, we have a robust back-end structure ready to store and manage the dense bytecode our VM will eventually execute.

Right now the bytecode is completely invisible. It is just a massive array of integers sitting in RAM. The next challenge is bridging the gap between them.

Interpreter

CS Interpreter

This post is licensed under CC BY 4.0 by the author.