Building a Register VM Interpreter - Part 1: Introduction and Architecture
This guide is about building a register-based VM interpreter from scratch. The first part is the introduction of the project and overall architecture of the interpreter.
Building a language interpreter from scratch is often considered a milestone in software engineering. Demystifying how code is transformed into executable reveals the elegance behind. While many educational cases focus on building AST-walking interpreters or stack-based virtual machines, this project embarks on a slightly different path: building a Register-Based Virtual Machine interpreter from scratch entirely in C.
Virtual Machine
Why VM at all? When translating human-readable source code into executable actions, there is generally three distinct paths:
Tree-Walking Interpreters
The simplest appraoch is to parse the code into an Abstract Syntax Tree (aka. AST) and recursively evaluate its nodes. While easy to implement, this method is notoriously slow. Traversing a tree involves constant pointer dereferencing and memory jumps, resulting in terrible CPU cache locality and runtime overhead.
Native Compilation
Some languages like C, C++ and Rust rely on compilers such as GCC or LLVM, to translate source code directly into raw machine code. While this truely yields maximum performance, the engineering burden is monumental. The compiler must understand hardware “quirks”, manage physical registers, and generate complex executable formats like ELF or Mach-O.
The Bytecode VM
This is the middle ground chosen by Java, Python, C# and etc. Instead of targeting physical hardware, the compiler translates source code into bytecode, a dense, linear sequence of instructions designed for an idealized, imaginary CPU. We then write a program (the Virtual Machine) in C that simulates this imaginary CPU, fetching and executing these instructions one by one. It ensures that we can offload the hardware adaption to the C compiler itself.
What We Are Building
Before dividing into the architectures, let’s set a clear goal of what we are actually going to implement. It is boring to just build a simple calculator or a toy expression evaluator; we are going to build a fully functional, dynamically typed scripting language including:
-
Rich Data Types and Collections: Beyond basic integers and booleans, we will implement strings, and built-in container types like lists and hash tables.
-
Functions and Control Flow: First-class functions, standard branching, loops and etc.
-
Simple Object-Oriented Programming: A complete class system allowing for the creation of objects, instances, and methods.
-
File-based Module System: The ability to import and organize code across multiple files.
-
Automatic Memory Management: A custom Mark-and-Sweep Garbage Collector (GC) built from scratch to automatically track and free unused memory.
-
Standard Library: Bind C functions to VM to create a standard library, providing essential capabilities like File System I/O, OS interactions, time manipulation and etc.
The project relies on CMake for build management, and strictly requires GCC toolchain.
Stack vs. Register
Before writing a single line of the code, a fundamental decision should be made: how will the virtual machine evaluate expressions and pass instructions. The landscape of language VMs is divided into two camps: Stack-based and Register-based architectures.
The Stack-Based
In a stack-based VM (like the JVM, Python’s CPython, or standard WebAssembly), operations draw their operands from a central data stack and push their results back onto it.
Consider a simple addition: a = b + c.
In a stack VM, the bytecode might look like this:
1
2
3
4
LOAD b
LOAD c
ADD
STORE a
As it shows, it is relatively straightforward to write. Since operands are implicitly on the stack, the compiler does not need to worry about allocating specific storage locations. However, evaluating simple expressions requires many distinct instructions. Also, moving data on and off the stack introduces significant overhead.
The Register-Based
A register-based VM (famously utilized by Lua’s interpreter and Android’s Dalvik/ART) mimics the architecture of physical CPUs. Instead of a push/pop stack model, instructions specify explicit source and destination “registers” (which, in a software VM, are typically just indexed slots in an array representing the current function’s call frame).
The same addition expression a = b + c in a register VM canbe represented in a single instruction:
1
ADD a b c
Obviously, a register VM can have fewer instructions and reduce the instruction dispatch overhead. By eliminating the constant push and pop operations, the VM spends more time performing actual computation rather than moving data around. The compiler must perform register allocation, managing which variables map to which slots and ensuring registers are reused efficiently without clashing, this significantly increase the complexity. Moreover, one single bytecode instruction can be larger because it must encode the addresses of operands (e.g., the indices of a, b, and c).
Architectural Overview
To manage the complexity of compiling and executing a custom language, the project is divided into three distinct decoupled subsystems:
-
The Core (
core/): The foundational data structures and memory management subsystem. It defines the representation of memory objects, values, hash tables, and chunks. Crucially, this is where the Garbage Collector lives, managing the lifecycle of every dynamically allocated entity. -
The Compiler (
compiler/): A compiler is the bridge between human-readable source code and VM-readable bytecode. It includes the Scanner (as a lexical analyzer) and the Parser (a bytecode generator). Its sole responsibility is to understand the raw source and emit a sequence of instructions into a “Chunk” which contains dynamic arrays of bytecode on virtual machine. -
The Virtual Machine (
vm/): The VM is the execution engine. It initialize the execution environment, manages the call stack and register frames and decodes/interpret bytecode.
The Entry Point
Every system needs a starting point. The first step of this interpreter is quite straightforward: it initialize the environment and routes the input to the appropriate subsystem.
Just taking a look at the foundational code in cmd/main.c:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include "common.h"
#include "chunk.h"
#include "debug.h"
#include "vm.h"
#include "repl.h"
#include "file.h"
int main(int argc, const char* argv[]){
VM vm;
if(argc == 1){
initVM(&vm, 0, NULL);
repl(&vm);
}else{
int scriptArgsSt = 1;
if(strcmp(argv[1], "run") == 0){
scriptArgsSt = 2;
}
if(scriptArgsSt >= argc){
fprintf(stderr, "Usage: %s [run] [script] [args...]\n", argv[0]);
return 64;
}
initVM(&vm, argc - scriptArgsSt, argv + scriptArgsSt);
runScript(&vm, argv[scriptArgsSt]);
}
freeVM(&vm);
return 0;
}
This setup serves as the shell of this interpreter. The VM vm structure is allocated and initialized via initVM. This bootstraps the core runtime, setting up the garbage collector, allocating the global environment, and prepping the root call frame.
The if branch indicates the interpreter supports two standard modes: REPL and Script. If no arguments are provided, it drops into a Read-Eval-Print Loop, allowing for real-time code evaluation and testing. If a file path is provided, it delegates to runScript which reads the file into memory, passes it to the compiler, and make the VM to execute the bytecode.
The next step is to breathe life into the system by defining how data is represented in memory, transitioning from theoretical architecture to concrete data structures.
