Building a Register VM Interpreter - Ch. 10: Heap, Strings, and Hash Tables

This guide is about building a register-based VM interpreter from scratch. To escape this 64-bit limit, we have to look beyond the virtual stack and into the C heap.

Posted Apr 18, 2026

By Theodore Cooper

22 min read

Building a Register VM Interpreter - Ch. 10: Heap, Strings, and Hash Tables

Over the last parts, we have built a preliminary engine. With the addition of control flow jumps, this VM has crossed the threshold into Turing completeness. It can make decisions, repeat tasks, and route execution flawlessly.

However, if we look closely at the data this VM is currently pushing through its virtual registers, we will find this language only understands primitive values: numbers, booleans, and null. This is because of the NaN-boxing we built early on. Every piece of data in this VM must fit perfectly inside a single, 64-bit Value integer.

But what if we need to write a script that processes a 100-character string? Or what about a list containing 50 distinct items? We cannot cram a dynamically sized string or a complex data structure into a tiny space.

To escape this 64-bit limit, we have to look beyond the virtual stack and into the C heap.

The solution lies in a trick we preemptively set up back in the second Ch.: the SIGN_BIT. Instead of storing the complex data structure inside the register, we will dynamically allocate the required memory on the system heap. Once the data is safely sitting on the heap, we take the 64-bit C memory pointer, combine it with our QNAN and SING_BIT masks, and box this raw pointer directly into the Value struct.

To the VM’s register allocator, it is still just another 64-bit integer being moved from slot to slot. But to the execution engine, it’s a map to a much larger universe of memory.

The Base Object Model

In an object-oriented language like C++ or Java, creating a base for dynamically allocated data is built into the language itself. C, however, has no concept of classed or base types.

Yet, this VM needs a way to pass pointers around abstractly. To solve this, we implement a classic C trick: Struct Inheritance.

In core/object.h, we define a universal header that every heap-allocated entity will share:

        
      
typedef enum{
    OBJECT_STRING,
    OBJECT_LIST,
    OBJECT_MAP,
    OBJECT_FUNC,
    OBJECT_CFUNC,
    OBJECT_MODULE,
    OBJECT_CLOSURE,
    OBJECT_UPVALUE,
    OBJECT_CLASS,
    OBJECT_INSTANCE,
    OBJECT_BOUND_METHOD,
    OBJECT_FILE,
    OBJECT_ITERATOR,
}ObjectType;

typedef struct Object{
    ObjectType type;
    // bool isMarked;
    struct Object* next;
}Object;

In a managed language, the VM must keep track of every single byte of heap memory it allocates. If it loses track of a pointer, that memory leaks forever. Instead of maintaining a massive array of pointers to track allocations, we make the objects track themselves.

The struct Object* next pointer turns every heap allocation into a node in an Intrusive Linked List.

Whenever we create a new object, regardless of what it is, we will prepend it to a global linked list owned by the VM state. You can see this pattern whenever we allocate something new, for example, in core/object.c when creating a new Map:

        
      
ObjectMap * newMap(VM* vm){
    ObjectMap* map = (ObjectMap*)reallocate(vm, NULL, 0, sizeof(ObjectMap));
    map->obj.type = OBJECT_MAP;
    map->obj.isMarked = false;
    initHashTable(&map->table);

    // Link into the tracking list
    map->obj.next = vm->objects;
    vm->objects = (Object*)map;
    return map;
}

This means vm->objects acts as a master registry. When it’s time to shut down the VM, or when the Garbage Collector needs to sweep away unused data, it can simply walk this single next chain from start to finish, examining or destroying every piece of heap memory the language ever created.

Strings

With the base Object structure defined, we can now build concrete data types on top of it. The most fundamental dynamic type in any scripting language is the string.

In core/object.h, we can define a ObjectString struct:

        
      
typedef struct ObjectString{
    Object obj;
    size_t length;
    uint64_t hash;
    char chars[];   // Flexible array member
}ObjectString;

ObjectString* copyString(VM* vm, const char* chars, int len);
ObjectString* takeString(VM* vm, char* chars, int length);

The very first field is Object obj, which guarantees that a pointer to an ObjectString can be safely cast to base Object pointer. We also store the length of the string and a pre-computed hash, which will be discussed when we build the Hash Table. But the most crucial part of this struct is the very last field char chars[];.

The Flexible Array

If you were writing a naive string struct in C, you might define the characters as a pointer char* chars;. However, this requires two seperate memory allocations: one for struct itself, and a second for the character array. This forces the CPU to jump to two completely different areas of memory to read a single string, destroying cache locality and adding overhead to the memory manager.

Instead, we use a C99 feature called a Flexible Array. By declaring an array of empty size (chars[]) at the very end of the struct, the compiler will know that there will be an array of characters there, its size will be determined at runtime.

Let’s look at the actual allocation in core/object.c using allocString:

        
      
static ObjectString* allocString(VM* vm, int len, uint64_t hash){
    ObjectString* str = (ObjectString*)reallocate(vm, NULL, 0, sizeof(ObjectString) + len + 1);
    
    str->obj.type = OBJECT_STRING;
    str->obj.isMarked = false;
    str->length = len;
    str->hash = hash;
    str->obj.next = vm->objects;
    vm->objects = (Object*)str;

    return str;
}

When we call our reallocate function, we don’t just ask for sizeof(ObjectString). We ask for sizeof(ObjectString) + len + 1.

This allocates a single, contiguous block of memory. The first chunk of bytes holds the Object header, the length, and the hash. The remaining bytes smoothly transition right into the raw character data. The + 1 ensures we have exactly enough room to append a standard C null-terminator (\0).

Before this function returns, it performs the linking step we discussed before:

        
str->obj.next = vm->objects;
vm->objects = (Object*)str;

It takes the newly created string, sets its next pointer to the top of the object list, and then updates the VM to point to this new string. From this moment on, the string is finally tracked by the VM.

String Equality

If we blindly allocate strings on the heap every time the user types a quote in the code or concatenates variables, the VM will run into a massive performance wall.

In a dynamically typed scripting language, strings are everywhere. They are not just used for printing on the screen, they are heavily used user the hood, for example, for resolving module imports or accessing properties (e.g., user.age looks up the string age). If the VM has to use the C standard function strcmp to compare these strings char-by-char every single time an object propertu is accessed, the performance overhead of those $O(N)$ comparisons will slow down the execution.

If we need the string comparisons to be $O(N)$, the solution is to use a technique called String Interning.

String Interning is actually a process of deduplication. The VM enforces a simple rule: There will never be two identical string occupying different places in memory. If the script generate a string of “hello”, the VM allocates and tracks it. If the script generates the exact same string “hello” again later, the VM does not allocate a new string. Instead, it just return the pointer to the original one.

Because the VM guarantees that every unique string only exists exactly once in the heap, comparing two string becomes mathematically trivial. We don’t need to look at the characters at all. We just compare the memory addresses of the two structs.

Let’s look at how we inplement this in core/object.c:copyString:

        
      
ObjectString* copyString(VM* vm, const char* chars, int len){
    uint64_t hash = hashString(chars, len, vm->hash_seed);

    ObjectString* interned = tableGetInternedString(vm, &vm->strings, chars, len, hash);
    if(interned != NULL){
        return interned;
    }

    ObjectString* str = allocString(vm, len, hash);
    memcpy(str->chars, chars, len);
    str->chars[len] = '\0';

    /*
    push(vm, OBJECT_VAL(str));
    tableSet(vm, &vm->strings, OBJECT_VAL(str), NULL_VAL);
    pop(vm);
    */

    return str;
}

When a copyString is called, the very first thing it does is calculate a hash for the character array using a highly optimized hashing algorithm. Once the hash is calculated, it calls tableGetInternedString. This checks whether the string has already been allocated with the exact same characters. If the answer is yes, it simply returns the existing interned pointer, completely bypassing the memory allocation. If the string is truly new, it calls the allocString.

This architectural choice makes our VM incredibly fast at runtime, but it introduces a new problem for us to solve in C. To know if a string has already been allocated, the VM needs a dictionary to remember every string it has ever seen. It needs a data structure that can look up a value instantly based on a key.

It needs a Hash Table.

Hash Table

To make String Interning work, and to eventually support global variables, classes, and object properties, this VM needs a robust dictionary. It needs a data structure that can map a key to a value with $O(1)$ time complexity.

In core/hashtable.h, we define the foundational structures for this:

        
      
typedef struct{
    Value key;
    Value value;
}Entry;

typedef struct{
    int count;
    int capacity;
    Entry* entries;
}HashTable;

A Hash Table is essentially just a dynamically resizing array of Entry buckets. When we want to insert a key, we calculate its hash, use modulo arithmetic (hash & (capacity - 1)) to find its “ideal” index in the array, and drop the Entry there.

But the efficiency of this entire data structure hinges entirely on two things: the quality of that hash function, and how we handle situations where two keys want the exact same bucket.

The Hash Algorithm

To minimize collisions, you need a highly uniform hash algorithm. Many educational VMs use simple algorithms like FNV-1a (Fowler-Noll-Vo).

        
      
static uint64_t hashString(const char* key, int len){
    uint64_t hash = 2166136261u; // FNV offset basis
    for(int i = 0; i < len; i++){
        hash ^= (uint8_t)key[i];
        hash *= 16777619;        // FNV prime
    }
    return hash;
}

FNV-1a is beautifully simple. It starts with a magic starting number (the offset basis). For every character in the string, it mixes the character’s bits into the hash using a bitwise XOR (^=), and then multiplies the result by a massive magic prime number (16777619). This prime multiplication causes the bits to scatter wildly. Even if you change just one letter in a long string, the resulting 64-bit integer will look completely different, preventing similar words from clumping together in the hash table.

While using FNV-1a is simple, it has drawbacks. First, processing one byte at a time in a tight loop is not utilizing modern 64-bit CPU architectures effectively. Second, a simple, predictable hash functions leave a language vulnerable to Hash Collision DoS attacks.

Instead of writing our own complex cryptographic hashed, we can rely on vendor libraries. In modern C development, one of the cleanest ways to integrate third-party code is through header-only libraries or single-file C drops.

In this codebase, we include xxhash.h. xxHash is a state-of-art, non-cryptographic hash algorithm that operates at RAM speed limits. We can drop the vendor file into the project tree and write a CMakeLists.txt:

        
      
cmake_minimum_required(VERSION 3.10)

project(PiCo)

set(CMAKE_C_COMPILER "gcc")

if(NOT CMAKE_C_COMPILER_ID MATCHES "GNU")
    message(FATAL_ERROR "FATAL: This project requires the GCC compiler due to specific GNU C extensions. "
                        "The attempt to set GCC as the compiler failed. Please ensure 'gcc' is installed and in your PATH.")
endif()

set(CMAKE_C_STANDARD 11)
set(CMAKE_C_STANDARD_REQUIRED ON)
set(CMAKE_C_EXTENSIONS ON)

file(GLOB_RECURSE CORE_SRC "core/*.c")
file(GLOB_RECURSE VM_SRC "vm/*.c")
file(GLOB_RECURSE COMPILER_SRC "compiler/*.c")
file(GLOB_RECURSE STD_SRC "std/*.c")
file(GLOB_RECURSE CMD_SRC "cmd/*.c")

include_directories(
    cmd
    core
    vm
    compiler
    std
    vendor/xxhash
)

add_library(xxhash_lib STATIC vendor/xxhash/xxhash.c)
target_include_directories(xxhash_lib PUBLIC vendor/xxhash)

add_compile_options(-g -O3)

add_executable(pico 
    ${CORE_SRC} 
    ${VM_SRC} 
    ${COMPILER_SRC} 
    ${STD_SRC} 
    ${CMD_SRC}
)

target_link_libraries(pico PRIVATE xxhash_lib m)

target_compile_options(pico PRIVATE
    $<$<CONFIG:RELEASE>:-O3>
)

message(STATUS "Project 'PiCo' configured to use C Compiler: ${CMAKE_C_COMPILER}")

By including the header #include "xxhash.h", we can upgrade the hashing engine in core/object.c instantly:

        
      
static uint64_t hashString(const char* key, int len, uint64_t seed){
    return XXH3_64bits_withSeed(key, (size_t)len, seed);
}

Notice the seed parameter. When the VM boots up, it generates a random seed integer (vm->hash_seed). Every script run has a completely different hash layout, neutralizing DoS collision attacks while maintaining blazing-fast throughput.

We can build a hashValue function to route different data types to the appropriate hashing strategy:

        
      
static uint64_t hashValue(Value value, uint64_t seed){
    if(IS_STRING(value)){
        return AS_STRING(value)->hash;
    }
    if(IS_NUM(value)){
        double num = AS_NUM(value);
        if(num == 0.0)    num = 0.0;
        //  0.0: 0x0000000000000000
        // -0.0: 0x8000000000000000
        return XXH3_64bits_withSeed(&num, sizeof(double), seed);
    }
    if(IS_OBJECT(value))    return (uint64_t)AS_OBJECT(value);
    if(IS_BOOL(value))  return AS_BOOL(value) ? 3 : 5;
    if(IS_NULL(value))  return 7;
    return (uint64_t)value;
}

When allocating a string, a hash has already been calculate, so we just return the pre-computed hash. When the situation comes to the Numbers, there is a -0.0 trap. According to the IEEE 754 floating-point standard, 0.0 and -0.0 are mathematically equal, but their binary bit representations are completely different.

//  0.0: 0x0000000000000000
// -0.0: 0x8000000000000000

If we blindly hashed their bits, tableGet(0.0) and tableGet(-0.0) would look in different buckets. The if(num == 0.0) num = 0.0; forces -0.0 to become standard 0.0 before passing it to XXH3_64bits_withSeed.

We return hardcoded prime numbers (3, 5, 7) because there are only three possible states for these types. In hash table design, the capacity of the table is almost always a power of 2 (e.g., 8, 16, 32, 64). You can see this in core/hashtable.c where the capacity starts at 8 and doubles, and where bucket indices are calculated using the bitwise AND mask: hash & (capacity - 1). Prime numbers are highly favored in hashing mathematics because they share no common divisors with powers of 2. Technically, any three distinct prime numbers would work perfectly. We could have used 11, 13, and 17, or 101, 103, and 107.

Open Addressing

Even with xxHash scattering our data perfectly, collisions are mathematically inevitable. Two keys will eventually try to claim the exact same index in the entries array.

To resolve this, we use Open Addressing. When a collision occurs, we do not attach a linked list to the bucket (Separate Chaining). Instead, we simply step forward until we find an empty slot. We use Open Addressing because a flat, contiguous array is incredibly friendly to CPU caches.

Robin Hood Hashing

In a standard Linear Probing, when a collision occurs, the algorithm simply steps forward until it finds an empty slot. It operates on a strict “first-come, first-served” basis. If an entry claims a bucket, it owns that bucket forever, regardless of how much it inconvenviences entries inserted later.

This leads to a phenomenon called Primary Clustering. If several keys hash to the same general area, they form a solid block of occupied buckets. When a new key hashes into the middle of this block, it is forced to step over every single occupied bucket until it reaches the end of the cluster. As the table fills up, these clusters merge into massive super-clusters. Your $O(1)$ lookup degrades into an $O(N)$ list crawl.

Probe Sequence Length

Robin Hood Hashing fixes this by abandoning the “first-come, first-served” rule. Instead, it introduce a metric of wealth known as Probe Sequence Length (PSL), or, as it is named in this codebase, probe_dist.

The PSL is simply the distance between the bucket an entry wants to be in (its ideal index) and the bucket it is actually in (its current index). We calculate a wealth depending on different distance. For example, a 0-distance is exactly where its hash placed it, which is incredibly “rich”; and a 5-distance is bumped 5 slots away due to collisions, which is “poor”.

Robbing the Rich

When inserting a new entry, we track its prob_dist. As we step forward through the array looking for an empty bucket, we do not just check if a bucket is occupied, we also calculate the occupant’s PSL (bucket_dist):

        
      
uint64_t bucket_hash = hashValue(bucket->key, vm->hash_seed);
uint32_t bucket_ideal_index = bucket_hash & (table->capacity - 1);
uint32_t bucket_dist = get_distance(table->capacity, bucket_ideal_index, cur_index);

We calculate the distance with get_distance:

        
      
static uint32_t get_distance(int capacity, uint32_t ideal_index, uint32_t cur_index){
    if(cur_index >= ideal_index){
        return cur_index - ideal_index;
    }else{
        // wrap-around
        return capacity - ideal_index + cur_index;
    }
}

If our incoming entry has a higher PSL than the current occupant (if(probe_dist > bucket_dist)), the incoming entry is poorer. The algorithm physically evicts the rich occupant, drops the poor incoming entry into the bucket, and forces the evicted occupant to become the new searcher.

        
      
if(probe_dist > bucket_dist){
    Entry tmp = *bucket;
    *bucket = entryToInsert;
    entryToInsert = tmp;
    
    key_hash = hashValue(entryToInsert.key, vm->hash_seed);
    ideal_index = key_hash & (table->capacity - 1);
    probe_dist = get_distance(table->capacity, ideal_index, cur_index);
}

In standard Linear Probing, the variance in probe lengths is massive. You might have one entry at distance 0 and another entry trapped at distance far away. Robin Hood hashing actively flattens this distribution. It forces the wealthy entries to compromise, pushing them a little bit further away so the poor entries don’t have to travel as far. The worst-case lookup time is drastically reduced.

Search the Table

Because the Robin Hood hashing guarantees that probe lengths are strictly ordered, the search function findEntry can be implemented with Early Exit.

In standard Linear Probing, if you search for a key that doesn’t exist (a cache miss), the VM is forced to step forward through every single occupied bucket in a cluster until it finally hits an empty slot.

Let’s look at how we search in core/hashtable.c:

        
      
static Entry* findEntry(Entry* entries, int capacity, Value key, uint64_t seed){
    if(capacity == 0){
        return NULL;    
    }

    uint64_t hash = hashValue(key, seed);
    uint32_t ideal_index = hash & (capacity - 1);
    uint32_t probe_dist = 0;

    while(true){
        uint32_t cur_index = (ideal_index + probe_dist) & (capacity - 1);
        Entry* bucket = &entries[cur_index];

        if(IS_EMPTY(bucket->key)){
            return NULL;    
        }

        uint64_t bucket_hash = hashValue(bucket->key, seed);
        uint32_t bucket_ideal_index = bucket_hash & (capacity - 1);
        uint32_t bucket_dist = get_distance(capacity, bucket_ideal_index, cur_index);
        if(probe_dist > bucket_dist){
            return NULL;
        }

        if(isEqual(bucket->key, key)){
            return bucket;
        }
        probe_dist++;
    }
}

If we are searching for a key, and our current search distance (probe_dist) becomes greater than the distance of the entry currently occupying the bucket (bucket_dist), we know with absolute mathematical certainty that our key is not in this table. Because if the key were in this table, and it had been pushed this far down, it would have robbed this occupant during the insertion phase. So we abort the search immediately.

Tombstone-Free Deletions

Deleting an entry from a standard Open Addressing hash table is annoying. If you just wipe a bucket back to empty, you break the probe chain. Any keys that collided and were stored after the deleted bucket will suddenly become unreachable. Usually, we solve this by replacing deleted entries with a Tombstone marker, which pollutes the table.

Because the strict order Robin Hood maintains, we do not need Tombstones. We use a technique called Backward Shifting in tableRemove.

        
      
bool tableRemove(VM* vm, HashTable* table, Value key){
    if(table->count == 0){
        return false;
    }

    Entry* entry = findEntry(table->entries, table->capacity, key, vm->hash_seed);
    if(entry == NULL || IS_EMPTY(entry->key)){
        return false;
    }

    table->count--;
    uint32_t bucket_index = (uint32_t)(entry - table->entries);
    uint32_t capacity = table->capacity;

    // move entries forward
    while(true){
        uint32_t next_bucket_index = (bucket_index + 1) & (capacity - 1);
        Entry* next_bucket = &table->entries[next_bucket_index];

        if(IS_EMPTY(next_bucket->key)){
            break;
        }

        uint64_t next_hash = hashValue(next_bucket->key, vm->hash_seed);
        uint32_t next_ideal = next_hash & (capacity - 1);
        if(get_distance(capacity, next_ideal, next_bucket_index) == 0){
            break;  // next entry already in its ideal bucket
        }

        table->entries[bucket_index] = *next_bucket;
        bucket_index = next_bucket_index;
    }

    table->entries[bucket_index].key = EMPTY_VAL;
    table->entries[bucket_index].value = NULL_VAL;

    return true;
}

When we delete an entry, we look at the very next bucket. If that next entry is sitting exactly where it wants to be (distance 0), we stop. But if that next entry was pushed forward due to collisions, we literally slide it backward into the slot we just deleted. We repeat this slide-backward process until we hit an empty slot or a perfectly placed entry. Every deletion actively heals the table by shifting surviving entries closer to their ideal index.

Tying It Back

Now we have a Robin Hood Hash Table, we can return to the problem of String Interning.

Earlier, we looked at the copyString, which takes a raw C string, hashes it, and checks if it already exists. We can finally unveil the lookup function tableGetInternedString:

        
      
ObjectString* tableGetInternedString(VM* vm, HashTable* table, const char* chars, int len, uint64_t hash){
    if(table->count == 0){
        return NULL;
    }

    uint32_t capacity = table->capacity;
    uint32_t ideal_index = hash & (capacity - 1);
    uint32_t probe_dist = 0;

    while(true){
        uint32_t cur_index = (ideal_index + probe_dist) & (capacity - 1);
        Entry* entry = &table->entries[cur_index];

        if(IS_EMPTY(entry->key)){
            if(IS_NULL(entry->value)){
                return NULL;
            }
            probe_dist++;
            continue;
        }

        uint64_t entry_hash = hashValue(entry->key, vm->hash_seed);
        uint32_t entry_ideal_index = entry_hash & (capacity - 1);
        uint32_t entry_dist = get_distance(capacity, entry_ideal_index, cur_index);
        
        if(probe_dist > entry_dist){
            return NULL;
        }

        if(IS_STRING(entry->key)){
            ObjectString* keyStr = AS_STRING(entry->key);
            if(keyStr->length == len && 
                keyStr->hash == hash &&
                memcmp(keyStr->chars, chars, len) == 0){
                return keyStr;
            }
        }

        probe_dist++;
    }
}

This function traverses the vm->strings hash table just like findEntry. The only difference is that instead of comparing two pre-existing Value objects, it uses memcmp to check if the raw C character array (chars) exactly matches the characters inside the ObjectString currently in the bucket.

Copy and Take

There are two ways strings enter this VM: copyString and takeString. Understanding the memory ownership difference between them is crucial.

When the compiler parses a string literal in the source code, the characters belong to the source file string. The VM cannot modify or free that memory. So, it uses copyString, which allocates fresh memory on the VM’s heap and uses memcpy to duplicate the characters.

However, during runtime, the VM will often dynamically create strings. For example, if a script concatenates "hello " and "world", the VM’s OP_ADD instruction will create a temporary buffer, fill it with "hello world", and then intern it.

If we passed this dynamically allocated buffer to copyString, copyString would duplicate it again, and the original buffer would be lost forever, causing a memory leak. Instead, we use takeString:

        
      
ObjectString* takeString(VM* vm, char* chars, int length){
    uint64_t hash = hashString(chars, length, vm->hash_seed);
    ObjectString* interned = tableGetInternedString(vm, &vm->strings, chars, length, hash);
    if (interned != NULL) {
        reallocate(vm, chars, length + 1, 0); 
        return interned;
    }

    ObjectString* string = allocString(vm, length, hash);
    memcpy(string->chars, chars, length);
    string->chars[length] = '\0';

    push(vm, OBJECT_VAL(string));
    tableSet(vm, &vm->strings, OBJECT_VAL(string), NULL_VAL);
    pop(vm);

    reallocate(vm, chars, length + 1, 0);

    return string;
}

If the string is already interned, the VM simply returns the existing interned object and immediately frees the duplicate buffer we just passed it. If it is new, it allocates the ObjectString, copies the characters into the flexible array member, and then frees the passed buffer.

Wrapping Up

We have now escaped the 64-bit limit. This VM can now dynamically allocate complex data to the heap using the Object model, guarantee that every string is perfectly unique using Interning, and store information in a collision-resistant, tombstone-free dictionary.

Interpreter

CS Interpreter

This post is licensed under CC BY 4.0 by the author.