Building a Register VM Interpreter - Part 3: Lexical Analysis and Keywords
This guide is about building a register-based VM interpreter from scratch. It's time for us to turn to the front-end if the interpreter, transforming raw source code into the executable instructions.
In the previous parts, we established the foundational architecture of the VM and a memory-efficient way to represent dynamic values using NaN-Boxing. It’s time for us to turn to the front-end if the interpreter: transforming raw source code into the executable instructions.
But for a compiler it cannot understand a raw string of code like var x = 0;. It needs structed data. The very first step in this part is Lexical Analysis, in short, building a Scanner. A scanner groups these raw code into meaningful, atomic uints called Tokens.
The Zero-Allocation Token
When designing a scanner in C, the most critical decision is how to represent a token in memory.
Consider a native approach. If the scanner meets an identifier like x, the instinctive response might be to dynamically allocate memory to store it:
1
2
3
4
5
typedef struct{
TokenType type;
char* lexeme;
int line;
}Token;
While this does work, it is an architecture disaster for a compiler. The primary and biggest problem is performance overhead. malloc is an expensive OS-level operation. A typical script might contain thousands of tokens, allocating heap memory for every single keyword, identifier and numbers will drastically slow down the compilation. Tiny tokens also destroy cache locality and fragments the heap. In the long run, if tokens are dynamically allocated, the compiler now has the buiden of tracking and freeing them, complicating the codebase and risking memory leaking before the Garbage Collector is fully ready.
String-View
By realizing the entire source code is already sitting in the memory as a massive string, we don’t need to copy it, we just need to point to it. Instead of holding the string data, our roken will simply act like a view into the code string. In compiler/scanner.h, the Token is defined as:
1
2
3
4
5
6
typedef struct{
TokenType type;
const char* head;
int len;
int line;
}Token;
This design is incredibly lightweight. A token is just an enum representing token type, a head pointer points directly to the first character of the lexeme inside the raw code string, and two integers len and line.
Because of this, generating a token requires zero memory allocation. We can pass Token struct around by value (since they are small enough to fit in a couple of registers).
The TokenType is defined enum:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
typedef enum{
TOKEN_EOF, TOKEN_ERROR,
TOKEN_IDENTIFIER,
TOKEN_IF, TOKEN_ELSE,
TOKEN_WHILE, TOKEN_FOR,
TOKEN_BREAK, TOKEN_CONTINUE,
TOKEN_SWITCH, TOKEN_DEFAULT, TOKEN_FAT_ARROW,
TOKEN_FUNC, TOKEN_RETURN, TOKEN_CLASS, TOKEN_THIS, TOKEN_METHOD,
TOKEN_TRUE, TOKEN_FALSE,
TOKEN_VAR,
TOKEN_NULL,
TOKEN_NUMBER,
TOKEN_STRING_START, TOKEN_STRING_END,
TOKEN_INTERPOLATION_START, TOKEN_INTERPOLATION_END, TOKEN_INTERPOLATION_CONTENT,
TOKEN_PLUS, TOKEN_MINUS, TOKEN_STAR, TOKEN_SLASH, TOKEN_PERCENT,
TOKEN_PLUS_EQUAL, TOKEN_MINUS_EQUAL, TOKEN_PLUS_PLUS, TOKEN_MINUS_MINUS,
TOKEN_EQUAL, TOKEN_NOT_EQUAL, TOKEN_NOT, TOKEN_QUESTION,
TOKEN_AND, TOKEN_OR,
TOKEN_LESS, TOKEN_GREATER, TOKEN_LESS_EQUAL, TOKEN_GREATER_EQUAL,
TOKEN_ASSIGN,
TOKEN_LEFT_PAREN, TOKEN_RIGHT_PAREN,
TOKEN_LEFT_BRACE, TOKEN_RIGHT_BRACE,
TOKEN_LEFT_BRACKET, TOKEN_RIGHT_BRACKET,
TOKEN_COMMA, TOKEN_SEMICOLON, TOKEN_DOT, TOKEN_COLON,
TOKEN_PRINT,
TOKEN_IMPORT,
TOKEN_SYSTEM, TOKEN_PIPE,
TOKEN_DEFER,
}TokenType;
Scanner State
To produce these types of tokens, we need a machine that reads the souce code linearly. We can encapsulate the state of this proccess in a Scanner struct:
1
2
3
4
5
typedef struct{
const char* head;
const char* cur;
int line;
}Scanner;
The core of the scanner relies on a sliding window deployed with two pointers: head and cur. The head drops an anchor at the beginning of the lexeme we are currently analyzing while cur acts as a scout. It steps forward through the string, examining one character at a time to figure out what kind of token it is looking at.
By seperating these two pointers, capturing a token becomes mathematically trivial. Once the cur has consumed enough characters to form a token, the length of the token can be simply calculated by the distance between two pointers.
Scanning Loop
With the zero-allocation tokens and scanner state defined, how compiler actually consumes tokens is the next problem.
There are two primary method to design a scanner:
-
Push-based (AOT Lexing): Known as Ahead-of-Time Lexing. The scanner runs through the entire source code file at once, generate a massive array of every token, and pass the array to the parser.
-
Pull-based (Lazy Lexing): The scanner does nothing until the parser explicitly requests for the next token.
This VM uses the pull-based approach. We don’t need to allocate an array to hold massive tokens, we only need to hold the current token in memory. This approach saves memory. Second, it allows the scanner to be context-aware. As we will see later with string interpolation, the scanner may need to switch its mode based on what the parser is currently doing. A lazy, on-demand scanner makes this switching trivial.
Skipping Noise
When the main function of the scanner scan() is called (and we are in standard code mode defaultly), it delegates to scanDefault(). The very first thing this function does is walking through characters to skip noise.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Scanner sc;
static Token scanDefault(){
while(true){
skipWhitespace();
if(handleComment()){
continue; // Skip comments
}
break;
}
sc.head = sc.cur; // Anchor the head
// ... begin scanning ...
}
Whitespace
Whitespace including spaces, tabs, newlines and comments are entirely meaningless (at least for now) to the Virtual Machine. No one would define and generate tokens like TOKEN_WHITESPACE or TOKEN_COMMENT. Parser would have to waste CPU cycles checking and discarding. By stripping them out at the lexical level, we guarantee that the parser only receives significant code.
1
2
3
4
5
6
7
8
9
Scanner sc;
static inline void skipWhitespace(){
while(*sc.cur == ' ' || *sc.cur == '\t' || *sc.cur == '\n' || *sc.cur == '\r'){
if(*sc.cur == '\n') sc.line++;
next();
}
return;
}
The next() funtion consume characters. A scanner will examine every single character in a source file, even multiple times. In a tight loop like this, function call overhead and array indexing become massive bottlenecks.
1
2
3
4
5
6
Scanner sc;
static inline const char* next(){
if(*sc.cur == '\0') return NULL;
return sc.cur++;
}
Instead of using array indices, we use raw pointer arithmetic (sc.cur++). Bumping a memory address is the absolute fastest way to traverse memory on modern CPUs.
Futhermore, functions like next() are explicitly declared as static inline. With the -O3 optimization of gcc, the raw assembly of these functions will be directly pasted into the scanning loop, keeping the CPU’s instruction pipeline saturated and eliminating branching overhead.
Comment
Comments seem trivial, but they present a fascinating language design choise. Let’s support single-line comments (#) and block comments (#{ ... }#).
In compiler/scanner.c:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Scanner sc;
static bool handleComment(){
if(*sc.cur == '#'){
if(sc.cur[1] == '{'){
handleBlockComment();
}else{
handleLineComment();
}
return true;
}
return false;
}
static inline void handleLineComment(){
while(*sc.cur != '\n' && *sc.cur != '\0'){
next();
}
}
Standard line comments simply advance sc.cur until hit a \n or \0. But let’s take a look at handleBlockComment():
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Scanner sc;
static inline void handleBlockComment(){
next(); next(); // Skip #{
int depth = 1;
while(depth > 0 && *sc.cur != '\0'){
if(*sc.cur == '#' && sc.cur[1] == '{'){
next(); next(); // Skip #{
depth++;
continue;
}
if(*sc.cur == '}' && sc.cur[1] == '#'){
next(); next(); // Skip }#
depth--;
continue;
}
// ...
}
}
Here we need to maintain an integer depth counter to support nested block comments.
In C and C++, block comments (/* ... */) do not nest. If programmer try to comment out a large block of code that already happens to contain a block comment inside it, the C compiler will see the first */, prematurely end the outer comment, and then throw a syntax error on the remaining text.
By tracking the depth, our scanner understand when a block comment is opened . It will only stop consuming characters when the depth reaches exact 0.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
static inline void handleBlockComment(){
// ....
if(*sc.cur == '\n') sc.line++;
next();
if(*sc.cur == '\0' && depth > 0){
fprintf(stderr, "Error: Unclosed block comment at line %d\n", sc.line);
return;
}
// ...
}
Lookahead Machine
Once the noise is cleared and head is anchored, we grab the first character with char c = *next(); and figure out what token to build.
For single characters, a simple switch statement is suitable. But for operators that might be two or more characters (like +=, ++ or --), we can write a lookahead function to assist the scanner.
1
2
3
4
5
6
7
8
9
10
Scanner sc;
static inline bool is_next(char c){
if(*sc.cur == '\0') return false;
if(*sc.cur == c){
sc.cur++;
return true;
}
return false;
}
Now we can add more code to the compiler/scanner.c: scanDefault() in this pattern:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
static Token scanDefault(){
while(true){
skipWhitespace();
if(handleComment()){
continue; // Skip comments
}
break;
}
sc.head = sc.cur;
switch(c){
case '+':
if(is_next('=')) return pack(TOKEN_PLUS_EQUAL, sc.head, 2, sc.line);
if(is_next('+')) return pack(TOKEN_PLUS_PLUS, sc.head, 2, sc.line);
return pack(TOKEN_PLUS, sc.head, 1, sc.line);
case '-':
if(is_next('=')) return pack(TOKEN_MINUS_EQUAL, sc.head, 2, sc.line);
if(is_next('-')) return pack(TOKEN_MINUS_MINUS, sc.head, 2, sc.line);
return pack(TOKEN_MINUS, sc.head, 1, sc.line);
// more cases...
}
return error("Unrecognized character", sc.line);
}
is_next() peaks at the next character. If it matches, it will automatically consumes it and returns true, allowing us to pack a 2-character token. If it does not match, it just leave sc.cur exactly where it is, and we pack a standard 1-character token.
With the core loop spinning, whitespace stripped, and basic symbols packed, the scanner is fully operational. But we can have one more trick up the sleeve: seamlessly injecting variables into the middle of string literals.
String Interpolation and Mode Stack
In languages like C, a string literal is just an opaque blob of text. But modern scripting languages always demand more. Developers expect string interpolation, which is the ability to embed variables and complex expressions directly inside a string, like this:
1
print "The result of a + b is ${a + b}.";
Inside a ${ ... }, we are no longer scanning a string but scanning code. That code might contain operators, function calls, variables, or even other nested strings.
If the scanner treats the string as a single token, the parser later has to manually rip the string apart, look for ${, andsomehow run the scanner again to the internal pieces.
Context-Aware Scanner
Instead of forcing the parser to dissect string, we can upgrade our scanner into a state machine with a stack. In compiler/scanner.h, we introduce the ScannerMode enumeration and attach a mode stack directly to the global Scanner.
1
2
3
4
5
6
7
8
9
10
11
12
typedef enum{
MODE_DEFAULT,
MODE_IN_STRING,
}ScannerMode;
typedef struct{
const char* head;
const char* cur;
int line;
ScannerMode modeStack[MAX_MODE_STACK];
int modeStackTop;
}Scanner;
The stack initialized with a MODE_DEFAULT at the buttom when compilation begins. The main scan() function acts as a router, checking the top of the stack to decide which mode to apply.
1
2
3
4
5
6
7
Token scan(){
switch(currentMode()){
case MODE_DEFAULT: return scanDefault();
case MODE_IN_STRING: return scanString();
// ...
}
}
Now the scanner can completely alter its behavior simply by changing its current mode.
Interplation Lifecycle
Let’s trace how the scanner parse the example: "The result of a + b is ${a + b}."
When in the mode of default, the scanner consumes an opening ". Instead of gobbing up the whole string, it immediately suspends the current state and then pushes a MODE_IN_STRING onto the stack by pushMode(), and yields a TOKEN_STRING_START.
1
2
3
4
5
6
7
static inline void pushMode(ScannerMode mode){
if(sc.modeStackTop < MAX_MODE_STACK - 1){
sc.modeStack[++sc.modeStackTop] = mode;
}else{
fprintf(stderr, "Error: Scanner Mode stack overflow at line %d\n", sc.line);
}
}
The next time the parser asks for a token, scan() routes the execution to scanString(). This function consumes characters until it hits a quote or the interpolation trigger ${. It captures the The result of a + b is and yields it as a TOKEN_INTERPOLATION_CONTENT.
Then the scanner resumes and soonly hits a ${. It pushes MODE_DEFAULT back onto the top of the stack and yields a TOKEN_INTERPOLATION_START.
1
2
3
4
5
if(*sc.cur == '$' && sc.cur[1] == '{'){
next(); next(); // Skip ${
pushMode(MODE_DEFAULT);
return pack(TOKEN_INTERPOLATION_START, sc.head, 2, sc.line);
}
Now, the top of the mode stack is MODE_DEFAULT again. The scanner completely forgets it is inside a string. The only one things the scanner should know, is when the parser asks for the next token, the calculation (a + b) will be processed, using the same standard rules for global variables. This means the compiler’s expression parser requires absolutely no extra work to support string interpolation.
Eventually, the scanner hits a closing }. In scanDefault(), we need a specific rule for handling the right brace.
1
2
3
4
5
6
case '}':
if(sc.modeStackTop > 0){
popMode();
return pack(TOKEN_INTERPOLATION_END, sc.head, 1, sc.line);
}
return pack(TOKEN_RIGHT_BRACE, sc.head, 1, sc.line);
Because sc.modeStackTop > 0 (nested inside an interpolation), it intercepts the brace, pops MODE_DEFAULT off the stack, and yields TOKEN_INTERPOLATION_END.
With MODE_DEFAULT popped, MODE_IN_STRING is back on top. Then scanner resumes looking for string characters. It then meets the closing ", pops the string mode off the stack, and returns a TOKEN_STRING_END.
Keyword and Perfect Hashing
As the scanner traverses through the source code, it will inevitably encouter all kinds of sequences of alphabetical characters. When the cur pointer reads a word like return, the scanner will run into a identity crisis. The scanner should know whether the word is the name of a user-defined variable (an identifier), or a built-in language keyword.
In compiler/scanner.c, we first consume the entire word assuming it is a standard identifier:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
static inline bool isDigit(char c){
return c >= '0' && c <= '9';
}
static inline bool isAlpha(char c){
return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == '_';
}
static inline Token handleIdentifier(){
while(isAlpha(*sc.cur) || isDigit(*sc.cur)){
next();
}
return pack(TOKEN_IDENTIFIER, sc.head, (int)(sc.cur - sc.head), sc.line);
}
But before returning this token to the parser, we must check if it is valid keyword:
1
2
3
4
5
6
7
static TokenType identifierType(){
const struct Keyword* keyword = findKeyword(sc.head, (int)(sc.cur - sc.head));
if(keyword != NULL){
return keyword->type;
}
return TOKEN_IDENTIFIER;
}
The magic lies entirely within findKeyword(). There are several approaches to implement.
The most straightforward way to implement a findKeyword() is a massive if-else chain utilizing strcmp. However, a strcmp is an $O(N)$ operation. Running dozens of strcmp will absolutely be a disaster.
A more advanced way is building a Trie (prefix tree). We can navigate a tree node-by-node for each character. While much faster than strcmp, a Trie requires chasing pointers through memory for every single letter, destroying CPU cache locality.
The standard computer science answer is maintaining a Hash Table. We can hash the string and look it up in a dictionary. Because the keywords of a programming language are completely static and known at all the time, we don’t need a general-purpose hash table. We can make the use of a Perfect Hash Function. A perfect hash function guarantees mathematically that there will be absolutely zero collisions for a specific, predetermined set of keys.
Instead of writing this math bu hand, we use a compiler tool called GNU gperf. We define our language’s leywords in a simple configuration file compiler/keywords.gperf:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// gperf -L C -E -t -C -G -N findKeyword -H keywordHash keywords.gperf
%includes
%{
#include <string.h>
#include "scanner.h"
%}
%language=C
%struct-type
struct Keyword {
const char* name;
TokenType type;
};
%%
class, TOKEN_CLASS
else, TOKEN_ELSE
false, TOKEN_FALSE
for, TOKEN_FOR
func, TOKEN_FUNC
method, TOKEN_METHOD
var, TOKEN_VAR
if, TOKEN_IF
null, TOKEN_NULL
print, TOKEN_PRINT
return, TOKEN_RETURN
this, TOKEN_THIS
true, TOKEN_TRUE
while, TOKEN_WHILE
and, TOKEN_AND
or, TOKEN_OR
break, TOKEN_BREAK
continue, TOKEN_CONTINUE
switch, TOKEN_SWITCH
default, TOKEN_DEFAULT
import, TOKEN_IMPORT
defer, TOKEN_DEFER
When we run gperf -L C -E -t -C -G -N findKeyword -H keywordHash keywords.gperf on this file, it analyzes the exact character structures of all 22 of our keywords. It then generates a highly optimized header (compiler/keywords.h) containing a custom-built, collision-free hash algorithm for the keyword list.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
static unsigned int
keywordHash (str, len)
register const char *str;
register size_t len;
{
static const unsigned char asso_values[] =
{
31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
// ...
return len + asso_values[(unsigned char)str[1]] + asso_values[(unsigned char)str[0]];
}
// ...
const struct Keyword *
findKeyword (str, len)
register const char *str;
register size_t len;
{
if (len <= MAX_WORD_LENGTH && len >= MIN_WORD_LENGTH)
{
register unsigned int key = keywordHash (str, len);
if (key <= MAX_HASH_VALUE)
{
register const char *s = wordlist[key].name;
if (*str == *s && !strncmp(str, wordlist[key].name, len))
return &wordlist[key];
}
}
return (struct Keyword *) 0;
}
The logical is bulletproof, it will first run a bounds check, if the word is too short or too long, it fails instantly. Then, it computes the perfect hash index and grabs the keyword candidate directly from the static wordlist array. Even if the user write a custom variable that unfortunately hash to the same as a keyword, it will perform a verification using strncmp to guarantee a perfect string mach. Using gperf is $O(N)$ computation in its purest, fastest form.
With the perfect hash computed, the generated findKeyword function just simply does an array lookup.
Wrapping Up
By combining token string-views, a scanning loop, a automaton for context-aware interpolation, and perfect hashing for keyword, we have built a pragmatic and performant scanner.
At this stage, the front-end of this interpreter is capable of stripping noise and converting a raw file into a on-demand stream of lexical tokens.
It is important to remember that the scanner is entirely ingorant of the grammer. Which means, it does not know that TOKEN_PLUS is supposed to sit between two “things”, nor does it know if a return statement is valid outside a function. Its only job is to “check the vocabulary”.
With lexical analysis solved, we need to feed this stream of tokens into the Parser, where we will enforce the actual syntax rules, evaluate complex expressions, and finally begin generating the bytecode instructions that Virtual Machine will execute.
