An LC-3 toolchain in Zig 0.16

Zig 0.16 is freshly out, with a major rewrite of std.Io. The release notes describe the new shape; an actual project, small enough to finish and old enough that any mistake will be caught by thirty years of textbook polish, exercises it. An LC-3 virtual machine and assembler fits the brief.

LC-3 is the teaching ISA from Patt & Patel's Introduction to Computing Systems. Sixteen-bit, fifteen opcodes, eight registers, condition flags, memory-mapped keyboard. The VM fits in 600 lines; the assembler in 640. Both speak the same flat .obj format the textbook ships, which makes thirty years of existing assembly programs valid test inputs.

Two binaries, one flat format. Assembling 2048 from its real source took one bug fix.

What 0.16 actually feels like

The biggest break is what gets passed to your main. Pre-0.16 you reached for std.io.getStdOut() and std.process.argsAlloc() like ambient globals. In 0.16 they live in a single struct the runtime hands you:

pub fn main(init: std.process.Init) !void {
    const io = init.io;
    const arena = init.arena.allocator();

    var stdout_buf: [4096]u8 = undefined;
    var stdout_writer = Io.File.stdout().writer(io, &stdout_buf);
    const stdout = &stdout_writer.interface;

    const args = try init.minimal.args.toSlice(arena);
    // ...
}

The arena, the io, the args: they show up at the boundary, get plumbed wherever they're needed, and that's it. The implication is that you can no longer grab I/O from anywhere; every function that needs to write or read takes either an Io instance, a *Io.Writer, or an Io.File directly. It looks like ceremony at first. It pays off the first time you write a unit test that captures stdout, because the test just hands a *Io.Writer backed by a buffer to the function under test and there's nothing to mock.

The build system shape stayed largely as in 0.15. Two binaries (zvm and zasm) reduce to two addExecutable calls plus their run/test steps:

const exe = b.addExecutable(.{
    .name = "zvm",
    .root_module = b.createModule(.{
        .root_source_file = b.path("src/main.zig"),
        .target = target,
        .optimize = optimize,
    }),
});

The migration from 0.15 was localized to the surface. The rest of the program structure didn't move at all.

The dispatch loop

The interesting code in any VM is the inner loop:

fetch  → instr = mem[pc]
decode → 4-bit opcode = (instr >> 12)
execute → 15-arm switch on opcode

Zig's plain switch over a u4 lowers (under LLVM, in any release mode) to a single jump table: one indirect branch per simulated instruction. Sixteen four-byte targets fit in a single cache line. Written the obvious way, the VM chews through a 16-million-instruction benchmark in about 28 ms in ReleaseFast: roughly 560 million simulated instructions a second, on one commodity core. (Precise numbers below.)

The next question is whether it can go faster.

The trick that didn't pay

The folklore is that you can beat a switch with threaded dispatch: instead of one shared dispatch site at the top of the loop, every opcode handler ends with its own jump to the next handler, so each indirect-jump site gets specialised branch prediction for the patterns following that specific opcode. In C this is the computed-goto idiom (goto *table[op]); GCC and Clang have supported it since forever. In Zig you mimic it with @call(.always_tail, ...) into a table of handler functions:

fn h_add(vm: *Vm, ctx: *const Ctx, instr: u16) anyerror!void {
    opADD(vm, instr);
    return nextOp(vm, ctx);   // inlined; tail-calls into next handler
}

Fifteen opcodes, sixteen handler slots (the reserved 1101 encoding traps), one function-pointer table, all wired up behind a --threaded flag for A/B comparison against the switch. Best-of-ten on a 16M-instruction loop:

Build	switch	threaded	delta
ReleaseSafe	467 Mips	439 Mips	-6%
ReleaseFast	564 Mips	536 Mips	-5%

Threaded dispatch came in slower. Five percent slower in ReleaseFast, the opposite of what the folklore promises.

This isn't what the literature predicts, and the result takes a moment to stop reading like a measurement bug. The honest reading takes three pieces:

LLVM already lowers a 16-arm switch into a tight jump table. The "improvement" threaded dispatch promises, one indirect jump per opcode, is what the switch already produces. There is no second indirect to remove.
The branch-predictor argument requires an unpredictable pattern. The benchmark loop is a five-op cycle (ADD/NOT/ADD/ADD/BR) repeating thirty thousand times. A modern predictor learns that pattern in microseconds with one shared site, so spreading the dispatch across sixteen sites buys nothing.
Tail calls aren't free. Even with .always_tail the codegen reserves a small amount of register-shuffling at each call boundary that the switch doesn't pay.

The computed-goto trick was designed against compilers that didn't optimise switches well. Modern LLVM closed that gap, and the trick quietly lost its premise.

For a complex ISA with hundreds of opcodes and unpredictable instruction streams, the calculus might still flip. But for LC-3, and for most teaching ISAs, the plain switch is already the answer.

The threaded code is rolled back. It's a useful negative result; it isn't useful sitting in the codebase as a slower alternative path. The whole experiment is now a paragraph in this post and gone from the source tree.

End-to-end on a real program

The toolchain is two binaries: zasm turns text into .obj, zvm loads .obj and executes. Both speak the same format: origin word followed by 16-bit big-endian instruction/data words, no header, no symbol table.

A real test for the assembler is a program that wasn't hand-crafted in-tree. rpendleton/lc3-2048 fits: one 30 KB .asm file, all 15 LC-3 opcodes, JSR subroutines, ANSI-coloured terminal output, a board-rendering loop. Pass 1 fails on the very first try with UnterminatedString.

The bug is obvious in the 2048 source:

ANSI_BOARD_LABELS_2 .STRINGZ "\e[1;37m 4  \e[0m"

The naive stripComment looks for ; and chops the line there. The semicolon inside the ANSI escape was getting treated as a comment start. The codebase already had a comment apologizing for this exact assumption: "no one writes ; inside an LC-3 .STRINGZ in practice." Famous last words. The fix: comment stripper goes string-aware, \e joins the accepted escape list. 1137 words assemble clean. The VM boots, draws the board, accepts WASD input, merges tiles, spawns new ones. Whole pipeline working end-to-end on a program written by someone else.

+--------------------------+
|                          |
|         2                |
|                          |
|   2     4                |
|                          |
|                          |
|                          |
|                          |
|                          |
+--------------------------+

What 0.16 still leaves on the table

A few rough edges left:

The self-hosted x86_64 backend (stage2_x86_64) doesn't support @call(.always_tail) yet. While the threaded experiment was live, Debug builds required forcing use_llvm = true in build.zig. Slows iteration; went away with the threaded code. When stage2 catches up the restriction lifts.
std.time.Timer is gone. Wall-clock timing is now Io.Clock.awake.now(io) returning a timestamp, with from.durationTo(to).nanoseconds for elapsed. The new shape is more uniform but the new path takes some grepping through the std to find.
std.fs is gone in favour of Io.Dir and Io.File. Largely a rename, but readFile/stat paths look different enough that the cookbook needs re-learning.

None of this is painful. Porting a 0.15 codebase blind costs a couple of hours per file as the new locations get re-internalised.

What it cost

The whole project is around 1300 lines: 600 for the VM, 640 for the assembler, the rest in build files and example programs. About 560 Mips peak in ReleaseFast on commodity hardware.

The tool to actually use day to day is, of course, lcc from the textbook authors. A hand-built copy exists for a different reason: prolonged contact with std.Io is the fastest way to internalise the new shape. The negative threaded-dispatch result isn't really a Zig story either; it would have happened with C and GCC too. Modern compilers have closed the gap that made computed-goto a winning trick in 2005. The lesson is the boring one: trust the compiler's switch lowering, profile before optimising, and let go of folklore that has lost its premise.

Zig 0.16 shipped a big rewrite of std.Io, so I wrote a small project against it: an LC-3 virtual machine and assembler. LC-3 is the teaching ISA from Patt and Patel, sixteen bits, fifteen opcodes, eight registers. The VM is 600 lines, the assembler 640, and both speak the flat .obj format the textbook ships, so thirty years of existing assembly programs are valid test inputs.

What 0.16 changes

The break is at main. Pre-0.16 you grabbed std.io.getStdOut() and std.process.argsAlloc() as ambient globals. In 0.16 the runtime hands you a struct with the io, the arena, and the args, and every function that reads or writes now takes an Io, a *Io.Writer, or an Io.File. It looks like ceremony until the first unit test that captures stdout: you hand the function a writer backed by a buffer and there is nothing to mock.

The rest barely moved. The build script is two addExecutable calls plus their run and test steps, and the program structure ported straight over from 0.15.

The trick that did not pay

The VM's inner loop is fetch, decode a four-bit opcode, then a fifteen-arm switch. LLVM lowers that switch to a single jump table: one indirect branch per simulated instruction. Written the obvious way it runs about 564 Mips in ReleaseFast on one core.

Folklore says you can beat a switch with threaded dispatch: each opcode handler tail-calls the next, so every jump site gets its own branch prediction. I wired it up behind a flag and measured. It came in 5% slower.

The computed-goto trick was designed against compilers that did not optimise switches well. Modern LLVM closed that gap, and the trick quietly lost its premise.

Three reasons it lost:

LLVM already lowers the switch to one indirect jump per opcode. There is no second indirect to remove.
The benchmark is a short repeating opcode cycle, so a modern predictor learns it with one shared site. Spreading dispatch across sixteen sites buys nothing.
Tail calls reserve register shuffling at each boundary that the switch does not pay.

For a huge ISA with unpredictable streams the calculus might flip. For LC-3 the plain switch is already the answer, so the threaded path is rolled back: a useful negative result, not a slower code path to keep.

End to end, and the rough edges

The real test was assembling rpendleton/lc3-2048, a 30 KB program written by someone else. It failed on the first pass with UnterminatedString: the comment stripper chopped at the ; inside an ANSI escape. One fix (string-aware comment stripping) and the whole pipeline ran, drawing the board and merging tiles. A few 0.16 edges remain: std.time.Timer, std.fs, and the timing path all moved, so a blind 0.15 port costs a couple of hours per file relearning locations. None of it is painful.