Zig 0.16 is freshly out, with a major rewrite of std.Io. The release notes describe the new shape; an actual project, small enough to finish and old enough that any mistake will be caught by thirty years of textbook polish, exercises it. An LC-3 virtual machine and assembler fits the brief.
LC-3 is the teaching ISA from Patt & Patel's Introduction to Computing Systems. Sixteen-bit, fifteen opcodes, eight registers, condition flags, memory-mapped keyboard. The VM fits in 600 lines; the assembler in 640. Both speak the same flat .obj format the textbook ships, which makes thirty years of existing assembly programs valid test inputs.
What 0.16 actually feels like
The biggest break is what gets passed to your main. Pre-0.16 you reached for std.io.getStdOut() and std.process.argsAlloc() like ambient globals. In 0.16 they live in a single struct the runtime hands you:
pub fn main(init: std.process.Init) !void {
const io = init.io;
const arena = init.arena.allocator();
var stdout_buf: [4096]u8 = undefined;
var stdout_writer = Io.File.stdout().writer(io, &stdout_buf);
const stdout = &stdout_writer.interface;
const args = try init.minimal.args.toSlice(arena);
// ...
}
The arena, the io, the args: they show up at the boundary, get plumbed wherever they're needed, and that's it. The implication is that you can no longer grab I/O from anywhere; every function that needs to write or read takes either an Io instance, a *Io.Writer, or an Io.File directly. It looks like ceremony at first. It pays off the first time you write a unit test that captures stdout, because the test just hands a *Io.Writer backed by a buffer to the function under test and there's nothing to mock.
The build system shape stayed largely as in 0.15. Two binaries (zvm and zasm) reduce to two addExecutable calls plus their run/test steps:
const exe = b.addExecutable(.{
.name = "zvm",
.root_module = b.createModule(.{
.root_source_file = b.path("src/main.zig"),
.target = target,
.optimize = optimize,
}),
});
The migration from 0.15 was localized to the surface. The rest of the program structure didn't move at all.
The dispatch loop
The interesting code in any VM is the inner loop:
fetch → instr = mem[pc]
decode → 4-bit opcode = (instr >> 12)
execute → 15-arm switch on opcode
Zig's plain switch over a u4 lowers (under LLVM, in any release mode) to a single jump table: one indirect branch per simulated instruction. Sixteen targets fit in a single cache line. Written the obvious way, the VM runs at 500 Mips on a 16-million-instruction benchmark in ReleaseFast, about 50× faster than the textbook's 10 MHz reference clock.
The next question is whether it can go faster.
The trick that didn't pay
The folklore is that you can beat a switch with threaded dispatch: instead of one shared dispatch site at the top of the loop, every opcode handler ends with its own jump to the next handler, so each indirect-jump site gets specialized branch prediction for the patterns following that specific opcode. In C this is the computed-goto idiom (goto *table[op]); GCC and Clang have supported it since forever. In Zig you mimic it with @call(.always_tail, ...) into a table of handler functions:
fn h_add(vm: *Vm, ctx: *const Ctx, instr: u16) anyerror!void {
opADD(vm, instr);
return nextOp(vm, ctx); // inlined; tail-calls into next handler
}
Sixteen opcodes, sixteen handlers, one function-pointer table, all wired up behind a --threaded flag for A/B comparison against the switch. Best-of-ten on a 16M-instruction loop:
switch threaded delta
ReleaseSafe 467 Mips 439 Mips -6%
ReleaseFast 564 Mips 536 Mips -5%
Threaded was slower. Five percent slower in ReleaseFast.
This isn't what the literature predicts. The result takes a moment to stop reading like a measurement bug. The honest reading takes three pieces:
- LLVM already lowers a 16-arm switch into a tight jump table. The "improvement" threaded dispatch promises, one indirect jump per opcode, is what the switch already produces. There is no second indirect to remove.
- The branch-predictor argument requires an unpredictable pattern. The benchmark loop is a five-op cycle (ADD/NOT/ADD/ADD/BR) repeating thirty thousand times. A modern predictor learns that pattern in microseconds with one shared site, so spreading the dispatch across sixteen sites buys nothing.
- Tail calls aren't free. Even with
.always_tailthe codegen reserves a small amount of register-shuffling at each call boundary that the switch doesn't pay.
The textbook computed-goto trick was designed against compilers that didn't optimize switches well. Modern LLVM closes that gap. For a complex ISA with hundreds of opcodes and unpredictable instruction streams, the calculus might still flip. But for LC-3, and for most teaching ISAs, the plain switch is already the answer.
The threaded code is rolled back. It's a useful negative result; it isn't useful sitting in the codebase as a slower alternative path. The whole experiment is now a paragraph in this post and gone from the source tree.
End-to-end on a real program
The toolchain is two binaries: zasm turns text into .obj, zvm loads .obj and executes. Both speak the same format: origin word followed by 16-bit big-endian instruction/data words, no header, no symbol table.
A real test for the assembler is a program that wasn't hand-crafted in-tree. rpendleton/lc3-2048 fits: one 30 KB .asm file, all 15 LC-3 opcodes, JSR subroutines, ANSI-coloured terminal output, a board-rendering loop. Pass 1 fails on the very first try with UnterminatedString.
The bug is obvious in the 2048 source:
ANSI_BOARD_LABELS_2 .STRINGZ "\e[1;37m 4 \e[0m"
The naive stripComment looks for ; and chops the line there. The semicolon inside the ANSI escape was getting treated as a comment start. The codebase already had a comment apologizing for this exact assumption: "no one writes ; inside an LC-3 .STRINGZ in practice." Famous last words. The fix: comment stripper goes string-aware, \e joins the accepted escape list. 1137 words assemble clean. The VM boots, draws the board, accepts WASD input, merges tiles, spawns new ones. Whole pipeline working end-to-end on a program written by someone else.
+--------------------------+
| |
| 2 |
| |
| 2 4 |
| |
| |
| |
| |
| |
+--------------------------+
What 0.16 still leaves on the table
A few rough edges left:
- The self-hosted x86_64 backend (
stage2_x86_64) doesn't support@call(.always_tail)yet. While the threaded experiment was live, Debug builds required forcinguse_llvm = trueinbuild.zig. Slows iteration; went away with the threaded code. When stage2 catches up the restriction lifts. std.time.Timeris gone. Wall-clock timing is nowIo.Clock.awake.now(io)returning a timestamp, withfrom.durationTo(to).nanosecondsfor elapsed. The new shape is more uniform but the new path takes some grepping through the std to find.std.fsis gone in favour ofIo.DirandIo.File. Largely a rename, butreadFile/statpaths look different enough that the cookbook needs re-learning.
None of this is painful. Porting a 0.15 codebase blind costs a couple of hours per file as the new locations get re-internalized.
What it cost
The whole project is around 1300 lines: 600 for the VM, 640 for the assembler, the rest in build files and example programs. Five hundred Mips peak in ReleaseFast on commodity hardware.
The tool to actually use day to day is, of course, lcc from the textbook authors. A hand-built copy exists for a different reason: prolonged contact with std.Io is the fastest way to internalize the new shape. The negative threaded-dispatch result isn't really a Zig story either; it would have happened with C and GCC too. Modern compilers have closed the gap that made computed-goto a winning trick in 2005. The lesson is the boring one: trust the compiler's switch lowering, profile before optimizing, and let go of folklore that has lost its premise.