Running real PICO-8 Lua on a Cortex-M7

The whole project hinged on one question I was not sure had a yes: can a real, unmodified Lua 5.4 interpreter run on this microcontroller, fast enough to play actual PICO-8 carts? If the answer was no, Rusty Nail would have been stuck running a toy bytecode VM forever. So I went and found out.

Vendoring Lua

The first step was the least glamorous. I dropped the Lua 5.4.7 source into a crate called fclua, unmodified, and got it building on the host first. The reference Lua sources are conveniently all ASCII and MIT licensed, so vendoring them is clean. The host build's only job is to prove the C-API bindings work. The test is the most satisfying one-liner in the repo:

// fclua: returns 6*7 == 42

Then the cross-compile. cargo build --target thumbv7em-none-eabihf runs the Lua C sources through the Arm GNU toolchain (gcc plus newlib) and produces a liblua.a. No source changes to Lua at all. The build script hunts for the toolchain in a few likely places and forces optimisation to -O2, which matters later. Linked into the firmware, the whole interpreter is about 170 KB of flash and 840 bytes of static RAM. On a chip with 2 MB of flash, 170 KB for a complete scripting language is nothing.

6 times 7 is not always 42

I flashed it, ran return 6*7, and over the defmt log came:

lua on the nucleo: 6*7 = 42

Except the first time it was not 42. It was 42 minus 2 to the 32. The integers were wrapping at 32 bits.

The cause was a single config flag. luaconf.h was picking up LUA_USE_C89, which makes LUA_INTEGER a 32-bit long. My FFI on the Rust side was handing back i64. The two disagreed about how wide an integer was, and the top half came back as garbage. Dropping LUA_USE_C89 gives 64-bit integers that match the host exactly. After that, 42 was 42.

The allocator was the other early piece. Lua wants a malloc, and there is no libc heap here. You give Lua a custom lua_Alloc callback backed by a static arena, and it never touches the C heap. I open a curated set of standard libraries with open_safe_libs: base, table, string, math, coroutine, utf8. No io, no os, no loadlib. A sandboxed cart has no business opening files. The math library did drag in one stray newlib syscall stub (_times, via randomseed), which is a failing stub like the rest and never actually called.

Getting carts onto the device

A PICO-8 cart is not just Lua. It is Lua plus a chunk of memory holding the sprite sheet, the map, sound effects, and music. So a cart on the wire is a small bundle I call P8B1:

"P8B1" | lua_len (u32) | data_len (u32) | lua bytes | data bytes

The data region is the 0x4300-byte PICO-8 memory image: graphics at 0x0000, map at 0x2000, flags at 0x3000, music at 0x3100, sound effects at 0x3200. The host side decodes a .p8 or .p8.png file, preprocesses the Lua, packs the memory image, and emits the blob. It travels over the USB upload path (begin/chunk/end with a CRC-32) and lands in a 48 KB static buffer on the device, where it is parsed and the Lua state is rebuilt around it.

The Lua heap had to move. It started in the AXI SRAM, but that bank is full of framebuffers. Table-heavy carts need room, so the heap moved to the D2 SRAM at 0x30000000, which is three contiguous banks giving 256 KB. That single change is what made real carts fit.

Making it fast

The first real cart I ran was zep's "3d dot party", which builds something like 343 point tables and spins them. It ran at 16 fps. Playable-ish, but not great. Two changes took it to nearly 48 fps, a 3x jump:

  1. Turn on the caches. The Cortex-M7 has instruction and data caches that are off at reset, and embassy does not turn them on for you. Enabling them is safe here only because nothing else reads cached SRAM behind the CPU's back. The USB peripheral uses its own internal FIFO, not a DMA buffer in cached RAM, so there is no coherency trap yet. (That stops being true the moment the LCD controller shows up, which is a problem for a later post.)
  2. Compile Lua with -O2, not -Os. The interpreter's dispatch loop is the hot path. Optimising for size bloats it past the 16 KB instruction cache and you pay for it on every bytecode. Optimising for speed keeps the inner loop tight.

Profiling the dots cart afterward, the cart's own _draw was about 90% of each frame. The desktop compositing barely registered. That was the answer to the original question: the ceiling is the interpreter on a heavy cart, not OS overhead. Light carts just hit the 60 fps cap and sit there.

Stopping a cart that will not stop

A sandbox has to survive a cart with while true do end in its update loop. I tried the textbook approach first: lua_sethook with a count mask, which fires a callback every N bytecode instructions and lets you abort. It works. A runaway loop turns into a caught error and the OS recovers without a reflash.

The problem is the cost. The hook traces every VM instruction, and that tax is paid by every cart, all the time, not just the misbehaving ones. On the dots cart it knocked 45 fps down to 35, about a 22% hit. That is a lot to pay continuously to guard against a case that almost never happens.

So I threw it out and used the hardware instead. The present loop feeds a 2-second IWDG watchdog every frame. A cart that hangs stops the frames, the watchdog is not fed, and after two seconds the board resets and reboots straight back to the desktop. Zero per-instruction overhead, because the CPU is doing nothing extra in the common case. The trade is that a hang resets the whole board instead of failing gracefully, and you lose your debug session mid-hang. For a console that is the right trade. Dots went back to 45 fps. The lua_sethook code is still in the tree as a reference, just not in the hot path.

How far it goes

There is a host-side proof, fcpico, that links the system Lua 5.4 and runs the exact same PICO-8 API. Only the Lua build differs between host and device, which means I can run a compatibility sweep across thousands of carts on a laptop in seconds. A lot of that work is lowering PICO-8's dialect quirks (its compound assignments, the ? print shorthand, button glyphs, binary literals) into plain Lua. The numbers from that grind:

  • A curated set of 884 carts: from 26% loading to 95.6%.
  • The full Internet Archive BBS dump, around 19,155 carts: 92.5%.

The residue is heavy procedural-generation intros that blow the budget, a few runtime gaps, and some carts leaning on fractional fixed-point bit tricks. But the headline holds: real Lua, real carts, on a microcontroller, at a playable frame rate. The next thing it needed was sound.