A fantasy console that boots on a dev board
I have been building a thing called Rusty Nail. It is a fantasy console, but instead of running as an app on a desktop like PICO-8 or Picotron, it runs bare-metal on a microcontroller. The board is an ST Nucleo-H753ZI: a Cortex-M7 at 400 MHz with 1 MB of RAM spread across a few banks. No operating system underneath, just Rust and the embassy async runtime.
The one-line version of the design is this:
The Nucleo is the computer. The Mac is the screen and input, for now.
The board renders a windowed pixel desktop into a framebuffer in its own RAM, runs little sideloaded programs in windows, and streams the finished frame to my Mac over USB. The Mac sends keyboard and mouse back. None of the actual computing happens on the Mac. It is a dumb terminal that draws what it is told and forwards input. The "for now" matters, and I will come back to it.
The framebuffer
The desktop is 320x240 at 8 bits per pixel, palettized through a 256-entry
colour lookup table. Double-buffered, that is 150 KB. It lives in the H7's AXI
SRAM (the D1 domain, at 0x24000000), because that is the bank the DMA engines
and the LCD controller can actually reach. The fast tightly-coupled memories
(DTCM, ITCM) are great for hot CPU code but invisible to DMA, so the framebuffer
cannot live there.
Getting that placement right needs a linker section. The framebuffers go in a
.framebuffers (NOLOAD) section placed right after .bss, set up in
memory.x with a build.rs that forces a relink when it changes. 150 KB of
framebuffer plus a few hundred bytes of statics plus the stack fits comfortably
in the 512 KB bank.
Present is a trait
The single most important decision was making the display output an abstraction before there was any real display to drive. Drawing and presenting are separate concerns. Everything draws into the RAM framebuffer through one path. Getting that framebuffer onto something a human can see is a trait:
pub trait Present {
type Error;
fn format(&self) -> PixelFormat; // Pal8 | Rgb565 | Rgb888
fn supports_partial(&self) -> bool; // can it take a dirty rect?
async fn present(
&mut self,
fb: &Framebuffer<'_>,
info: &PresentInfo<'_>,
) -> Result<(), Self::Error>;
}
The first backend streams over USB. The next ones drive a real SPI TFT, then an LTDC parallel-RGB panel, then HDMI through an external transmitter. They have genuinely different shapes, and the trait was designed so those differences do not leak into the rest of the system:
- The USB and SPI backends support partial updates. They take a dirty rectangle and only push the pixels that changed. That matters a lot for USB, where full scale bandwidth is the bottleneck.
- The LTDC and HDMI backends do not support partial updates at all. They are scanout engines. You point them at the back buffer, they read it 60 times a second over DMA, and you swap buffers on vblank. There is no per-pixel CPU work to be partial about.
Designing partial support into the trait from the first commit, when the only backend was USB, is the kind of thing that feels like over-engineering until the second backend shows up and slots in without touching the compositor.
The wire protocol
The USB link is a CDC serial port, and on top of it sits a small protocol I call fcproto. Frames are COBS-encoded with a single zero-byte delimiter. COBS (consistent overhead byte stuffing) removes every zero from the payload so the delimiter is unambiguous, which gives you a property that is worth more than it sounds: one-byte resync. If the viewer attaches mid-stream, or a byte gets mangled, the decoder just scans forward to the next zero and it is aligned again.
Each packet, before COBS, is a small header (version, opcode, flags, length), a payload, and a CRC-16. There are two classes of traffic:
- Video is best-effort. Frames are split into chunks, sent, and if one drops the world does not end because a fresh keyframe is coming. Dropped frames are fine. Latency is not.
- Control and sideload are transactional. Uploading a program is begin / chunk / end with acknowledgements and a CRC over the whole thing. Here a dropped byte is not fine, so this path is careful where the video path is loose.
The nice part is that all of this is a no_std library that is unit-tested on
the host. COBS round-trips, corruption forces a resync, chunks reassemble out of
order, bad CRCs get rejected. None of it needs the hardware to test.
Tasks
On the device, embassy gives me async tasks on a single core. The important ones
are a compositor task (the only writer to the back buffer), a present task (owns
whichever impl Present is active), a USB task, an input task, and the app
tasks. Frame ownership flips between the compositor and the present task through
a couple of signals rather than a long-held lock, because on a single core a
held mutex is just a worse signal.
The watchdog is wired to the compositor. The hardware IWDG only gets fed from inside the compositor loop. If the compositor wedges, the board resets itself. That turns out to be the cleanest answer to a whole category of problems, which is a story for the post about running untrusted code.
Why bother
Most of the interesting logic, the framebuffer math, the compositor, the
protocol, the VM, is pure no_std Rust that runs and is tested on a laptop with
no hardware attached. The firmware is a thin layer of peripheral glue around
those libraries. That split is what makes the whole thing tractable. I can work
out the hard parts in a normal test loop and only reach for the debugger when
something is genuinely about the silicon.
The display abstraction is the spine. Everything downstream of "draw into RAM" is a backend, and the backends can show up in whatever order I get the hardware working. Next post: putting real PICO-8 Lua carts into one of those windows, on the actual chip.