A fantasy console that boots on a dev board

I have been building a thing called Rusty Nail. It is a fantasy console, but instead of running as an app on a desktop like PICO-8 or Picotron, it runs bare-metal on a microcontroller. The board is an ST Nucleo-H753ZI: a Cortex-M7 at 400 MHz with 1 MB of RAM spread across a few banks. No operating system underneath, just Rust and the embassy async runtime.

The one-line version of the design is this:

The Nucleo is the computer. The Mac is the screen and input, for now.

The board renders a windowed pixel desktop into a framebuffer in its own RAM, runs little sideloaded programs in windows, and streams the finished frame to my Mac over USB. The Mac sends keyboard and mouse back. None of the actual computing happens on the Mac. It is a dumb terminal that draws what it is told and forwards input. The "for now" matters, and I will come back to it.

The framebuffer

The desktop is 320x240 at 8 bits per pixel, palettized through a 256-entry colour lookup table. Double-buffered, that is 150 KB. It lives in the H7's AXI SRAM (the D1 domain, at 0x24000000), because that is the bank the DMA engines and the LCD controller can actually reach. The fast tightly-coupled memories (DTCM, ITCM) are great for hot CPU code but invisible to DMA, so the framebuffer cannot live there.

Getting that placement right needs a linker section. The framebuffers go in a .framebuffers (NOLOAD) section placed right after .bss, set up in memory.x with a build.rs that forces a relink when it changes. 150 KB of framebuffer plus a few hundred bytes of statics plus the stack fits comfortably in the 512 KB bank.

Present is a trait

The single most important decision was making the display output an abstraction before there was any real display to drive. Drawing and presenting are separate concerns. Everything draws into the RAM framebuffer through one path. Getting that framebuffer onto something a human can see is a trait:

pub trait Present {
    type Error;
    fn format(&self) -> PixelFormat;        // Pal8 | Rgb565 | Rgb888
    fn supports_partial(&self) -> bool;     // can it take a dirty rect?
    async fn present(
        &mut self,
        fb: &Framebuffer<'_>,
        info: &PresentInfo<'_>,
    ) -> Result<(), Self::Error>;
}

The first backend streams over USB. The next ones drive a real SPI TFT, then an LTDC parallel-RGB panel, then HDMI through an external transmitter. They have genuinely different shapes, and the trait was designed so those differences do not leak into the rest of the system:

  • The USB and SPI backends support partial updates. They take a dirty rectangle and only push the pixels that changed. That matters a lot for USB, where full scale bandwidth is the bottleneck.
  • The LTDC and HDMI backends do not support partial updates at all. They are scanout engines. You point them at the back buffer, they read it 60 times a second over DMA, and you swap buffers on vblank. There is no per-pixel CPU work to be partial about.

Designing partial support into the trait from the first commit, when the only backend was USB, is the kind of thing that feels like over-engineering until the second backend shows up and slots in without touching the compositor.

The wire protocol

The USB link is a CDC serial port, and on top of it sits a small protocol I call fcproto. Frames are COBS-encoded with a single zero-byte delimiter. COBS (consistent overhead byte stuffing) removes every zero from the payload so the delimiter is unambiguous, which gives you a property that is worth more than it sounds: one-byte resync. If the viewer attaches mid-stream, or a byte gets mangled, the decoder just scans forward to the next zero and it is aligned again.

Each packet, before COBS, is a small header (version, opcode, flags, length), a payload, and a CRC-16. There are two classes of traffic:

  • Video is best-effort. Frames are split into chunks, sent, and if one drops the world does not end because a fresh keyframe is coming. Dropped frames are fine. Latency is not.
  • Control and sideload are transactional. Uploading a program is begin / chunk / end with acknowledgements and a CRC over the whole thing. Here a dropped byte is not fine, so this path is careful where the video path is loose.

The nice part is that all of this is a no_std library that is unit-tested on the host. COBS round-trips, corruption forces a resync, chunks reassemble out of order, bad CRCs get rejected. None of it needs the hardware to test.

Tasks

On the device, embassy gives me async tasks on a single core. The important ones are a compositor task (the only writer to the back buffer), a present task (owns whichever impl Present is active), a USB task, an input task, and the app tasks. Frame ownership flips between the compositor and the present task through a couple of signals rather than a long-held lock, because on a single core a held mutex is just a worse signal.

The watchdog is wired to the compositor. The hardware IWDG only gets fed from inside the compositor loop. If the compositor wedges, the board resets itself. That turns out to be the cleanest answer to a whole category of problems, which is a story for the post about running untrusted code.

Why bother

Most of the interesting logic, the framebuffer math, the compositor, the protocol, the VM, is pure no_std Rust that runs and is tested on a laptop with no hardware attached. The firmware is a thin layer of peripheral glue around those libraries. That split is what makes the whole thing tractable. I can work out the hard parts in a normal test loop and only reach for the debugger when something is genuinely about the silicon.

The display abstraction is the spine. Everything downstream of "draw into RAM" is a backend, and the backends can show up in whatever order I get the hardware working. Next post: putting real PICO-8 Lua carts into one of those windows, on the actual chip.