December 5, 2012 AT 10:00 am

How we built a Super Nintendo out of a wireless keyboard @Sifteo #Sifteo

In today’s world, video game consoles have become increasingly complex virtual worlds unto themselves. Shiny, high polygon count, immersive, but ultimately indirect. A video game controller is your gateway to the game’s world—but the gateway itself can be a constant reminder that you’re outside that world, looking in.

Likewise, the technology in these game consoles has become increasingly opaque. Decades ago, platforms like the Commodore 64 encouraged tinkerers and do-it-yourselfers of all kinds. You could buy commercial games, sure, but the manual that shipped with the C-64 also told you what you needed to know to make your own games, tools, or even robots. The manual included a full schematic, the components were in large through-hole packages, and most of them were commonly-available chips with published data sheets.

Fast forward three decades. Today’s video game consoles are as powerful and as complex as a personal computer, with elaborate security systems designed specifically to keep do-it-yourselfers out. They contain dozens of customized or special-purpose parts, and it takes some serious wizardry to do anything with them other than exactly what the manufacturer intended. These systems are discouragingly complicated. It’s so hard to see any common link between the circuits you can build at home, and the complex electrical engineering that goes into an Xbox 360 or Playstation 3.

We wanted to build something different. Our platform has no controller, no television. The system itself is the game world. To make this happen, we had to take our engineering back to basics too. This is a game platform built using parts that aren’t fundamentally different from the Arduino or Maple boards that tens of thousands of makers are using right now.

This is the story of how we built the hardware behind the new Sifteo Cubes, our second generation of a gaming platform that’s all about tactile sensation and real, physical objects.

Beginnings

I’m Micah Elizabeth Scott, one of the engineers from the very small team that built this platform. When I started working for Sifteo in July 2011, the first generation Sifteo Cubes were about to go on sale. It was the product that the Sifteo founders created in order to bring their idea out to the world. When people outside our office started kicking the tires, they discovered what we already knew—there was something magical there, but it needed work.

It was the usual: better graphics, less money. Oh, and making it portable. One shortcut that allowed the Sifteo team to take their idea to market so quickly was that all games actually ran on your desktop or laptop computer. Games were written in a high-level language like Python or C#. They communicated with each cube via a 2.4 GHz wireless adapter, and the cubes acted like computer peripherals.

At first we tried taking some baby steps toward un-tethering our beloved cubes. We knew this was a problem we needed to solve fast, and anything we could reuse would help us toward that end. We tried to create a portable device that could take the place of the PC and communicate with the same cubes. It ran embedded Linux, and it looked like the charging dock. We called it SuperDock, and we invested quite a lot of time and money into it. Circuit boards, industrial design. I put together a distro with OpenEmbedded and wrote a kernel module for our radio hardware.

But it was clear that incremental change wouldn’t be enough. Games were written to expect a large CPU. The wireless protocol and graphics stack weren’t designed efficiently. The cubes used too much power, requiring a bigger and more expensive battery. Everything was interconnected, and everything needed to scale down. This was an uncertain time for the company, since any course forward would mean taking risks and going back to the drawing board.

We held a brainstorming meeting. We wanted to reduce our cubes down to their fundamental parts and eliminate unnecessary complexity, but we didn’t know what was really required to create the kind of tactile play we were interested in. We were open to trying just about anything, as long as it was fun and it seemed possible to create within our budget. A lot of ideas came out of this session, and most of them were extremely different from the first generation cubes.

Building a Lighter Bridge

Why were these designs so different? Well, we did look at doing a “cost down” on the first-generation design. Every part of it was too expensive. In so many kinds of engineering, scale ripples out. If you build a bridge with a really heavy surface, the supports for that surface have to be heavier too, and so on.

The original set included a 72 MHz ARM processor in each cube. It is roughly equivalent to the chip that powers the LeafLabs Maple board. This sounds paltry compared to the gigahertz processor in your phone, right? We’re so used to being surrounded by devices with chips powerful enough to run Linux, Android, or iOS. These chips aren’t even that expensive on their own. A top-of-the-line mobile phone CPU costs maybe $20. A more modest 375MHz ARM might be only $7 in quantity. Surely in a consumer product that costs upwards of $100, we could afford to ship three or four of these chips, right?

Not even close, unfortunately. In the bridge analogy, that’s only the cost for the pavement. You need support structures: power conversion, batteries, battery chargers, memory, programming infrastructure. These are significant costs, especially batteries and RAM. Now you multiply everything by a markup factor to account for the cost of assembly, running the factory, supporting the retailers. Every dollar you spend on the CPU turns into at least three dollars of cost to the end user.

Even that modest 72 MHz ARM was too heavy. We couldn’t afford a one-size-fits-all design. We needed to build a lightweight bridge from the ground up, using parts that would get the job done without overburdening the rest of the structure.

Just Add Magic

After that brainstorming session, I compiled a spreadsheet with all of the possible CPUs we could use if we wanted to try and build an optimized version of our cubes. To hit our cost and power budget, we would have to aim very low. And yet, we were still trying to create a next generation product with better graphics and gameplay. This would require deep magic.

There was one sweet-spot on that spreadsheet that intrigued me. We had some experience at using Nordic Semiconductor’s nRF24L01 radio chips. These are really convenient little radios, and they find a home in everything from DIY robotics applications to remote sensing. If you’ve used the TI/ChipCon CC2500 chip, they’re very similar.

These radios are also really common in wireless mice, keyboards, and video game controllers. For highly integrated battery powered applications like these, Nordic also makes a related chip: the nRF24LE1. This chip includes an 8-bit 16 MHz microcontroller with 16 kB of program memory and 1&frac14; kB RAM. It’s very similar to the ATmega168 chip used by most Arduino boards.

Could we use this part? It would have immense advantages in terms of making the whole bridge lighter. This system-on-a-chip combination is very power efficient. Furthermore, it does save money to have fewer chips in the system. When you buy one chip that includes a radio and a CPU, you’re only paying for packaging once. These chips also have some amazing economy of scale. Many different wireless keyboards and video game controllers use this chip, so we could get a better deal on it. It was an enticing possibility, but I had no idea whether it could be done.

Compelling Vaporware

Anyone, myself included, would have a hard time believing it was possible to redesign our game console around a chip no more powerful than the original Arduino. We needed a realistic demo. I had a graphics scheme in mind, but it would have required ordering parts and designing a PCB before we could build a hardware prototype. And even once we built the hardware, the software had better be correct the first time because there were very few debugging options available.

This wasn’t a problem unique to our project. In fact, chip designers have it pretty bad. Their solution: extensive use of software simulation. Simulators for integrated circuits are typically orders of magnitude slower than real-time, but they let you instantly try out changes that would have cost weeks and hundreds of thousands of dollars to try in silicon. Perhaps even more compellingly, they allow you to debug your design at a level of detail that wouldn’t be practical or even possible in the real world. So, I took a page from their book. I found an open source simulator for the 8051 instruction set, and used that as a starting point for building a special-purpose simulation to test this hypothetical hardware design.

After maybe a week of intense optimization and head-scratching, I had a prototype. The first new Sifteo Cube was a software simulation, and its firmware was a looping graphics demo. The simulation usually ran slower than real-time, but it would use accurate CPU cycle counts to calculate how many frames per second the real hardware would have been running at. This told me the design might actually work.

Taking Inspiration

This was quite a challenge: creating fluid 2D graphics on a chip designed to run wireless mice and keyboards, a peer of the original Arduino. No separate graphics chip, no custom-designed silicon. Just software and some very simple hardware. To make this possible, I took inspiration from the 1990s.

Why is it difficult to do graphics on a slow microprocessor, anyway? Pixels are just data. Unfortunately, there are quite a lot of them. Even our modest 128×128 pixel displays contain 16,384 pixels. That’s 16 pixels for every byte of RAM available to us. If we want to draw 30 frames per second, that means drawing almost half a million pixels every second! At a clock speed of 16 MHz, that gives us only 32 clock cycles per pixel, or roughly 10 instructions. That needs to be enough time to handle our sensor input, decompress data that comes in over the radio, and compose our final video signal for the LCD. We don’t have nearly enough memory to store all the pixels, and have almost no time to compute them.

Most classic 2D video game consoles faced the same challenge. In a previous life, I spent a lot of time reverse engineering and programming classic game consoles. I spent the most time with Nintendo’s portable Game Boy systems, but I also admired the architecture of the NES, Super Nintendo, and Atari 2600. All of these consoles managed to create graphics and sound that seemed far beyond the capabilities of their modest CPU power. They had some help from custom-designed chips, but the real magic came from a completely different approach to drawing.

One might call it cheating, but really the only winning move was not to play. Systems like the NES, Game Boy, and Super Nintendo solved this problem by opting out of pixels altogether. By operating on a larger unit, an 8×8 pixel “tile”, they gained an amazing amount of leverage. For example, one screenful of graphics on the NES would be 61,440 pixels. This would have been far too big to store on the game cartridges of the day, much less in RAM. But that same screen is only 960 tiles, which easily fit in the 2 kB of video memory.

This kind of leverage was common in the early days of PC graphics too. In 8-bit color modes, each pixel was a single byte stand-in, replacing a larger 18-bit or 24-bit color that lived in a colormap. Text modes used bytes which stood for characters, but the image of each possible character lived in a separate font table. These 2D video game consoles had similar tables, mapping an 8-bit tile index to an image of that tile, then using a colormap to expand that image into the final pixels you see on your TV. It was really a form of data compression. Unlike more modern algorithms, however, this kind of compression was designed so that the game engine and even the artists work directly with compressed graphics. They needed all the leverage they could get.

Pixel Pipeline

This tile-based graphics strategy would end up serving our needs really well. I wrote a graphics engine guide for our SDK documentation which describes our particular approach in a lot more detail. But there was still one big missing piece before we could implement tile-based graphics on this tiny chip.

The leverage gained by using tiles instead of pixels would make our graphics data small enough to fit in memory, but we still needed some way of sending pixel data to our LCD hardware. All of the classic game consoles had custom-designed chips to do this job. These chips interfaced directly with multiple kinds of memory. The NES used ROM for tile lookup tables and RAM for tile indices. Its graphics chip would use the settings in RAM to route pixel data from the correct part of ROM, colorize it according to the active palettes, then send it in real-time to the television:

This custom “Picture Processor” (PPU) chip had a lot of nice features: hardware support for sprites, scrolling, palette swaps. The critical functionality, though, was that it kept the high-bandwidth pixel data (the thick line above) out of the CPU. Software could make changes to the tile numbers in RAM at a relatively slow pace while the PPU streamed video out to the television without skipping a beat. We needed to figure out how to do this without any custom silicon. Luckily, the past few decades of advances in electronics have changed the game enough to make this possible.

Firstly, we aren’t using a television any more. Analog TV represents video as a waveform with very rigid timing. The NES PPU chip had to produce each pixel at exactly the right time, and it had no ability to pause the output to wait for other parts of the system to catch up. This alone makes it very challenging to generate analog television signals without some amount of hardware support. Projects like the Uzebox use very carefully timed software loops to generate analog video without any hardware support at all, but that comes at a cost. The Uzebox spends quite a lot of its CPU time just copying pixels.

Instead of a television, we use a small LCD module with a built-in controller chip, similar to many of the modules you see in the Adafruit shop. These chips are very common and inexpensive, and they contain enough RAM to buffer a single full frame of video. This RAM isn’t plentiful enough or fast enough to use the same way a general-purpose computer uses its framebuffer, but it does let us output pixel data at whatever rate we like.

Secondly, we could make a classic memory vs. time trade-off. The NES has additional processing in the PPU which lets it use less memory to store graphics in its cartridge ROMs. But if we’re okay with storing those graphics in a much larger form, we can avoid performing some of those processing steps. I chose to use a parallel Flash memory chip to store pixel data in exactly the same format that the LCD module expects it in. By wiring up the flash memory, LCD controller, and CPU in a clever way, I gave my firmware the ability to send bursts of contiguous pixels directly from Flash to LCD by simply counting upwards on one of the CPU’s 8-bit I/O ports.

It’s worth noting that this was a fairly extreme memory/time exchange. Classic game consoles stored the pixels for each tile at a very low color depth, typically 1 to 4 bits per pixel. We would need to use our LCD’s native 16-bit format. This makes our graphics a full 8x the size of equivalent graphics in a NES cartridge. We can compress this data when it traverses the radio link, but it must remain uncompressed in flash memory. Thankfully, in modern terms the amount of memory we need is small. We use a 4 MB flash device which can store 32,768 tiles.

At this point, we had a promising combination. The hardware was all commonly available and relatively DIY-friendly, the CPU would have enough time to draw multiple layers of tile graphics with several sprites, the total cost was within budget, and I had a software simulation with the debugging and unit-testing features we would need in order to take this platform from a proof-of-concept to a product. Now it just needed to run games.

Burning the Candle at Both Ends

To make this project work we were asking a chip no more powerful than an Arduino to compose smooth 2D video, a task well beyond its pay grade. On the other side of the radio link, another processor would be running our games.

The processor we chose is a 72 MHz STM32 with an ARM Cortex-M3 core. It’s almost the same chip used in the Maple. We were familiar with these processors from using them in the first generation of Sifteo Cubes. But now, instead of running the cubes themselves, this processor was running our games. Yet again we found an economic sweet spot where our job seemed possible, if inconvenient.

This chip has 64 kB of RAM, 128 kB of built-in Flash, and no hardware memory protection or paging. We added an external 16 MB serial flash chip for storing games and saved data. It would have been wonderful to have an MMU and a microSD card slot, but our design choices were ruled by power consumption. The base needed to run all day off of two AAA batteries, while the vast majority of that power went to the radio and speaker.

This chip needs to simultaneously run downloaded games, stream data from our low-power serial flash, synthesize music and sound effects, and communicate with over a dozen cubes. It would need to give us everything it’s got.

Safety Dance

When it came to running downloaded games on this CPU, none of our options looked good. For many months, we had been sidestepping this problem by compiling our games directly into the Base’s firmware. But re-flashing the CPU to run each game was really no good. It would be slow, the size of a game would be seriously constrained, and it would wear out the built-in flash memory relatively fast. Even during development, this kludge became a major thorn in our side. At one point Liam spent a hellish week porting our codebase to a larger microcontroller when we ran out of flash, just so we could continue our scheduled tests and demos.

We could copy games from external flash to RAM and execute them there. That limits a game’s size even further, since RAM is our most constrained resource. Additionally, all of these solutions allow games to have full, unrestricted access to the Base’s hardware. This may sound like a good thing at first, but such low-level access has consequences. Any game you download from the store could, either accidentally or maliciously, configure the system’s hardware in a way that would permanently damage it. This was something we wanted to avoid if at all possible!

This low-end CPU didn’t give us any way to protect our hardware from buggy or malicious games, nor did it give us a way to stream code and data in from external flash. So, I did what seemed natural when faced with a machine that’s missing some key capabilities. I built a virtual machine inside it.

Direct Execution

Virtual machines are a complex topic, and there are a few different axes along which you can categorize VMs. First, what does it run? Some VMs are designed to exactly emulate an existing computer architecture. Some VMs are designed to run a particular programming language, and the specifics of the compiled code are more of an implementation detail. And further still, some VMs are designed to mostly emulate an existing computer architecture, but with allowances that make it easier to virtualize. This last category is often termed paravirtualization, and it’s the chief strategy used by the Xen hypervisor.

Second, how does the VM run code? The simplest VM would use emulation. In software, it would fetch and decode the next instruction, then perform a sequence of operations that models the machine in question. This strategy works with any combination of virtual and physical machine architectures, but it’s very slow.

The ideal way to implement a VM would be with special-purpose hardware. This is how the first VMs ran on mainframes, and it’s the method used by most modern virtualization packages on desktops, laptops, and servers.

If you want the best performance possible but you don’t have special-purpose hardware available, you can compromise. This was the original approach used by software virtualization packages like VMware before the CPU vendors caught up. If you can determine ahead of time that a chunk of code is safe to run, you can directly execute it. In other words, the virtual machine’s instructions will run directly on the physical machine. For this to be safe, those instructions must have the same effect when run on the physical machine as they would have had running inside an emulation of the virtual machine.

PC virtualization packages like VMware have to make this determination on-the-fly. But our system could act more like the kind of special-purpose VM one might build for a new programming language. We could engineer the virtual machine such that the extra work necessary to make code safe can be done ahead of time, by the compiler. When a game runs on our VM, we can perform a quick check to make sure the code is in fact safe, then we can use direct execution to run that chunk of code at full-speed. This works well if we can design the VM such that it’s very fast to validate the safety of some code, even if it’s slow to actually make that code safe.

In order to efficiently run large games using only a small amount of RAM, we divide the game’s code and data into small pages, which are fetched on-demand from external flash into a small RAM-based cache. Some of these pages contain data, and some of them contain instructions for the Virtual Machine. If a game tries to execute code from a new page, we can run a speedy validation algorithm on that page to ensure that all of its code is “safe”.

In our case, “safe” code only contains a small subset of the available CPU instructions. Safe code can only transfer control to other safe code within the same page. Many common operations can’t be performed using this small subset of instructions, so we rely on system calls into the firmware, asking it to do these operations on behalf of the virtualized code. System calls allow us to do additional safety checks at runtime.

This type of virtualization is most similar to Google’s Native Client project, a way of running untrusted code safely inside a web browser. Like Native Client, we rely on a specialized compiler to produce code which is easy to validate ahead-of-time. Unlike Native Client, our VM must operate without any hardware memory protection. This means that the subset of CPU instructions we allow is even more restrictive, and we do memory virtualization entirely in software. Our VM is slower than native code, but still much faster than the emulation approach used by other common microcontrollers like the Netduino and BASIC Stamp.

Greetz to the Demoscene

At this point, we had a graphics engine and a way to run downloadable games. But our gaming experience was really only half complete. We had no sound.

We considered various strategies for music and sound effects. For a while we were using the open source Speex codec to store compressed music. But remember, we’re on a really tight space budget due to our power constraints. Even with very high levels of compression, there’s only so much music you can store in a few megabytes. We needed to think differently about the problem. Instead of storing compressed music, we needed to synthesize music in real-time.

This is another problem that was solved a while ago in a different context. Many older video games would benefit from having a way to store music more efficiently, but the technique we used was most well-known in the demoscene, a community of creative coders who are masters at squeezing every drop of potential from the available hardware.

This is a Tracker, a type of music sequencer that originated on the Amiga in 1987. Trackers are based on sampling. Similar to how many electronic musicians create a track by reassembling samples of other recordings, a Tracker does this in real-time to assemble a full song from many tiny recordings of individual instruments. The memory used by these tiny recordings plus the pattern of notes would be far less than the memory used by a recording of the complete song, just as a MIDI file can be much smaller than an MP3. Unlike MIDI, Trackers give the artist exact control over how their instruments will sound.

There were existing tracker playback engines that almost fit the bill. ChibiXM, a tracker designed for iOS games, was close. But it turned out that our tracker needed to be closely integrated with our audio mixer and virtual memory subsystems to achieve the performance we needed. My colleague Scott Perry wrote a new tracker engine for Sifteo that was perfectly tailored to the project’s needs.

Fill the Toolbox

Along the way, we ended up writing a lot of custom tools. In fact, for every line of code in the system’s firmware, there was another line of code purely for development and test tools.

To generate code for our special-purpose virtual machine, I used the LLVM libraries to build a special-purpose code generator and linker. I wrote another custom tool to compress and optimize our graphics, using some of the same techniques used to compress video for the web.

The emulator I wrote originally as a proof-of-concept grew up into a full-featured platform simulator that forms the cornerstone of our freely available software development kit. I used an embedded Lua interpreter to add unit testing capabilities to this simulator. This is how we ensured the quality of our firmware builds before they ever met real hardware.

I faced some challenging speed vs. accuracy trade-offs when developing our simulator. It needs to be fast enough to use as a game development tool, but accurate enough to catch firmware bugs early. I designed the cube simulation with two execution modes: A 100% cycle-accurate mode which emulates each instruction, and a static binary translation mode in which entire basic blocks of microcontroller code are pre-translated to native x86 code. This latter mode is the default, and it is cycle accurate in most cases. In both modes, all hardware peripherals are cycle-accurate.

Simulating the Base is another story. We didn’t have as much incentive to write a cycle-accurate hardware model for it. The Base’s firmware and hardware design were much less risky than the Cube, and we expected to need less low-level debugging assistance. So on the Base side, we recompile the firmware with an alternate set of hardware abstraction libraries. These simulation-only libraries implement hardware features using the APIs available on the host operating system. Radio transactions are sent to another thread which runs the simulated cubes. Both simulated systems use a common virtual clock, and they synchronize with each other when necessary. The main disadvantage is a lack of precision in modeling the speed of the Base’s firmware. We find that this approach is still substantially bug-for-bug compatible with the real hardware, and it’s much more efficient than emulation. It happens to be the same approach used by Apple’s iOS simulator.

I’ve always believed in investing in good tools, whether they’re physical tools or software tools. Putting in the time early-on to build a powerful simulator and compiler would give us immense freedom to add powerful debugging and optimization tools. And I can’t overstate the importance of thorough unit testing. When you have confidence in your code’s correctness, you have so much more freedom to optimize it or refactor it.

Give Yourself a Lever Long Enough

Thank you for reading about our journey. It’s been quite a ride, and I hope you enjoyed this glimpse into it. It wouldn’t have been possible to create this new game platform without the immense effort of our entire team. I wanted to specifically thank my co-conspirator in all things firmware, Liam Staskawicz, and our electrical engineering team: Hakim Raja and Jared Wolff. Bob Lang was responsible for all things mechanical and manufacturing, and Jared Hanson built our end-user desktop software. And perhaps most amazingly, the whole games team did a fantastic job creating top quality experiences even though we pulled the rug out from under them on a weekly basis.

This project involved an immense amount of work and a lot of creativity, but otherwise there wasn’t anything particularly special about what we did. We didn’t have million-dollar runs of custom silicon, we didn’t have any secret data sheets or proprietary code libraries. Just like all of the other hardware tinkerers and do-it-yourselfers out there, I built on common parts and open source tools. In that way, there isn’t a fundamental difference between our project and a DIY game console kit like the Meggy Jr.

I encourage everyone to take a critical look at the objects in their lives, and imagine how you could make them more personal, more magical, or just generally more awesome. Then look around you for ways to make that happen. At Sifteo, our “secret sauce” was really just creatively repurposing hardware to do things the designers never imagined. This is something anyone can do. I think the future of hardware is in creativity, not raw money or power. Creativity democratizes innovation.

I enjoyed reading about your design process and how your team took inspiration from the seemingly lost art of working with very limited processing power. I got all nostalgic thinking about the hours I spent pouring over hex printouts from my C64 when I was a kid. (However, I do NOT miss dot matrix printers.)
As an electrical engineer I applaud your ingenuity and ability to adapt to very limited hardware resources. Any solution that is more akin to a scalpel than a sledgehammer is worthy of respect.

It’s hard to express how excited I am about this. This takes me back to 1980, when I was doing low-level code for a Z-80-based system with tiled graphics! Then, as now, the technology itself wasn’t nearly as important as how you dealt with its limitations. In an age of gigahertz and gigabytes, this article is inspiring^3.

@makomk: diyembedded.com has some nice breakout boards for these chips. Not the *most* hobbyist friendly by any means, but the technology is certainly accessible- and you can do some very similar prototyping work using a separate nrf24l01 radio chip plus whatever other MCU you like.

@Nick: Great question. The transparency has to be handled in software in real-time by the 16 MHz CPU. This multi-layer compositing is by far the biggest thing that consumes its cycles. The inner loops of each scanline renderer need to check the first byte of each pixel for a special value to indicate transparency.

This represents a block of 256 colors which all mean the pixel is transparent, since there’s no time to read the second byte of the pixel. If the special value shows up on that first byte, the CPU has to abort sending that pixel to the LCD, and instead re-address the flash memory to access whatever layer is behind.

As you can imagine, this is an awful lot of complexity to pack into a very few clock cycles. I spent many weeks optimizing those inner loops!