Learning FPGA Design with nMigen

Like many of us, I’ve been stuck indoors without much to do for the past month or so. Unfortunately, I’m also in the process of moving, so I don’t know anyone in the local area and most of my ‘maker’ equipment is in storage. But there’s not much point in sulking for N months straight, so I’ve been looking at this as an opportunity to learn about designing and implementing FPGA circuits.

I tried getting into Verilog a little while ago, but that didn’t go too well. I did manage to write a simple WS2812B “NeoPixel” driver, but it was clunky and I got bored soon after. In my defense, Verilog and VHDL are not exactly user-friendly or easy to learn. They can do amazing things in the hands of people who know how to use them, but they also have a steep learning curve.

Luckily for us novices, open-source FPGA development tools have advanced in leaps and bounds over the past few years. The yosys and nextpnr projects have provided free and (mostly) vendor-agnostic tools to build designs for real hardware. And a handful of high-level code generators have also emerged to do the heavy lifting of generating Verilog or VHDL code from more user-friendly languages. Examples of those include the SpinalHDL Scala libraries, and the nMigen Python libraries which I’ll be talking about in this post.

I’ve been using nMigen to write a simple RISC-V microcontroller over the past couple of months, mostly as a learning exercise. But I also like the idea of using an open-source MCU for smaller projects where I would currently use something like an STM32 or MSP430. And most importantly, I really want some dedicated peripherals for driving cheap addressable “NeoPixel” LEDs; I’m tired of needing to mis-use a SPI peripheral or write carefully-timed assembly code which cannot run while interrupts are active.

But that will have to wait for a follow-up post; for now, I’m going to talk about some simpler tasks to introduce nMigen. In this post, we will learn how to read “program data” from the SPI Flash chip on an iCE40 FPGA board, and how to use that data to light up the on-board LEDs in programmable patterns.

The LEDs on these boards are very bright, because you’re supposed to use PWM to drive them.

The target hardware will be an iCE40UP5K-SG48 chip, but nMigen is cross-platform so it should be easy to adapt this code for other FPGAs. If you want to follow along, you can find a 48-pin iCE40UP5K on an $8-20 “Upduino” board or a $50 Lattice evaluation board. If you get an “Upduino”, be careful not to mis-configure the SPI Flash pins; theoretically, you could effectively brick the board if you made it impossible to communicate with the Flash chip. The Lattice evaluation board has jumpers which you could unplug to recover if that happens, but I don’t think that the code presented here should cause those sorts of problems. I haven’t managed to brick anything yet, knock on wood…

Be aware that the Upduino v1 board is cheaper because it does not include the FT2232 USB/SPI chip which the toolchain expects to communicate with, so if you decide to use that option, you’ll need to know how to manually write a binary file to SPI Flash in lieu of the iceprog commands listed later in this post.

Toolchain Setup

nMigen appears to be the continuation of a project called Migen, which is already used in a number of cool projects such as the “LiteX” System-on-Chip generator. It is still under very active development, so the syntax presented in this post may go out of date eventually, but it seems stable enough to use with quite complex designs. It also includes a built-in simulator which is very easy to use, so you can verify your designs before trying them in hardware without much extra effort.

A wide variety of FPGA toolchains are supported, including closed-source ones. It outputs designs in an intermediate representation which yosys can convert to Verilog, so if you want to use nMigen with a big expensive Xilinx chip, you can! I opted to use a Lattice iCE40UP5K chip for learning, though, because it is cheap and available in a hobbyist-friendly QFN package. It also has more resources than the iCE40HX1K found on the popular “Icestick” board, with 5,280 LUTs and (I think) 128KB of on-chip RAM.

So if you want to follow along, you’ll need to install the yosys, nextpnr-ice40, and icestorm toolchains:

Each of those repositories contains its own list of prerequisites and build instructions, which seem pretty straightforward for basic installations. The nMigen libraries are distributed as Python 3.x packages, so you can install them through the pip3 package manager or by copying each package into your library installation path:

I just copied the packages into ~/.local/lib/python3.6/site-packages/, but if you use Python a lot, you might already have a preferred way to install new libraries. Either way, be sure to copy the package directories (nmigen-boards/nmigen_boards/, etc), not the entire repository (nmigen-boards/, etc).

The nmigen-boards repository contains descriptions of the hardware resources on different FPGA development boards, and the nmigen-soc repository contains extra building blocks for creating System-on-Chip designs, including a Wishbone bus implementation. The nmigen repository contains the core project, as you might expect.

Getting Started with nMigen

A brief word about digital circuits before we start: in general, FPGA designs use two basic types of logic: “synchronous” logic which perform reads/writes when a clock signal rises or falls, and “combinatorial” logic which reads/writes as quickly as the signals can move through the tiny gates and wires inside the chip. Synchronous logic uses flip-flops to “latch” an input value at each clock edge, which guarantees that the output value will only change at the start or end of each clock cycle. Combinatorial logic allows the output value to change at any time, which lets signals travel as quickly as the charge carriers can propagate through the circuit. But different circuit elements usually take different lengths of time to complete, so the output of a combinatorial circuit will usually experience ‘glitches’ before it eventually settles on the correct value. It follows that your synchronous logic’s top speed will generally be limited by the longest route (“critical path”) through your combinatorial logic.

I hope that explains enough for a newcomer to understand the simple designs that we’re about to go over, but if you are interested in the subject and not entirely satisfied by that very brief explanation, check out the free “computation structures” online course, which is based off of MIT’s sadly discontinued 6.004.x edX classes.

Writing a Simple Circuit

Before we write logic to communicate with external devices like SPI Flash, let’s get familiar with nMigen by writing and simulating a simple “Hello, World” design which increments a counter using synchronous logic.

nMigen designs are structured as a tree of nested “modules”: you create a “parent” module, and add “child” modules (which can each have their own set of “child” modules) to it. When you want to test or run your design, you simulate or build the top-level “parent” module. Each module is structured as a Python class, with an __init__ method to initialize its resources and an elaborate method to describe its runtime logic.

You can create different types of synchronous logic with nMigen, but by default, there is one sync “clock domain” which acts on the rising edge of the default clock, and one comb domain for combinatorial logic. With that in mind, take a look at this minimal nMigen module, which you can put in an ordinary .py file:

This design will increment a 16-bit counter at each rising clock edge, and its combinatorial logic will try to ensure that another 16-bit value is always the bitwise opposite of the counter. Notice that the module inherits from Elaboratable and implements the abstract elaborate( self, platform ) method which returns a Module object describing its runtime logic. The platform variable contains information about the hardware which the design is being built for, and it is set to None when the design is built to run in a simulator.

The sync and comb logic domains are accessed through the Module’s d (“domain”) attribute, and you can add “rules” to them with the += operator. x.eq( y ) is a simple rule which translates to x = y. If a module’s state ends up with more than one rule targeting the same signal, the rule which comes “later” in the design will take precedence. So if we wanted to reset the count signal to zero when it reaches a value of 42, we could override the count = count + 1 logic like this:

Notice that the conditional statement uses m.If(...): instead of if ...:. We’re describing behavior which can change as the FPGA design runs, so we defer to the Module object to evaluate the condition. You should only use Python if/elif/else statements for conditions which will never change, such as configuration flags which determine whether features will be included in the design.

You can also only write to each signal from one domain, so this code would not work:

Since it tries to write to count from both the sync and comb domains at different points in the design, you’ll see an error like this if you try to simulate or build it:

nmigen.hdl.dsl.SyntaxError: Driver-driver conflict: trying to drive (sig count) from d.comb, but it is already driven from d.sync

Simulating and Testing a Design

Now, if you were to run the previous error-free design, it seems pretty clear that the count value should equal 3 after 3 clock cycles. But it’s always a good idea to double-check your assumptions by simulating a design before you build and run it on an FPGA. Fortunately, nMigen makes it very easy to simulate a design; you can add something like this to the end of your module’s .py file:

The if __name__ == "__main__": line is Python-ese for “only run this logic if the file is being executed”, and dut is a common variable name which means “Device Under Test”.

The above logic sets up a simulation to output its results to a file called test.vcd. The add_clock method adds a clock signal which ticks once every 0.000001 seconds, and which will drive the default sync clock domain. Then the proc test process is added to the simulation and run; in this case, it simply waits 50 clock ticks before finishing.

If you run the resulting file with python3 test.py, the simulation will run. It won’t print anything out, but it will create a test.vcd file which you can open and view with a program like gtkwave:

gtkwave rendering of the minimal test design

You can also write unit tests by reading values out of the simulation using yield. And if you write any helper functions to run within the simulation, they should be called with yield from func(...). For example, this test process would check that the count value always equals the number of ticks which have elapsed since the simulation started:

Blinking an LED

Now that we know how to increment a value, let’s toggle an LED on an interval. To do that, we’ll need to access our FPGA board’s I/O pins. These are defined in “board files”, which you can find in the nmigen-boards repository. The “Upduino” and Lattice evaluation boards both include a common-cathode RGB LED on the same 3 pins, and nMigen lets you access hardware resources by name and index, so we can use the same design to target both boards. In fact, since we’re only going to use the LED and SPI Flash resources, you can build the design for an Upduino board and run it on a Lattice evaluation board, and vice-versa.

The LED resources are aliased to led_r, led_g, and led_b. Once you request a pin resource from the platform object, you can access its input and output values with pin.i and pin.o. LEDs are output-only, so the following code will set the green LED to the 20th bit of the count value, and the blue LED to the opposite of that:

It’s a good idea to put platform resource logic under an if platform is not None: check, so that they do not cause errors when you simulate the design. Remember, the platform value is only provided when you are building a design for actual hardware. To build a design, you can simply use the build method provided by the board file. For an “Upduino v2” board, you can append this to the end of the file:

It’s usually a good idea to include a way to choose between simulating and building a design, but we’ll do that in the next example. For the sake of brevity, you can just run your led.py file to build this example. The results will be placed in a build/ directory. The top.rpt file is a report which tells you how much of the FPGA’s resources are used in the design, and the top.tim file is a timing analysis which tells you how quickly its synchronous logic will probably be able to run.

The top.bin file contains the actual FPGA configuration bitstream, and you can upload that to your board with the iceprog utility:

Reading from Memory

Almost every iCE40 development board includes an external SPI Flash chip to store their configurations. This type of FPGA does have non-volatile memory inside of the chip, but you can only ever write to it once and it’s supposed to have a separate 2.5V power supply to run properly. So for multi-use development, iCE40s also include hardware to read their configuration out of a commodity SPI Flash chip, such as a Winbond W25Q series or similar.

These chips are usually sized to at least 2x the expected maximum size of the FPGA’s configuration bitstream, so you can use the leftover space to store “program data”, similar to the Flash memory on a microcontroller. Memory access will be comparatively slow, since you’ll need to wait several dozen clock cycles every time that you want to read a word of data, but this is just an example.

To test reading data from the Flash chip, we can write a new module which contains a simple state machine to read data, and a parent module which acts like a very simple CPU. The parent module will increment the address that it reads from, and depending on the returned value, it will either delay for a given number of cycles, toggle one of the LEDs, or jump back to address zero. This will let us write simple “programs” which set the LEDs to different colors at different times.

Simulating Memory on a Wishbone Bus

Before writing the actual SPI Flash module, let’s write and test the parent “CPU” module with simulated memory values. The SPI interface has a few “gotchas” and places where incorrect timing can cause problems, so it’s a good idea to make sure that the basic program logic works before diving into that.

The parent “CPU” module needs to be able to request new reads and wait for the child memory module to finish, without knowing how long a memory access will take ahead of time. This is a common problem which people have already developed standards for, so I’m going to use the free Wishbone Bus standard to mediate data transfers. nMigen includes an implementation of this sort of bus in the nmigen-soc repository, so it is nice and easy to use in a design. We can make our memory module inherit from nmigen_soc.wishbone.Interface, call the parent __init__ method with the desired address and data widths during initialization, and then access the bus signals with self.<signal>.

I’m still learning about the Wishbone bus, so these descriptions may not be completely accurate, but as I understand it, these are the relevant signals for a simple read-only memory:

ack: “acknowledge” signal which is asserted by the child when it is finished with a transaction.

cyc: “cycle” signal which is asserted by the parent when a bus transaction is ongoing. The child should ignore any inputs and avoid asserting any outputs when “cycle” is not asserted.

stb: “strobe” signal which is asserted by the parent when a bus transfer cycle is ongoing. The “strobe” signal can be toggled multiple times while the “cycle” signal is asserted to perform multiple transfers in a single transaction.

dat_r: “read data” buffer which the child fills with data to be read by the parent once ack is asserted.

The Wishbone specification uses “master” and “slave” if you’re following along in the documentation; I prefer terminology like “parent” / “child” or “host” / “device”, but I realize that can cause some confusion and I’m sorry about that. Anyways, you can set up a simple word-addressed read-only Wishbone memory like this:

The MemoryMap object lets you describe the structure of different registers and memory spaces within an Interface. That isn’t necessary in a simple implementation like this, but nMigen will produce an error if you don’t assign one, so I initialized an empty one with the same address/data width as the bus and no alignment restrictions.

The Memory class is an efficient way to store data in the FPGA; I think it asks the synthesizer to try to use storage resources like RAM instead of the scarcer and more precious LUTs / flip-flops / etc. When you are simulating a design, you can access Memory data using array syntax (e.g. data[ 0 ]), but you need to use read and write ports when you build a design for real hardware. Read ports are pretty easy to use: set port.addr, wait a cycle, read port.data. Write ports are similar, but they also have an en “enable” attribute which prevents writes when it is de-asserted.

And since you should always simulate your modules before running them (ignore the above LED example), here’s an example testbench:

Try running the full rom.py file (which you can also find on GitHub), and inspect the signals in the rom.vcd file. Notice how the read data follows the address by one clock cycle; it would be nice to be able to perform single-cycle access, but I am learning that FPGA design is full of tradeoffs between speed and size.

Memory access through a ‘read port’ is delayed by one clock cycle.

Writing a Simple State Machine

Now that we have a simulated memory module, we can write the parent “CPU” state machine in a separate top.py file. (It looks like “top” is often used as a default name for top-level parent modules.) First, let’s define a simple set of usable commands to toggle each LED, delay for a variable number of cycles, and return to address 0:

It’s also a good idea to define a ‘dummy LED’ class to use when simulating the design; it will let you avoid unnecessary if statements around the LED logic, and it will let you see the LED signals in the simulation results. You can use the ‘name’ keyword argument to set a signal’s name in the simulation output, otherwise it will use the name of the variable followed by $<number> if the name is re-used:

Finally, we can define the state machine itself with the elaborate method. nMigen has some special syntax for evaluating a simple Finite State Machine. You start with a with m.FSM(): statement, then you define each state using with m.State( "this_state" ): block. And to move to another state at the next clock tick, you can set m.next = "next_state". So you can implement a simple “read, execute, increment address” state machine like this:

I made the ‘return to address 0’ instruction accept either 0x00000000 or 0xFFFFFFFF, because the default state of SPI Flash memory is a 1, not a 0. There are also usually pull-up resistors on the data lines, so if you end up reading from an area of memory that has never been written to or failing to read data entirely, you’ll probably receive a value of 0xFFFFFFFF. And it would be nice to avoid side effects when…uh, I mean “if”… that happens.

Next, you can define a ROM simulated memory module with an array containing a series of instructions, and simulate it for a few hundred cycles:

If you open the resulting test.vcd file, you should see the LEDs turning on and off in time with the program image:

viewing the results of running the Finite State Machine test design. Notice that the LED signals toggle in time with the instructions listed above.

With the simulation looking good, you can add an option to build the design instead of simulating it. There are probably better ways to process command-line arguments in Python, but one easy option is to use sys.argv:

With that, you should be able to build the design with python3 top.py -b, and program it with iceprog build/top.bin. The LEDs on the board should flash with the timing of your program, but notice how large the ‘delay’ values are in the test program above. It’s easier to look at transitions when you use small delay values, but if you do that with a fast clock rate, the LED might look white as the colors blur together.

Reading from SPI Flash

Now that you have a basic state machine which can run a few different instructions, it’s time to read a program out of the board’s SPI Flash chip. Since we implemented the simulated memory module as a standard Wishbone bus, we can write a SPI Flash module which implements the same Interface superclass and drop it in to the Memory_Test module with hardly any changes. The read-only SPI Flash memory module will have a slightly more complicated state machine, but it’s not too bad:

1. Wake up the Flash chip from sleep mode.

2. Wait for a read request from the parent module.

3. Send the 32-bit “read data from address X” command.

4. Receive 32 bits of data from the chip in little-endian order.

5. Assert a ‘done’ signal for the parent module to see, and go back to step 2.

It’s easy to accidentally use the wrong clock edge, but you can catch those sorts of mistakes by looking carefully at simulation results and the datasheet of whatever device you’re trying to talk to. The “wake up from sleep mode” command is also easy to miss, because it usually isn’t required when communicating with SPI Flash. But to save power and prevent unwanted writes, iCE40 chips issue a “go to sleep” command after they finish reading their configuration data.

So this isn’t the most efficient way to do it, but here’s a basic implementation of the state machine described above. Remember, you can find these files on GitHub if you don’t want to copy/paste everything:

The SPI_RX state’s countdown value starts at 7 instead of 31 and occasionally increments by 15 instead of decrementing by 1. This is because the data will be returned from SPI Flash one byte at a time in “little-endian” order, with the lowest-address byte occupying the most-significant bits. This is how many processors organize data internally, but we human beings usually read and write hexadecimal values in “big-endian” order. It is very easy to get confused about “endianness”, so keep it on your shortlist of things to check when you need to debug a faulty design.

There are plenty of improvements that you could make to this module, but it’s just a minimal example. If you want to make it a bit smaller, you could start by finding a way to replace logic which shifts values by variable amounts, such as self.spi.miso.i << self.dc. It sounds like that sort of logic synthesizes into a barrel shifter, which is expensive compared to other options. I’m not very experienced in writing digital logic design, so I don’t want to give you the impression that these are optimal designs 🙂

Anyways, once you start writing logic which interacts with external hardware, simulating and testing your design becomes more like a requirement than a recommendation. That’s why there are a few if platform is None: blocks in the above module, to simulate the expected responses from a working SPI Flash chip. Here is some basic test logic to go along with the above module:

You can see that I didn’t write a verification of the initial “wake up from sleep” command; you won’t always have time to write completely comprehensive tests, but it’s a good idea to cover the core behavior so that you can quickly check for obvious regressions when you make changes.

You can run those tests (the full spi_rom.py file is also available on GitHub) to verify that read commands are sent in the expected format at the right times, and check the waveform results to see that bits are shifted in and out according to the SPI protocol:

SPI Flash simulation results

Notice that the cs signal is inverted; you write a value of 1 to assert it, which pulls the corresponding pin low.

Finally, you can update the Memory_Test module and change how it is built and tested. The parent module only requires one small change, because the simulated memory and SPI Flash memory modules implement the same Wishbone bus specification. But the simulated memory module addresses 32-bit words of data, and SPI Flash addresses 8-bit bytes of data, so you should modify the line which increments the pc variable to add 4 instead of 1:

I also added a -w option to generate a program image to write to SPI Flash, since the new design will not have a built-in program to run. And you can still simulate this design, since the SPI_ROM module accepts an optional set of test data which it will simulate receiving:

Checking the expected results of running a simple “program” from SPI Flash.

Once you’re happy with the simulation results, you can generate a binary image of the test program and write it to a 2 Megabyte offset in the board’s SPI Flash with:

python3 top.py -w
iceprog -o 2M prog.bin

Then, with a valid program image stored on the chip, you can write the design to your board:

python3 top.py -b
iceprog build/top.bin

The LEDs should cycle through various colors in time with the test program. If you change the test_prog array and re-build / re-write it, the LED color timings should change without making you re-flash the entire Test_Memory design.

Note that if you are using an Upduino board, it might take a few tries to re-write SPI memory once you upload a design which accesses the resource. If you see an error like this:

…or if the iceprog program freezes before it reads the flash ID, don’t panic! Your board isn’t bricked, you just tried to write to the board while the FPGA was busy accessing the SPI Flash. This doesn’t seem to happen on the Lattice evaluation board, which has a suspiciously similar schematic, so it could also be caused by EMI issues on the cheaper 2-layer PCB.

Whatever the reason, if you keep trying and occasionally un-plug the board and plug it back in again, you should eventually succeed in re-writing it. If you use longer DELAY(...) instructions, the FPGA will spend less time communicating with the Flash chip and it should be easier to re-program. You can also run iceprog -b to bulk-erase the entire Flash chip, but that command might also take a few tries to succeed.

Conclusions

From my inexperienced perspective, nMigen is a lot easier to use than Verilog or VHDL. A lot easier. It has its quirks, but you can learn a lot from the source code, which is pretty well-documented. The lack of actual documentation and tutorials is a little bit tricky, but I’ve found a few resources that are good starting points:

There’s also a very helpful #nmigen channel on Freenode IRC, which is where I learned a lot of the information presented here. So a huge thanks to those folks for helping me learn and getting me un-stuck while I whittle away these long hours.

All in all, if you’ve been wanting to try learning about hardware design and you’re not a fan of traditional hardware description languages, give nMigen a try!

Related posts:

I’ve written a little bit in the past about how to design a basic STM32 breakout board, and how to write simple software that runs on these kinds of microcontrollers. But let’s be honest: there’s still a bit of a gap between creating a small breakout board to blink an LED, and building hardware / software for a ‘real-world’ application. Personally, I would still want a couple of more experienced engineers to double-check any designs that I wanted to be reliable enough for other people to use, but building more complex applications is a great way to help yourself learn.

So in this post, I’m going to walk through the process of designing a small ‘gameboy’-style handheld with a GPS receiver and microSD card slot, for exploring the outdoors instead of video games. Don’t get me wrong, you could still write games to run on this if you wanted to, and that would be fun, but everyone and their dog has made a Cortex-M-based handheld game console by now; there are plenty of better guides for that, and many of those authors put a lot more time into their designs and firmware than I ever did.

Assembled GPS Doohicky. I left too much room between the ribbon connector footprint and the edge of the board on this first revision, so the display couldn’t fold over quite right. Oh well, you live and learn.

The board design isn’t too complicated, but there are several different parts and it gets easier to make small-but-important mistakes as a design gets larger. It mostly uses peripherals that I’ve talked about previously, but there are a couple of new ones too. The display will be driven over SPI, the speaker uses a DAC, the GPS receiver talks over UART, the battery and light levels will be read using an ADC, and the buttons will be listened to using interrupts. But I haven’t written about the USB or SD card (“MMC”) peripherals, and those will need to go in a future post since I haven’t actually worked them out myself yet. Note that SD cards can technically use either SPI or SD/MMC to communicate, but the microcontroller that I picked has a dedicated SD/MMC peripheral, and I wanted to learn about it.

Several years ago, a company called Future Technology Devices International (FTDI) sold what may have been the most popular USB / Serial converter on the market at the time, called the FT232R. But this post is not about the FT232R, because that chip is now known for its sordid history. Year after year, FTDI enjoyed their successful chip’s market position – some would say that they rested too long on their laurels without innovating or reducing prices. Eventually, small microcontrollers advanced to the point where it was possible to program a cheap MCU to identify itself as an FT232R chip and do the same work, so a number of manufacturers with questionable ethics did just that. FTDI took issue with the blatant counterfeiting, but they were unable to resolve their dispute through the legal system to their satisfaction, possibly because most of the counterfeiters were overseas and difficult to definitively trace down. Eventually, they had the bright idea of publishing a driver update which caused the counterfeit chips to stop working when they were plugged into a machine with the newest drivers.

FTDI may have technically been within their rights to do that, but it turned out to be a mistake as far as the market was concerned – as a business case study, this shows why you should not target your customers in retaliation for the actions of a 3rd party. Not many of FTDI’s customers were aware that they had counterfeit chips in their supply lines – many companies don’t even do their own purchasing of individual components – so companies around the world started to get unexpected angry calls from customers whose toy/media device/etc mysteriously stopped working after being plugged into a Windows machine. You might say that this (and the ensuing returns) left a bad taste in their mouths, so while FTDI has since recanted, a large vacuum opened up in the USB / Serial converter market almost overnight.

Okay, that might be a bit of a dramatized and biased take, but I don’t like it when companies abuse their market positions. Chips like the CH340 and CH330 were already entering the low end of the market with ultra-affordable and easy-to-assemble solutions, but I haven’t seen them much outside of Chinese boards, possibly due to a lack of multilingual documentation or availability from Western distributors. So at least in the US, the most popular successor to the FT232R seems to have been Silicon Labs’ CP2102N.

It’s nice to have a cheap-and-cheerful way to put a USB plug which speaks UART onto your microcontroller boards, so in this post, I’ll review how to make a simple USB / UART converter using the CP2102N. The chip comes in 20-, 24-, and 28-pin variants – I’ll use the 24-pin one because it’s smaller than the 28-pin one and the 20-pin one looks like it has some weird corner pads that might be hard to solder. We’ll end up with a simple, small board that you can plug into a USB port to talk UART:

Drivers for the CP2102N are included in most popular OSes these days, including Linux distributions, so it’s mostly plug-and-play.

It’s worth noting that you can buy minimal CP2102N boards from AliExpress or TaoBao for about $1, but where’s the fun in that?

I’ve written a few basic tutorials about bare-metal STM32 development in the past, and even though I’m still learning as I write them, I think that there’s enough groundwork to start covering some ‘real world’ scenarios now. I’d like to start with a very important technique for designing efficient applications: the Direct Memory Access (DMA) peripheral. DMA is important because it lets you move data from one area of memory to another without using CPU time. After you start a DMA transfer, your program will continue to run normally while the data is moved around ‘in the background’.

That’s the basic idea, but the devil is always in the details. So in this post, we’re going to review how the three main types of STM32 DMA peripherals work. Different STM32 chips can have similar peripherals which behave slightly differently, and usually more expensive / newer chips have more fully-featured peripherals. I think that this is how the peripherals are grouped, but I didn’t test every type of STM32 chip and corrections are always appreciated:

‘Type 1’ Simple DMA: F0, L0, F1, L1, F3, L4

‘Type 2’ Double-buffered DMA: F2, F4, F7

‘Type 3’ DMA + DMA multiplexer: G0, G4, L4+

Once we’ve reviewed the basics of how DMA works, I’ll go over how to use it in a few example applications to show how it works with different peripherals and devices. The required hardware for each example will be discussed later, but I’ll present code to:

Generate an audio tone by sending a sine wave to the DAC peripheral at a specific frequency.

Map an array of colors to a strip of WS2812 or SK6812 ‘NeoPixel’ LEDs.

Map a small region of on-chip RAM to a monochrome SSD1306 OLED display.

Map a a region of RAM to an ILI9163C or ILI9341 TFT display.

The key to these examples is that the communication with an external device will happen ‘in the background’ while your microcontroller’s CPU is doing other things. Most of the examples won’t even use interrupts; the data transmission is automatic once you start it. But be aware that DMA is not magic. Every DMA ‘channel’ or ‘stream’ shares a single data bus which is also used by the CPU for memory transfers, so there is a limit to how much data you can actually send at once. In practice this probably won’t be a problem unless you have multiple high-priority / high-speed DMA transfers with tight timing requirements, but it’s something to be aware of.