A Raspberry Pi (mid-2012 vintage) which parses pattern commands coming from the cloud into LED sequences which are downloaded to the KL26 over SPI. It also hosts a webcam and periodically throws the image back up to the cloud so that the masses can see the results of their labors.

A cloud server which hosts a Redis instance and python application to facilitate the user interface and communication down to the Raspberry Pi.

I’m going to go into brief detail about each of these pieces and some of the challenges I had with getting everything to work.

In my most recent project I selected an ARM Cortex-M0 microcontroller (the STM32F042). I soon realized that there is a key architectural piece missing from the Cortex-M0 which the M0+ does not have: The vector table offset register (VTOR).

I want to talk about how I overcame the lack of a VTOR to write a USB bootloader which supports a semi-safe fallback mode.

The source for this post can be found here (look in the “bootloader” folder):

As my final installment for the posts about my LED Wristwatch project I wanted to write about the self-programming bootloader I made for an STM32L052 and describe how it works. So far it has shown itself to be fairly robust and I haven’t had to get out my STLink to reprogram the watch for quite some time.

The main object of this bootloader is to facilitate reprogramming of the device without requiring a external programmer. There are two ways that a microcontroller can accomplish this generally:

Include a binary image in every compiled program that is copied into RAM and runs a bootloader program that allows for self-reprogramming.

Reserve a section of flash for a bootloader that can reprogram the rest of flash.

Each of these ways has their pros and cons. Option 1 allows for the user program to use all available flash (aside from the blob size and bootstrapping code). It also might not require a relocatable interrupt vector table (something that some ARM Cortex microcontrollers lack). However, it also means that there is no recovery without using JTAG or SWD to reflash the microcontroller if you somehow mess up the switchover into the bootloader. Option 2 allows for a fairly fail-safe bootloader. The bootloader is always there, even if the user program is not working right. So long as the device provides a hardware method for entering bootloader mode, the device can always be recovered. However, Option 2 is difficult to update (you have to flash it with a special program that overwrites the bootloader), wastes unused space in the bootloader-reserved section, and also requires some features that not all microcontrollers have.

Because the STM32L052 has a large amount of flash (64K) and implements the vector-table-offset register (allowing the interrupt vector table to be relocated), I decided to go with Option 2.

During my LED Wristwatch project, I decided early on that I wanted to do something different with the way my USB stuff was implemented. In the past, I have almost exclusively used libusb to talk to my devices in terms of raw bulk packets or raw setup requests. While this is ok, it isn’t quite as easy to do once you cross out of the fruited plains of Linux-land into the barren desert of Windows. This project instead made the watch identify itself (enumerate) as a USB Human Interface Device (HID).

What I would like to do in this post is a step-by-step tutorial for modifying a USB device to enumerate as a human interface device. I’ll start with an overview of HID, then move on to modifying the USB descriptors and setting up your device endpoints so that it sends reports, followed by a few notes on writing host software for Windows and Linux that communicates to devices using raw reports. With a little bit of work, you should be able to replace many things done exclusively with libusb with a cross-platform system that requires no drivers.

One thing to note is that since I’m using my LED Watch as an example, I’m going to be extending using my API, which I describe a little bit here. The main source code files for this can be found in common/src/usb.c and common/src/usb_hid.c.

A couple years ago I wrote a post about writing a bare metal USB driver for the Teensy 3.1, which uses Freescale Kinetis K20 microcontroller. Over the past couple years I’ve switched over to instead using the STM32 series of microcontrollers since they are cheaper to program the “right” way (the dirt-cheap STLink v2 enables that). I almost always prefer to use the microcontroller IC by itself, rather than building around a development kit since I find that to be much more interesting.

LED Wristwatch with USB

One of my recent (or not so recent) projects was an LED Wristwatch which utilized an STM32L052. This microcontroller is optimized for low power, but contains a USB peripheral which I used for talking to the wristwatch from my PC, both for setting the time and for reflashing the firmware. This was one of my first hobby projects where I designed something without any prior breadboarding (beyond the battery charger circuit). The USB and such was all rather “cross your fingers and hope it works” and it just so happened to work without a problem.

In this post I’m going to only cover a small portion of what I learned from the USB portion of the watch. There will be a further followup on making the watch show up as a HID Device and writing a USB bootloader.

My objective here is to walk quickly through the operation of the USB Peripheral, specifically the Packet Memory Area, then talk a bit about how the USB Peripheral does transfers, and move on to how I structured my code to abstract the USB packetizing logic away from the application.

I’ve been using kicad for just about all of my designs for a little over 5 years now. It took a little bit of a learning curve, but I’ve really come to love it, especially with the improvements by CERN that came out in version 4. One of the greatest features, in my opinion, is the Python Scripting Console in the PCB editor (pcbnew). It gives (more or less) complete access to the design hierarchy so that things like footprints can be manipulated in a scripted fashion.

In my most recent design, the LED Watch, I used this to script myself a tool for arranging footprints in a circle. What I want to show today was how I did it and how to use it so that you can make your own scripting tools (or just arrange stuff in a circle).

The python console can be found in pcbnew under Tools->Scripting Console.

Step 1: Write the script

When writing a script for pcbnew, it is usually helpful to have some documentation. Some can be found here, though I mostly used “dir” a whole bunch and had it print me the structure of the various things once I found the points to hook in. The documentation is fairly spartan at this point, so that made things easier.

Here’s my script:

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

#!/usr/bin/env python2

# Random placement helpers because I'm tired of using spreadsheets for doing this

component_offset: Offset in degrees for each component to add to angle

hide_ref: Hides the reference if true, leaves it be if None

lock: Locks the footprint if true

"""

pcb=GetBoard()

deg_per_idx=360/len(refdes)

foridx,rd inenumerate(refdes):

part=pcb.FindModuleByReference(rd)

angle=(deg_per_idx*idx+start_angle)%360;

print"{0}: {1}".format(rd,angle)

xmils=center[0]+math.cos(math.radians(angle))*radius

ymils=center[1]+math.sin(math.radians(angle))*radius

part.SetPosition(wxPoint(FromMils(xmils),FromMils(ymils)))

part.SetOrientation(angle*-10)

ifhide_ref isnotNone:

part.Reference().SetVisible(nothide_ref)

print"Placement finished. Press F11 to refresh."

There are several arguments to this function: a list of reference designators ([“D1”, “D2”, “D3”] etc), the angle at which the first component should be placed, the position in mils for the center of the circle, and the radius of the circle in mils. Once the function is invoked, it will find all of the components indicated in the reference designator list and arrange them into the desired circle.

Step 2: Save the script

In order to make life easier, it is best if the script is saved somewhere that the pcbnew python interpreter knows where to look. I found a good location at “/usr/share/kicad/scripting/plugins”, but the list of all paths that will be searched can be easily found by opening the python console and executing “import sys” followed by “print(sys.path)”. Pick a path that makes sense and save your script there. I saved mine as “placement_helpers.py” since I intend to add more functions to it as situations require.

Step 3: Open your PCB and run the script

Before you can use the scripts on your footprints, they need to be imported. Make sure you execute the “Read Netlist” command before continuing.

The scripting console can be found under Tools->Scripting Console. Once it is opened you will see a standard python (2) command prompt. If you placed your script in a location where the Scripting Console will search, you should be able to do something like the following:

Python

1

2

3

4

5

6

7

8

9

10

PyCrust0.9.8-KiCAD Python Shell

Python2.7.13(default,Feb112017,12:22:40)

[GCC6.3.120170109]on linux2

Type"help","copyright","credits"or"license"formore information.

>>>importplacement_helpers

>>>placement_helpers.place_circle(["D1","D2"],0,(500,500),1000)

D1:0

D2:180

Placement finished.Press F11 to refresh.

>>>

Now, pcbnew may not recognize that your PCB has changed and enable the save button. You should do something like lay a trace or some other board modification so that you can save any changes the script made. I’m sure there’s a way to trigger this in Python, but I haven’t got around to trying it yet.

Conclusion

Hopefully this brief tutorial will either help you to place components in circles in Kicad/pcbnew or will help you to write your own scripts for easing PCB layout. Kicad can be a very capable tool and with its new expanded scripting functionality, the sky seems to be the limit.

About 2009 I saw an article written by Dr. Paul Pounds in which he detailed a pocketwatch he had designed that fit inside a standard pocketwatch case and used LEDs as the dial. While the article has since disappeared, the youtube video remains. The wayback machine has a cached version of the page. Anyway, the idea has kind of stuck with me for a while and so a year or so ago I decided that I wanted to build a wristwatch inspired by that idea.

Although the project started out as an AVR project, I decided after my escapades with the STM32 in August that I really wanted to make it an STM32 project, so around November I started making a new design that used the STM32L052C8 ARM Cortex-M0+ ultra-low power USB microcontroller. The basic concept of the design is to mock up an analog watch face using a ring of LEDs for the hours, minutes, and seconds. I found three full rings to be expecting a bit much if I wanted to keep this small, so I ended up using two rings: One for the hours and another for combined minutes and seconds (the second hand is recognized by the fact that it is “moving” perceptibly).

In this post I’m going to go over my general design, some things I was happy with, and some things that I wasn’t happy with. I’ll make some follow-up posts for the following topics:

The Problem

I am working on a project that involves a Li-Ion battery charger. I’ve never built one of these circuits before and I wanted to test the battery over its entire charge-discharge cycle to make sure it wasn’t going to burst into flame because I set some resistor wrong. The battery itself is very tiny (100mAH, 2.5mm thick) and is going to be powering an extremely low-power circuit, hopefully over the course of many weeks between charges.

Battery charger breadboard

After about 2 days of taking meter measurements every 6 hours or so to see what the voltage level had dropped to, I decided to try to automate this process. I had my trusty Teensy 3.1 lying around, so I thought that it should be pretty simple to turn it into a simple data logger, measuring the voltage at a very slow rate (maybe 1 measurement per 5 seconds). Thus was born the EZDAQ.

Setting up the Teensy 3.1 ADC

I’ve never used the ADC before on the Teensy 3.1. I don’t use the Teensy Cores HAL/Arduino Library because I find it more fun to twiddle the bits and write the makefiles myself. Of course, this also means that I don’t always get a project working within 30 minutes.

The ADC on the Teensy 3.1 (or the Kinetis MK20DX256) is capable of doing 16-bit conversions at 400-ish ksps. It is also quite complex and can do conversions in many different ways. It is one of the larger and more configurable peripherals on the device, probably rivaled only by the USB module. The module does not come pre-calibrated and requires a calibration cycle to be performed before its accuracy will match that specified in the datasheet. My initialization code is as follows:

Following the recommendations in the datasheet, I selected a clock that would bring the ADC clock speed down to <4MHz and turned on hardware averaging before starting the calibration. The calibration is initiated by setting a flag in ADC0_SC3 and when completed, the calibration results will be found in the several ADC0_CL* registers. I’m not 100% certain how this calibration works, but I believe what it is doing is computing some values which will trim some value in the SAR logic (probably something in the internal DAC) in order to shift the converted values into spec.

One thing to note is that I did not end up using the 16-bit conversion capability. I was a little rushed and was put off by the fact that I could not get it to use the full 0-65535 dynamic range of a 16-bit result variable. It was more like 0-10000. This made figuring out my “volts-per-value” value a little difficult. However, the 12-bit mode gave me 0-4095 with no problems whatsoever. Perhaps I’ll read a little further and figure out what is wrong with the way I was doing the 16-bit conversions, but for now 12 bits is more than sufficient. I’m just measuring some voltages.

Since I planned to measure the voltages coming off a Li-Ion battery, I needed to make sure I could handle the range of 3.0V-4.2V. Most of this is outside the Teensy’s ADC range (max is 3.3V), so I had to make myself a resistor divider attenuator (with a parallel capacitor for added stability). It might have been better to use some sort of active circuit, but this is supposed to be a quick and dirty DAQ. I’ll talk a little more about handling issues spawning from the use of this resistor divider in the section about the host software.

Quick and dirty USB device-side driver

For this project I used my device-side USB driver software that I wrote in this project. Since we are gathering data quite slowly, I figured that a simple control transfer should be enough to handle the requisite bandwidth.

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

staticuint8_t tx_buffer[256];

/**

* Endpoint 0 setup handler

*/

staticvoidusb_endp0_handle_setup(setup_t*packet)

{

constdescriptor_entry_t*entry;

constuint8_t*data=NULL;

uint8_t data_length=0;

uint32_t size=0;

uint16_t*arryBuf=(uint16_t*)tx_buffer;

uint8_ti=0;

switch(packet->wRequestAndType)

{

...USB Protocol Stuff...

case0x01c0://get adc channel value (wIndex)

*((uint16_t*)tx_buffer)=adc_get_value(packet->wIndex);

data=tx_buffer;

data_length=2;

break;

default:

gotostall;

}

//if we are sent here, we need to send some data

send:

...Send Logic...

//if we make it here, we are not able to send data and have stalled

stall:

...Stall logic...

}

I added a control request (0x01) which uses the wIndex (not to be confused with the cleaning product) value to select a channel to read. The host software can now issue a vendor control request 0x01, setting the wIndex value accordingly, and get the raw value last read from a particular analog channel. In order to keep things easy, I labeled the analog channels using the same format as the standard Teensy 3.1 layout. Thus, wIndex 0 corresponds to A0, wIndex 1 corresponds to A1, and so forth. The adc_get_value function reads the last read ADC value for a particular channel. Sampling is done by the ADC continuously, so the USB read doesn’t initiate a conversion or anything like that. It just reads what happened on the channel during the most recent conversion.

Host software

Since libusb is easy to use with Python, via PyUSB, I decided to write out the whole thing in Python. Originally I planned on some sort of fancy gui until I realized that it would far simpler just to output a CSV and use MATLAB or Excel to process the data. The software is simple enough that I can just put the entire thing here:

Basically, I just use the argparse module to take some command line inputs, find the device using PyUSB, and spit out the requested channel values in a CSV format to stdout every so often.

In addition to simply displaying the data, the program also processes the raw ADC values into some useful voltage values. I contemplated doing this on the device, but it was simpler to configure if I didn’t have to reflash it every time I wanted to make an adjustment. One thing this lets me do is a sort of calibration using the “attenuation” values that I put into the host. The idea with these values is to compensate for a voltage divider in front of the analog input in order so that I can measure higher voltages, even though the Teensy 3.1 only supports voltages up to 3.3V.

For example, if I plugged my 50%-ish resistor divider on channel A0 into 3.3V, I would run the following command:

1

2

3

4

5

$./ezdaq0

Time,Channel0

1467771464.9665403,1.7990478515625

...

We now have 1.799 for the “voltage” seen at the pin with an attenuation factor of 1. If we divide 1.799 by 3.3 we get 0.545 for our attenuation value. Now we run the following to get our newly calibrated value:

1

2

3

4

5

$./ezdaq-a00.5450

Time,Channel0

1467771571.2447994,3.301005232

...

This process highlights an issue with using standard resistors. Unless the resistors are precision resistors, the values will not ever really match up very well. I used 4 1meg resistors to make two voltage dividers. One of them had about a 46% division and the other was close to 48%. Sure, those seem close, but in this circuit I needed to be accurate to at least 50mV. The difference between 46% and 48% is enough to throw this off. So, when doing something like this with trying to derive an input voltage after using an imprecise voltage divider, some form of calibration is definitely needed.

Conclusion

Battery Charger with EZDAQ Attached (don’t mind the O-Scope probes…those are for another test)

After hooking everything up and getting everything to run, it was fairly simple for me to take some two-channel measurements:

This will dump the output of my program into the charge.csv file (which is measuring the charge cycle on the battery). I will get samples every 5 seconds. Later, I can use this data to make sure my circuit is working properly and observe its behavior over long periods of time. While crude, this quick and dirty DAQ solution works quite well for my purposes.

A complete tutorial for using an STM32 without a dev board

Introduction

STM32F103C8 Tray from eBay

About two years ago I started working with the Teensy 3.1 (which uses a Freescale Kinetis ARM-Cortex microcontroller) and I was super impressed with the ARM processor, both for its power and relative simplicity (it is not simple…its just relatively simple for the amount of power you get for the cost IMO). Almost all my projects before that point had consisted of AVRs and PICs (I’m in the AVR camp now), but now ARM-based microcontrollers had become serious contenders for something that I could go to instead. I soon began working on a small development board project also involving some Freescale Kinetis microcontrollers since those are what I have become the most familiar with. Sadly, I have had little success since I have been trying to make a programmer myself (the official one is a minimum of $200). During the course of this project I came across a LOT of STM32 stuff and it seemed that it was actually quite easy to set up. Lots of the projects used the STM32 Discovery and similar dev boards, which are a great tools and provide an easy introduction to ARM microcontrollers. However, my interest is more towards doing very bare metal development. Like soldering the chip to a board and hooking it up to a programmer. Who needs any of that dev board stuff? For some reason I just find doing embedded development without a development board absolutely fascinating. Some people might interpret doing things this way as a form of masochism. Perhaps I should start seeing a doctor…

Having seen how common the STM32 family was (both in dev boards and in commercial products) and noting that they were similarly priced to the Freescale Kinetis series, I looked in to exactly what I would need to program these, saw that the stuff was cheap, and bought it. After receiving my parts and soldering everything up, I plugged everything into my computer and had a program running on the STM32 in a matter of hours. Contrast that to a year spent trying to program a Kinetis KL26 with only partial success.

This post is a complete step-by-step tutorial on getting an STM32 microcontroller up and running without using a single dev board (breakout boards don’t count as dev boards for the purposes of this tutorial). I’m writing this because I could not find a standalone tutorial for doing this with an ARM microcontroller and I ended up having to piece it together bit by bit with a lot of googling. My objective is to walk you through the process of purchasing the appropriate parts, writing a small program, and uploading it to the microcontroller.

I make the following assumptions:

The reader is familiar with electronics and soldering.

The reader is familiar with flash-based microcontrollers in general (PIC, AVR, ARM, etc) and has programmed a few using a separate standalone programmer before.

The reader knows how to read a datasheet.

The reader knows C and is at least passingly familiar with the overall embedded build process of compilation-linking-flashing.

The reader knows about makefiles.

The reader is ridiculously excited about ARM microcontrollers and is strongly motivated to overlook any mistakes here and try this out for themselves (srsly tho…if you see a problem or have a suggestion, leave it in the comments. I really do appreciate feedback.)

All code, makefiles, and configuration stuff can be found in the project repository on github.

Materials

You will require the following materials:

A computer running Linux. If you run Windows only, please don’t be dissuaded. I’m just lazy and don’t want to test this for Windows. It may require some finagling. Manufacturer support is actually better for Windows since they provide some interesting configuration and programming software that is Windows only…but who needs that stuff anyway?

A STLinkv2 Clone from eBay. Here’s one very similar to the one I bought. ~$3

Some STM32F103C8’s from eBay. Try going with the TQFP-48 package. Why this microcontroller? Because for some reason it is all over the place on eBay. I suspect that the lot I bought (and all of the ones on eBay) is probably not authentic from ST. I hear that Chinese STM32 clones abound nowadays. I got 10 for $12.80.

A breakout board for a TQFP-48 with 0.5mm pitch. Yes, you will need to solder surface mount. I found mine for $1. I’m sure you can find one for a similar price.

4x 0.1uF capacitors for decoupling. Mine are surface mount in the 0603 package. These will be soldered creatively to the breakout board between the power pins to provide some decoupling since we will probably have wires flying all over. I had mine lying around in a parts bin, left over from my development board project. Digikey is great for getting these, but I’m sure you could find them on eBay or Amazon as well.

Some dupont wires for connecting the programmer to the STM32. You will need at least 4. These are the ones that are sold for Arduinos. These came with my programmer, but you may have some in your parts box. They are dang cheap on Amazon.

Regular wires.

An LED and a resistor.

I was able to acquire all of these parts for less than $20. Now, I did have stuff like the capacitors, led, resistor, and wires lying around in parts boxes, but those are quite cheap anyway.

Step 1: Download the datasheets

Above we decided to use the STM32F103C8 ARM Cortex-M3 microcontroller in a TQFP-48 package. This microcontroller has so many peripherals its no wonder its the one all over eBay. I could see this microcontroller easily satisfying the requirements for all of my projects. Among other things it has:

64K flash, 20K RAM

72MHz capability with an internal PLL

USB

CAN

I2C & SPI

Lots of timers

Lots of PWM

Lots of GPIO

All this for ~$1.20/part no less! Of course, its like $6 on digikey, but for my purposes having an eBay-sourced part is just fine.

Ok, so when messing with any microcontroller we need to look at its datasheet to know where to plug stuff in. For almost all ARM Microcontrollers there will be no less than 2 datasheet-like documents you will need: The part datasheet and the family reference manual. The datasheet contains information such as the specific pinouts and electrical characteristics and the family reference manual contains the detailed information on how the microcontroller works (core and peripherals). These are both extremely important and will be indispensable for doing anything at all with one of these microcontrollers bare metal.

Step 2: Figure out where to solder and do it

STM32F103C8 Pins of interest

After getting the datasheet we need to solder the microcontroller down to the breakout board so that we can start working with it on a standard breadboard. If you prefer to go build your own PCB and all that (I usually do actually) then do that instead of this. However, you will still need to know which pins to hook up.

On the pin diagram posted here you will find the highlighted pins of interest for hooking this thing up. We need the following pins at a minimum:

Shown in Red/Blue: All power pins, VDD, VSS, AVDD, and AVSS. There are four pairs: 3 for the VDD/VSS and one AVDD/AVSS. The AVDD/AVSS pair is specifically used to power the analog/mixed signal circuitry and is separate to give us the opportunity to perform some additional filtering on those lines and remove supply noise induced by all the switching going on inside the microcontroller; an opportunity I won’t take for now.

Shown in Yellow/Green: The SWD (Serial Wire Debug) pins. These are used to connect to the STLinkV2 programmer that you purchased earlier. These can be used for so much more than just programming (debugging complete with breakpoints, for a start), but for now we will just use it to talk to the flash on the microcontroller.

Shown in Cyan: Two fun GPIOs to blink our LEDs with. I chose PB0 and PB1. You could choose others if you would like, but just make sure that they are actually GPIOs and not something unexpected.

Below you will find a picture of my breakout board. I soldered a couple extra pins since I want to experiment with USB.

STM32F103C8 Breakout

Very important: You may notice that I have some little tiny capacitors (0.1uF) soldered between the power pins (the one on the top is the most visible in the picture). You need to mount your capacitors between each pair of VDD/VSS pins (including AVDD/AVSS). How you do this is completely up to you, but it must be done and they should be rather close to the microcontroller itself. If you don’t it is entirely possible that when the microcontroller first turns on and powers up (specifically at the first falling edge of the internal clock cycle), the inductance created by the flying power wires we have will create a voltage spike that will either cause a malfunction or damage. I’ve broken microcontrollers by forgetting the decoupling caps and I’m not eager to do it again.

Step 3: Connect the breadboard and programmer

Cheap STLinkV2 Clone

Don’t do this with the programmer plugged in.

On the right you will see my STLinkV2 clone which I will use for this project. Barely visible is the pinout. We will need the following pins connected from the programmer onto our breadboard. These come off the header on the non-USB end of the programmer. Pinouts may vary. Double check your programmer!

3.3V: We will be using the programmer to actually power the microcontroller since that is the simplest option. I believe this pin is Pin 7 on my header.

GND: Obviously we need the ground. On mine this was Pin 4.

SWDIO: This is the data for the SWD bus. Mine has this at Pin 2.

SWCLK: This is the clock for the SWD bus. Mine has this at Pin 6.

You may notice in the above picture that I have an IDC cable coming off my programmer rather than the dupont wires. I borrowed the cable from my AVR USBASP programmer since it was more available at the time rather than finding the dupont cables that came with the STLinkV2.

Next, we need to connect the following pins on the breadboard:

STM32 [A]VSS pins 8, 23, 35, and 47 connected to ground.

STM32 [A]VDD pins 9, 24, 36, and 48 connected to 3.3V.

STM32 pin 34 to SWDIO.

STM32 pin 37 to SWCLK.

STM32 PB0 pin 18 to a resistor connected to the anode of an LED. The cathode of the LED goes to ground. Pin 19 (PB1) can also be connected in a similar fashion if you should so choose.

Here is my breadboard setup:

STM32F103C8 Breadboard Setup

Step 4: Download the STM32F1xx C headers

Since we are going to write a program, we need the headers. These are part of the STM32CubeF1 library found here.

Visit the page and download the STM32CubeF1 zip file. It will ask for an email address. If you really don’t want to give them your email address, the necessary headers can be found in the project github repository.

Alternately, just clone the repository. You’ll miss all the fun of poking around the zip file, but sometimes doing less work is better.

The STM32CubeF1 zip file contains several components which are designed to help people get started quickly when programming STM32s. This is one thing that ST definitely does better than Freescale. It was so difficult to find the headers for the Kinetis microcontrollers that almost gave up at that point. Anyway, inside the zip file we are only interested in the following:

The contents of Drivers/CMSIS/Device/ST/STM32F1xx/Include. These headers contain the register definitions among other things which we will use in our program to reference the peripherals on the device.

Drivers/CMSIS/Device/ST/STM32F1xx/Source/Templates/gcc/startup_stm32f103xb.s. This contains the assembly code used to initialize the microcontroller immediately after reset. We could easily write this ourselves, but why reinvent the wheel?

Drivers/CMSIS/Device/ST/STM32F1xx/Source/Templates/system_stm32f1xx.c. This contains the common system startup routines referenced by the assembly file above.

Drivers/CMSIS/Device/ST/STM32F1xx/Source/Templates/gcc/linker/STM32F103XB_FLASH.ld. This is the linker script for the next model up of the microcontroller we have (we just have to change the “128K” to a “64K” near the beginning of the file in the MEMORY section (line 43 in my file) and we are good to go). This is used to tell the linker where to put all the parts of the program inside the microcontroller’s flash and RAM. Mine had a “0” on every blank line. If you see this in yours, delete those “0”s. They will cause errors.

The contents of Drivers/CMSIS/Include. These are the core header files for the ARM Cortex-M3 and the definitions contained therein are used in all the other header files we reference.

I copied all the files referenced above to various places in my project structure so they could be compiled into the final program. Please visit the repository for the exact locations and such. My objective with this tutorial isn’t really to talk too much about project structure, and so I think that’s best left as an exercise for the reader.

Step 5: Install the required software

We need to be able to compile the program and flash the resulting binary file to the microcontroller. In order to do this, we will require the following programs to be installed:

The arm-none-eabi toolchain. I use arch linux and had to install “arm-none-eabi-gcc”. On Ubuntu this is called “gcc-arm-none-eabi”. This is the cross-compiler for the ARM Cortex cores. The naming “none-eabi” comes from the fact that it is designed to compile for an environment where the program is the only thing running on the target processor. There is no underlying operating system talking to the application binary file (ABI = application binary interface, none-eabi = No ABI) in order to load it into memory and execute it. This means that it is ok with outputting raw binary executable programs. Contrast this with Linux which likes to use the ELF format (which is a part of an ABI specification) and the OS will interpret that file format and load the program from it.

arm-none-eabi binutils. In Arch the package is “arm-none-eabi-binutils”. In Ubuntu this is “binutils-arm-none-eabi”. This contains some utilities such as “objdump” and “objcopy” which we use to convert the output ELF format into the raw binary format we will use for flashing the microcontroller.

Make. We will be using a makefile, so obviously you will need make installed.

OpenOCD. I’m using 0.9.0, which I believe is available for both Arch and Ubuntu. This is the program that we will use to talk to the STLinkV2 which in turn talks to the microcontroller. While we are just going to use it to flash the microcontroller, it can be also used for debugging a program on the processor using gdb.

Once you have installed all of the above programs, you should be good to go for ARM development. As for an editor or IDE, I use vim. You can use whatever. It doesn’t matter really.

Step 6: Write and compile the program

Ok, so we need to write a program for this microcontroller. We are going to simply toggle on and off a GPIO pin (PB0). After reset, the processor uses the internal RC oscillator as its system clock and so it runs at a reasonable 8MHz or so I believe. There are a few steps that we need to go through in order to actually write to the GPIO, however:

Enable the clock to PORTB. Most ARM microcontrollers, the STM32 included, have a clock gating system that actually turns off the clock to pretty much all peripherals after system reset. This is a power saving measure as it allows parts of the microcontroller to remain dormant and not consume power until needed. So, we need to turn on the GPIO port before we can use it.

Set PB0 to a push-pull output. This microcontroller has many different options for the pins including analog input, an open-drain output, a push-pull output, and an alternate function (usually the output of a peripheral such as a timer PWM). We don’t want to run our LED open drain for now (though we certainly could), so we choose the push-pull output. Most microcontrollers have push-pull as the default method for driving their outputs.

Toggle the output state on. Once we get to this point, it’s success! We can control the GPIO by just flipping a bit in a register.

If we turn to our trusty family reference manual, we will see that the clock gating functionality is located in the Reset and Clock Control (RCC) module (section 7 of the manual). The gates to the various peripherals are sorted by the exact data bus they are connected to and have appropriately named registers. The PORTB module is located on the APB2 bus, and so we use the RCC->APB2ENR to turn on the clock for port B (section 7.3.7 of the manual).

The GPIO block is documented in section 9. We first talk to the low control register (CRL) which controls pins 0-7 of the 16-pin port. There are 4 bits per pin which describe the configuration grouped in to two 2-bit (see how many “2” sounding words I had there?) sections: The Mode and Configuration. The Mode sets the analog/input/output state and the Configuration handles the specifics of the particular mode. We have chosen output (Mode is 0b11) and the 50MHZ-capable output mode (Cfg is 0b00). I’m not fully sure what the 50MHz refers to yet, so I just kept it at 50MHz because that was the default value.

After talking to the CRL, we get to talk to the BSRR register. This register allows us to write a “1” to a bit in the register in order to either set or reset the pin’s output value. We start by writing to the BS0 bit to set PB0 high and then writing to the BR0 bit to reset PB0 low. Pretty straightfoward.

It’s not a complicated program. Half the battle is knowing where all the pieces fit. The STM32F1Cube zip file contains some examples which could prove quite revealing into the specifics on using the various peripherals on the device. In fact, it includes an entire hardware abstraction layer (HAL) which you could compile into your program if you wanted to. However, I have heard some bad things about it from a software engineering perspective (apparently it’s badly written and quite ugly). I’m sure it works, though.

So, the next step is to compile the program. See the makefile in the repository. Basically what we are going to do is first compile the main source file, the assembly file we pulled in from the STM32Cube library, and the C file we pulled in from the STM32Cube library. We will then link them using the linker script from the STM32Cube and then dump the output into a binary file.

Step 7: Flashing the program to the microcontroller

This is the very last step. We get to do some openocd configuration. Firstly, we need to write a small configuration script that will tell openocd how to flash our program. Here it is:

1

2

3

4

5

6

# Configuration for flashing the blink program

init

reset halt

flash write_image erase bin/blink.bin0x08000000

reset run

shutdown

Firstly, we init and halt the processor (reset halt). When the processor is first powered up, it is going to be running whatever program was previously flashed onto the microcontroller. We want to stop this execution before we overwrite the flash. Next we execute “flash write_image erase” which will first erase the flash memory (if needed) and then write our program to it. After writing the program, we then tell the processor to execute the program we just flashed (reset run) and we shutdown openocd.

Now, openocd requires knowledge of a few things. It first needs to know what programmer to use. Next, it needs to know what device is attached to the programmer. Both of these requirements must be satisfied before we can run our script above. We know that we have an stlinkv2 for a programmer and an stm32f1xx attached on the other end. It turns out that openocd actually comes with configuration files for these. On my installation these are located at “/usr/share/openocd/scripts/interface/stlink-v2.cfg” and “/usr/share/openocd/scripts/target/stm32f1x.cfg”, respectively. We can execute all three files (stlink, stm32f1, and our flashing routine (which I have named “openocd.cfg”)) with openocd as follows:

1

2

3

openocd-f/usr/share/openocd/scripts/interface/stlink-v2.cfg\

-f/usr/share/openocd/scripts/target/stm32f1x.cfg\

-fopenocd.cfg

So, small sidenote: If we left off the “shutdown” command, openocd would actually continue running in “daemon” mode, listening for connections to it. If you wanted to use gdb to interact with the program running on the microcontroller, that is what you would use to do it. You would tell gdb that there is a “remote target” at port 3333 (or something like that). Openocd will be listening at that port and so when gdb starts talking to it and trying to issue debug commands, openocd will translate those through the STLinkV2 and send back the translated responses from the microcontroller. Isn’t that sick?

In the makefile earlier, I actually made this the “install” target, so running “sudo make install” will actually flash the microcontroller. Here is my output from that command for your reference:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

kcuzner@kcuzner-laptop:~/Projects/ARM/stm32f103-blink$sudo make install

Wooo!!! The LED blinks! At this point, you have successfully flashed an ARM Cortex-M3 microcontroller with little more than a cheap programmer from eBay, a breakout board, and a few stray wires. Feel happy about yourself.

Conclusion

For me, this marks the end of one journey and the beginning of another. I can now feel free to experiment with ARM microcontrollers without having to worry about ruining a nice shiny development board. I can buy a obscenely powerful $1 STM32 microcontroller from eBay and put it into any project I want. If I were to try to do that with AVRs, I would be stuck with the ultra-low-end 8-pin ATTiny13A since that’s about it for ~$1 AVR eBay offerings (don’t worry…I’ve got plenty of ATMega328PB’s…though they weren’t $1). I sincerely hope that you found this tutorial useful and that it might serve as a springboard for doing your own dev board-free ARM development.

If you have any questions or comments (or want to let me know about any errors I may have made), let me know in the comments section here. I will try my best to help you out, although I can’t always find the time to address every issue.

Wow it has been a while. Between school, work, and another project that I’ve been working on since last October (which, if ultimately successful, I will post here) I haven’t had a lot of time to write about anything cool.

I wanted to share today something cool I wrote for my AVRs. Many of my recent AVR projects have become rather complex in that they usually are split into multiple parts in the software which interact with each other. One project in particular had the following components:

A state machine managing several PWM channels. It implemented behavior like pulsing, flashing, etc. It also provided methods for the other components to interact with it.

A state machine managing an NRF24L01+ radio module, again providing methods for components to interact with it.

A state machine managing several inputs, interpreting them and sending commands to the other two components

So, why did I use state machines rather than just implementing this whole thing in a giant loop with some interrupts mixed in? The answer is twofold:

Spaghetti. Doing things in state machines avoided spaghetti code that would have otherwise made the system very difficult to modify. Each machine was isolated from the others both structurally and in software (static variables/methods, etc). The only way to interact with the state machines was to use my provided methods which lends itself to a clean interface between the different components of the program. Switching out the logic in one part of the program did not have any effect on the other components (unless method signatures were changed, of course).

Speed. All of these state machines had a requirement of being able to respond quickly to events, whether they were input events from the user or an interrupt from a timer and whatnot. Interrupts make responding to things in a timely fashion easy, but they stop all other interrupts while running (unless doing nested interrupts, which is …unwieldy… in an AVR) and so doing a lot of computation during an interrupt can slow down the other parts of the system. By organizing this into state machines, I could split up parts of computation into pieces that could execute fast and allow other parts to run in response to their interrupts quickly.

None of this, however, is particularly new. Everyone does state machines and they are comparatively easy to implement. In almost every project I have done there has been some form of state machine, whether it was just busy waiting for some flag somewhere and doing something after that flag was set, or doing something much more complex. What I wanted to show today was a different way of dealing with the issues of spaghetti and speed: Building a preempting task scheduler. These are considered a key and central component of Real-Time Operating Systems, so what you see in this article is the beginnings of a real-time kernel for an AVR microcontroller.

I should mention here that there is a certain level of ability assumed with AVRs in this post. I assume that the reader has a good knowledge of how programs work with the stack, the purpose and functioning of the general purpose registers, how function calls actually happen on the microcontroller, how interrupts work, the ability to read AVR assembly (don’t worry most of this is in C…but there is some critical code written in assembly), and a general knowledge of the avr-gcc toolchain.

Table of Contents

Preemption?

The definition of preemption in computing is being able to interrupt a task, without its cooperation, with the intention of executing it later, in order to run some other task.

Firstly we need to define a task: A task is a unit of execution; something that the program structure defines as a self-contained part of the program which accomplishes some purpose. In our case, a task is basically going to just be a method which never returns. The next thing to define is a scheduler. A scheduler is a software component which can decide from a list of tasks which task needs to run on the processor. The scheduler uses a dispatcher to actually change the code that is executing from the current task to another one. This is called saving and restoring the context of the task.

The cool thing about preemptive scheduling is that any task can be interrupted at any time and another task can start executing. Now, it could be asked, “Well, isn’t that what interrupts do? What’s the difference?”. The difference is that an interrupt in a system using a preemptive scheduler can actually resume in a different place from when it started. Without a scheduler, when an interrupt ends the processor will send the code right back to where it was executing when the interrupt occurred. In contrast, the scheduler actually allows the program to have a little more control and move from one task to another in response to an interrupt or some other stimuli (yes, it doesn’t even need to be an interrupt…it could be another task!).

In the state machine example above, I used state machines as a way to break up computations so that the other machines could run at predetermined points. I noticed that in most of these cases, this point was when waiting for some user input or an interrupt. Although I used interrupts extensively in the application, there were a lot of flags to be polled and this happened inside state machine tick functions. When an interrupt occurred and set a flag, it would need to wait for the “main” code to get around to executing the tick function for the particular state machine that listens to the flag before anything could happen. This introduces a lot of latency and jitter (differences in the amount of time it takes the system to respond to an interrupt from one moment to the next). Not good.

Using tasks removes a lot of these latency problems since the interrupt can halt the current task (block) and begin executing another which was waiting on the interrupt to occur (unblock or resume). Once the higher priority task is blocked again (through a call to some function asking it to wait for some event), the scheduler will change back to the original task and things go on as usual. Using tasks also has the effect of making the code easier to read. While state machines are easy to write, they are not always easy to follow. A function is invoked over and over again and that requires more thought than simply reading a linear function. Tasks can be very linear since the state machine is embodied in calls which could possibly block the task.

A positive side effect of doing things this way (with a scheduler) is that we can now implement the familiar things such as semaphores and queues to communicate between our tasks in a fine grained manner. At their core these are simply methods that can manipulate the list of tasks and call the scheduler to decide which task to execute next.

Summary: Using a preemptive scheduler can allow for lower latency and jitter between an interrupt occurring and some non-ISR code responding to it when compared to using several state machines. This means it will respond more consistently to interrupts occurring, though not necessarily in a more timely fashion. It also allows for fine-grained priority control with these responses.

Pros and Cons

Before continuing, I would like to point out some pros and cons that I see of writing a task scheduler lest we fall into the “golden hammer” antipattern. There are certainly more, but here is my list (feel free to comment with comments on this).

Some Pros

Can reduce the jitter (and possibly the latency) in responding to interrupts. This is of paramount importance in some embedded systems which will have problems if the system cannot respond in a predictable manner to external stimuli.

Can greatly simplify application code by using familiar constructs such as semaphores and queues. Compared to state machines, this code can be easier to read as it can be written very linearly (no switches, if’s etc). This can reduce the initial bugs found in programs.

Can entirely remove the need for busy waits (loops polling a flag). A properly designed state machine shouldn’t have these either, but it can take a large amount of effort to design these kinds of machines. They also can take up a lot of program space when space is at a premium (not always true).

Can reduce application code size. This is weak, but since the code can be made more linear with calls to the scheduler rather than returning all the time, there is no need for switch statements and ifs which can compile to some beastly assembly code.

Some Cons

Can add unnecessary complexity to the program in general. A task scheduler is no small thing and brings with it all of the issues seen in concurrent programming in general. However, these issues usually already exist when using interrupts and such.

Can be very hard to debug. I needed an emulator to get this code working correctly. Anything where we mess with the stack pointer or program counter is going to be a very precise exercise.

Can make the application itself hard to debug. Is it a problem with the scheduler? Or is it a problem with the program itself? It is an additional component to consider when debugging.

Adds additional program weight. My base implementation uses ~450 bytes of program memory. While quite tiny compared to many programs, this would be unacceptably high on a smaller AVR such as the ATTiny13A which only has 1K of program memory.

So…lots of those are contradictory. What is a pro can also be a con. Anyway, I’m just presenting this as something cool to do, not as the end all be all of ways to structure an embedded program. If you have a microcontroller that is performing a lot of tasks that need to be able to react reliably to an interrupt, this might be the way to go for you. However, if your microcontroller is just toggling some gpios and reacting to some timers, this might be overkill. It all depends on the application.

Implementation

Mmmkay here’s the fun part. At this point you may be asking, “How in the world can we make something that can interrupt during one function and resume into another?” I recently completed a course on Real-Time Operating Systems (RTOS) at my university which opened my eyes into how this can be done (we wrote one for the 8086…so awesome!), so I promptly wrote one for the AVR. For those who come by here who have taken the same course at BYU, they will notice some distinct similarities since I went with what I knew. I’ve named it KOS, for “Kevin’s Operating System”, but this was just so I had an easy prefix for my types and function names. If you’re going to implement your own based on this article, don’t worry about naming it like mine (though a mention of this article somewhere would be cool).

Disclaimer: I have only started to scratch the surface of this stuff myself and I may have made some errors. I appreciate any insight anyone can give me into either suggestions for this or problems with my implementation. Just leave it in the comments 🙂

The focus of a scheduler/dispatcher system for tasks is manipulating the stack pointer and the stack itself. “Traditionally,” programs written for microcontrollers have a single stack which grows from the bottom of memory up and all code is executed on that stack. The concept here is that we still start out with that stack, but we actually execute the tasks on their own separate stacks. When we want to switch to a task, we point the AVR’s stack pointer to the desired task’s stack and start executing (its the “start executing” part where things get fun).

First, let’s take a look at the structure which represents a task:

C

1

2

3

4

5

6

7

8

typedefenum{TASK_READY,TASK_SEMAPHORE,TASK_QUEUE}KOS_TaskStatus;

typedefstructKOS_Task{

void*sp;

KOS_TaskStatus status;

structKOS_Task*next;

void*status_pointer;

}KOS_Task;

The very first item in this struct is the pointer to the stack pointer (*sp). It is a void* because we don’t normally access anything on it…we just make the SP register point to it when we want to execute the task.

The next item in the struct is a status enum. This is used by my primitive scheduler to determine if a task is “READY” to execute. If a task is ready to execute, then it is not waiting on anything (i.e. blocked) and it can be resumed at any time. In the case where the task is waiting on something like a semaphore, this status would be changed to SEMAPHORE. The semaphore posting code would then change the status back to READY once somebody posted to the semaphore. This is called “unblocking”.

After the status comes the *next pointer. The tasks are arranged in a linked list because they have a priority attached to them. This priority determines which tasks get executed first. At the top of the linked list is the highest priority task and at the end of the list is the lowest priority task.

Finally, we have the *status_pointer. This is used by our functions which can unblock tasks to determine why tasks are blocked in the first place. We will see more about this when we make a primitive semaphore.

Ok, so for the basic task scheduling and dispatching functionality we are going to implement some functions (these are declared in a header):

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

typedefvoid(*KOS_TaskFn)(void);

externKOS_Task*kos_current_task;

/**

* Initializes the KOS kernel

*/

voidkos_init(void);

/**

* Creates a new task

* Note: Not safe

*/

voidkos_new_task(KOS_TaskFn task,void*sp);

/**

* Puts KOS in ISR mode

* Note: Not safe, assumes non-nested isrs

*/

voidkos_isr_enter(void);

/**

* Leaves ISR mode, possibly executing the dispatcher

* Note: Not safe, assumes non-nested isrs

*/

voidkos_isr_exit(void);

/**

* Runs the kernel

*/

voidkos_run(void);

/**

* Runs the scheduler

*/

voidkos_schedule(void);

/**

* Dispatches the passed task, saving the context of the current task

*/

voidkos_dispatch(KOS_Task*next);

As for source files, we will only have a single C file for the implementation, but there will be some inline assembly because we are going to have to fiddle with registers. Yay! I’ll just go through the functions one by one and afterwards I’ll go through my design decisions and how they affect things. This is not the only, nor the best, way to do this.

Implementation: kos_init and kos_new_task

Firstly, we have the kos_init and kos_new_task functions, which come with some baggage:

Here we have two concepts that are embodied. The first is the context. The context the data pushed onto the stack that the dispatcher is going to use in order to restore the task before executing it. This is similar (identical even) to the procedure used with interrupt service routines, except that we store every single one of the 32 registers instead of just the ones that we use. The next concept is that of the idle task. As an optimization, there is a task which has the lowest priority and is never blocked. It is always ready to execute, so when all other tasks are blocked, it will run. This means that we don’t have to deal with the case in the scheduler when there is no tasks to execute since there will always be a task.

The kos_init function performs only one operation: Add the idle task to the list of tasks to execute. Notice that there was some space allocated for the stack of the idle task. This stack must be at least as large as the entire context (35 bytes here) plus enough for any interrupts which may occur during program execute. I chose 48 bytes, but it could be as large as you want. Also take note of the pointer that we pass for the stack into kos_new_task: It is a pointer to the end of our array. This is because stacks grow “up” in memory, meaning a push decrements the address and a pop increments it. If we passed the beginning of the array, the first push would make us point before the memory allocated to the stack since arrays are allocated “downwards” in memory.

The kos_new_task function is a little more complex. It performs two operations: setting up the initial context for the function and adding the Task structure to the linked list of tasks. The context needs to be set up initially because from the scheduler’s perspective, the new task is simply an unblocked task that was blocked before. Therefore, it expects that some context is stored on that task’s stack. Our context is ordered such that the PC (program counter) is first, the 32 registers are next, and the status register is last. Since the stack is last-in first-out, the SREG is popped first, then the 32 registers, and then the PC. We can see at the beginning of the function that we take the function pointer (they are usually 16 bits on most AVRs…the ones with lots of flash do it differently, so consult your datasheets) and set it up to be the program counter. It is arranged LSB-first, so the LSByte is “pushed” before the MSByte. The order here is very important and the reason why will become very apparent when we see the code for the dispatcher. After that, we put 32 0’s onto the stack. These are the initial values for the registers and 0 seemed like a sensible value. The very last byte “pushed” is the status register. We set it to 0x80 so that the interrupt flag is set. This is a design decision to prevent problems with forgetting to enable interrupts for every task and having one task where we forgot to enable it prevent all interrupts from executing. Finally, the top of the stack (note the subtraction of 35 bytes from the stack pointer) is stored on the Task struct along with the initial task state. We add it to the task list as the head of the list, so the last task added is the task with the highest priority.

Implementation: kos_run and kos_schedule

Next we have the kos_run function:

C

1

2

3

4

voidkos_run(void)

{

kos_schedule();

}

Well that’s simple: it just calls the scheduler. So, let’s look at kos_schedule:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

voidkos_schedule(void)

{

if(kos_isr_level)

return;

KOS_Task *task=task_head;

while(task->status!=TASK_READY)

task=task->next;

if(task!=kos_current_task)

{

ATOMIC_BLOCK(ATOMIC_RESTORESTATE)

{

kos_dispatch(task);

}

}

}

The very first thing to notice is the kos_isr_level reference. This solves a very specific problem that occurs with ISRs which I talk about in the next section. Other than that bit, however, this is also simple. Because our tasks in the linked list are ordered by priority, we can simply start at the top and move along the linked list until we locate the first task that is ready (unblocked). Once that task is found, we will call the dispatcher if the task we found is not the currently executing task.

The purpose of the ATOMIC_BLOCK is to ensure that interrupts are disabled when the dispatcher runs. Since the stack is going to be manipulated, the entire dispatcher is considered to be a critical section of code and must be run atomically. The ATOMIC_BLOCK will restore the interrupt status after kos_dispatch returns (which is after the task has been resumed).

Implementation: kos_enter_isr and kos_exit_isr

We are faced with a very particular problem when we want to call our scheduler inside of an interrupt. Let’s imagine a scenario where we have two tasks, Task A and Task B (Task A has higher priority than Task B), in addition to the idle task. Task A uses waits on two semaphores (semaphores 1 and 2) that is signaled by an ISR. When task A is running, it signals another semaphore that Task B waits on (semaphore 3). Here is what happens:

The idle task is running because both Task A and Task B are waiting on semaphores.

An interrupt occurs (note that it happens during the idle task) and the ISR begins executing immediately. An ISR can be thought of as a super high priority task since it will interrupt anything.

The ISR posts to semaphore 1 which Task A is waiting on. The very next statement is going to be to signal semaphore 2 as well. However, this happens next:

After signaling semaphore 1, the dispatcher runs and Task A begins to execute. Task A signals semaphore 3 which will cause Task B to run. Since Task A has a higher priority than B, however, Task B isn’t executed yet. Task A goes on to wait on semaphore 2. This then causes Task B to be dispatched.

Task B takes a really long time to run, but it finally ends. There are no more tasks on the ready list, so the idle task begins to run.

The idle task resumes inside the ISR and posts to semaphore 2.

Task A begins running again.

As straightforward as that may seem, that isn’t the intended behavior. Imagine if a task with an even higher priority than A had the ISR occur while it was executing. The sequence above would be totally different because Task A wouldn’t be dispatched after the 1st semaphore being posted (item #4). Let’s see what happens:

The idle task is running because both Task A and Task B are waiting on semaphores.

An interrupt occurs (note that it happens during the idle task) and the ISR begins executing immediately. An ISR can be thought of as a super high priority task since it will interrupt anything.

The ISR posts to semaphore 1 which task A is waiting on.

After signaling semaphore 1, the scheduler notices that the current task has a higher priority than Task A, so it does not dispatch.

The ISR posts to semaphore 2.

Same as #4. The ISR ends. Let’s say that the high priority task blocks soon afterwards.

Once the high priority task has blocked, Task A is executed. It posts to semaphore 3 and then waits on semaphore 2. Since semaphore 2 has already been posted, it continues right on through without a task switch to Task B. This is a major difference in the order of operations.

After Task A finally blocks, Task B executes.

Because of the inconsistency and the fact that the ISR “priority” when viewed by the scheduler is determined by possibly random ISRs (making it non-deterministic), we need fix this. The solution I went with was to make two methods: kos_enter_isr and kos_exit_isr. These should be called when an ISR begins and when an ISR ends to temporarily hold off calling the scheduler until the very end of the ISR. This has the effect of giving an ISR an apparently high priority since it will not switch to another task until it has completely finished. So, although the idle task may be running when the ISR occurs, while the ISR is running no context switches will occur until the very end. Here is some code:

C

1

2

3

4

5

6

7

8

9

10

11

staticuint8_t kos_isr_level=0;

voidkos_isr_enter(void)

{

kos_isr_level++;

}

voidkos_isr_exit(void)

{

kos_isr_level--;

kos_schedule();

}

As seen in kos_schedule, we use the kos_isr_level variable to indicate to the scheduler whether we are in an ISR or not. When kos_isr_level finally returns to 0, the scheduler will actually perform scheduling when it is called at the end of kos_isr_exit. The second set of events described earlier will now happen every time, even if the idle task is interrupted.

These functions must be run with interrupts disabled since they don’t use any sort of locking, but they should support nested interrupts so long as they are called at the point in the interrupt when interrupts have been disabled.

Implementation: kos_dispatch

The dispatcher is written basically entirely in inline assembly because it does the actual stack manipulation:

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

voidkos_dispatch(KOS_Task*task)

{

// the call to this function should push the return address into the stack.

// we will now construct saving context. The entire context needs to be

// saved because it is very possible that this could be called from within

// an isr that doesn't use the call-used registers and therefore doesn't

"bld r31, %[_T_] \n\t"//T flag is on the call clobber list and tasks are only blocked as a result of a function call

"andi r31, %[_nI_MASK_] \n\t"//I is now stored in T, so clear I

"out %[_SREG_], r31 \n\t"

"pop r0 \n\t"

"pop r1 \n\t"

"pop r2 \n\t"

"pop r3 \n\t"

"pop r4 \n\t"

"pop r5 \n\t"

"pop r6 \n\t"

"pop r7 \n\t"

"pop r8 \n\t"

"pop r9 \n\t"

"pop r10 \n\t"

"pop r11 \n\t"

"pop r12 \n\t"

"pop r13 \n\t"

"pop r14 \n\t"

"pop r15 \n\t"

"pop r16 \n\t"

"pop r17 \n\t"

"pop r18 \n\t"

"pop r19 \n\t"

"pop r20 \n\t"

"pop r21 \n\t"

"pop r22 \n\t"

"pop r23 \n\t"

"pop r24 \n\t"

"pop r25 \n\t"

"pop r26 \n\t"

"pop r27 \n\t"

"pop r28 \n\t"

"pop r29 \n\t"

"pop r30 \n\t"

"pop r31 \n\t"

"brtc 2f \n\t"//if the T flag is clear, do the non-interrupt enable return

"reti \n\t"

"2: \n\t"

"ret \n\t"

""::

[_SREG_]"i"_SFR_IO_ADDR(SREG),

[_I_]"i"SREG_I,

[_T_]"i"SREG_T,

[_nI_MASK_]"i"(~(1<<SREG_I)),

[_SPL_]"i"_SFR_IO_ADDR(SPL),

[_SPH_]"i"_SFR_IO_ADDR(SPH),

[_next_task_]"r"(task));

}

So, a lot is happening here. There are 4 basic steps: Save the current context, update the current task’s stack pointer, change the stack pointer to the next task, and restore the next task’s context.

Inline assembly has an interesting syntax in GCC. I don’t believe it is fully portable into non-GCC compilers, so this makes the code depend more or less on GCC. Inline assembly works by way of placeholders (called Operands in the manual). At the very end of the assembly statement, we see a series of comma-separated statements which define these placeholders/operands and how the assembly is going to use registers and such. First off, we pass in the SREG, SPL, and SPH registers as type “i”, which is a constant number known at compile-time. These are simply the IO addresses for these registers (found in avr/io.h if you follow the #include chain deep enough). The next couple parameters are also “i” and are simply bit numbers and masks. The last parameter is the next task pointer passed in as an argument. This is the part where we see the reason why it is more convenient to do this in inline assembly rather than writing it up in an assembly file. While it is possible to look up how avr-gcc passes arguments to functions and discover that the arguments are stored in a certain order in certain registers, it is far simpler and less breakable to allow gcc to fill in the blanks for us. By stating that the _next_task_ placeholder is of type “r” (register), we force GCC to place that variable into some registers of its choosing. Now, if we were using some global variable or a static local, gcc would generate some code before our asm block placing those values into some registers. For this application, that could be quite bad since we depend on no (possibly stack-manipulating) code appearing between the function label and our asm block (more on this in the next paragraph). However, since arguments are passed by way of register, gcc will simply give us the registers by which they are passed in to the function. Since pointers are usually 16 bits on an 8-bit AVR (larger ones will have 3 bytes maybe…but I’m really not sure about this), it fits into two registers. We reference these in the inline assembly by way of “%A[_next_task_]” and “%B[_next_task_]” (note the A and B…these denote the LSB and MSB registers).

Storing the context is pretty straightforward: push all of the registers and push the status register. At this point you may ask, “What about the program counter? Didn’t we have to push that earlier during kos_new_task?” When the function was called (using the CALL instruction), the return address was pushed onto the stack as a side-effect of that instruction. So, we don’t need to push the program counter because it is already on there. This is also why it would be very bad if some code appeared before our asm block. It is likely that gcc will clear out some space on the stack and so we would end up with some junk between the return address on the stack and our first “push” instruction. This would mess up the task context frame and we will see later in the code that this will prevent this function from dispatching the task correctly when it became time for the task to be resumed.

Updating the stack pointer is slightly more tricky. Interrupts are disabled first because it would really suck if we got interrupt during this part (anytime the stack pointer is manipulated is a critical section). We then get to dereference the kos_current_task variable which contains our current task. If we remember from above, the very first thing in the KOS_Task structure is the stack pointer, so if we dereference kos_current_task, we are left with the address at which to store the stack pointer. From there, its as simple as loading the stack pointer into some registers and saving it into Indirect Register X (set by registers 26 and 27).

I should note here something about clearing the interrupt flag. Normally, we would want to check to see if interrupts were enabled beforehand so that we can know if we need to restore them. This code lacks an explicit check because of the fact that the status register (with interrupts possibly enabled) has already been stored. Later, when the current task is restored, the SREG will be restored and thus interrupts will be turned back on if they need to be. Similarly, if the next task has interrupts enabled, they will turned on in the same fashion.

After updating kos_current_task’s stack pointer, we get to move the stack to the next task and set kos_current_task to point to the next task. This is essentially the reverse of the previous operation. Instead of writing to Indirect Register X (which points to the stack pointer of the task), we get to read from it. We also slip in a couple instructions to update the kos_current_task pointer so that it points to the next task. After we have changed the SPL and SPH registers to point to our new stack, the task passed into kos_dispatch is ready to be resumed.

Resuming the next task’s context is a little less straightforward than saving it. We need to prevent interrupts from occurring while we restore the context. The reason for this is to ensure that we don’t end up storing more than one context on that task’s stack (and thereby increase the minimum required stack size to prevent a stack overflow). The problem here is that when we restore the status register, interrupts could be enabled at that point, rather that at the end when the context is done being restored. So, we need to restore in three steps: Restore the status register without the interrupt flag, restore all other registers, and then restore the interrupt flag. This is done by transferring the interrupt flag in the status register into the T (transfer) bit in the status register (that’s the “bst” and “bld” instructions), clearing the interrupt flag, and then later executing either the ret or reti instruction based on this flag. The side effect is that we trash the T bit. I am not sure I can actually do this. This is one part that is tricky: The avr-gcc manual states that the T flag is a scratchpad, just like r0, and doesn’t need to be restored by called functions. My logic here is that since the only way for a task to become blocked is either it being executed initially or from a call to kos_dispatch, gcc sees the dispatch call as a normal function call and will not assume that the T flag will remain unchanged.

After dancing around with bits and restoring the modified SREG, we proceed to pop off the rest of the registers in the reverse order that they were stored at the beginning of the function. At the very end, we use a T flag branch instruction to determine which return instruction to use. “ret” will return normally without setting the interrupt flag and “reti” will set the interrupt flag.

Implementation: Results by code size

So, at this point we have implemented a task scheduler and dispatcher. Here is how it weighs in with avr-size when compiled for an ATMega48A running just the idle task:

1

2

3

4

5

6

7

8

9

10

avr-size -C --mcu=atmega48a bin/kos.elf

AVR Memory Usage

----------------

Device: atmega48a

Program: 474 bytes (11.6% Full)

(.text + .data + .bootloader)

Data: 105 bytes (20.5% Full)

(.data + .bss + .noinit)

Not the best, but its reasonable. The data usage could be taken down by reducing the number of maximum tasks. There are other RTOS available for AVR which can compile smaller. We could do several optimizations which I will discuss in the conclusion

Example: A semaphore

So, we now have a task scheduler. The thing is, although capable of running multiple tasks, it is not possible for multiple tasks to actually run. Why? Because kos_dispatch is never called! We need something that causes the task to become blocked.

As a demonstration, I’m going to implement a simple semaphore. I won’t go into huge detail since that isn’t the point of this article (and it has been long enough), but here is the code:

Header contents:

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

typedefstruct{

int8_t value;

}KOS_Semaphore;

/**

* Initializes a new semaphore

*/

KOS_Semaphore*kos_semaphore_init(int8_t value);

/**

* Posts to a semaphore

*/

voidkos_semaphore_post(KOS_Semaphore*sem);

/**

* Pends from a semaphore

*/

voidkos_semaphore_pend(KOS_Semaphore*sem);

Source contents:

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

staticKOS_Semaphore semaphores[KOS_MAX_SEMAPHORES+1];

staticuint8_t next_semaphore=0;

KOS_Semaphore*kos_semaphore_init(int8_t value)

{

KOS_Semaphore*s=&semaphores[next_semaphore++];

s->value=value;

returns;

}

voidkos_semaphore_post(KOS_Semaphore*semaphore)

{

ATOMIC_BLOCK(ATOMIC_RESTORESTATE)

{

KOS_Task*task;

semaphore->value++;

//allow one task to be resumed which is waiting on this semaphore

task=task_head;

while(task)

{

if(task->status==TASK_SEMAPHORE&&task->status_pointer==semaphore)

break;//this is the task to be restored

task=task->next;

}

task->status=TASK_READY;

kos_schedule();

}

}

voidkos_semaphore_pend(KOS_Semaphore*semaphore)

{

ATOMIC_BLOCK(ATOMIC_RESTORESTATE)

{

int8_t val=semaphore->value--;//val is value before decrement

if(val<=0)

{

//we need to wait on the semaphore

kos_current_task->status_pointer=semaphore;

kos_current_task->status=TASK_SEMAPHORE;

kos_schedule();

}

}

}

So, our semaphore will cause a task to become blocked when kos_semaphore_pend is called (and the semaphore value was <= 0) and when kos_semaphore_post is called, the highest priority task that is blocked on the particular semaphore will be made ready.

Just so this makes sense, let’s go through an example sequence of events:

Task A is created. There are now two tasks on the task list: Task A and the idle task.

Semaphore is initialized to 1 with kos_semaphore_init(1);

Task A calls kos_semaphore_pend on the semaphore. The value is decremented, but it was >0 before the decrement, so the pend immediately returns.

Task A calls kos_semaphore_pend again. This time, the kos_current_task (which points to Task A) state is set to blocked and the blocking data points to the semaphore. The scheduler is called and since Task A is now blocked, the idle task will be dispatched by kos_dispatch.

The idle task runs and runs

Eventually, some interrupt could occur (like a timer or something). During the course of the ISR, kos_semaphore_post is called on the semaphore. Every call to kos_semaphore_post allows exactly one task to be resumed, so it goes through the list looking for the highest priority task which is blocked on the semaphore. Task A is resumed at the point immediately after the call to kos_dispatch in kos_schedule. kos_schedule returns after a couple instructions restoring the interrupt flag state and now Task A will run until it is blocked.

Here’s a program that does just this:

C

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

/**

* Main file for OS demo

*/

#include "kos.h"

#include <avr/io.h>

#include <avr/interrupt.h>

#include "avr_mcu_section.h" //these two lines are for simavr

AVR_MCU(F_CPU,"atmega48");

staticKOS_Semaphore*sem;

staticuint8_t val;

staticuint8_t st[128];

voidthe_task(void)

{

TCCR0B|=(1<<CS00);

TIMSK0|=(1<<TOIE0);

while(1)

{

kos_semaphore_pend(sem);

TCCR0B=0;

val++;

}

}

intmain(void)

{

kos_init();

sem=kos_semaphore_init(0);

kos_new_task(&the_task,&st[127]);

kos_run();

return0;

}

ISR(TIMER0_OVF_vect)

{

kos_isr_enter();

kos_semaphore_post(sem);

kos_isr_exit();

}

Running this with avr-gdb and simavr we can see this in action. I placed breakpoints at the val++ line and the kos_semaphore_post line. Here’s the output with me pressing Ctrl-C at the end once it got into and stayed in the infinite loop in the idle task:

You may have noticed that the interrupt was called three times before we even got to val++. The reason for this is that timer0 is an 8-bit timer and I used no prescaler for its clock, so the interrupt will happen every 255 cycles. Given that the dispatcher is nearly 100 instructions and the scheduler isn’t exactly short either, the interrupt could easily be called three times before it manages to resume the task after it blocks (including the time it takes to block it).

A word on debugging

Before I finish up I want to mention a few things about debugging with avr-gdb. This project was the first time I had ever needed to use an simulator and debugger to even get the program to run. It would have been impossible to write this using an actual device since very little is revealed when operating the device. Here are a few things I learned:

avr-gdb is not perfect. For example, it is confused by the huge number of push statements at the beginning of kos_dispatch and will crash if stepped into that function (if it receives a break inside kos_dispatch that seems to work sometimes). This is due to avr-gdb attempting to decode the stack and finding that the frame size of the function is too big. It’s weird and I didn’t quite understand why that limitation was there, so I didn’t really muck around with it. This made debugging the dispatcher super difficult.

Stack bugs are hard to find. I would recommend placing a watch on the top of your stack (the place where the variable actually points) and then setting that value to something unlikely like 0xAA. If you see this value modified, you know that there is a problem since you are about to exceed your stack size. I spent hours staring at a problem with that semaphore example above before I realized that the idle task stack had encroached on the semaphore variables. Even then, I was looking at something totally different and just noticed that the stack pointer was too small. As it turns out, my original stack size of 48 was too small. The dispatcher will always require at least 35 free bytes on the stack and any ISR that calls a function will require at least 17 bytes due to the way that functions are called in avr-gcc. 35+17 = 52 which is greater than 48…so yeah. Not good.

Simavr is pretty good. It supports compiling a program that embeds simavr which can be used to emulate the hardware around the microcontroller rather than just the microcontroller itself. I didn’t use this functionality for this project, but that is a seriously cool thing.

Conclusion

This has been a long post, but it is a complicated topic. Writing something like this is actually considered writing an operating system (albeit just the kernel portion and a small one at that) and the debug along for just this post took me a while. One must have a good knowledge of how exactly the processor works. I found my knowledge lacking, actually, and I learned a lot about how the AVR works. The other thing is that things like concurrency and interrupts must be considered from the very beginning. They can’t be an afterthought.

The scheduler and dispatcher I have described here are not perfect nor are they the most optimal efficient design. For one thing, my design uses a huge amount of RAM compared to other RTOS options. My scheduler and dispatcher are also inefficient, with the scheduler having an O(N) complexity depending on the number of tasks. My structure does, however, allow for O(1) time when suspending a task (although I question the utility of this…it worked better with the 8086 scheduler I made for class than with the AVR). Another problem is that kos_dispatch will not work with avr-gdb if the program is stopped during this function (it has a hard time decoding the function prologue because of the large number of push instructions). I haven’t found a solution to this problem and it certainly made debugging a little more difficult.

So, now that I’ve told you some of what’s wrong with the above, here are two RTOS which can be used with the AVR and are well tested:

FemtoOS. This is an extremely tiny and highly configurable RTOS. The bare implementation needs only 270 bytes of flash and 10 bytes of RAM. Ridiculous! My only serious issue with it is that it is GPLv3 licensed and due to how the application is compiled, licensing can be troublesome unless you want to also be GPLv3.

FreeRTOS. Very popular RTOS that has all sorts of support for many processors (ARM, PPC, AVR…you name it). I’ve never used it myself, but it also seems to have networking support and stuff like that. The site says that it’s “market leading.”

Anyway, I hope that this article is useful and as usual, any suggestions and such can be left in the comments. As mentioned before, the code for this article can be found on github here: https://github.com/kcuzner/kos-avr