Board bring-up

I started playing with the FRDM-K64F board recently. I want to use it as a
base for a bunch of hobby projects. The start-up code is not that different from
the one for Tiva, which I describe here - it's the same Cortex-M4
architecture after all. Two additional things need to be taken care of, though:
flash security and the COP watchdog.

The K64F MCU restricts external access to a bunch of resources by default.
It's a great feature if you want to ship a product, but it makes debugging
impossible. The Flash Configuration Field (see section 29.3.1 of the datasheet)
defines the default security and boot settings.

Intro

The game code up until this point abuses timers a lot. It has a timer to handle
rendering and to refresh the display, and a timer to change notes of a tune.
These tasks are not very time sensitive. A couple of milliseconds delay here or
there is not going to be noticeable to users. The timer interrupts are more
appropriate for things like maintaining a sound wave of the proper frequency. A
slight delay here lowers the quality of the user experience significantly.

We could, of course, do even more complex time management to handle both the
graphics and the sound in one loop, but that would be painful. It's much nicer
to have a scheduling system that can alternate between multiple threads of
execution. It is what I will describe in this post.

Thread Control Block and the Stack

Since there's usually only one CPU, the threads need to share it. The easiest
way to achieve time sharing is to have a fixed time slice at the end of which
the system will switch to another thread. The systick interrupt perfect for
this purpose. Not only is it invoked periodically, but it can also by requested
manually by manipulating a register. This property will be useful in
implementation of sleeping and blocking.

But first things first: we need to have a structure that will describe a thread,
a. k. a. a Thread Control Block:

flags - properties describing the thread; we will need just one to indicate
whether the thread used the floating-point coprocessor

func - thread's entry point

next - pointer to the next thread in the queue (used for scheduling)

sleep - number of milliseconds the thread still needs to sleep

blocker - a pointer to a semaphore blocking the thread (if any)

priority - thread's priority

When invoking an interrupt handler, the CPU saves most of the running state
of the current thread to the stack. Therefore, the task of the interrupt handler
boils down to switching the stack pointers. The CPU will then pop all the
registers back from the new stack. This behavior means that we need to do some
initialization first:

The ARM ABI requires that the top of the stack is 8-aligned and we will
typically push and pop 4-byte words. The first part of the setup function makes
sure that the stack boundaries are right. The second part sets the initial
values of the registers. Have a look here for details.

the PSR register needs to have the Thumb bit switched on

we put the startup function address to the program counter

we put 0xffffffff to the link register to avoid confusing stack traces in
GDB

r0 gets the argument to the startup function

an interrupt pushes 16 words worth of registers to the stack, so the initial
value of the stack pointer needs to reflect that

This function is typically called as:

1 IO_sys_stack_init(thread,thread_wrapper,thread,stack,stack_size);

Note that we do not call the user thread function directly. Rather we have a
wrapper function that gets the TBC as its argument. It is because we need to
remove the thread from the scheduling queue if the user-specified function
returns.

The context switcher

Let's now have a look at the code that does the actual context switching. Since
it needs to operate directly on the stack, it needs to be written in assembly.
It is not very complicated, though. What it does is:

pushing some registers to the stack

storing the current stack pointer in the stack_ptr variable of the current TCB

The only complication here is that we sometimes need to store the floating point
registers in addition to the regular ones. It is, however, only necessary if the
thread used the FPU. The fourth bit of EXC_RETURN, the value in the LR
register, indicates the status of the FPU. Go here and here for more
details. If the value of the bit is 0, we need to save the high floating-point
registers to the stack and set the FPU flag in the TCB.

Also, after selecting the new thread, we check if its stack contains the FPU
registers by checking the FPU flag in its TCB. If it does, we pop these
registers and change EXC_RETURN accordingly.

The Lazy Stacking is taken care of by simply pushing and popping the high
registers - it counts as an FPU operation.

Semaphores, sleeping and idling

We can now run threads and switch between them, but it would be useful to be
able to put threads to sleep and make them wait for events.

Sleeping is easy. We just need to set the sleep field in the TCB of the
current thread and make the scheduler ignore threads whenever their sleep
field is not zero:

If the value of the semaphore was negative, we find a thread that it was
blocking and unblock it. It will make the scheduler consider this thread for
running in the future.

None of the user-defined threads may be runnable at the time the scheduler makes
its decision. All of them may be either sleeping or waiting for a semaphore. In
that case, we need to keep the CPU occupied with something, i.e., we need a fake
thread:

Scheduler

The system maintains a circular linked list of TCBs called threads. The job of
the scheduler is to loop over this list and select the next thread to run. It
places its selection in a global variable called IO_sys_current so that other
functions may access it.

otherwise select the next highest priority thread that is neither sleeping
nor blocked on a semaphore

Starting up the beast

So how do we get this whole business running? We need to invoke the scheduler
that will preempt the current thread and select the next one to run. The problem
is that we're running using the stack provided by the bootstrap code and don't
have a TCB. Nothing prevents us from creating a dummy one, though. We can create
it on the current stack (it's useful only once) and point it to the beginning of
our real queue of TCBs:

Tests

Tests 11 and 12 run a dummy calculation for some time and then return.
After this happens, the system can only run the idle thread. If we plug-in the
profiler code, we can observe the timings on a logic analyzer:

Test #11

Test 13 is more complicated than the two previous ones. Three threads are
running in a loop, sleeping, and signaling semaphores. Two more threads are
waiting for these semaphores, changing some local variables and signaling other
semaphores. Finally, there is the writer thread that blocks on the last set of
semaphores and displays the current state of the environment. The output from
the logic analyzer shows that the writer thread needs around 3.3 time-slices to
refresh the screen:

Test #13

Silly Invaders

How all this makes Silly Invaders better? The main advantage is that we don't
need to calculate complex timings for multiple functions of the program. We
create two threads, one for rendering of the scene and another one for playing
the music tune. Each thread cares about its own timing. Everything else takes
care of itself with good enough time guarantees.

Referrences

Intro

Life would be so much easier if all the software was open source and came
packaged with Debian. Much of the stuff I use is available this way, but
there are still some programs that come as binary blobs with custom installers.
I don't like that. I don't trust that. Every now and then, you hear the stories
of software coming from reputable companies and misbehaving in dramatic
ways. It would be great to contain the potential damage, but running virtual
machines on a laptop is a pain. As it turns out, things may work pretty well
with Docker, but as usual, the configuration is not so trivial.

Terminal

The X Server

The solutions I found on the Internet either share the main X server with
the container or use VNC. The first approach is problematic because
apparently the X architecture has been designed by happy hippies and has no
notion of security. If two applications share a screen, for instance, one can
sniff the keystrokes typed into the other, and all this is by design. The VNC
solution, on the other hand, is terribly slow: windows smudge when moved, and
the Netflix playback is lagging.

Starting a Xephyr instance on the host and sharing its socket with the container
seems to solve the sniffing problem. The programs running inside the container
can't listen to the keystrokes typed outside of it anymore. Xephyr is also fast
enough to handle high-resolution movie playback smoothly.

You can start Xephyr like this:

Xephyr :1 -ac -br -screen 1680x1050 -resizeable

The server will run as display :1 in a resizable window of initial size
defined by the screen parameter. Adding the following to the Docker command
line makes the server visible inside of the container:

-e DISPLAY=:1 -v /tmp/.X11-unix/X1:/tmp/.X11-unix/X1

The only remaining pain point is the fact that you cannot share the clipboard by
default. Things copied outside of the container do not paste inside and vice
versa. The xsel utility and a couple of lines of bash code can solve this
problem easily:

Audio

Making the audio work both ways (the sound and the microphone) is
surprisingly easy with PulseAudio. The host just needs to configure the
native protocol plug-in and ensure that port 4713 is not blocked by the
firewall:

pactl load-module module-native-protocol-tcp auth-ip-acl=172.17.0.2

All you need to do in the container is making sure that the PULSE_SERVER
envvar points to the host. It is less straightforward than you might expect when
you run a desktop environment and don't want to start all your programs in a
terminal window. For XFCE, I do the following in the script driving the
container:

Intro

I have finally received my Kickstarter-backed UP boards. So far they seem
great! There are three minor drawbacks, though:

They don't have the exact same shape as Raspberry PI's, so they don't fit
the raspberry cases. It's nothing that could not be rectified with small
pliers, though.

The audio chip on Cherry Trail (Intel Atom x5 z8350) SoCs is not yet
supported by Linux out of the box, so some fiddling with the kernel is
necessary.

Debian's UEFI boot configuration does not seem to work from the get-go
either.

Boot

You can install Debian Testing using a USB stick. Don't try Jessie, though -
the kernel will not detect the MMC card. Things should work fine, except that
grub will install itself in /EFI/debian/grubx64.efi on the EFI partition.
You will need to move it to /EFI/boot/bootx64.efi manually. It's possible to
do it from the UEFI shell using familiar commands.

Media

Kodi installs and works out of the box from Debian Multimedia.
Unfortunately, to get the sound working, you will need to recompile the kernel
:)

I wanted to see how efficient it is, so I run the compilation on the board
itself. It took roughly 2.5 hours and got very hot. The board can handle
perfectly fine the FullHD video files over Samba that Raspberry PI 2 couldn't.
The audio quality is much better too. It seems that surround 5.1 actually works.
:)

Random Number Generator

To make the game more engaging, we introduce some randomness into it. We don't
need anything cryptographically secure, so a Linear Congruential Generator
will do just fine. We count the time from the start-up in millisecond-long
jiffies and wait for a first button press to select the seed.

Rendering Engine

The rendering engine takes a scene descriptor, a display device, and a timer.
Based on this information it computes new positions of objects, draws them on
the screen if necessary and checks for collisions.

Each SI_scene holds a list of "polymorphic" objects that should be rendered, a
pointer to a pre_render function that calculates a new position of each
object, and a pointer to a collision callback that is invoked when the scene
renderer detects an overlap between two objects. The SI_scene_render function
runs after every interrupt:

Whether it gets executed or not, depends on the flag parameter of the scene.
If it's set to SI_SCENE_IGNORE, the renderer returns immediately. On the other
hand, if it's set to SI_SCENE_RENDER, the renderer calls the pre_render
callback, draws the objects on the screen, and computes the object overlaps
notifying the collision callback if necessary. After each frame, the scene is
disabled (SI_SCENE_IGNORE). It is re-enabled by the timer interrupt in a time
quantum that depends on the fps parameter.

DAC

Tiva does not have a DAC, but we'd like to have some sound effects while playing
the game. Fortunately, it's easy to make a simple binary-weighted DAC using
resistors and GPIO signals. It's not very accurate, but will do.

A binary-weighted DAC

As far as the software is concerned, we will simply take 4 GPIO pins and set
them up as output. We will then get an appropriate bit-banded alias such that
writing an integer to it is reflected only in the state of these four pins.

Sound

We will create a virtual device consisting of a DAC and a timer. Using the
timer, we will change the output of the DAC frequently enough to produce sound.
Since the timer interrupt needs to be executed often and any delay makes the
sound break, we need to assign the highest possible priority to this interrupt
so that it does not get preempted.

In reality, the timer fires 32 times more often than the frequency of the tone
requires. It is because we use a table with 32 entries to simulate the actual
sound wave. In principle, we could just use a sinusoid, but it turns out that
the quality of the sound is not so great if we do so. I have found another
waveform in the lab materials of EdX's Embedded Systems course that works much
better.

Timers

Tiva has 12 timer modules that can be configured in various relatively complex
ways. However, for the purpose of this game, we don't need anything fancy. We
will, therefore, represent a timer as an IO_io device with the IO_set
function (generalized from IO_gpio_set) setting and arming it. When it counts
to 0, the IO_TICK event will be reported to the event handler.

ADC

Similarly to the timers, the ADC sequencers on Tiva may be set up in fairly
sophisticated ways. There are 12 analog pins, two modules with four sequencers
each. Again, we don't need anything sophisticated here, so we will just use the
first eight pins and assign them to a separate sequencer each. In the blocking
mode, IO_get initiates the readout and returns the value. In the non-blocking
and asynchronous mode IO_set, requests sampling and IO_get returns it when
ready. An IO_DONE event is reported to the event handler if enabled.

Intro

The paper shows that, despite often repeated mantra, the OS task scheduling is
far from being easy. The authors developed two new tools to investigate the CPU
usage and the state of the associated run queue. It has allowed them to uncover
four interesting performance bugs on a 64 core NUMA system. They discovered that
often some cores stay idle for a long time while tasks are waiting. It is a
violation of one of the design principles of the Completely Fair Scheduler, the
Linux default, which is supposed to be work-conserving. Fixing these bugs
resulted in a 138 times speedup in an extreme test case (multithreaded, using
spinlocks) and 13-23% speedups in other test cases. This type of bugs is hard to
uncover because they typically waste cycles hundreds of milliseconds at a time,
which is beyond the resolution of standard monitoring tools.

Completely Fair Scheduler

CFS defines an interval in which each task must run a least once. This interval
is then divided between all the tasks in proportion to their weight (niceness).
A running thread accumulates vruntime, which is the amount of time it was
running divided by its weight. The scheduler keeps these tasks in a run queue
which is implemented as a red-black tree. When the CPU gets idle, the leftmost
node is picked because it has accumulated the least of weighted runtime.

In a multi-core system, each core has its own run queue. To fairly distribute
the load among the cores, the run queues must be periodically re-balanced. In
today's systems, with dozens of run queues, the balancing procedure is expensive
and not run very ofter. It is due to the need to take into account other
factors, such as power-saving, cache and memory locality. The load balancing
algorithm takes the threads from the most loaded cores and distributes them
between the least loaded cores taking into account the topology of the system.
The more complex the system gets, the more rules need to be applied and the
harder it gets to reason about performance.

The bugs and the tools

The bugs uncovered by the authors are all related to migrating tasks between
NUMA nodes. They were detected using new tools:

Sanity Checker checks every second whether there are idle cores in the
presence of waiting threads in other run queues. If there are, it monitors
the system for another 100ms. If the situation is not remediated, it begins
to record the profiling information for further off-line analysis.

The scheduler visualization tool taps into various kernel functions to
monitor and visualize scheduling activity over time.

Conclusion

The authors note that the problems were caused by people wanting to optimize CFS
to compensate for the complexity of the modern hardware. They suggest rewriting
of the scheduler as a core and a bunch of optimization modules.

Debugging

To test and debug the SSI code, I connected two boards and made them talk to
each other. It mostly worked. However, it turned out that, by default, you can
run the OpenOCD-GDB duo only for one board at a time. It's the one that
libusb enumerates first. There is a patch that lets OpenOCD choose the
device to attach to by its serial number. The patch has not made it to any
release yet, but applying it and recompiling the whole thing is relatively
straight-forward: clone the source, apply the patch and run the usual
autotools combo. You will then need to create a config file for each board that
specifies unique port numbers and defines the serial number of the device to
attach to:

SSI and GPIO

We need both SSI and GPIO to control the Nokia display that we want to use for
the game. Since, in the end, both these systems need to push and receive data,
they fit well the generic interface used for UART. The SSI's initialization
function needs many more parameters than the one for UART, so we pack them all
in a struct. As far as GPIO is concerned, there are two helpers:
IO_gpio_get_state and IO_gpio_set_state that just write the appropriate
byte to the IO device. GPIO also comes with a new event type: IO_EVENT_CHANGE.

Platforms, the display interface, and fonts

All the devices that are not directly on the board may be connected in many
different ways. To handle all these configurations with the same board, we split
the driver into libtm4c.a (for the board specific stuff) and
libtm4c_platform_01.a (for the particular configuration). For now, the only
thing that the platform implements is the display interface. It passes the
appropriate SSI module and GPIOs to the actual display driver. The user sees the
usual IO_io structure that is initialized with IO_display_init and can be
written to and synced. write renders the text to the back-buffer, while sync
sends the back-buffer to the device for rendering. There's also a couple of
specialized functions that have to do only with display devices:

Platform 01 provides one display device, a PCD8544, the one used in Nokia 5110.
It translates and passes the interface calls to the lower-level driver. See
pcd8544.c.

If you haven't noticed in the list of the functions above, the display interface
supports multiple fonts. In fact, I wrote a script that rasterizes TrueType
fonts and creates internal IO_font structures. These can then be used to
render text on a display device. All you need to do is provide a TTF file,
declare the font name and size in CMake, and then reference it in the
font manager. The code comes with DejaVuSans10 and DejaVuSerif10 by
default.

The heap

Malloc comes handy from time to time, so I decided to implement one. It is
extremely prone to fragmentation and never merges chunks, so using free is not
advisable. Still, sometimes you just wish you had one. For instance, when you
need to define a buffer for pixels and don't have a good way to ask for
display parameters at compile time. For alignment reasons, the heap management
code reserves a bit more than 4K for the stack. It then creates a 32 bytes long
guard region protected by the MPU. Everything between the end of the .bss
section and the guard page is handled by IO_malloc and IO_free.

Hardware Abstraction Layer

I'd like the game to be as portable as possible. As far as the game logic is
concerned, the actual interaction with the hardware is immaterial. Ideally, we
just need means to write a pixel to a screen, blink an LED or check the state of
a push-button. It means that hiding the hardware details behind a generic
interface is desirable. This interface can then be re-implemented for a
different kind of board, and the whole thing can act as a cool tool for getting
to know new hardware.

In this project, we will use one static library (libio.a) to provide the
interface. This library will implement all the hardware independent functions as
well as the stubs for the driver (as weak symbols). Another library (libtm4c.a)
will provide the real driver logic for Tiva and the strong symbols. This kind of
approach enables us to use the linker to easily produce the final binary for
other platforms in the future.

Initialization PLL and FPU

To initialize the hardware platform, the user calls IO_init(). The stub for
this function is provided by libio.a as follows:

The actual implementation for Tiva in libtm4c.a initializes PLL to provide
80MHz clock and turns on microDMA. It also sets the access permissions to the
FPU by setting the appropriate bits in the CPAC register and resetting the
pipeline in assembly. We will likely need the floating point in the game, and
it comes handy when calculating UART transmission speed parameters.

IO_read and IO_write push to and fetch bytes from the device. IO_print
writes a formated string to the device using the standard printf semantics.
IO_scan reads a word (a stream of characters surrounded by whitespaces) and
tries to convert it to the requested type.

Each subsystem needs to provide its initialization function to fill the IO_io
struct with the information required to perform the IO operations. For instance,
the following function initializes UART:

It needs to know which UART module to use, what the desired mode of operation is
(non-blocking, asynchronous, DMA...) and what should be the speed of the link.
This approach hides the hardware details from the user well and is very generic,
see test-01-uart.c. For instance, you can write something like this:

Passing 0 as flags to the UART initialization routine creates a blocking
device that is required for IO_print and IO_scan to work.

Non-blocking and asynchronous IO

A blocking IO device will cause the IO functions to return only after they have
pushed or pulled all the data to or from the hardware. If, however, you
configure a non-blocking (IO_NONBLOCKING) device, the functions will process
as many bytes as they can and return. They return -IO_WOULDBLOCK if it is
not possible to handle any data.

The IO_ASYNC flag makes the system notify the user about the device readiness
for reading or writing. These events are received and processed by a
user-defined call-back function:

DMA

The DMA mode allows for transferring data between the peripheral and the main
memory in the background. It uses the memory bus when the CPU does not need it
for anything else. When in this mode, IO_read and IO_write only initiate a
background transfer. The next invocation will either block or return
-EWOULDBLOCK, depending on other configuration flags, as long as the current
DMA operation is in progress. The memory buffer cannot be changed until the DMA
transfer is done. Passing IO_ASYNC will generate completion events for DMA
operations. It enables us to implement a pretty neat UART echo app:

The driver

There was nothing ultimately hard about writing the driver part. It all boils
down to reading the data sheet and following the instruction contained therein.
It took quite some time to put everything together into a coherent whole,
though. See: TM4C_uart.c.

Intro

I have recently been playing with microcontrollers a lot. Among other things, I
have worked through some of the labs from this course on EdX. The material
does not use much high-level code, so it gives a good overview of how the
software interacts with the hardware. There are some "black box" components in
there, though. For me, the best way to learn something well has always been
building things from "first principles." I find black boxes frustrating. This
post describes the first step on my way to make an Alien Invaders game from
"scratch."

Compiling for Tiva

First, we need to be able to compile C code for Tiva. To this end, we will use
GCC as a cross-compiler, so make sure you have the arm-none-eabi-gcc command
available on your system. We will use the following flags build Tiva-compatible
binaries:

-mcpu=cortex-m4 - produce the code for ARM Cortex-M4 CPU

-mfpu=fpv4-sp-d16 - FPv4 single-precision floating point with the register
bank seen by the software as 16 double-words

-Wall and -pedantic - warn about all the potential issues with the code

-ffunction-sections and -fdata-sections - place every function and data
item in a separate section in the resulting object file; it allows the
optimizations removing all unused code and data to be performed at link-time

Object files

To generate a proper binary image, we need to have some basic understanding of
object files produced by the compiler. In short, they consist of sections
containing various pieces of compiled code and the corresponding data. These
sections may be loadable, meaning that the contents of the section should be
read from the object file and stored in memory. They may also be just
allocatable, meaning that there is nothing to be loaded, but a chunk of memory
needs to be put aside for them nonetheless. There are multiple sections in a
typical ELF object file, but we need to know only four of them:

VMA (virtual memory address) - This is the location of the section the code
expects when it runs.

LMA (load memory address) - This is the location where the section is stored
by the loader.

These two addresses are in most cases the same, except the situation that we
care about here: an embedded system. In our binary image, we need put the
.data section in ROM because it contains initialized variables whose values
would otherwise be lost on reset. The section's LMA, therefore, must point to
a location in ROM. However, this data is not constant, so it's final position
at program's runtime needs to be in RAM. Therefore, the VMA must point to a
location RAM. We will see an example later.

Tiva's memory layout

Tiva has 256K of ROM (range: 0x0000000000-0x0003ffff) and 32K of RAM (range:
0x20000000-0x20003fff). See the table 2-4 on page 90 of the data sheet for
details. The NVIC (Interrupt) table needs to be located at address 0x00000000
(section 2.5 of the data sheet). We will create this table in C, put it in a
separate object file section, and fill with weak aliases of the default handler
function. This approach will enable the user to redefine the interrupt handlers
without having to edit the start-up code. The linker will resolve the handler
addresses to strong symbols if any are present.

We start with the .text section and begin it with 0x20003fff. It is the
initial value of the stack pointer (see the data sheet). Since the stack
grows towards lower addresses, we initialize the top of the stack to the
last byte of available RAM.

We then put the .nvic section. The KEEP function forces the linker to
keep this section even when the link-time optimizations are enabled, and the
section seems to be unused. The asterisk in *(.nvic) is a wildcard for an
input object file name. Whatever is in the brackets is a wildcard for a
section name.

We put all the code and read-only data from all of the input files in this
section as well.

We define a new symbol: __text_end_vma and assign its address to the
current VMA (the dot means the current VMA).

We put this section in FLASH: > FLASH at line 10.

We combine the .data* sections from all input files into one section and
put it behind the .text section in FLASH. We set the VMAs to be in RAM:
> RAM AT > FLASH.

Apparently TivaWare changes the value of the VTABLE register and needs
to have the NVIC table in RAM, so we oblige: *(vtable).

We put .bss in RAM after .data.

We use asterisks in section names (i.e. .bss*) because
-ffunction-sections and -fdata-sections parameters cause the compiler to
generate a separate section for each function and data item.

Edit 02.04.2016: The initial stack pointer needs to be aligned to 8 bytes
for passing of 64-bit long variadic parameters to work. Therefore, the value of
the first four bytes in the text section should be: LONG(0x20007ff8). See
this post for details.

The .text section starts at 0x00000000 both VMA and LMA. The .data section
starts at 0x00000484 LMA (in FLASH) but the code expects it to start at
0x20000000 VMA (in RAM). The symbol addresses seem to match the expectations as
well:

Table of Contents

Conclusion

I have implemented all of the interesting functions listed here and, thus,
reached my goal. There were quite a few surprises. I had expected some things to
bo more complicated than they are. Conversely, some things that had seemed
simple turned out to be quite complex.

I had initially hoped that I would be able to re-use much of glibc and
concentrate only on the thread-specific functionality. I was surprised to
discover how much of glibc code refers to thread-local storage.

I had expected the interaction between join and detach to be much
simpler to handle. Having to implement descriptor caching was an unexpected
event.

I had never heard of pthread_once before.

I had not used much of the real-time functionality before, so figuring out
the scheduling part was very entertaining. I especially enjoyed implementing
the PRIO_INHERIT mutex.

I may revisit this project in the future because there are still some things
that I would like to learn more about.

If I'll have the time to learn DWARF, I would like to provide proper
.eh_frame for the signal trampoline. It would allow me to implement
cancellation using stack unwinding the way glibc does it.

I may look into the inter-process synchronization to learn about the
robust futexes.

The Intel article on lock elision seemed interesting, and I'd like to
play with this stuff as well.

Table of Contents

Condition Variables

Condition variables are a mechanism used for signaling that a certain predicate
has become true. The POSIX mechanism for handling them boils down to three
functions: cond_wait, cond_signal and cond_broadcast. The first one causes
the thread that calls it to wait. The second wakes a thread up so that it can
verify whether the condition is true. The third wakes all the waiters up.

Table of Contents

RW Locks

A read-write lock protects a critical section by allowing multiple readers when
there are no writers. We won't bother implementing lock attributes handling
because we don't support process-shared locks (irrelevant in our case) and we
don't let the user prefer readers (a non-POSIX extension). The implementation
remembers the ID of the current writer if any. It also counts readers as well as
the queued writers. We use two futexes, one to block the readers and one to
block the writers.

A writer first bumps the number of queued writers. If there is no other writer
and no readers, it marks itself as the owner of the lock and decrements the
number of queued writers. It goes to sleep on the writer futex otherwise.

When unlocking, we use the writer field to determine whether we were a reader
or a writer. If we were a writer, we've had an exclusive ownership of the lock.
Therefore, we need to either wake another writer or all of the readers,
depending on the state of the counters. If we were a reader, we've had a
non-exclusive lock. Therefore, we only need to wake a writer when we're the last
reader and there is a writer queued. We bump the value of the futex because we
want to handle the cases when FUTEX_WAKE was called before the other thread
manged to call FUTEX_WAIT.

Table of Contents

Thread Scheduling

The scheduler makes a decision of which thread to run next based on two
parameters: scheduling policy and priority.

Conceptually, the scheduler maintains a list of runnable threads for each
possible sched_priority value. In order to determine which thread runs next,
the scheduler looks for the nonempty list with the highest static priority
and selects the thread at the head of this list.

A thread's scheduling policy determines where it will be inserted into the
list of threads with equal static priority and how it will move inside this
list.
the sched(7) man page

The supported policies are:

SCHED_NORMAL - It's the default Linux policy for threads not requiring any
real-time machinery. All the threads have priority of 0, and the decision of
which thread gets run next is based on the
nice mechanism.

SCHED_FIFO - The threads have priority from 1 (low) to 99 (high). When a
SCHED_FIFO thread becomes runnable, it will preempt any thread of lower
priority. There is no time slicing, so the hight priority thread runs as long
as it must.

SCHED_RR - RR stands for round-robin. It's the same as SCHED_FIFO
except each thread is allowed to run for a limited time quantum.

There is a bit more fun to it, though. As you can see, the kernel can only
schedule a task that already exists. Therefore, we need to have a way to set the
thread's priority before this thread invokes the user function. The reason for
this is that we may need to abort this thread immediately should the scheduler
setting fail. We do it by having another futex that we wake when we know whether
the thread can run or not:

Priority mutexes

TBTHREAD_PRIO_NONE - Acquiring this type of mutex does not change the
scheduling characteristics of the thread.

TBTHREAD_PRIO_INHERIT - When the mutex owner blocks another thread with a
higher priority, the owner inherits the priority of the thread it blocks if
it's higher than its own.

TBTHREAD_PRIO_PROTECT - Acquiring this kind of mutex raises the priority
of the owner to the prioceiling value of the mutex.

Thread Bites implements this functionality by keeping lists of PRIO_INHERIT
and PRIO_PROTECT mutexes. It then calculates the highest possible priority
taking into account the priority of the mutexes and the priority set by the
user.

The implementation of PRIO_PROTECT is relatively straightforward. Whenever a
thread acquires this kind of mutex, it is added to the list, and the priority
of the thread is recalculated:

Implementing PRIO_INHERIT is a lot more tricky. We add the mutex to the
appropriate list when a thread acquires it. Whenever a higher priority thread
tries to lock the mutex, it bumps the priority of the blocker. But the priority
recalculation is done only at this point. Implementing it like this covers all
the main cases and is not horrendously hard. It allows for simple recursion: if
the owner of a mutex gets blocked, the blocker inherits the priority that comes
with the first mutex. It also has a couple of drawbacks:

It assumes that the kernel will always wake the highest priority thread. It
makes sense and is most likely the case. However, I have not tested it.

If the owner of a PRIO_INHERIT mutex is already blocked on another mutex of
the same kind and it's priority gets bumped later, the last thread in the
line won't be affected.

It was by far the most challenging part so far. See the patch at GitHub.

Remaining functions

pthread_setconcurrency- It defines how many kernel tasks should be created
to handle the user-level threads. It does not make sense in our case because
we create a kernel task for every thread.

pthread_attr_setscope - It defines the set of threads against which the
thread will compete for resources. There are two settings:
PTHREAD_SCOPE_SYSTEM meaning all the threads in the entire system and
PTHREAD_SCOPE_PROCESS meaning only the threads within the process. The man
page says that Linux only supports PTHREAD_SCOPE_SYSTEM, but I am not sure
whether it's still the case with all the cgroups stuff.

Table of Contents

Cancellation

Cancellation boils down to making one thread exit following a request from
another thread. It seems that calling tbthread_exit at an appropriate point
is enough to implement all of the behavior described in the man pages. We will
go this way despite the fact that it is not the approach taken by glibc.
Glibc unwinds the stack back to the point invoking the user-supplied thread
function. This behavior allows it to simulate an exception if C++ code is using
the library. We don't bother with C++ support for the moment and don't always
care to supply valid DWARF information. Therefore, we will take the easier
approach.

tbthread_setcancelstate and tbthread_setcanceltype are the two functions
controlling the response of a thread to a cancellation request. The former
enables or disables cancellation altogether queuing the requests for later
handling if necessary. The latter decides whether the thread should abort
immediately or at a cancellation point. POSIX has a list of cancellation points,
but we will not bother with them. Instead, we'll just use tbthread_testcancel
and the two functions mentioned before for this purpose.

The thread must not get interrupted after it disables or defers cancellation. It
would likely lead to deadlocks due to unreleased mutexes, memory leaks and such.
The trick here is to update all the cancellation related flags atomically. So, we
use one variable to handle the following flags:

TB_CANCEL_ENABLED: The cancellation is enabled; if a cancellation request
has been queued, reaching a cancellation point will cause the thread to exit.

TB_CANCEL_DEFERRED: The cancellation is deferred (not asynchronous);
SIGCANCEL will not be sent; see the paragraph on signal handling.

TB_CANCELING: A cancellation request has been queued; depending on other
flags, SIGCANCEL may be sent.

TB_CANCELED: A cancellation request has been taken into account and the
thread is in the process of exiting; this flag is used to handle the cases
when a cancellation point has been reached before SIGCANCEL has been
delivered by the kernel.

Clean-up handlers

The user may register a bunch of functions cleaning up the mess caused by an
unexpected interruption. They are installed with tbthread_cleanup_push and
called when the thread exits abnormally. The purpose of these functions is to
unlock mutexes, free the heap memory and such. tbthread_cleanup_pop removes
them and optionally executes in the process.

Signals and asynchronous cancellation

The asynchronous cancellation uses the first real-time signal, SIGRTMIN, that
we call SIGCANCEL here for clarity.

Registering a signal handler is somewhat more tricky than just calling the
appropriate syscall. It is so because, on x86_64, we need to provide a function
that restores the stack after the signal handler returns. The function is called
a signal trampoline and its purpose is to invoke sys_rt_sigreturn. The
trampoline is registered with the kernel using a special sigaction flag:

Looking at the corresponding glibc code, you can see that they add the
eh_frame info here. The comments say that it is to aid gdb and handle the
stack unwinding. I don't know enough DWARF to write one on my own, gdb does
not seem to be utterly confused without it, and we won't do stack unwinding, so
we just won't bother with it for the moment.

In the cancellation handler, we first check whether it's the right signal and
that it has been sent by a thread in the same thread group. We then need to
check whether the thread is still in the asynchronous cancellation mode. It
might have changed between the time the signal was sent and the time the it is
delivered. Finally, we call thread_testcancel to see if the thread should
exit.

Cancellation of a "once" function

The implementation of tbthread_once gets quite a bit more interesting as well.
If the thread invoking the initialization function gets canceled, another thread
needs to pick it up. We need to install a cleanup handler that will change the
state of the once control back to TB_ONCE_NEW and wake all the threads so that
they could restart from the beginning:

Table of Contents

Recycling the thread descriptors

How do we know when the thread task has died? This is what the CHILD_SETTID
and CHILD_CLEARTID flags to sys_clone are for. If they are set, the kernel
will store the new thread's TID at the location pointed to by the ctid
argument (see tb #1). When the thread terminates, the kernel will set the
TID to 0 and wake the futex at this location. It is a convenient way to wait
for a thread to finish. Unfortunately, as far as I can tell, there is no way to
unset these flags, and it makes implementing tbthread_detach a pain. We cannot
delete the thread descriptor in the thread it refers to anymore. Doing so would
cause the kernel to write to a memory location that might have been either
unmapped or reused. Therefore, we need to have some sort of a cache holding
thread descriptors and make sure that we re-use them only after the thread they
were referring to before has exited. Thread bites uses two linked lists to
maintain this cache, and the descriptor allocation function calls the following
procedure to wait until the corresponding task is gone:

Joining threads

Joining complicates things a bit further because it does quite a bit of error
checking to prevent deadlocks and such. To perform these checks, the thread
calling tbthread_join needs to have a valid thread descriptor obtainable by
tbthread_self. The problem is that we have never set this thread descriptor
up for the main thread, and we need to do it by hand at the beginning of the
program. The original state needs to be restored at the end because glibc
uses it internally and not cleaning things up causes segfaults.

Dynamic initialization

pthread_once is an interesting beast. Its purpose is to initialize dynamically
some resources by calling a designated function exactly once. The fun part is
that the actual initialization call may be made from multiple threads at the
same time. pthread_once_t, therefore, is kind of like a mutex, but has three
states instead of two:

new: the initialization function has not been called yet; one of the
threads needs to call it.

in progress: the initialization function is running; the threads are
waiting for it to finish.

done: the initialization function is done; all the threads may be woken up.

The thread that manages to change the state from new to in progress gets to
call the function. All the other threads wait until the done state is reached.

Side effects

The original glibc thread descriptor stores the localization information for
the thread. Changing it to ours causes seemingly simple functions, like
strerror, to segfault. Therefore, we need to implement strerror
ourselves.

Table of Contents

Intro

This part discusses an implementation of a mutex. It will not be a particularly
efficient mutex, but it will be an understandable and straightforward one. We
will not bother minimizing the number of system calls or implementing lock
elision. We will also not handle the case of inter-process communication.
Therefore, the process-shared and robust mutexes will not be discussed. If you
are interested in these ideas, I recommend the kernel's documentation file
on robust mutexes and Intel's blog post on lock elision. The scheduling
related parameters will likely be dealt with on another occasion.

Futexes

POSIX defines a bunch of different mutexes. See the manpage for
pthread_mutexattr_settype to learn more. On Linux, all of them are implemented
using the same locking primitive - a Futex (Fast User-Space Mutex). It is a
4-byte long chunk of aligned memory. Its contents can be updated atomically, and
its address can be used to refer to a kernel-level process queue. The kernel
interface is defined as follows:

We will only use the first four out of the six parameters here. The first one is
the address of the futex, and the second one is the type of the operation to be
performed. The meaning of the remaining parameters depends on the context. We
will need only two of the available operations to implement a mutex:

FUTEX_WAIT puts a thread to sleep if the value passed in val is the same
as the value stored in the memory pointed to by *uaddr. Optionally, the
sleep time may be limited by passing a pointer to a timespec object. The
return values are:

0, if the thread was woken up by FUTEX_WAIT.

EWOULDBLOCK, if the value of the futex was different than val.

EINTR, if the sleep was interrupted by a signal.

FUTEX_WAKE wakes the number of threads specified in val. In practice, it
only makes sense to either wake one or all sleeping threads, so we pass
either 1 or INT_MAX respectively.

Normal mutex

We start with a normal mutex because it is possible to implement all the other
kinds using the procedures defined for it. If the value of the associated futex
is 0, then the mutex is unlocked. Locking it means changing the value to 1.
To avoid race conditions, both the checking and the changing need to be done
atomically. GCC has a built-in function to do this that results
with lock cmpxchgq or similar being emitted in assembly. If the locking fails,
we need to wait until another thread releases the mutex and re-check the lock.

Note that the values stored in the futex are application-specific and arbitrary.
The kernel does not care and does not change this variable except in one case,
which we will discuss in a later chapter.

This mutex is not particularly efficient because we make a system call while
unlocking regardless of whether there is a waiting thread or not. To see how the
situation could be improved, please refer to Ulrich Drepper's
Futexes Are Tricky.

Error-check mutex

These guys are the same as the ones discussed earlier, except that they do
additional bookkeeping. An error-check mutex remembers who owns it, to report
the following types of errors:

Recursive mutex

Recursive mutexes may be locked multiple times by the same thread and require
the same numbers of unlock operations to be released. To provide this kind of
functionality, we just need to add a counter counter.

Other code

This is pretty much it. The remaining part of the code is not interesting. Both
mutex and mutexattr objects need to be initialized to their default values,
but the futexes don't need any initialization or cleanup. As always, the full
patch is available on GitHub.

Edit 28.03.2016: There are more details about the startup code in
this post.

CMake

I have recently started playing with the Tiva launchpad. It's a pity, though,
that most of the tutorials and course material out there show you how to
program it only using something or other on Windows. I have even gone as far
as installing it on my old laptop to follow some of these tutorials. But,
I have quickly re-discovered the reasons for my dislike of Windows.

There are some great resources available explaining how to use the Stellaris
board on Linux. Stellaris is a predecessor of Tiva, and much of this advice
applies to Tiva as well. Everyone seems to use Make, though. I don't like it
because generating source file dependencies and discovering libraries with it
involves black magic and blood of goats. I decided, then, to add my two cents
and create a template for CMake (GitHub). It works fine both with or
without TivaWare and uses my BSD-licensed start-up files. To use it for your
project, all you need to do is:

Table of Contents

Intro

The second part of our threading journey covers the thread-local storage. I do
not mean here the compiler generated TLS that I mentioned in the first part. I
mean the stuff that involves pthread_setspecific and friends. I don't think
it's the most critical part of all this threading business. I barely ever use it
in practice. However, we will need to refer to the current thread all over the
place, and this requires the TLS. It's better to deal with it once and for all,
especially that it's not particularly complicated.

Self

How does one store something in a fixed location and make it distinct for each
execution thread? The only two answers that come to my mind are either using
syscalls to associate something with kernel's task_struct or using CPU
registers. The first approach requires context switches to retrieve the value,
so it's rather inefficient. The second option, though, should be pretty fast.
Conveniently, on x86_64, some registers are left unused (see
StackOverflow). In fact, the SETTLS option to clone takes a pointer
and puts it in the file segment register for us, so we don't even need to make
an extra syscall just for that.

Since fs is a segment register, we cannot retrieve it's absolute value without
engaging the operating system. Linux on x86_64 uses the arch_prctl syscall for
this purpose:

This syscall seems expensive (link) and making it defies the reason for
using a register in the first place. We can read the memory at an address
relative to the value of the register, though. Using this fact, we can make our
thread struct point to itself and then retrieve this pointer using inline
assembly. Here's how:

For the main thread, the linker initializes the TLS segment for pthreads
automatically using arch_prctl. We cannot use it for our purposes, but we can
count on tbthread_self returning a unique, meaningful value.

TLS

The actual TLS is handled by the following four functions: pthread_key_create,
pthread_key_delete, pthread_getspecific, pthread_setspecific. I will not
explain what they do because it should pretty self-evident. If it's not, see the
man pages.

In principle, we could just have a hash table in the tbthread struct. We could
then use setspecific and getspecific to set and retrieve the value
associated with each given key. Calling setspecific with a NULL pointer would
delete the key. Unfortunately, the designers of pthreads made it a bit more
complicated by having separate key_create and key_delete functions, with the
delete function invalidating the key in all the threads. Glibc uses a global
array of keys and sequence numbers in a clever way to solve this problem. We
will take almost the same approach in a less efficient but a bit clearer way.

We will have a global array representing keys. Each element of this array will
host a pointer to a destructor and a sequence number. The index in the array
will be the actual key passed around between the TLS functions. Both
key_create and key_delete will bump the sequence number associated with a
key in an atomic way. If the number is even, the key is not allocated, if the
number is odd, it is.

Each tbthread struct will hold an array of the same size as the global array
of keys. Each element in this array will hold a data pointer and a sequence
number. Storing data will set the sequence number as well. Retrieving the data
will check whether the local and the global sequence number match before
proceeding.

Table of Contents

Intro

This is a first of hopefully many posts documenting my attempts to understand
how to implement a pthread-style threading system on Linux. To this end, I
started implementing a small, non-portable and a rather useless library that I
called Thread Bites. Thread Bites is useless mainly because it lacks support
for compiler-generated thread-local storage. It may not sound like a grave
issue, but it makes Thread Bites incompatible with most of glibc. Therefore,
I had to provide my own functionality for invoking syscalls, managing the heap
and even printing stuff to stdout. It's all pretty simple and understandable so
far, so I hope I will be able to implement most of the pthreads' functionality
in a couple of small bites. You can get the source from GitHub.

Syscalls

For a program to be even remotely useful, it needs to communicate with the user
in one way or another. A standard library for the programming language, like
glibc, typically provides all the necessary components for such communication.
For reasons mentioned in the introduction, using glibc in this case is not
advisable. Hence, I need to find a way to call the operating system directly
without using the syscall function, because it is also a part glibc and sets
errno, which is supposed to reside in the TLS that I did not set up.

All that is needed to implement an equivalent function is shuffling around the
values of the registers to translate between the calling conventions of C and
the Linux kernel, as described here and here on page 20. As it turns
out, it's not that hard to do it using inline assembly in C, and there's an
excellent tutorial here.

All this is, of course, horribly inefficient because it messes up with all the
registers even in the situations it does not have to. The intermediate variables
for the parameters (__a1 and friends) are used to prevent embedded function
calls from messing with the registers that have already been set; think of
strlen in SYSCALL3(__NR_write, 1, blah, strlen(blah)).

Printing to stdout

It seems that glibc is using errno and other thread local stuff to calculate
buffer sizes in one of the subroutines called by printf. It causes printf to
segfault when called concurrently from different threads because of the same TLS
story. Thread Bites provides a convenience function similar to printf and
supporting %s%x%u%o%d in l and ll flavors:

1 voidtbprint(constchar*format,...);

Malloc

Glibc's default malloc implementation, a ptmalloc2 derivative, uses
thread-specific arenas to limit lock congestion caused by calling malloc
concurrently from multiple threads. It looks like it depends on TLS, so using it
is not the best idea. Thread Bites comes with its own evil version of
malloc. It's pathological because it's extremely prone to fragmentation, it
never shrinks the heap, and it's essentially one big critical section. It has
some undeniable advantages too: it works, it's incredibly simple, and it fits in
around 50 lines of code. Look here to find it.

The only thing worth noting in this section is that the sys_brk syscall does
not behave like glibc's brk or sbrk functions. On error, it would return
the previous location of the heap boundary, so the code calls it with an
obviously wrong parameter (0) to figure out what the initial heap boundary is.

Clone

Clone is an interesting beast. From the standpoint of this section, the
relevant thing about it is that it behaves mostly like fork, except that the
child is launched on a new stack. For this reason and to generate proper Call
Frame Information (see here and here), it needs to be implemented in
assembly. The implementation puts the user function pointer and its argument on
the child's stack so that they can be later popped and called by the child.
Then, it puts the syscall parameters in the appropriate registers and makes the
syscall. Again, all the code is on GitHub.

Creating a thread

After dealing with all this boilerplate, the actual thread creation is
relatively straightforward. First, the new thread needs a stack it can run on.
The stack could be allocated on the heap, but it's probably safer to get a
separate memory region for it. It can be done using the mmap syscall. A proper
threading library should check the system limits to figure out the size of the
stack, but Thread Bites is not proper, so it will use 8MiB as a default. All
the function parameters in the snippet below match glibc's mmap call.

It's a good idea to mark a page at the beginning of the stack as non-readable
and non-writable to protect any valid adjecent memory if there is any. Stepping
over the guard page will get us a familiar segmentation fault. Why at the
beginning and not at the end? Because the stack grows downwards, i.e. from a
high address towards a lower one. Go here for more details.

We finally have all that is needed to spawn a new kernel task sharing with the
main task everything that you would expect threads running within the same
process to share. The man page of clone discusses all of the parameters
in details.

When the function returns, the stack memory needs to be freed (munmaped) and
the thread terminated. The clone function took some provisions for calling
sys_exit with the status returned by the wrapper. However, here we remove the
stack from underneath our feet, and we cannot risk any C function write over it.
Therefore, we need to call sys_munmap and sys_exit in one piece of assembly.

Lessons learned

The logic involved in running threads does not seem to be horrendously
complicated, at least not yet.

Glibc is quite convoluted and figuring out what is going on and where is a
challenge. It is probably justified by the range of platforms it supports,
but the documentation could be better.

Threading is tightly coupled with glibc in the areas that I have not
initially suspected: the dynamic linker to support TLS, globals in TLS
(errno), locales, finalizers, and so on.

POSIX threads have an interesting relationship with the Linux task model.
This document describes the initial incompatibilities that have
ultimately been ironed out. It is a very interesting read if you want to
understand how the userspace and the kernelspace interact to implement a
threading system.

Ultimately all this is not necessarily a black magic. At least not so far.

Intro

Following a piece of advice from a friend, I decided to buy this new domain name
and start writing down all the cool things I do. I have written a bit in a bunch
of other places before and have other quasi-failed blogs, so I actually already
have a bit of content to bootstrap this one.

There's plenty of blogware options out there, but, as a programmer, I like the
ones that keep the content in a version control system intended for software.
From the alternatives available in this department, I decided to go for
c()λeslaw. It's kind of similar to Jekyll, which I have used before, and
it's written in Common Lisp, which, typically, is a good sign in general.

There is very little instruction over the Internet on how to use it, but it's
not hard to figure out after reading the code. This post is a brief summary of
what I have done to create this website and convert the content from Jekyll.

Site Structure

The first thing that you need to do is to create a .coleslawrc file describing
the layout of the site, the theme to be used to render the final HTML, and other
such things. There's a good example here and you can get the full picture
by reading the source. :) I like to change the separator (:separator "---"),
so that --- is used to distinguish the metadata from the content section in
source files, this makes things look the Jekyll way. The static-pages
plugin, makes it possible to create content other that blog posts and indices.

Coleslaw will search the repo for files ending with .post (and .page if the
static-pages plugin is enabled) and run them through the renderer selected
in the page's metadata section. It will generate the indices automatically and
copy verbatim everything it finds in the static directory.

You can create our own theme following the rules described here or choose
something from the built-in options. I built the theme you see here more or less
from scratch using Bootstrap and the live customizer to tweak the
colors. It was a fairly easy and pleasant exercise.

In the end, the resulting directory structure looks roughly like this:

The first few lines of the post you are reading right now look like this:

---
title: Blogging with Coleslaw
date: 2015-12-07
tags: blogging, lisp, programming, linux, sbcl
format: md
---
Intro
-----
Following a piece of advice from a friend, I decided to by this new domain name

Patches

Coleslaw and the packages it depends on work pretty well to begin with, but I
made a couple of improvements to make them fit my particular tastes better:

Some themes and plugins are site specific and cannot be generalized. There
is very little point in keeping them in the coleslaw source tree when they
really belong with the site content. I submitted patches to make it possible
to define themes and plugins in the content repo. See PR-98 and
PR-101.

I like to have the HTML files named in a certain way in the resulting web
site, so it's convenient for me to be able to specify lambdas in
.coleslawrc mapping the content metadata to file names. I made a pull
request to allow that (PR-100), but Brit, the maintainer of coleslaw,
has different ideas on how to approach this problem.

I think pygments have no real competition if it comes to coloring source
code, so I made changes to 3bmd - the markdown rendering library used by
coleslaw - allowing it to use pygments. See PR-24.

It's nice to be able to control how the rendered HTML tables look. In order
to do that, you need to be able to specify the css class for the table.
See PR-25.

Customization

3bmd makes it fairly easy to customize how the final HTML is rendered. For
instance, you can change the resulting markup for images by defining
a method :aroundprint-tagged-element. I want the images on this web site
to have frames and captions, so I did this:

Being able to use $config.domain and other variables in the markdown makes
it possible to define relative paths to images and other resources. This comes
handy if you want to test the web site using different locations. In order to
acheve this you can define a method :aroundrender-text in the following
way:

After some investigation, it turned out that DreamHost uses grsecurity
kernel patches and, it looks like, their implementation of ASLR (Address Space
Layout Randomization) does not respect the ADDR_NO_RANDOMIZE personality that
is indeed set by sbcl at startup. They still allow the memory to be mapped at a
specific location, which is a requirement for sbcl, if the MAP_FIXED flag is
passed to mmap. The patch fixing this problem was a fairly simple one
once I figured out what's going on. It looks like it will be included in sbcl
1.3.2. Until then, you will have to recompile the sources yourself.

Let's see if we get a speedup if we compile the code. The snippets below list
the contents of col1.lisp and col2.lisp respectively:

Conclusion

Building this web site was quite an instructive experience, especially that it
was my first non-toy project done in Common Lisp. It showed me how easy it is
to use and hack on CL projects and how handy QuickLisp is. There's plenty of
good libraries around and, if they have areas in which they are lacking, it's
quite a bit of fun to fill the gaps. The library environment definitely is not
as mature as the one of Python or Ruby, so new users may find it difficult, but,
overall, I think it's worth it to spend the time getting comfortable with Common
Lisp. I finally feel emotionally prepared to go through Peter Norvig's
Paradigms of Artificial Intelligence Programming. :)

Generally, LWN runs top quality articles. I always read them with pleasure and
they are good enough to make me a paid subscriber. Every now and then though,
they would publish something pretty great even by their standards. I read
this and was amazed. I had not realized that it is this easy to create
and run a simple virtual machine. I typed the code in and played with it for a
couple of hours. You can get the file that actually compiles (C++14) and runs
here.

When writing software, I have always assumed that I could have trust in the
underlying platform. At least to some basic extent. For instance, when writing
a multi-threaded program running on Linux, it is not unreasonable to think that
the POSIX thread synchronization mechanisms are actually, you know, thread-safe.
As it turns out, it's not quite true. We have learnt about this fact in a
rather painful way - having a heavily-loaded production system crash every now
and then. I ended up having to implement my own semaphores.

Intro

My media center box has recently died tragically of overheating and I decided
to replace it with a brand new Cubox-4iPro. While the hardware seems pretty
great, the software support is less than perfect, to say the least. Especially
if you decide to put, say, Debian Testing on it instead of one of the images
prepared by the vendor. Basing on these notes, I was able to install
and boot the system, and had quite some fun doing so. Not everything worked as
described in the notes and some tweaking needed to be done, so I present here
what worked for me.

Preparing the Micro SD card and bootstrapping the system

You need to create at least two partitions: a swap and a root partition.
Remember to leave some space at the beginning for the boot loader, 4MB or 8192
sectors should be more than enough. The following layout works fine for my
64GB card.

Installing and setting up the boot loader

Cubox has the Initial Program Loader (IPL) in its NVRAM and you need to put
the Secondary Program Loader (SPL) and the primary boot loader (Das U-Boot
in our case) at the beginning of the Micro SD card.

You will also need an appropriate DTB file and an U-Boot environment. The DTB
(Device Tree Blob) is a database that represents the hardware components in
the system, it is provided by the kernel package. You can get the U-Boot
environment by slightly massaging the one provided by the flash-kernel
package. Finally, you will need to make an environment image using mkimage
command (u-boot-tools package).

The system does not have a real time clock, but you can cheat and preserve the
time at least between reboots with good enough accuracy. To do this you can, for
instance, set the current time to the last modification time of
/var/log/syslog as early in the boot sequence as possible. Download
this script and put it in /mnt/tmp/etc/init.d then run:

Serial console

As of writing this, the HDMI output does not really work out of the box, so
you will need to access the serial console to boot into the system and make
it accessible over the network. You can use a Micro USB cable and screen
for this purpose, like this:

]==> screen /dev/ttyUSB0 115200

Booting

You are now ready to boot the box. Insert the Micro SD card, attach the power
cable and see the system boot in your screen session. When it starts the
boot count-down, stop it by pressing enter and type the following:

setenv mmcpart 2
saveenv
boot

Doing this changes the boot partition to 2 and saves the environment on the
card so that you won't have to re-do this every time you reboot the system.

Post-boot settings

Now you need to configure the network, set up time zones, locale settings,
keyboard layout and install some useful packages, like network-manager
and openssh-server.

ASUS Transformer Pad Infinity (TF700T) is a really lovely piece of equipment
but, from the perspective of a Linux geek, lack of access to certain
commandline utilities through a nice-looking and functional terminal emulator
seriously limits its usefulness. I want zshell for heaven's sake! :) And ssh,
and git, python, midnight commander, imagemagic and others. In order to install
and use these comfortably I need root access to the device that I own after all!
And I am denied it. O tempora o mores!

There's a certain "workaround" to this problem over at
xda-developers.com laveraging the fact that the block
device holding the system partition is mounted read-only and, despite seeming
access protected, is actually writeable. This could potentially enable a
dissatisfied owner to use one of e2fsprogs
to plant su, with all its sticky bits set right, and finally enjoy his
property somewhat more. :) It looks like no Windows installation is actually
needed, just functional adb and the binaries.