ARM processors support various performance monitoring registers, the most basic being a cycle count register. This is how to make use of it on the Raspberry Pi 3 with its ARM Cortex-A53 processor. The A53 implements the ARMv8 architecture which can operate in both 64- and 32-bit modes, the Pi 3 uses the 32-bit AArch32 mode, which is more or less backwards compatible with the ARMv7-A architecture, as implemented for example by the Cortex-A7 (used in the early Pi 2’s) and Cortex-A8. I hope I’ve got that right, all these names are confusing

The performance counters are made available through coprocessor registers and the mrc and mcr instructions, the precise registers used depending on the particular architecture.

By default, use of these instructions is only possible in “privileged” mode, ie. from the kernel, so the first thing we need to do is to enable register access from userspace. This can be done through a simple kernel module that can also set up the cycle counter parameters needed (we could do this from userspace after the kernel module has enabled access, but it’s simpler to do everything at once).

To compile a kernel module, you need a set of header files compatible with the kernel you are running. Fortunately, if you have installed a kernel with the raspberrypi-kernel package, the corresponding headers should be in raspberrypi-kernel-headers – if you have used rpi-update, you may need to do something else to get the right headers, and of course if you have built your own kernel, you should use the headers from there. So:

Looks like we can count a single cycle and since the Pi 3 has a 1.2GHz clock the loop time looks about right (the clock seems to be scaled if the processor is idle so we don’t necessarily get a full 1.2 billion cycles per second – for example, if we replace the loop above with a sleep).

After the topical excitement of the last couple of posts, let’s look at an all-time great – Leslie Lamport’s Bakery Algorithm (and of course this is still topical; Lamport is the most recent winner of the Turing Award).

The problem is mutual exclusion without mutual exclusion primitives. Usually, it’s described in the context of a shared memory system (and that is what we will implement here), but will work equally well in a message-passing system with only local state (each thread or process only needs to write to its own part of the store).

For further details, and Lamport’s later thoughts see http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html#bakery: “For a couple of years after my discovery of the bakery algorithm, everything I learned about concurrency came from studying it.” – and since Lamport understands more about concurrency than just about anyone on the planet, it’s maybe worth spending some time looking at it ourselves.

I’m not going to attempt to prove the algorithm correct, I’ll leave that to Lamport, but the crucial idea seems to me to be that a thread reading a particular value from another thread is a synchronization signal from that thread – here, reading a false value for the entering variable is a signal that the other thread isn’t in the process of deciding on it’s own number, therefore it is safe for the reading process to proceed.

Implementing on a real multiprocessor system, we find that use of memory barriers or synchronization primitives is essential – the algorithm requires that reads and writes are serialized in the sense that once a value is written, other processes won’t see an earlier value (or earlier values of other variables). This doesn’t conflict with what Lamport says about not requiring low-level atomicity – we can allow reads and writes to happen simultaneously, with the possibility of a read returning a bogus value – and in fact we can simulate this in the program by writing a random value just before a process selects its real ticket number, but once a write has completed, all processes should see the new value.

Another essential feature is the volatile flag – as many have pointed out, this isn’t enough by itself for correct thread synchronization, but for shared memory systems, prevents the compiler from making invalid assumptions about consistency of reads from shared variables.

A final point – correctness requires that ticket numbers can increase without bound, this is hard to arrange in practice, so we just assert if they grow too large (this rarely happens in reality, unless we get carried away with our randomization).

Apart from a brief visit many years ago to Macclesfield church with my friend Jo, I have never taken part in the ancient English art of bell-ringing, but permutation generation has always seemed a fascinating topic.

In TAOCP 7.2.1.2, Knuth has Algorithm P (“plain changes”) for generating all permutations using adjacent interchanges:

but somehow that seems missing the point (I don’t think that the worst thing about that code is that it still uses modifiable state).

There are actually two component parts to the algorithm: generating a “mixed radix reflected Gray code” which defines the progressive inversions of the array elements, and going from the inversion changes to the actual indexes of the swapped objects.

For generating the Gray codes, with a bit of rewriting, 0-basing our arrays etc. we get:

Each time around, find the highest element whose inversion count can be changed in the desired direction. For elements whose inversion count can’t be changed, change direction. If no element can be changed we are done.

The next step is to calculate the position of element j – but this is just j less the number of elements less than j that appear after j (ie. the number of inversions, c[j]), plus the number of elements greater than j that appear before j – but this is just the number of elements we have been passed over in the “if (q == j+1)” step above, so we can now add in the rest of algorithm P:

which explains it very lucidly: there is a 1-1 correspondence between inversion counts & permutations – and a Gray enumeration of inversion gives us a sequence of permutations where only 1 element at a time changes its inversion count, and only by 1 or -1, which can only be if we exchange adjacent elements.

Knuth also gives a “loopless” algorithm for generating reflected Gray sequences, and we could use a table of element positions to construct the permutations from this:

The “classic” Johnson-Trotter algorithm for plain changes involves consideration of “mobile” elements: each element has a current direction and it is “mobile” if if it greater than the next adjacent element (if there is one) in that direction. The algorithm proceeds by finding the highest mobile element and moving it accordingly:

Note that we can check elements from high to low directly & stop as soon as a mobile element is found – there is no need to check every element.

Johnson-Trotter and Algorithm P are essentially doing the same thing: when moving the mobile element, we are changing its inversion count in the same way as we do directly in P.

Performance-wise, the two algorithms are very similar though (somewhat surprisingly to me) the Johnson-Trotter version seems a little faster.

Either way, my laptop generates the 3628800 permutations of 10 elements in 0.03 seconds, outputting them (even to /dev/null) takes 4.5 seconds. It takes about 3 seconds to run through the 479001600 permutations of 12 elements (I haven’t tried outputting them). We can streamline the computation further by taking advantage of the fact that most of the time we are just moving the highest element in the same direction, but that can wait for another day.

Finally, here is a function that gives the next permutation in plain changes without maintaining any state:

We use the Knuth/Dijkstra trick to avoid building a position table. I suspect that n-squared inversion counting can be improved, but we still come in at under second for all permutations of 10, though that’s about 30 times slower than the optimized version.

Well, OK, it’s unwound the loop and vectorized with SSE instructions though a more precise analysis will have to wait for a future post. Impressive optimization from GCC though and pretty groovy considering the biggest factorial we can fit in a 32-bit int is fact(15)!

Like, I suspect, many programmers, there are many software tools I have used on a regular basis for many years but remain woefully ignorant of their inner workings and true potential. One of these is the Unix command line.

It’s a common, for example, to want to make the error output of a program to appear as the normal output and trial and error, or Googling or looking at Stack Overflow leads to:

strace ls 2>&1 >/dev/null

which works fine but seems puzzling – we’ve told the program to send error output to normal output, then normal output to /dev/null so why doesn’t that discard everything, similar to:

strace ls >/dev/null 2>&1

This is because we don’t understand what is going on.

An indirection is actually a call to the dup2 system call. From the man page:

dup2() makes newfd be the copy of oldfd, closing newfd first if necessary'

So: n>&m does a dup2(m,n): close fd n if necessary, then make n be a copy of fd m, and n>file means: close n if necessary, open file as fd m, then do dup2(m,n).

Now it all makes sense:

strace ls 2>&1 1>/dev/null

first of all makes 2 be a copy of 1, then changes 1 to point to /dev/null – the copying is done ‘by value’ as it were (despite the confusing, for C++ programmers anyway, use of ‘&’).

Using strace here is not an accident, but used like this doesn’t tell us much: indirection is handled by the shell, not by the program, so we need to do something like this for further insight:

While we are this part of town, let’s talk briefly about named pipes, another feature that has been around for ever, but doesn’t seem to get used as much as it deserves. We can run in to problems though:

Suppose I want to capture an HTTP response from a web server, I can do this:

This doesn’t seem right, netcat should be listening on port 8888, I told it so myself! And checking with netstat shows no listener on 8888 and finally ps -aux shows no sign of any netcat processes – what is going on?

Once again, strace helps us see the true reality of things:

$ kill %1
$ strace netcat -l 8888 < f1 > f2

But strace tells us nothing – there is no output! Like the dog that didn’t bark in the night though, this is an important clue, and widening our area of investigation, we find:

The shell is stalled trying to open the “f1” fifo, before it even gets around to starting the netcat program, which is why the first strace didn’t show anything. What we have forgotten is that opening a pipe blocks if there is no process with the other end open (it doesn’t have to actively reading or writing, it just has to be there). The shell handles redirections in the order they appear, so since our 3 processes are all opening their read fifo first, none have got around to opening their write fifo – we have deadlock, in fact, the classic dining philosophers problem, and a simple solution is for one philosopher to pick up the forks in a different order:

TUN devices are much used for virtualization, VPNs, network testing programs, etc. A TUN device essentially is a network interface that also exists as a user space file descriptor, data sent to the interface can be read from the file descriptor, and data written to the file descriptor emerges from the network interface.

Here’s a simple example of their use. We create a TUN device that simulates an entire network, with traffic to each network address just routed back to the original host.

We want a TUN device (rather than TAP, essentially the same thing but at the ethernet level) and we don’t want packet information at the moment. We copy the name of the allocated device to the char array given as a parameter.

Now all our program needs to do is create the TUN device and sit in a loop copying packets:

A recent discussion at work about the best way to write a simple string checking function set me thinking…

Essentially, the problem is to recognize strings conforming to some regular expression, but with some extra conditions, that either can’t be expressed as a regexp or that are inconvenient to do so, so we can’t just plug it in to PCRE or whatever RE library is flavour of the month.

In this case, we wanted strings conforming to:

[0-9a-zA-Z*][-0-9a-zA-Z*]*(.[0-9a-zA-Z*][-0-9a-zA-Z*]*)*

ie. a path containing “.” separated components, possibly with “*” wildcards, but also with the condition that each component has at most one wildcard and with a limit on the length of each component.

Ignoring the extra conditions, we can define a suitable regular grammar:

and this naturally suggests a recursive descent recognizer, though in this case, there is no actual descent as the grammar is regular and all the recursions are tail calls.

Happily, these days compilers like GCC will compile tail calls to jumps (at least on higher optimization settings), just like ML compilers, for example, have always had to, and we can hope for reasonably good code from such an approach. Of course, once the tail calls have been optimized, we have a finite state recognizer with the state transitions implemented as gotos!

The extra conditions can now be easily added in to the state transition functions, with any variables needed just appearing as extra parameters (and because we need to specify the parameter value for every recursive call/state transition, we are less likely to forget to set something) so we end up with something like:

It looks like the tail calls have indeed been compiled as jumps, so the stack won’t blow up. There is what looks like some unnecessary movement of function arguments between the stack and the function argument registers, but we’re not doing too badly. Probably not quite as fast as a hand-coded goto-based version, but only a little slower, almost certainly a price worth paying for the gain in clarity (in fact, most of the time seems to be spent in the calls to isalnum).