On programming and programming languages: because there has got to be a better way.

01 March 2007

Haskell for the Short Attention Span: Run Length Encoding, Part 1

I'm a fortunate man. Some 50% of people who suffer herpes zoster infections involving the eye wind up with permanent eye damage. I seem to have escaped with little or no permanent damage. The nerves are not completely healed, and there is still visible damage under the skin, but it is no longer painful to work at the computer or read!

It's time to jump back on the horse and risk looking foolish in public by writing some code. I still consider myself to be close to a rank beginner in Haskell, so you're reading my "finger exercises." To experienced functional programmers they will probably sound like a new violin student sawing away. However, since I haven't been a student for almost twenty years and there is no one to assign me homework but myself, this seems to be the next best way to force myself to learn something!

With three children at home, two of them babies, I don't get a lot of unbroken free time; I've got a guaranteed short attention span. So consequently I'm going to look at some very small programs. However, rather than make up toy programs from scratch, I am now going to turn to a few real-world problems that I've stumbled across while doing embedded programming.

OK. Consider a spherical cow of uniform density... wait, I mean, consider a black-box system with an outward-facing interface in which a number of functions are made available. Functions are accessed via a 12-bit unique ID. A client of that system doesn't know which functions are available. The client needs a method to retrieve a list of funciton IDs. Let's assume we don't have bandwidth to burn, so we don't want to send a complete list of all the function IDs. Let's further assume that the functions are organized into a small number of groups with consecutive IDs. In other words, this is data that is very amenable to a simple compression scheme.

We're going to do it using Run-Length Encoding (RLE). Here is a brief informal explanation of RLE; if you are already know all about RLE, skip the next paragraph (but read the ones after that, please!)

RLE is a commonly used technique for encoding bitmaps; a variant of RLE is even used in fax machines. Rather than store a bit for every bit in a bitmap, RLE records _runs_ of repeating data. If you have a raster image that you're scanning left to right, top to bottom, and that image has a lot of horizontal lines in it, or black and white rectangles, this might be a big win. The RLE encoding will record long alternating runs of ones and zeroes. If the image has a lot of vertical lines, this might not be much of a win. The worst-case scenario is a 50% checkerboard: in that case, we'll have to indicate runs of length 1, and the overhead will be far worse than just encoding the bits. For that kind of image you'd be better off using a means of compression that actually adapts to the frequency of occurrence of particular patterns in the data, such as Huffman coding or LZW.

Now, there are several possible ways to represent the runs of RLE data. One scheme might use tuples consisting of the bit value and the number of occurrences of that value (the "run length"). For example, the tuples [(0, 14), (1, 28), (0, 3)] could represent a run of fifteen zeroes, followed by a run of twenty-nine ones, followed by four zeroes. Why are the values off by one? Because we would never need to encode a run of length zero, so we allow zero to represent one, and so on. A tuple like this could be encoded in a single byte, where the high-order bit represents the value and the remaining seven bits represented the run length - 1. This gives us the ability to represent runs of up to 128 bits. Of course, if we have a run of length 129 or greater, we'll need to represent it with multiple bytes. For degenerate cases where we encode very long runs, we'll have unnecessary redundancy.

For that low-entropy bitmap, a better scheme might be to just record the points at which the data _changes_. We have to choose an initial state for decoding: should the decoder, prior to the first change point, produce ones, or zeroes? Once we know that, we don't need to encode the bit values at all. If we assume that our encoded data begins with a run of zeroes, we could represent the example above (fifteen zeroes, followed by twenty-nine ones, followed by four zeroes) by recording only [15, 44]. Let's say that we want our decoder to spit out the indices of the one bits. We'll get:

Using this scheme, a bitmap of all zeroes would be represented by an empty list, while a bitmap of all ones would be represented by by a single [0]; it turns on the "ones" state and never turns it off, so we just generate ones until our range of values is exhausted. (This implies that, in addition to knowing what starting state it should assume, the decoder has to know the size of the bitmap we are representing). A bitmap consisting of all zeroes followed by a single one would be represented by [N - 1] where N is the number of bits in the bitmap.

If, instead of assuming that our data starts with an implicit run of zeroes, we assume instead that it starts with an implicit run of ones, it should be obvious that we could encode the same example using [0, 15, 44]; in this case, a bitmap of all zeroes would be represented by [0], which would yield an empty list of one-bit indices, while a bitmap of all ones would be represented by an empty list, yielding a list [0..n - 1] of bit indices.

We'll assume that our function IDs are represented this way: they are the one bits in the bitmap, and we want a list of their indices. When we query our interface about its function IDs, we get back a response that consists of a short list of change points. Here are the data bytes:

The first time I encountered data like this, I wrote myself a little bit of Ruby code to decode it. Here's first part of the Ruby code, which turns these bytes into a series of 12-bit unsigned numbers by assembling three nybbles (half-bytes) at a time:

Here's how it works. The first line reads data from standard input and, reading left to right, removes any end-of-line character and deletes spaces. We next create an empty list. "0.step" is cute little bit of idiomatic Ruby shorthand: everything is an object, including integers, and they have a "step" method, so by calling 0.step we loop from zero to the string length - 2, incrementing by threes. We also provide a Ruby block, which is basically a closure that receives one parameter, the loop index, and which is bound to the name nybble_idx. Working inside-out, the body of this closure takes three-byte substrings of the raw data, passes them to the hex function to yield an integer, and then uses push, which treats the list like a stack and appends the new integer to the end. The result looks like this:

[0x003, 0x090, 0x093, 0x101, 0x103 ...]

This is the list of change points. Once I have that, I can generate the runs of data. I'll assume that our data starts with an initial run of ones:

Now, I should mention that I like Ruby, and especially like its support for nice idioms like blocks and closures and iterators and chaining functions. But while this Ruby code worked well enough for my little task, you might notice that it doesn't actually handle all the termination cases properly. As I work with Haskell, I'm finding that I like it quite a bit more than Ruby. One of the reasons is that Haskell code, being less imperative, seems closer to a kind of abstract, platonic representation of the problem space. Problems such as the failure to properly handle various termination cases tend to look much more obvious, particularly when functions are broken down into pieces using pattern matching.

(My first opportunity to look silly: I'm sure there's a way to break a string across multiple lines with some sort of continuation character, but I could not get "\" to work together with Literate Haskell mode, so I'll just concatenate them and hope no one notices...)

>rle_bytes_str = filter isHexDigit rle_raw_str

That was easy!

Now, my second opportunity to look silly: I'm sure there is a very clever one-line fold that applies "take 3" in an mind-expanding manner, but I was unable to find it. This was about as clever as I could manage, but it does have the advantage of checking for unexpected data length:

Our playing field is the range of integers 0x000..0xFFF or 0..4095 in decimal. We want to lazily generate a list of valid values using our RLE decoding scheme. Rather than walk the list of change points spitting out values using some kind of toggling state variable to keep track of whether the decoder is processing a run of ones or a run of zeroes, or walking all the possible values filtering it by the list of change points, we'll use two functions, one for extracting ranges, and one for generating values from ranges. Both functions receive a list representing "the rest of the work to do" in the form of the remainder of the list of change points, and so this is a _little_ bit like using continuation-passing style, or using a Monad to maintain the progress of the calculation. Maybe in Part 2 I'll have learned enough to explain what I mean by that!

Our first function, getRangeFromChangePoints, has to handle some interesting cases. Assuming we are decoding RLE in the range of possible values [X..Y], what happens when our list contains a final value, N, that opens a final range but doesn't explicitly close it? Well, there are two approaches we could take. The more conservative approach is to assume that this was a harmless mistake and close the range [N..N + 1) so that it contains only the value N. (Note that I'm using interval notation in which a square bracket indicates that the range _includes_ the bound specified, and a parenthesis _excludes_ the bound specified). However, there is a problem with this approach: it means we can't generate a range that includes the maximum value, Y! That's no good, so we'll assume that a final range opening with N means that we really want [N..Y] (inclusive). In fact, since we treat ranges as exclusive at the end, this is actually the only way we can generate a range containing Y.

For now, we also assume that running out of values is an error. We're assuming that our caller will handle the termination conditions.

Wow, that's pretty wordy as Haskell code goes, but I kind of like the way the pattern-matching breaks down. Note that in particular it makes the termination conditions obvious. This helps me think about the problem domain, rather than the implementation, in a way that the imperative Ruby code did not.

Next, here is the recursive function to generate values from ranges. We represent a range using the first two values of our tuple, while the third value is a list of the remaining ranges (sort of like our continuation). We handle the termination case, in which we have no more work to do, by just returning a list from our range, while the non-terminating case gets the next range (including the list representing our remaining work) and calls the function again to continue generating the list.

To generate the initial case we use a helper function called valuesFromRLE. In ddition to the change points list, it takes a Boolean value which indicates whether we are to assume that our RLE decoding is initially inclusive or initially exclusive: that is, whether the first value in our first range indicates that the range is starting or ending. If we want it to behave inclusively, we prepend a zero value to create a range starting with zero. Note that this will be harmless if our first range starts with zero, because we'll generate a range like [0..(-1)], which will contain no members.

Note that we can evaluate the whole thing, but more interestingly, we can use "take" to produce only the first few elements:

take 16 $ valuesFromRLE rle_change_points True

[0,1,2,144,145,146,257,258,544,560,577,578,1088,3087,3088,3328]

And there you have it. If you like, play with the code. Next time we'll see if we can encode the function IDs back into change points, and confirm the round trip. As always, I appreciate your comments.

9 comments:

One problem with that implementation of triplify is that it computes the length of the string at least once every time that it recurses (and possibly twice, assuming the compiler isn't doing enough common subexpression elimination). This makes its complexity O(n^2) overall, and O(n) to get the first element, which means it doesn't work at all with infinite lists. (n being the length of the input string.)

If I didn't care much about the error condition, or wanted to handle it elsewhere, I'd write triplify as:

triplify = map (take 3) . takeWhile (not . null) . iterate (drop 3)

This will result in the final string being shorter in the event that the string length isn't a multiple of 3. Because all the components are lazy enough, it'll give O(1) complexity to get the first element and the expected O(n) overall.

Alternately, to give a solution closer to yours, with the error condition built in, we can use pattern matching to avoid computing the length of the list:

One related thing I should also point out is that length x == 0 is not a good way to check if a list is empty, since if it's not, the length is still going to be computed, taking lots of time, when all we really want to know is if it's 0. In the infinite list case, it actually fails completely. Using pattern matching or the null function will do the test in constant time.

Basically, try to avoid the length function wherever you can, because it forces the evaluation of the whole spine of the list (all the cons cells, though not necessarily the elements in the list).

Thanks to everyone for the suggestions so far. I have cleaned up the exposition of RLE a little bit, since my original version was not only kind of incoherent, but contained some mistakes. I have also discovered that although I claim that if we feed the decoder an empty change point list and assume that the data begins with a run of zeroes, we should get back a bitmap of all zeroes, represented by an empty list of indices (meaning that there are no ones). In fact, this triggers an error case. I'll address that in a followup.

There have also been some useful comments posted on Reddit instead of here. See the Reddit comments here.