Cessu's blog

Friday, November 28, 2008

SR3C

Symbol Ranking is a method for data compression where one maintains a LRU list of symbols recently seen in the given context. Those symbols are expected to be more probable the next time and therefore encoded with fewer bits than other symbols. The LRU list is usually only up to three symbols long, thus making it possible to store a context in just one 32-bit word.

Symbol Ranking has been common computing folklore for ages, Peter Fenwick gave the first modern treatment and implementation subsequently improved by Matt Mahoney for example by adding a secondary arithmetic compression stage. I added a few ideas of my own, rewrote it in C in a way easily embeddable to other programs to compress e.g. network connections and database logging.

This compression library, SR3C, hits the front where one either has to consume more memory or cycles to compress better. The compression is slightly faster to gzip -7, but results in ~11% smaller data. bzip2 -2 compresses slightly better than SR3C, but takes almost three times as long and is not on-line.

Matt Mahoney made a version whose command line interface compiles on Windows and benchmarked it as well. Both versions of SR3C are published under the MIT license.

Random Access Pseudo-Random Numbers

Since my previous post I was approached by a fellow-in-keyboards in need of a "random-access" pseudo-random number generator which could generate the i'th generated random r number directly, without computing the intermediate i-1 random numbers since seeding. Would a hash function for integers be a good RAPRNG?

No! Good hash functions avoid funnels and are thus bijective, which for 32-bit hash functions would imply only a 2^32 period. Consequently, if we draw N < 2^32 consequtive random numbers, then hashing would make r could occur once or never, but not twice or more frequently. A correct random number generation would result in an exponential distribution of the frequency of r.

We can fix this by increasing the period, which is equivalent to increasing the number of output bits of the hash, but use only 32 bits of the output. Alternatively we could hash i with two independent integer hash functions and xor the result. Either way, we deliberately create just the right amount of final funnelling.

Practical tests, such as Dieharder, are surprisingly difficult. I tried xoring results of several pairs of known integer hash functions, and eventually found a reasonable solution based my own hash_32_to_64 mentioned in my previous post.

The hypothetical period of raprng is 2^96 if i were of sufficient width. This period is barely enough to be respectable, at least compared to some serious PRNG's like the Mersenne Twister. raprng is also slower. But it is random-accessible.

Having gone through all this I need to point out that there is a less-known PRNG called the explicit inversive congruential generator (EICG), where the i'th output is defined as modular inverse of a*i+b for some a and b, where a should obviously be non-zero. The problem herein is that modular inversion is by no means cheap, and it isn't clear to me that some other very non-linear bijection - such as a very good hash function - wouldn't do.

Thursday, November 13, 2008

Hashing with SSE2 Revisited, or My Hash Toolkit

Now that Bob "The Hashman" Jenkins refers to my earlier post on a hash function implemented with the help of SSE2 instructions, I feel compelled to post an update on it.

After my original post Bob and I concluded the mix step was not reversible, so I went back to the drawing board and came up with the function below. As earlier, the task of padding the data to even 16 bytes and other trivialities are left to the reader.

Some people have asked me what my set preferred set of hash functions are. As mentioned earlier, the one-at-a-time hash fills most of my daily needs. But when I need more speed or more bits to address a larger hash table, I use either Jenkins hash functions for string-like data, or when I have fast multiplication and the hashed material is more reminiscient of picking fields in a record, I use

About Me

A gradually middle-aging hacker, a software engineer currently working on my PhD in computer science at the Helsinki University of Technology, married, a happy father of two small daughters who are probably bound to outsmart me in everything except some arcane programming languages, a modest/low-carber, member of IKI and Skepsis, and someone who likes good discussions, elegant algorithms, terse software designs, puzzles with an element of surprize, cycling, cooking, and - given a relaxed schedule - building or renovating woodwork.