Sunday, 29 April 2012

An experiment in Van Emde Boas layouts

Given an arbitrarily large sorted, balanced binary tree of values, the obvious way to search for elements is binary search. In undergraduate classes I learned that a binary search requires log(n) operations and is thus as efficient as you can get without throwing additional storage space at the problem. One thing my lecturer didn't dive into though was the effect of CPU cache on this otherwise simple little log(n) algorithm.

While binary sort sounds great on paper, it turns out that repeatedly jumping back and forth in memory could really clock up a tonne of wasted cycles if you're doing it a lot. What makes matters worse is the fact that different architectures and even generations of CPU's will have different sized caches with different performance characteristics.

Enter the van Emde Boas layout - a data layout designed to minimise cache misses without regard for the size of the cache. (I have a coded up visual comparison in JavaScript here.)

The layout breaks a tree into sqrt(n) sub trees, each of sqrt(n) nodes. This continues recursively until the trees contain a single node. The idea is to try to locate related data close to each other in memory to minimise cache misses. The recursive nature of the layout means that it works relatively well regardless of cache size.

This layout complexity obviously adds additional computational complexity. The question I wanted answered was whether or not the benefits outweigh the costs. Enter experiment time!

As expected, we see the D1 miss rate where vEB < tree-order < in-order.

Real-world performance

Its almost impossible to make generic claims about real-world performance here. The actual cost-benefit argument will depend entirely on your combination of CPU, RAM, data set size and data access patterns. Nevertheless, in my case I'm testing with 64k of packed entries of 4 bytes of key and 1 byte of data. (Note that due to a limitation of my implementation of the vEB layout, I require tree depths d = 2^(2^n) and it would involve significantly more effort to run these benchmarks with 4G of entries.)

So it looks as though despite vEB being a more cache-friendly layout, the cost of determining the index of nodes in a vEB layout tends to outweigh the benefits.

Conclusion

I've been investigating vEB layouts for potentially turbo-charging a bunch of static lookup service code so this is a pretty disappointing result in that regard but still, all is not lost. BFS ordering is clearly a very big performance win for minimal computational cost. Also, the focus of my efforts so far have been exclusively on the CPU cache benefits vEB might provide but even if these are nullified by the extra computational overheads, slower storage technologies such as CD, flash and disk should still benefit significantly from the vEB layout. This is doubly true for media with expensive seeks. I might have to run another set of similar benchmarks on various storage media. I imagine the results would be much more in vEB's favour if applied to spinning media.

FYI: If you're interested in the cache configuration of your machine and you run linux, this is a great little trick I discovered in my Googling travels: