Links

War of allocators: hoard or hoards?

I found Emery Berger's allocator called hoard. Hoard's homepage holds out some promising general qualities (fast, scalable, and memory-efficient) about the allocator, do we need more than this? I tried out how it performs in WebKit.
I did the benchmark measurements on my benchmarking machine (called Suttyo :-]), this is an x86 Debian-Lenny (SMP kernel, dual core 2.33GHz CPU), I used the Linux-Qt port of WebKit with the official r55720 revision. The memory results are provided by our modified Linux kernel and they represent maximum resident set size (RSS).

Methanol

Methanol is still our live browsing simulation benchmark which loads and renders popular web pages one by one (currently, 9 pages, 5 times) locally. The time measurement has been done with Methanol's JavaScript.
Unfortunately, hoard is almost as slow as JEmalloc in this test, compared to TCmalloc it is 5.3% slower. On the memory consumption side, it consumes 4.6% more memory than TCmalloc (up until now TCmalloc has consumed the most memory).
After benchmarking with Methanol I'd say we shouldn't spend more time on hoard, but still let's see the results of our other benchmarks.

SunSpider in QtLauncher

This test runs SunSpider inside QtLauncher, it does a minimal rendering but it lays emphasis on the JavaScript execution. From the view of performance hoard is slower than TCmalloc by 11.4%, and it consumes 3.7% more memory. It is slower and consumes more memory... Sounds bad...

V8 in QtLauncher

The V8 benchmark shows the same results that we have seen above in the case of SunSpider. It is slower than TCmalloc by 4.9% and consumes more memory than TCmalloc by 7.9%. Anyway, V8 benchmark consumes ~151 megabytes, this number will be interesting later.

WindScorpion in QtLauncher

WindScorpion is our collection of real life JavaScripts and it works like SunSpider and V8 benchmarks.
As the chart shows, hoard is slower than TCmalloc by 11% and it consumes more memory by 20.7% which means almost +10MB memory usage in the case of WindScorpion.

These were single threaded benchmarks... But as hoard's web site wrote "it can dramatically improve application performance, especially for multi-threaded programs running on multiprocessors"...

Workers - the multi-threaded benchmarks

With the help of JavaScript workers, we can run JavaScript applications simultaneously. Let's see, how does our new multi-threaded allocator perform with workers.

Two SunSpider workers in QtLauncher

The columns represent the slower worker's result. In the case of performance, TCmalloc is 39% faster than hoard. On the memory consumption side, hoard consumes 2.9% more memory.

Two V8 workers in QtLauncher

As the chart shows, hoard is 44% slower than TCmalloc and it consumes 28% more memory. In the case of V8, TCmalloc consumes ~150 megabytes, with 2 V8 workers it consumes 334 megabytes.

Summary

I've expected that hoard will perform better, but as you can see, still TCmalloc shows the best values.

These results don't surprise me at all. The hoard allocator has a lot of noise around it, but my experience on embedded and multi-core systems has demonstrated that you need to have a really big churning code base for the allocator to start to hit it's prime. On the whole, WebKit is a pretty frugal and well behaved application and doesn't do wild amounts of cross thread allocation and de-allocations etc.

One allocator that you might be interested in trying out, now that you've been dissapointed with Hoard, is TLSF (http://rtportal.upv.es/rtmalloc/ or http://tlsf.baisoku.org/.

It generally performs about the same as the dlmalloc in my experience, but it's big win is its deterministic behaviour which may be a win for WebKit.