Transcendental Technical Travails

New technical endeavors often push the limits of the state of the art. Discovering working solutions is important, but just as important are the transcendental travails that start with non-working attempted solutions.

18 April 2010

Red-black trees revisited

A couple of months ago I was looking at CPU usage profiles for one of the server applications at work, trying to figure out how to reduce the 28% of run time attributed to jemalloc. For the application under scrutiny, red-black tree insertion/deletion accounted for 6% of total run time, and I knew from previous benchmarking that jemalloc's left-leaning red-black trees are approximately half as fast as the traditional implementations that utilize parent pointers. Therefore I set out to create a balanced tree implementation that is both fast and compact.

Insertion/deletion is fastest for the red-black tree implementations that do lazy fixup (rb_old and RB). rb_new uses a single-pass algorithm, which requires more work.

Well, in fact it isn't quite that simple. While it is true that the single-pass algorithm requires some extra node rotations, the overhead of traversing down/up the tree is also significant. A recursive implementation I experimented with performed approximately the same number of node rotations as rb_old and RB, but the traversal overhead of full down/up passes dominated.

A major advantage of the rb_old and RB insertion/deletion implementations is that as soon as balance is restored, they terminate. Recursive implementation makes early termination infeasible, so I developed a non-recursive implementation, rb_newer, that uses a stack-allocated array to store the path from the root to the current position. The essence of the algorithm is the same as for recursive implementation, with the advantage that early termination requires no call stack unwinding.

The following table shows benchmark results that directly correspond to those reported in a previous blog post (Ubuntu 9.10 instead of 8.10 though):

The main result to note is that rb_newer is substantially faster than rb_new at insertion/deletion, though it only closes half the performance gap between rb_new and rb_old. Despite my best efforts, rb_old (a direct translation of the pseudo-code in Introduction to Algorithms, by Cormen et al.) remains markedly faster, thanks to the simplicity afforded by parent pointers.

jemalloc uses rb_newer now, though the total memory savings as compared to rb_old is only about 0.4%. Nonetheless this seems like a worthwhile space savings, given how typical applications use malloc.

11 April 2010

Stand-alone jemalloc 1.0.0

Stand-alone jemalloc 1.0.0 is finally released. There are many interesting features to talk about at some point (thread-local caching, heap profiling, introspection, yet another red-black tree implementation, etc.), but in the meanwhile, enjoy!

17 May 2009

Mr. Malloc gets schooled

I've been terribly busy for the past 8 months, frantically developing Crux and using it to conduct experiments so that I can finish my thesis for a PhD in computational biology. An ironic thing happened this weekend, and this is the perfect forum for sharing it. I spent two days trying to figure out why Crux's memory usage was growing without bound when analyzing large datasets. I looked for memory leaks, inefficient caching, garbage collection issues, any explanation for the memory usage. After much pain and agony (including that inflicted on the system administrator who kindly patched up the bleeding Beowulf cluster I left in my wake), I finally came to the conclusion that the problem wasn't in Crux. That led me to... glibc's ptmalloc. Launching Crux with jemallocLD_PRELOAD'ed made the problem go away!

It turns out that memory was fragmenting without bound. Crux incrementally reallocates (actually free(...); posix_memalign(...)) vectors of double-precision floating point numbers. For the dataset I'm currently analyzing, these vectors are multiples of ~337KiB, where the multiplier is anything from 1 to ~65 (20+MiB). I wouldn't have expected this to cause malloc any fragmentation problems, since the last I knew, ptmalloc simply used mmap() for all allocations above 128KiB. However, here's what it does now (taken directly from the glibc source code):

Update in 2006:The above was written in 2001. Since then the world has changed a lot. Memory got bigger. Applications got bigger. The virtual address space layout in 32 bit linux changed.

In the new situation, brk() and mmap space is shared and there are no artificial limits on brk size imposed by the kernel. What is more, applications have started using transient allocations larger than the 128Kb as was imagined in 2001.

The price for mmap is also high now; each time glibc mmaps from the kernel, the kernel is forced to zero out the memory it gives to the application. Zeroing memory is expensive and eats a lot of cache and memory bandwidth. This has nothing to do with the efficiency of the virtual memory system, by doing mmap the kernel just has no choice but to zero.

In 2001, the kernel had a maximum size for brk() which was about 800 megabytes on 32 bit x86, at that point brk() would hit the first mmaped shared libaries and couldn't expand anymore. With current 2.6 kernels, the VA space layout is different and brk() and mmap both can span the entire heap at will.

Rather than using a static threshold for the brk/mmap tradeoff, we are now using a simple dynamic one. The goal is still to avoid fragmentation. The old goals we kept are1) try to get the long lived large allocations to use mmap()2) really large allocations should always use mmap() and we're adding now:3) transient allocations should use brk() to avoid forcing the kernel having to zero memory over and over again

The implementation works with a sliding threshold, which is by default limited to go between 128Kb and 32Mb (64Mb for 64 bitmachines) [actually 512KiB/32MiB for 32/64-bit machines as configured in glibc] and starts out at 128Kb as per the 2001 default.

This allows us to satisfy requirement 1) under the assumption that long lived allocations are made early in the process' lifespan, before it has started doing dynamic allocations of the same size (which will increase the threshold).

The upperbound on the threshold satisfies requirement 2)

The threshold goes up in value when the application frees memory that was allocated with the mmap allocator. The idea is that once the application starts freeing memory of a certain size, it's highly probable that this is a size the application uses for transient allocations. This estimator is there to satisfy the new third requirement.

So, ptmalloc got smart, and appears to have opened itself up to a serious fragmentation problem due to some inadequacy in its data segment layout policies. Here's a simple fix that disables the sliding threshold:

mallopt(M_MMAP_THRESHOLD, 128*1024);

How did I fail to consider this possibility for two whole days? There are a few contributing factors:

I'd never actually seen ptmalloc fail spectacularly before. I've received several emails over the past year from people using jemalloc to avoid ptmalloc fragmentation problems, but I didn't know what conditions actually triggered the problem(s).

As the author of jemalloc, I'm keenly aware of how often people are wrong when they blame the memory allocator for their problems.

My reasoning about how memory allocation works is tainted by intimate knowledge of how jemalloc works, and I failed to consider that Crux's memory allocation patterns could cause problems for other allocators.

27 August 2008

Stand-alone jemalloc for Linux

I have received numerous requests for a version of jemalloc that is ported to various operating systems. My plan has long been to create a jemalloc distribution that supports *BSD, Linux, Solaris, OS X, and Windows, but there are a lot of portability headaches to deal with for OS X and Windows, so the project keeps being neglected.

Porting to Linux isn't much trouble though, so I whipped up a minimal distribution this morning, and put it in this directory. Maybe someday I'll get around to a proper widely portable distribution, but until then, I hope people find the Linux port useful.

Most of this article discusses performance, but let me first mention implementation difficulty. It took me about 90 hours to design/implement/test/benchmark left-leaning red-black trees, and less than 10 hours for treaps. Search/insert/delete for red-black trees is O(lg n), versus O(n) for treaps. However, the average case for treaps is (lg n), and the chances of worst case behavior are vanishingly small, thanks to (pseudo-)randomness. Thus, real-world performance differences are only incremental. To be fair, I made red-black trees harder by avoiding recursion. Regardless however, treaps are way easier to implement than red-black trees.

As for benchmarking, I wrote functionally identical benchmark programs for three red-black tree implementations and two treap implementations. The tree implementations are:

The benchmark programs iteratively generate permutations of NNODES nodes, for NSETS node sets. For each node set, the programs iteratively build and tear down a tree using the first [1..NNODES] nodes in the set. Each insert/remove operation is accompanied by NSEARCH rounds of searching for every object in the tree, and NITER rounds of iterating over every object in the tree. Don't worry too much about the details; in short the benchmark programs can be configured to predominantly benchmark insertion/deletion, searching, and/or iteration.

The following table summarizes benchmark results as measured on a 2.2 GHz amd64 Ubuntu 8.10 system. The benchmarks were all compiled with "gcc -O3", and the times are user+system time (fastest of three runs):

NNODES,NSETS,NSEARCH,NITER(focus)

rb_new

rb_old

RB

trp_hash

trp_prng

1500,25,0,0 (ins/del)

7.60

3.99

4.25

17.57

7.58

125,25,10,0 (search)

17.74

18.61

16.60

17.84

17.77

250,25,0,100 (iteration)

18.45

21.06

19.19

20.45

20.40

Insertion/deletion is fastest for the red-black tree implementations that do lazy fixup (rb_old and RB). rb_new uses a single-pass algorithm, which requires more work. trp_prng is about the same speed as rb_new, but trp_hash is way slower, due to the repeated hash computations that are required to avoid explicitly storing node priorities.

Search performance is similar for all implementations, which indicates that there are no major disparities in tree balance.

Iteration performance is similar for all implementations, even though they use substantially different algorithms. If tree size were much larger, rb_old and RB would suffer, since they use an O(n lg n) algorithm, whereas rb_new and trp_* use O(n) algorithms. rb_new uses a rather complicated iterative algorithm, but trp_* use recursion and callback functions due to the weak upper bound on tree depth.

Sadly, there is no decisive winner, though any of the five tree implementations is perfectly adequate for the vast majority of applications. The winners according to various criteria are:

24 July 2008

Overzealous use of my red-black tree hammer

When Firefox 3 was released, jemalloc was left disabled for the OS X version, essentially because OS X's malloc implementation did as good a job as jemalloc (in terms of both speed and memory usage), and we didn't think it worth introducing potential regressions due to changed memory layout. Recently I have been working on a memory reserve system that allows Firefox to simplify its error handling with regard to out-of-memory errors. Since the memory reserve is necessarily deeply integrated with the allocator, we need to use jemalloc on all platforms in order to take advantage of this new facility. This prompted me to take a closer look at jemalloc performance on OS X. In summary:

On ELF-based systems (pretty much all modern Unix and Unix-like systems except OS X), it is possible to cleanly replace the system malloc, either by directly implementing the appropriate functions (malloc, realloc, free, etc.), or by using the LD_PRELOAD environment variable to preload a dynamic library that contains a malloc implementation. For Windows, replacing malloc is much harder; it is necessary to create a custom CRT. On the bright side, at least it is possible to create a custom CRT, since source code is included with MS Visual Studio.

OS X uses the Mach-O format, and in order to completely replace the system malloc, it would be necessary to compile a custom libSystem. As far as I know, that has not been possible outside the confines of Apple since version 10.3 (2+ years ago). Even if it were possible, there would be all sorts of undesirable aspects to shipping a custom libSystem with Firefox; libSystem is a huge library, and binary compatibility issues would be a constant problem. So, the only remaining viable option is to subvert the malloc zone machinery. There is no supported method for changing the default zone, and furthermore, CoreFoundation directly accesses the default zone. Enough about that though; suffice it to say that I did find ways to subvert the malloc zone machinery.

Once Firefox was successfully using jemalloc for all memory allocation, I started doing performance tests. Memory usage differences were minor, but jemalloc was consistently slower than OS X's allocator. It took a lot of profiling for me to finally accept the hard truth: jemalloc was spending way too much time manipulating red-black trees. My first experimental solution was to replace red-black trees with treaps. However, this made little overall difference. So, the problem was too many tree operations, not slow tree operations.

After a bit of code review, it became clear that when I fixed a page allocation bottleneck earlier this year, I was overzealous with the application of red-black trees. It is possible to use constant-time algorithms based on linear page map data structures for splitting/coalescing sequential runs of pages, but I had re-coded these operations entirely using red-black trees. So, I enhanced the page map data structures to support splitting/coalescing, and jemalloc became markedly faster. For example, Firefox sped up by as much as ~10% on JavaScript-heavy benchmarks. (As a side benefit, memory usage went down by 1-2%).

In essence, my initial failure was to disregard the difference between a O(1) algorithm and a O(lg n) algorithm. Intuitively, I think of logarithmic-time algorithms as fast, but constant factors and large n can conspire to make logarthmic time not nearly good enough.

21 April 2008

Left-leaning red-black trees are hard to implement

Back in 2002, I needed balanced trees for a project I was working on, so I used the description and pseudo-code in Introduction to Algorithms to implement red-black trees. I vaguely recall spending perhaps two days on implementation and testing. That implementation uses C preprocessor macros in order to make it possible to link data structures into one or more red-black trees without requiring container objects.

About the same time, Niels Provos added a similar implementation to OpenBSD, which was imported into FreeBSD, so when I imported jemalloc into FreeBSD, I switched from my own red-black tree implementation to the standard one. Unfortunately, both implementations use nodes that include four pieces of information: parent, left child, right child, and color (red or black). That typically adds up to 16 or 32 bytes on 32- and 64-bit systems, respectively. A few months ago I fixed some scalability issues Stuart Parmenter found in jemalloc by replacing linear searches with tree searches, but that meant adding more tree links. These trees now take up ~2% of all mapped memory, so I have been contemplating ways to reduce the overhead.

A couple of weeks ago, I came across some slides for a talk that Robert Sedgewick recently gave on left-leaning red-black trees. His slides pointedly disparage the use of parent pointers, and they also make left-leaning red-black trees look simple to implement. Left-leaning red-black trees maintain a logical 1:1 correspondence with 2-3-4 B-trees, which is a huge help in understanding seemingly complex tree transformations.

Last Monday, I started implementing left-leaning red-black trees, expecting to spend perhaps 15 hours on the project. I'm here more than 60 hours of work later to tell you that left-leaning red-black trees are hard to implement, and contrary to Sedgewick's claims, their implementation appears to require approximately the same amount of code and complexity as standard red-black trees. Part of the catch is that although standard red-black trees have additional cases to deal with due to 3-nodes that can lean left or right, left-leaning red-black trees have a universal asymmetry between the left and right versions of the algorithms.

If memory overhead weren't my primary concern for this project, I would have dropped red-black trees in favor of treaps. Unfortunately, treaps require either recursive implementation or parent pointers, and they also require an extra "priority" field, whereas red-black trees can be implemented without recursion or parent pointers, and it is possible to stuff the red-black bit in the least significant bit of one of the left/right pointers.

For the curious or those in need of such a beast, here is my left-leaning red-black tree implementation. One point of interest is that my benchmarks show it to be ~25% slower than my standard red-black tree implementation. The red-black bit twiddling overhead only accounts for about 1/5 of the slowdown. I attribute the other 4/5 to the overhead of transforming the tree on the down pass, rather than lazily fixing up tree structure violations afterward.

[26 April 2008] I did some further experimentation to understand the performance disparity between implementations. The benchmarks mentioned above were flawed, in that they always searched for the most recently inserted item. Since top-down insertion/deletion is more disruptive than lazy fixup, the searches significantly favored the old implementation. I fixed the benchmarks to compute the times for random searches, random insertions/deletions, and in-order tree traversal.

The old rb.h and sys/tree.h perform essentially the same for all operations. The new rb.h takes almost twice as long for insertion/deletion, is the same speed for searches, and is slightly faster for iteration. Red/black bit twiddling overhead accounts for ~6% of insertion/deletion time, and <3% of search time.

I am actually quite pleased with these benchmark results, because they show that for random inputs, left-leaning red-black trees do not noticeably suffer from the fact that tree height is O(3h) rather than O(2h), where h is the height of an equivalent fully balanced tree.

03 April 2008

Using Mercurial patch queues for daily development

I recently watched a video (slides) of Bryan O'Sullivan speaking about Mercurial. The presentation was mainly a (great) introduction to Mercurial, but I was surprised to learn that Mercurial patch queues could be useful even when using a repository that I have full commit access to. In a nutshell, Bryan described how he uses patch queues to checkpoint his work without cluttering the permanent revision history. Checkpointing is mainly useful to me when I am about to try a risky programming solution on top of reasonable code that only partially implements a feature. Historically, I have archived my entire sandbox at such critical points, but patch queues are a much cleaner solution; they make it possible to separate work into distinct patches and checkpoint regularly without performing heavyweight archiving operations. Note that reverting to an earlier state is much easier with patch queues, which makes failed experiments much less costly. This all sounds great, but it took me several hours and a lot of mistakes to actually figure out how to use patch queues in this fashion, so I'm recording the solution here with the hope that it will be useful to others.

The first step is to enable the mq extension (see Configuration directions), though it is enabled by default on my Ubuntu 7.10 systems, and in fact following the standard configuration directions blindly causes some strange warnings.

Following is a terse example of how to perform every operation that I find useful when using patch queues for daily development:

The trickiest parts of the above are committing/deleting with the qdelete command, and editing the commit message with qrefresh. I omitted the many ways of messing up the order of operations, so tread lightly and experiment with a toy repository before you use this mode of operation for real.

11 March 2008

Migrating from Subversion to Mercurial

jemalloc has settled into Firefox pretty nicely at this point, so after having mostly worked on Lyken for a few weeks while waiting for the dust to settle, I'm planning to start working on adding the necessary functionality to allow the Tamarin JavaScript engine to integrate without requiring a separately managed heap for garbage collection. One of the first things I ran into was that the Tamarin source code is available as a Mercurial repository, so it seemed like a good time to become familiar with yet another version control system (VCS).

Over the past ten years, there has been a proliferation of VCS's, especially those supporting distributed development models (Arch, darcs, BitKeeper, git, svk, Mercurial, Bazaar, etc.), but for some reason I've found it difficult to get excited about them. The biggest barrier for me has been perceived complexity, but that is perhaps attributable in part to lack of exposure. Well, I've been exposed to Mercurial now, and I really like it so far.

I've primarily been using Subversion for the past several years, and much to my surprise, Mercurial felt completely natural almost right away. In fact, it was immediately easier to deal with branching and merging than it has ever been for me with any other VCS. I have historically avoided branched development when at all possible, because it has been hard to make sure that the VCS was doing what I intended.

While Stuart and I were getting jemalloc working in Firefox, we were tossing patches back and forth constantly. I spent a total of ~2 days just dealing with patch merges, and changes were dropped on the floor on multiple occasions. It occurs to me now that I could have avoided the majority of this work if we had been using something like Mercurial. We wouldn't have lost changes, we wouldn't have had mystery failures due to subtle patch conflicts, and so on.

Mercurial is so cool that I spent almost two full days trying to migrate my Subversion repositories. In particular, I was initially trying to convert the Lyken repository, which consisted of 1023 revisions and perhaps 1000 files, with a couple of vendor code imports and one temporary branch (all pretty straightforward as repositories go). I tried all of the following:

hgsvn silently failed to commit 233 files, which made the resulting repository almost completely useless. I poked around in the code a bit and determined that fixing the problem myself would be a major undertaking.

yahg2svn could only handle 'trunk', 'branches', and 'tags' at the top level, and I had 'vendor' as well. I hacked on the code a bit and probably could have gotten it to work eventually, but I moved on in pursuit of easier solutions.

hg convert, which is an extension that comes with Mercurial, failed to do more than throw exceptions due to pickling failures.

Tailor mostly worked, though it was completely broken as installed on my Ubuntu/amd64 7.10 system, so I had to install it manually. It got confused by a handful of revisions, but it merely left them as unmerged branches, and the fallout was minimal.

I never did find a complete example of how to use Tailor to convert a Subversion repository to Mercurial format, so here's a bit more detail, in the hope that it will be of use to someone.

The command line I used was:

tailor -D -v -F "" --configfile lyken.tailor

The hard part though was coming up with the configuration file. Of course, the manual might have helped, had I found it before writing this blog post.

You can peruse the resulting repository to see what sorts of warts I had to clean up after the conversion. I have successfully converted several other repositories using the same method. The Onyx repository is giving Tailor a real workout though, since it consists of 3475 revisions and (this is the killer) due to how cvs2svn did things back when I switched to Subversion from CVS, there are 180 extant branches, 47 extant tags, and [gasp] 89087 extant files in the latest revision. It will probably take most of a day for Tailor to complete the conversion, and I can see in the log output that there are going to be a lot of problems in all the spontaneous branches cvs2svn generated.