Write Optimization: Myths, Comparison, Clarifications, Part 2

In my last post, we talked about the read/write tradeoff of indexing data structures, and some ways that people augment B-trees in order to get better write performance. We also talked about the significant drawbacks of each method, and I promised to show some more fundamental approaches.

We had two “workload-based” techniques: inserting in sequential order, and using fewer indexes, and two “data structure-based” techniques: a write buffer, and OLAP. Remember, the most common thing people do when faced with an insertion bottleneck is to use fewer indexes, and this kills query performance. So keep in mind that all our work on write-optimization is really work for read-optimization, in that write-optimized indexes are cheap enough that you can keep all the ones you need to get good read performance.

Today, I’ll draw some parallels with the write buffer and OLAP. Recall that the write buffer gets you a small insertion speedup but doesn’t really hurt query time, and OLAP gets you a big insertion speedup but doesn’t let you query your data for a long time. We’ll also use the fact that sequential insertions are orders of magnitude faster than random insertions.

With that in mind, let’s move on to the new generation of write optimization.

Two Great Tastes: Log-Structured Merge trees (LSMs)

We couldn’t manage to get both insertion and query performance out of either a write buffer or OLAP. But they’re similar techniques, just at two extremes.

LSMs are two great tastes that taste great together. To start, make the buffer big, but make it a B-tree, so you can query data as it arrives. In fact, suppose you have log(N) B-trees, B1, B2, …, Blog(N), each twice the size of the one before. If B-trees are slow, using more of them sounds crazy, but I promise we’re getting somewhere.

An LSM Tree with 4 levels

Each B-tree has twice the capacity of the one before it. So Bk can hold 2k elements. When a new row arrives, put it in B-tree B1. When B1 reaches capacity (which in this case is 2 rows), dump those rows into B2. B2‘s capacity is 4 rows. When B2 overflows, you dump the items into B3, which has a capacity of 8 rows, and so on. The trick here is that each time we dump things down to the next B-tree, they’re already sorted, so we get the insertion boost out of doing sequential insertions.

The first log(M) B-trees are in memory (where M is the size of memory). A simple optimization is to just have one B-tree for all these levels, because in-memory B-trees are fast. Once you start flowing out of memory, you are always merging one tree with another which has at most twice as many rows. This way, the smaller B-tree can be treated like the large, OLAP-style buffer, and you get a similarly large speedup, in fact, this merge happens at disk bandwidth speeds.

Not so fast, you say: You don’t get to use all the bandwidth, because each row gets moved from B-tree to B-tree, and it uses up bandwidth each time. This is true, but it turns out that you’re operating at a 1/log(N/M) fraction of bandwidth, which is a lot better than a B-tree, by orders of magnitude.

Alas, the queries are not so great. Even though we made the buffer into B-trees, which are good for queries, you now need to do a query in each one. There are log(N/M) of them on disk, so this ends up being slower than a B-tree by a log(N/M) factor. There’s that pesky tradeoff, which is much better than the B-tree tradeoff, but still not the mathematically optimal tradeoff.

One last point: if instead of growing the B-trees by a factor of 2, you grow them by a larger factor, you slow down your insertions but speed up your queries. Once again, the tradeoff emerges.

Have Your Cake and Eat It Too: COLAs

A COLA (that’s Cache-Oblivious Lookahead Array) is a lot like an LSM, with the queries done in a better way. To begin with, you use a technique called fractional cascading to maintain some information about the relationship from one level to the next. This information can take many forms, but what’s important is that you don’t restart your query at each level and end up doing a full B-tree query log(N) times. Instead, you get to do a small local search. If you do things just right, you can match the query time of a single B-tree. This is true even if you are doubling your structures at each level, so in addition, COLAs are as fast at insertions as LSMs.

Let me repeat that: they match B-trees for queries while simultaneously matching LSMs for insertions. It’s nice to note that COLAs are on the mathematically optimal write/read tradeoff curve, and they’re a proof, by example, that B-trees are not optimal.

COLAs are on the optimal read/write tradeoff curve

This flavor of data structure, which combines the insertion speed of the size-doubling hierarchy of sorted structures (the LSM) with the query speed boost of fractional cascading, goes by many names and can be found dressed up in a bunch of surprising ways, but the underlying math, as well as the performance, is exactly the same.

For bonus points, if you read my colleagues’ paper on COLAs, you’ll see that they are described as being log(B) slower than B-trees on queries. This log(B) is easily recouped in practice—giving you the same query speed as B-trees—if you give up so-called cache obliviousness (a property which is nice mathematically, but not as nice as having faster queries).

Write Optimization is the Best Read Optimization

I’ve been focusing on write optimization, and Fractal Trees do go a couple of orders of magnitude faster than B-trees for indexing that pesky non-sequential data. What that means for the user is typically read optimization: you start adding all the indexes you needed all along, since indexes are so wonderfully cheap to update. My motto is: write optimization is the best read optimization!

You can get COLA-style read-optimal, write-optimized goodness here at Tokutek, where it is marketed as Fractal Trees and available in TokuDB for MySQL and MariaDB.

Background threads can help you reduce latency, increase concurrency, and put RAID systems to better use. Industrial-strength LSM tree implementations certainly use them, and TokuDB’s Fractal Tree indexes use them too, for the same reasons. But they don’t change the overall number of I/Os you need to do to accomplish operations like insert and search in a given data structure. That’s why the academic literature on these types of data structures completely ignores background threads. They really have no effect on asymptotic behavior. Let’s take the LSM tree as an example of why.

If I were going to implement an LSM tree, I’d probably start with one in-memory tree (maybe even a red-black tree or something, just for simplicity), and give it a size limit (maybe 1/4 of RAM). I’d call this the “level 0″ tree. As inserts come in, I would just stick them right in there (and log them for recovery, of course). When the in-memory tree reaches its limit, I would iterate over it and insert all of its elements into level 1, which is a significantly larger tree on disk. Then I’d just clear out the level 0 tree and keep inserting there. When it fills up again, I would again iterate over it and “flush” it down to level 1, this time inserting into a tree that already had some things in it.

At some point, the level 1 tree would fill up. Then I’d iterate over it and insert all of its elements into a level 2 tree that’s even bigger, again on disk. Then I’d clear out the level 1 tree and continue. This sort of operation is called a “cascading flush,” and sometimes these cascading flushes can get really big and cause one operation to suddenly have to wait a really long time. That’s why you want background threads doing the flushing.

And they do help, certainly. When level 0 gets full, you can freeze it, hand it off to a background thread to be flushed into level 1, and then let your client start inserting into a fresh level 0 tree. When the background thread finishes, it just gets rid of the old tree because everything in it has gotten down to level 1. Sure, there’s a lot of trickiness you need to do in order to also search the data structure while a background thread is flushing, but it’s doable.

But it breaks down if you try to insert too fast. Suppose you fill up a whole new level 0 tree before the background thread even finishes flushing the first one. You can’t just keep making fresh level 0 trees, you’ll run out of memory! So you have to stall the client thread until the background thread has caught up with it, at least somewhat.

There are many tricks you can try to get around this, but eventually, they all break down if you have enough data coming in fast enough. Since the academic papers are concerned with asymptotic behavior (big-O), they are free to make their data sets as daunting as needed for a counterexample to a technique like background threading. That’s why—even though you’re right, background threads are great and everyone uses them—you won’t find them in a data structures paper.

So w.r.t. LSM trees, I’m seeing that at least 1 new implementation (LevelDB) uses a fanout of 10 instead of 2 that I am used to seeing COLA for example. i.e. Each lower (or higher) level is 10x the previous level. Assume that L0 can hold at most 1 element, L1 at most 10, L2 at most 100, etc…

This doesn’t affect the asymptotics since 10 is still a constant and inserts are still O((1/B) * log N) amortized, but there’s a huge honking constant (~55) hidden in the big-oh. What do you make of this, and why do you think this tradeoff was made?

p.s. In spam protection, you should start asking big-Oh related questions rather than algebra

Fanout represents a tradeoff (in the constant) between insert and query speed, which you seem to have discovered. If you increase the fanout, you get faster queries, because the tree is shorter and so you just have fewer levels you need to look at. But this (as you point out) makes your inserts slower, basically because with a shorter tree, there’s less “wiggle room” for you to amortize things before you hit the leaves. In a Fractal Tree, when you increase the fanout, you decrease the size of a flush, which is the number that represents the amount of amortizing going on. It’s a tuning parameter that doesn’t appear in the academic papers because it doesn’t affect the asymptotics, but in the analysis, it would be the denominator “B” in O((1/B) * log N).

Anyway, to answer your question, I don’t know why they picked 10. Our fanout right now is between 4 and 16, and in back-of-the-envelope calculations we usually throw around 10 as the “average” fanout we expect, so they’re in line with us. I wasn’t here when we picked the numbers 4 and 16, so I don’t know exactly, but here’s something that may explain it:

Given that a node is 1MB on disk (this is to balance seek times and bandwidth)—and this would have been before compression—you can fit 10,000 rows in a node if each row is 100 bytes (which is another guess we often use). If your fanout’s about 10, this means each buffer flush carries 1000 rows, and if you say your height is bounded by 10 (which is pretty extreme), this leads to a very hand-wavy estimate of a (drumroll…) 100x improvement in insertion performance, which is a nice thing to say on marketing materials.

Additionally, with the data sizes and machines we’ve been thinking about, this usually makes it so that all the internal nodes fit in memory, which is a useful property.

Changing the fanout is something that we talk about from time to time, it’s somewhat open for discussion.

> In a Fractal Tree, when you increase the fanout, you
> decrease the size of a flush, which is the number that
> represents the amount of amortizing going on.

I am assuming 2 things:
1. I am assuming you are speaking of a tree that roughly looks like a buffer-tree (with sqrt(B) keys, and hence fanout and the rest of the space being used for the insert buffer).
2. You keep the size of the node the same, which means that there’s lesser buffer space.

> … which is a nice thing to say on marketing materials.

That’s actually pretty neat as calculations go!

> Changing the fanout is something that we talk about from
> time to time, it’s somewhat open for discussion.

Is there any method to this “madness” of selecting the fanout/block-size? I was thinking that running a simple program that does a bunch of seeks and reads off 1-byte from each seeked location and then does reads of 1k, 2k, 4k, etc… and then compares where the seek time equals the read bandwidth would dictate the block size. i.e. If it takes 2ms to seek and 2ms on the average to read off 143kiB, then the block size should be 143kiB, since that’s where the 2 meet. Is this intuitively okay or do you see some holes in it?

Yep, it looks like a buffer-tree, but with 4 <= F <= 16 instead of F = sqrt(B). And we keep the max node size the same (but it's allowed to be smaller, and it can be bigger for a moment before it flushes or splits).

Yes, your block size calculation method is a good one, but if I were doing it I'd just guess what the average drive does (or look at some manufacturer specs) and use that. I doubt you'll get a lot more out of actually running the experiment. But I'm interested in your results if you do!

Makes sense. Do you have any numbers on how insert/query performance is affected by changing these?

My guess would be that if F exceeds the amount that causes one page to exceed the “optimal block size”, then both suffer since you need to read more data than you would in the same time as it takes for a seek.

> And we keep the max node size the same (but it’s allowed to be
> smaller, and it can be bigger for a moment before it flushes or splits).

Am guessing this is an implementation issue to account for rows that cause the buffer to spill out of the allocated page size. And smaller might be to reduce the amount of data read in case most of the buffer space is unused. Please correct me if I am wrong, since these seem like useful optimizations.

Will let you know when I run tests. I am guessing the results will vary across HDD and SSD.

I don’t have those numbers but I think if we do run any experiments we’ll probably blog about it.

I don’t quite understand your second paragraph, and it may be due to a misunderstanding. Each node (I assume you meant that by “page”) is allowed to grow to a maximum size that we set (to 4MB). This is independent of the fanout. If just one buffer grows to 4MB and the rest are empty, that’s fine, we just have one very full buffer and many empty buffers, but the node doesn’t exceed the size we had set. Alternatively, we could have 10 buffers each at 400KB and that would also be a valid node. So I don’t see how a page can grow beyond the optimal block size because of any tuning of F.

Yeah, they’re mostly implementation details. When you split a full node, you end up with two roughly half-full nodes (so a node can be smaller). And when you flush into a node, it may temporarily become too full just before you flush it to one of its children. If you’re very unlucky, you could flush an entire 4MB buffer to a node that was almost full (making it 8MB), but if the messages are spread uniformly within that node, since we only flush one buffer at a time (to prevent cascades from inducing lots of latency), we might only be able to flush 512KB of that and we’d end up with a 7.5MB node (but we’d flush it down slowly over the next few accesses).

This talks about disk efficiency quite a bit, but what about CPU efficiency? How do these data structures compare for CPU usage for writes, scans, and queries? It has been my experience that many data structures are built to optimize disk but fail to optimize well for CPU, leading to bottlenecks if the I/O subsystem is fast enough.

Lower CPU use can translate into less I/O bandwidth — just like how write optimization can allow for more indexes and thus faster reads, lower CPU can be used to add delta compression of keys or compression of disk blocks.

Does this data structure support any delta-compression of keys (or similar)? I’m working with large, long keys where shared prefixes are common and long (20 to 150 bytes).

Briefly, pure insertion workloads are often CPU-bound, but keep in mind that this is handling a far more intensive workload than other engines would support.

In general, it depends what you measure. If you take an insertion workload that InnoDB can handle and stick it on TokuDB but rate-limit it to the speed that InnoDB showed, you won’t notice the CPU. If you have an in-memory workload, it’ll be CPU-bound either way. For out-of-memory point query workloads, we use a little more CPU for decompression but it’s pretty negligible, but you’ll notice it if you’re doing large scans (but they’ll also go a lot faster than InnoDB, because InnoDB fragments badly).

As you said, CPU can be used for compression, and in fact we have compression on by default, and it’s a lot more effective because our block size is bigger. Right now you can choose between zlib, lzma, and quicklz (see our blog post introducing this last year). This is always the biggest user of CPU on workloads that show a lot of CPU usage (you can check me with perf). We have played with delta coding in the past and gotten some gains, and we might add it in the future, but I expect that with a data set like that, you’ll see great results just with the standard compression we offer.

Information about cache-friendly trees is hard to come by, and mostly concerned with optimizing the on-disk portion, where I am concerned mostly with in-memory data-structures.

I read the COLA paper a few years ago and again this last week. You mention briefly that you can in practice speed up the queries. Can you point me in the direction how to do this?

In the paper, they implemented cache aware COLAs; called g-COLAs. It was against these that they ran the performance tests, so I’m guessing it is something above and beyond this that you are talking about.

Sorry, when we talk about taking write-optimized data structures and getting back query speed, we’re still talking about disk accesses. What I mean by speeding up the queries is essentially just adding fractional cascading, which is already in the g-COLA (the pointer density parameter controls this). In fractal trees, we achieve the fractional cascading effect by actually having a real tree structure rather than having each level be a single array (this also helps latency a lot), but it’s isomorphic to fractional cascading as far as I’m concerned.

I unfortunately don’t have that much experience building cache-friendly data structures that close to the CPU, I don’t think I’ll be much help. :-/ You may want to talk to Dhruv Matani (http://dhruvbird.com/) if you haven’t already, I know he was working on implementing some things like this (i.e. https://github.com/duckduckgo/cpp-libface). Hope that helps!