This is Part 7 of the IKVS series, “Implementing a Key-Value Store”. You can also check the Table of Contents for other parts. In this series of articles, I describe the research and process through which I am implementing a key-value store database, which I have named “KingDB”.

In the previous articles, I have spent a fair amount of time reviewing existing key-value stores, interfaces, architectures, and I focused greatly on hash tables. In this article, I will talk about hardware considerations for storing data structures on solid-state drives (SSDs), and I will share a few notes regarding file I/O optimizations in Unix-based operating systems.

To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog.As usual, comments are open at the bottom of this post, and I am always happy to welcome questions, corrections and contributions!

1. Fast data structures on SSDs

As I was researching resources on efficient ways to implement KingDB, I fell down a gigantic rabbit hole: solid-state drives. I have compiled the results of that research into another series of articles, “Coding for SSDs”[1], and I have summarized the most important points in the last article of that series: “What every programmer should know about solid-state drives”[3]. I encourage you to read the whole series if you have time, and if not, then at least have a sneak peak at the summary. All that research and this series of articles on SSDs were necessary, as they gave me the confidence that my design decisions for KingDB were taking me in the right direction.

In-place updates are useless

SSDs differ dramatically from regular HDDs. The most important thing about SSDs is that, unlike HDDs, nothing is ever overwritten. Updates to an existing page are applied in an internal register, a RAM module in the SSD, and the updated version of the page is stored at a different location on the SSD. The original page is considered “stale” until the garbage collection process erases it and puts it back in the pool of free pages. As a direct consequence, any algorithm trying to be clever by applying updates in-place will bring no performance gain, and will needlessly increase code complexity.

Metadata must be buffered or avoided

The smallest size of data that can be written to an SSD is a NAND-flash page — 4 KB, 8 KB, or 16 KB depending on the SSD model — which can have serious consequences for any metadata stored on the drive. Imagine that a hash table needs to persist on the drive the number of items it is currently holding, as a 64-bit integer. Even though that number is stored over 8 bytes, every time it is written to the drive an entire NAND-flash page will be written. Thus an 8-byte write turns into a 16 KB write, every single time, leading to dramatic write amplification. As a result, when stored on SSDs, data structures need to cache their metadata and write them to the drive as infrequently as possible. Ideally, the file format will be designed in a way that this metadata never needs to be updated in-place, or even better, that this metadata never needs to be stored at all [2].

Data is never really sequential

In SSDs, addresses that are sequential in the logical space do not correspond to addresses that are sequential in the physical space. This means that if you open a file and write sequentially to it, the data is not stored sequentially on the drive. Of course, if data is written sequentially within the alignment of a NAND-flash page, then it will be sequential on the SSD, but two pages that are sequential in the logical space have very little chances to be stored sequentially in the drive. Thus for chunks larger than the size of the NAND-flash page, rearranging data sequentially hoping to optimize for future reads will brings no improvements. Not only that, but the amount of data moved around eats up CPU time, and causes the NAND-flash memory modules in the SSD to wear off prematurely. Below the size of the NAND-flash page, storing data sequentially in the logical space still helps.

Random writes are fast if they are large enough

SSDs have several levels of internal parallelism. A “clustered block” is a set of NAND-flash memory modules that can be accessed concurrently, and has a size of 16 MB or 32 MB for most SSDs. If writes are performed in the size of at least one clustered block, they will use all the levels of internal parallelism, and reach the same performance as sequential writes [2].

Random reads are fast

The data written together in a buffer the size of a clustered block will use all the internal parallelism of an SSD. If the same data is later retrieved, the throughput will be very high because the SSD will be able to fire up multiple memory modules at the same time, using of all its internal parallelism. For most workloads, it is impossible to predict what data will be read, and therefore, the database cannot take advantage of this aspect of the SSD architecture. However, read performance in SSDs is very good on average, whether reads are sequential or random. Therefore the focus for data structures persisted on SSD should really be on optimizing for the writes rather than for the reads.

Log-structured storage

The goal that I set myself for KingDB is to minimize the number of writes performed to the SSD incurred by incoming updates. The best option to cope with the specificities of SSDs that I have just mentioned above is to use the “log-structured” paradigm. Originally developed for file systems [4], more data structures have started to use that approach. One of the most famous is now the Log-Structured Merge-Tree (LSM tree), which is the data structure at the heart of the LevelDB key-value store [5].

The idea behind log-structured storage is simple: treat the incoming updates as a log stream, and write them on the drive sequentially, exactly as they come. Let’s take an example to make things clearer. Assume that a database has an entry with key “foo”, and that a new update, a Put(), comes for the same key: the old value for that key needs to be replaced with the new value. What log-structured storage does is that it keeps the old value where it is, and store the new value in the free space available just after the last incoming update. Then the index is updated to make the key “foo” point to the new location. The space used by the old value will be later recovered during a compaction process. Again, note that the old value is not changed in-place, and this is good because SSDs cannot do in-place updates anyway.

Another interesting side effect of log-structured storage is that, because it allows for writes to be as large as one wants them to be, they can be in the order of magnitude of the clustered block (32 MB). Incoming random updates can be batched into a single large write operation, by buffering them into the RAM for example, effectively turning a workload of small random writes into a workload of large sequential writes.

2. File I/O optimizations

Every syscall has a cost

The days when every system call was causing a full context switch are long gone. Nowadays, going from user land into kernel space only requires a “mode switch”, which is much faster as it does not require the CPU caches and TLB to be invalidated. Those mode switches still have a cost, roughly 50 to 100 ns depending on the CPU [6].

In Chapter 13 of his excellent book “The Linux Programming Interface”, Michael Kerrisk benchmarked the time it takes to write 100 MB of data with different buffer sizes. In one of his tests, it took 72 seconds to write 100 MB with a buffer size of 1 byte, and only 0.11 second to write the same 100 MB with a buffer size of 4 KB. No actual file I/O was performed as the changes were not synced to secondary storage, thus the test was effectively measuring the time spent in the system calls. This is why minimizing the number of system calls and determining the correct sizes for the buffers is crucial to performance.

Memory maps

Memory maps are the best thing known to mankind after hash tables. A memory map is basically a mapping of an area of a file into an address in the memory. That address can then be accessed, for both reads and writes, using a pointer like any regular buffer of allocated memory, and the kernel takes care of actually reading and writing the data to the file on the drive. There are countless success stories of improvements achieved by replacing calls to read() and write() with memory maps, an interesting one being covered in the SQLite documentation [7]. Optimizing file I/O is hard, and most of the time, is better left to the kernel. As a consequence, I will try to use memory maps as much as possible for KingDB. I will indulge myself the option of using calls to read() and write() if I judge that it can speed up the development time. Vectored I/O, also known as scatter-gather I/O, is another tool worth mentioning as it can offer substantial performance improvements, and deserves a try whenever applicable [8].

Zero-copy

Most system calls copy data from a buffer in user land to a another buffer in kernel space, or conversely, causing a lot of unnecessary data duplication. Zero-copy is there precisely to minimize the amount of data that needs to be copied around, with the goal of copying absolutely zero byte between user space and kernel space. The sendfile() syscall, which sends a file to a socket, is a typical example of zero-copy transfer. A buffer is still required for sendfile(), as pages from the file need to be loaded into RAM before they are sent to the socket, but this buffer remains in kernel space, and there is no data copy between user and kernel space [9, 10, 11]. Memory maps are also performing zero-copy operations, thus it is good to stick to memory maps and not rush into using specific system calls such as sendfile() and splice().

Data copying is a very serious source of wasted time in a database system, as the total time spent needlessly duplicating data around can sum up to large latencies. It can happen anywhere, not only when using syscalls, and must be monitored closely.

3. Done is better than perfect

In the two sections above, I have covered the most important points regarding the storing of data structures on SSDs and file I/O optimizations. I want to consider those sections only as a general direction, from which I can steer away at any moment if I gather data to prove that other options are preferable. Also, it is fine to cut corners, because the goal for me is to get a working key-value store as soon as possible: done is better than perfect.

I have started implementing KingDB, the key-value store related to this series of article. The source code, still under development, is available at http://kingdb.org. In the next article, I will review the architecture of KingDB and its components.

To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog.As usual, comments are open at the bottom of this post, and I am always happy to welcome questions, corrections and contributions!

On mmap(): Earlier this year I wrote a key value store called SharedHashFile [1] which only uses mmap() and it is very fast and can achieve over 10 million reads per second on a single box with 100 million short keys. I don’t think it would be possible to achieve this kind of performance using system calls to perform I/O. Why? Because of the non-zero-copy nature of using the system calls. However, I agree with you that write speeds can become a problem because the OS decides when to write the memory mapped page to backing store, and the user program has no control over this AFAIK. This seems to be okay for read heavy key value store functionality.

Another problem with memory maps is corruption. If the interface to the hash table is code and the memory maps exist inside the process of the hash table caller, then the potentially buggy code of the caller can simply corrupt the hash table memory maps causing problems. An example of a key value store which has tried to work around this — but with enormous performance trade-offs — is lightningdb [2]. Here the key value store is using read-only memory maps at run-time for fast access where the caller process cannot corrupt the memory maps. However, writing to the key value store is very slow in comparison and also cannot be done asynchronously.

On cache lines: Even using memory mapped files — and assuming everything is in memory — I have found that the performance is largely dictated by the number of cache line fetches. Therefore is it important to have an algorithm with reduces these to the minimum. SharedHashFile manages to reduce most key accesses to three cache line fetches.

On locking: Probably the other biggest thing influencing performance is lock contention. So SharedHashFile uses a fair read write spin lock. This relies on atomic instructions which are slower than regular instructions. I believe that to achieve higher performance it will be necessary to reduce the number of atomic instructions and this can probably only be done at the algorithm level?

On zero copy: I think this is very difficult or impossible to achieve. Why? When the hash table grows then it’s going to be necessary to internally relocate keys and values otherwise one will end up with memory holes due to other keys and values being deleted. For this reason alone then it’s going to be necessary for a key look-up to return a copy of a value instead of a pointer to the bytes in memory. A secondary reason is that an asynchronous write to the same key may want to update the value just after it has been read. Things get very complicated for the caller if it is suddenly responsible for locking access to the value itself.

What else to think about? How the key value store grows? For the next generation of SharedHashFile then I am thinking about how to reduce the initial, smallest size of the key value store. Why? I would like to implement a hierarchical key value store. There are two obvious ways to do this: (1) By building the hierarchy on top of the existing key value store using the existing keys. This would have the disadvantage that if you have a folder with e.g. 3 keys in it then those keys would be just regular keys and therefore could be distributed anywhere within the n keys of the complete set of all keys. A secondary disadvantage would be that to delete a complete folder then one would have to iterate over all keys and delete them individually. (2) By creating each folder in the hierarchy from a new instance of the key value store. This means that in the example with a folder with 3 keys that the keys would be stored close to each other in memory for better access. And that to delete the folder, it would be possible to not iterate over all keys, especially if the entire folder key value store is stored in the values of keys in the parent folder… just delete those keys. This scheme would not work today very well with SharedHashFile because the smallest over head for a key value store is substantial, meaning that memory would quickly be exhausted if creating a hierarchical key value store with many folders.

I know you’ve done a lot of research around key value store technology and I’d love to hear your thoughts on these topics. And I will follow the progress of KingDB with interest. Good luck!

Wow, that’s an awesome comment! Thank you for sharing your results on SharedHashFile, I took at look at the source code, that’s a cool project 🙂

In the case of SharedHashFile, you’re definitely right when you say that minimizing cache line access is critical. Also, as the size of the table increases, the average number of cache lines accessed on each read operation will increase too, and my expectation is that performance will take a blow every time the average distance to initial bucket gets larger than the size of one of the CPU caches.

And I agree with you when you say that writing to memory maps is dangerous. The problem is the same with regular writes, and will always happen whenever data is overwritten. Most database systems cope with that by never overwriting anything and by first persisting the incoming changes to a new location on the drive, then updating the actual data structure. Write-ahead logging and ARIES are examples of such techniques. KingDB uses a log-structured approach to cope with the problem.

The code of KingDB is still in progress, but so far I have decided not to implement the hash table myself, and simply to use an STL multimap. That multimap associate one or more locations on secondary storage for any hash of a key. Thus if the key value store holds entries of the form , the multimap will hold . The bottleneck for me is going to be in the file I/O, thus not optimizing for cache line accesses is fine. Also, incoming updates are persisted on the drive as soon as they arrive, in a log-structed way. It’s not as efficient as memory mapping everything, but it keeps the code simple and limit the amount of data that needs to be moved around. More on that in the next blog articles! 🙂

Finally, your current results of 10M reads per second with SharedHashFile are very impressive! Although, you mentioned that you were using short-length keys, thus except for the initial load of each file system page from secondary storage, chances are that the entire working set is read from memory. I’d be curious to know more about this test: what size are your entries for, and what is the performance with larger entries? Also, what is the load factor of your hash table at the moment the reads are performed?

Sorry I did not reply earlier… is it intended that the blog would notify me of your reply? Because it did not 🙁

Regarding the comment “as the size of the table increases, the average number of cache lines accessed on each read operation will increase too”: This is exactly the thing to avoid, and it’s also what SharedHashFile is designed to avoid; the average distance to any key and value should remain constant no matter how big the key value store becomes. Does KingDB also guarantee this? And if so, how?
I disagree with you that cache line optimization should be discarded if there is a bigger bottleneck, e.g. file I/O. Why? Let’s say the hashtable holds 100 million unique keys. Let’s say those keys belong to 100,000 objects and each object has one average 1,000 keys. If we can arrange the key value store is some sort of hierarchy (which I am dreaming about for the next version of SHF) then it may be possible to group that set of 1,000 keys closer together within the entire data set. This potentially means 1,000 time less I/O disk accesses than if the keys are scattered randomly across the 100 million keys. You can potentially have access to the 1,000 keys in-memory after just a couple of disk accesses. Therefore, the access speed of the remaining 999 keys — after the first key is read from disk — might remain critical and therefore cache line usage might remain critical. It’s a bit unfair talking about this if KingDB is intended to be a flat key value. But in a world where OO is very fashionable — and to some unthinkable not to have — then wouldn’t it be great if we could design a key value store whereby certain keys can be grouped together as if in an object? Here the key access time would profit from being ‘closer’ to other keys on disk.

Regarding the performance figures for SharedHashFile: There are lots of tests shipped with SHF. So it’s easy to build and run the tests on your own box. I haven’t tried a test with larger entries but that sounds like an interesting thing to do 🙂 An interesting thing about the SharedHashFile algorithm (described in the header file) is that the “load factor” is not critical to performance. You can start off with 1 key, and add n keys incrementally and the read and write performance should always be the same as long as you have enough main memory 🙂

The blog didn’t have a comment notification option. Thank you for pointing that out, I just installed a plugin for that and now all readers can subscribe to comments! 🙂

The way that KingDB deals with collisions is that it groups all entries with the same hashed keys and store them contiguously on disk. Because KingDB uses a log-structure approach, the initial writes will obviously not do that, and the grouping will occur during the compaction process. Once the compaction process has happened, all entries hashing to the same bucket are stored together, which allows the retrieval of the correct one to be done with a single sequential read operation. Of course, the compaction process incurs more writes than required as it needs to rewrite entries in order to group them, but that’s always a choice that the design of a database has to make: optimize for reads, or optimize for writes. In the case of KingDB, bursty writes are optimized by the log-structured storage, and on the long run, reads will be optimized due to the grouping in the compaction process.

A hierarchical structure is a good idea to minimize random reads, like what B-trees are doing. But this come at the cost of more complexity in the code and reduced performance with large batches of writes.

In the end there is no best solution, it all depends on the workload that the database will have to deal with.

I’m trying to understand the following sentence: ‘As a direct consequence, any algorithm trying to be clever by applying updates in-place will bring no performance gain, and will needlessly increase code complexity.’ Isn’t it supposed to be avoiding updates in-place instead? If not, could you please explain it with other words?

With HDDs, you can get chunks of contiguous logical memory mapped to contiguous physical memory on the disk. In that case to speed up reads, you would want to store you data in the same sequential order that’s it’s going to be read, and thus it makes sense to sometimes re-order data on the disk to guarantee this sequentiality. I could be wrong, but I believe that this is what is happening with primary keys in InnoDB (MySQL), i.e. data is being re-order by the InnoDB engine to be sorted by primary key, in order to speed up primary key lookups.

A lot of the “sequential ordering” philosophy is still around, but it no longer applies to systems that use SSDs for their storage [1]. Same goes with reserving sequential space in prevision for the future growth of a database, and then doing in-place updates in those locations. Therefore yes, with SSDs it’s better to avoid in-place updates, and one of the most common approaches to solve that nowadays is to use log-structured storage.