Friday, February 18, 2011

In the past few days I've been working on a feature in the Neo4j Graph Database to store short strings with less overhead. I'm pleased to announce that this feature is now in trunk and will be part of the next milestone release. In this blog post I will describe what a short string is, and how Neo4j now stores them more efficiently.

At Neo Technology we spend one day each week working on what we call "lab projects", a chance to explore new, experimental features outside of the regular roadmap, that might be useful. Two weeks ago I spiked a solution for storing short strings in a compressed way as a lab day project. To understand why we first need a bit of background on how strings are usually stored in Neo4j. Since strings can be of variable length, Neo4j stores them in something called the DynamicStringStore. This consists of a number of blocks, 120 bytes in size, plus 13 bytes header for each block. A string is divided into chunks of 60 characters, each such chunk stored in its own block (a character is two bytes). For a short string, such as "hello", that if encoded in UTF-8 would occupy only 5 bytes, the overhead of storing it in the DynamicStringStore (including the property record of 22 bytes needed to reference the block in the DynamicStringStore) is almost 97 percent!

My initial spike analyzed all strings shorter than or equal to 9 characters, and if all characters were found to be 7bit ASCII, stored it directly in the property record, without involving the DynamicStringStore at all. The 7bit part is important. The property record contains a 64bit payload field, which when the DynamicStringStore is involved contains the id of the first block. 9 7bit characters sums up to 63bits, I can store that in the 64bit payload field. I can then use the high order bit to denote that the content is a full 9 char string, and if it isn't, the high order bit doesn't get set, but instead the first byte denotes the length of the string, and the rest of the 56 (7*8) bits are the actual string.

While this started out as something I thought was a fun project to hack on for a day, we quickly found use for it. When importing the OpenStreetMap data for Germany, with this feature in place we found that the DynamicStringStore was now 80% smaller than before! Not only that but time for reading and writing strings had improved by at least 25%! (the benchmark I got this from creates nodes and relationships as well, so pure string operations is probably even faster) Such figures are great for getting a feature into the backlog.

I am not a big fan of ASCII though. It was designed for communicating with line printers, not for storing text. Also, with short strings the number of exotic characters that people use drops significantly, it is more likely to just be some simple alphanumerical name or identifier, such as "hello world", "UPPER_CASE", "192.168.0.1", or "+1.555.634.5773". So the next thing I did was to write a tool that could analyze the data stored in actual Neo4j instances and generate a report on the statistics of strings actually stored. I then sent this to our public users mailing list. The feedback confirmed my suspicions about what kind of text people store, and also suggested that we would be able to store up to 65% of our users strings as short strings.

Armed with statistics about actual strings I set out (along with my most recent colleague, Chris Gioran) to write an even better short string encoding, and incorporate it into Neo4j. Last night we pushed it to git. The format we ended up with can select between 6 different encodings, all encoded using the high order nibble of the payload entry of the property record:

Numerical up to 15 characters binary coded decimal, with the additional 6 codepoints used to encode punctuation characters commonly used in phone numbers or as thousand separators. This can encode any integer from -1015 to 1016 (both edges exclusively), most international phone numbers, IPv4 addresses, et.c.

Alphanumerical strings up to 10 characters, including space or underscore. Supports mixed case.

European words up to 9 characters, this includes alphanumerical, space, underscore, dash, dot and the acute characters in the latin-1 table. Useful for building translation graphs.

Latin-1 up to 7 characters. Will give you parenthesis if you have those in a short string.

UTF-8 if the string can be encoded in 7 bytes (or less). Useful for short CJK strings for example.

The code is still in internal review, and shouldn't be considered stable until its inclusion in the next milestone release a week from now. But I am very exited about the benefits this will give to Neo4j users, both in terms of lower storage sizes, but also in terms of performance improvements. Reading (and writing) a string that is encoded as a short string is much faster than reading (or writing) a string in the DynamicStringStore, since it is only one disk read instead of two.

A big thank you goes out to the people in the Neo4j community who provided me with the string statistics that made this possible.

3 comments:

Would gzipping the short strings make sense? I don't know much (read: anything) about gzip's constant overhead, but I know that text compresses unreasonably well and that zlib is really fast. You're probably not going to be able to squeeze many more characters out of 64 bits, but every bit is precious, right? ;)

@Daniel: At these small sizes a conventional compression algorithm, such as gzip, is not going to give much, if any, improvement. And the overhead is going to eat up too much space. What could be interesting is to look at general frequencies and use variable length encodings (such as some huffman code) to encode common characters in fewer bits. But when trying that out the added complexity to the code was not worth it. Especially since the gain wasn't that big, and it made it much harder too look at a string and judge if it would fit as a short string or not.

For strings that are slightly longer, and stored in the DynamicStringStore, it would be more interesting to look at using a conventional compression algorithm. As you point out, text compresses really well. I think the first step however is going to be to use some character encoding (probably UTF-8) instead of just storing the raw 16bit Java characters in the DynamicStringStore.

static huffman compression @char level can be made to work like a charm on small sized strings. One needs to enable periodic updating of the coding tables to keep-up with the data distribution. And to reserve one symbol for escaping in order to cover symbols not found in "preloaded static code tree". Making huffman canonical is really easy, and makes cpu-hit from compression unnoticable.