Tuesday, February 21, 2012

How Mailinator compresses email by 90%

Given the title of this article, the first thing that should pop into your mind is probably - "well, use a compression algorithm - right?".

Right! Well, yes, well, not exactly. Read on.

Your second thought might also have been - "Why bother? Just buy more disks." Which in the big picture is also not a bad answer. But for Mailinator that doesn't work - if you have read previous Mailinator tech articles you might know that Mailinator stores all it's email in RAM.

There were good reasons for that when Mailinator started. One was the use case - which was always disposable email that lasts a few hours (rather longer nowadays). Secondly, when Mailinator started, disks and datastores weren't as sophisticated/fast as they are now.

Also, Mailinator is/was always a free service so keeping costs down was always important. To this day, Mailinator runs on a single server. It averages about 4-5Terabytes of bandwidth a month and the peak incoming email rate I've seen is about 3500 emails/sec (this is just a production observation, server limit is bandwidth, not CPU).

And finally - last but not least - to me, much of web and application development today is utterly devoid of any fun algorithms. I spend a non-trivial amount of time in interpreted/dynamic scripting languages that do a fantastic job of hiding (or at least lure me away from thinking about) algorithmic complexity. I've probably inadvertently written more n^3 algorithms than, um, (n^3)-for-some-large-value-of-n.

Mailinator has always been my test bed for trying fun ideas, algorithms, and datastructures. In other words - I probably didn't need to do all the work I'm writing about here - but I definitely did have fun doing it (probably should have been out talking to girls, but alas).

Compression

Ok - so back to 90% compression.

So to start testing, I grabbed a few hundred megs of the Mailinator stream and ran it through several compressors. Mostly just stuff I had on hand 7z, bzip, gzip, etc. Venerable zip reduced the file by 63%. Not bad. Then I tried the LZMA/2 algorithm (7z) which got it down by 85% !

Well. OK! Article is over! Everyone out! 85% is good enough.

Actually - there were two problems with that result. One was that, LZMA, like many compression algorithms build their dictionary based on a fixed dataset. As it compresses it builds a dictionary of common sequences and improves and uses that dictionary to compress everything thereafter.

That works great on static files - but Mailinator is not a static file. Its a big, honking, several gigabyte cache of ever changing email. If I compressed a million emails, and then some user wanted to read email #502,922 - I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible. And, as I said, the Mailinator cache is constantly throwing out old emails and putting in new ones.

In other words, an algorithm that relies on previous entries to build a dictionary can't work given that we keep purging the front of the stream never to be seen again.

Hence, we cannot compress emails "together". But we can compress them individually. Sadly, this hurts our compression ratio - and by a lot. The algorithm now must start building a new dictionary with each email. And emails are small so the dictionary isn't very mature by the time we're done compressing in many cases.

We can help this situation by giving the compression algorithm a pre-built dictionary. That is, scan a typical piece of data to be compressed, find common sequences and create a list of them. Then we give that dictionary to the compressor/decompressor as it takes off.

Woopsie. Again, the Mailinator stream is a living and breathing entity that's always changing. One minute might be a few million viagra spams, the next minute might be all about fake rolex watches. In other words, there is no "typical piece of data" - a static dictionary built off a sample of emails will be obsolete in relatively short order.

So, the first idea was to build a sliding dictionary builder. Each email is scanned for string occurrences and we keep a count of them. Then every so often (minutes or hours), the compressor switches to using the most recently constructed dictionary. Every compressed email is given a reference to its dictionary so when/if it needs to be decompressed, it knows what dictionary to give the decompressor. Many thousands of emails share the same dictionary so RAM to store dictionaries isn't particularly significant.

Well, that's great and does restore LZMA back to about 60-70% but remember I mentioned I had another problem with LZMA? Speed.

The C++ version of LZMA by Igor Pavlov compresses at about 1.7MB/s per CPU core on my test machine. Um. no. Firstly, Mailinator can pull down tens of MB per sec at times. Secondly, no component of our processing pipeline can be allowed to take up this much CPU (my rule, not yours). We need our CPU for other things when large volumes of mail arrive. (The java version by the way was about the same speed).

Simply - LZMA is pretty awesome - but it's too slow for this purpose.

So the for the moment, I fell back to using a fast but simpler compression (zlib/LZW) on individual emails - and we sink down to about 40-50% savings from compression.

A Bigger Idea of a "Dictionary"

The next step for me was to think about email composition. We get lots of different types of email - but we get lots of the same types too. For example, we get lots of newsletters (people send them to Mailinator then read them via POP or RSS).

The nice thing for us is that a newsletter email blast could be 10,000 emails that are, all the same. Well, ok, not exactly - no two emails are ever the "same" because headers have times, dates, message-id's, etc. within them. But if we remove the headers, you can get 10,000 emails going into 10,000 different inboxes that all have the same message "body". Are you thinking what I'm thinking?

Right - store each email with it's own headers plus a pointer to ONE system-wide byte-array containing the newsletter body. What's the "compression" ratio of that? Well over 90%. And just to be a snot we can then apply compression to that byte array to eek out another few percent. We're reusing memory here so it's not exactly "compression" but we are reducing the size of the data sent to by some fantastic amount for this happy use case.

This isn't a revolutionary idea (online music libraries do the same thing) but it does fit pretty nicely in the Mailinator paradigm. Sadly apart from newsletters, not many other email sets, spam or otherwise have email bodies that are identical. In fact, spammers specifically change the subject line and destination url of every email they send for tracking and spam-detection-thwarting purposes. So what you get is something like this (headers omitted):

Email 1:Buy vi4gra now!http://rrr4.somerandomthing.com/?3jwow33ooHappy man are you will be!

Email 2:Buy vi4gra now!http://1rr220.somerandomthing.com/?ajo200kkoHappy man are you will be!

So much for simply detecting identical email bodies. And this goes for less nefarious things too. Sign-up emails from websites will contain the same surrounding text with different names and validation urls inside.

What we could use here is a Longest Common Substring (LCS) algorithm. Basically, it would compare the two email bodies and be able to break them up as:

Nice .. each email is stored as 3 (compressed) byte arrays where 2 of those can be shared.

Unfortunately, classic LCS algorithms are expensive. Comparing two sequences is an O(nm) algorithm. And we're not interested in comparing two sequences, we're interested in comparing each new sequence (er.. each new email) with the few million that preceded it. Also, the LCS algorithm is also very memory expensive in the creation of trie datastructures - again, scaling to millions of emails just doesn't fit in our parameters.

Generally speaking, there are a lot of tricks I've noticed in analyzing algorithms. A few off the top of my head are: if you see an easy O(n^2) algorithm, it's rather likely there's an O(nlogn) one hiding in there somewhere. In contrast, if your dataset is small, you might be better off sticking to algorithms that make your CPU's cache and instruction pipeline happy instead of worrying about algorithmic running time (i.e. bubblesort > quicksort for small data). Lastly - if you can make assumptions about your data, you can often short-cut the classic algorithm with an good approximation.

Caching Lines

Cool, so let's assume something about the data. For emails, as it turns out, disparate parts of emails often occur on line boundaries (as you see in lines 1 & 3 above). A few same lines, a different one, a few more same. Instead of looking for common sequences based on individual characters, we can treat individual lines as units. Then we can attempt to find multiple occurrences of those lines. It cannot be as precise as LCS proper as in our above example (we would not find the identical portion "http://" in line 2) but we're basically settling for a greedy approximation, and one that works pretty well.

How do we store it though? LCS's tries would kill us. I know - let's use an LRU cache. Those darn things work for everything! We can use an LRU cache that caches full email-lines. It will inherently flush out old email lines as the spam stream evolves (nice!) and will provide quick look- ups to compares thousands of lines at once (happy!). Specifically in Java, an LRU-cache is a synchronized LinkedHashMap with true as the last constructor parameter and an overridden removeEldestEntry.

So we store a few 10's of thousands of email lines in an LRU cache and then as each new email comes in, we check to see if that line is in the cache. If it is, we reuse the one in the cache instead of creating new storage for this email. By assuming all common sequences are bounded at newlines, we remove the boundary-discovery work LCS must do. Strictly speaking, we're cheating and losing some opportunity, but it's a good enough guess for this type of data.

This had a dramatic effect on our "compression" (again, it's slighty dubious to call it compression but, as you consider the big picture, our entire machinery of the LRU cache and bastardized LCS-in-spirit algorithm is creating a reuse-dictionary, it might not actually be compression - but it goes through several of the motions).

Caching Multi-lines

Caching lines is great - but what about caching multi-lines? Say we have a few emails - for brevity, assume each character in the following examples are email "lines":

Email 1:ABC1

Email 2:ABC2

Email 3:ABC3

Email 4:ABC4

So the first 3 lines are all the same in each email (ABC), the 4th lines are numbers which are not the same. Our algorithm:

1) Load a LINE and see if it's in the cache (if no more lines, quit)2) .. if it's not there, put LINE in the cache, and store LINE in the email - GOTO 13) .. If it IS there:4) .... see if LINE + NEXT_LINE is in the cache5) .... if its not there, put LINE + NEXT_LINE into the cache and store LINE (which is a cache hit) in our email - GOTO 16) .... if it IS there, LINE = LINE + NEXT_LINE, - GOTO 4;

So if we run our 4 emails above through this algorithm. We get the following:

So what happened? The system has realized that ABC is cacheable and is now pointing to that. All subsequent emails with the set-of-lines ABC will reuse the same memory. Note that the disparate lines 1,2,3, and 4 will always be stored separately, but the algorithm will then pick-up any common line-sets later in the email too (if there were any).

This elaborate system to find equal email lines and reuse them drags out compression of the entire flowing email stream down to about 80%. What about 90%? Well.. one more trick.

Back to LZMA

Remember LZMA from above that we abandoned because it was too slow to happen inline? As you'd guess, the biggest impact it had was on bigger emails. And although it's a CPU hog, we do actually have a few cores laying around. So let's give it one (but seriously, just one).

We setup one core (i.e. thread) to trail behind and scan incoming email for ones that are over some size (say 20k) and re-compress those using the sliding dictionary LZMA we mentioned earlier. While 3 of our cores average 5-10% utilization by receiving, analyzing, and storing incoming email - the 4th core sits at 100% re-compressing emails where it will find benefit. If it gets too far behind, it simply leaps ahead and leaves some compression on the table.

(Note that empirically, LZMA is an order of magnitude faster decompressing than compressing, otherwise that would have been a new problem as it could take too long when someone wanted to read an email)

Voila. 90%. (Two notes: 1: that's a reasonable average at least... sometimes better, sometimes worse and 2: I realize I'm not exactly sure what "Voila" means, looking that up now).

There are also some other important notes. Storing a byte array in Java costs something. The pointer alone (64bit) is 8bytes. Then there is the byte length field, padding, etc. In other words, I limited the system to never store email lines under 64 bytes. Small lines get concatenated together straight away.

Second, there are more email-idiomatic tweaks we can do to improve the situation. Base64-encoded attachments are effectively un-cacheable, so we pass over those.

Third, although from our cheeky example it may seem like we're finding optimal line sets (i.e. ABC). We're not. We could end up caching ABC and destroying an opportunity for a more optimal BCDXYZ or something. I'm guessing this doesn't happen often but would be an interesting future consideration.

Edit: Wow, sincere thanks to an Anonymous commenter for making me reconsider the above algorithm. I had originally stated it was O(n^2). My first version was indeed O(n^2) (which wasn't written about) and after a few changes it became O(n) and I failed to see that. I find its very easy to find tech reviewers once an article hits Hackers News, before then though - not so much. :) My apologies for the error.

So for the end-user, this whole diatribe simply means little except their emails are sticking around longer. They have no idea that when they click to read an email we may be LZW or LZMA decompressing tens of byte arrays shared by thousands of emails with a custom-sliding dictionary built by scanning emails that arrived hours ago and then catenating them together so they can be shown on their webpage all in a few milliseconds. And they likely don't care, they're probably too busy signing up for Minecraft or something.

But that's ok. I know.

And if you got this far, you know too.

Ok.. now back to real work. What was I doing again? Oh yeah, writing some slick one-liners in Ruby. No clue on the running times - probably like O(n^4) or something, but if I fiddle with it a bit more - I bet I can cut the character count of the code by half!

What happens to emails, when a line used in a mail gets evicted from the cache due to the LRU policy? Will you mark the email as evicted or will the user encounter that when trying to open the mail and the decompressor finds the line evicted?

I believe the correct buzzword for this is data de-duplication. It's huge in enterprise storage for obvious reasons.

This is a nice domain-specific dedup process for your email use case. I'd be interested to see if using the hash value of each line of text (and some sort of dictionary data structure) would make classification faster, and what impact it might have on memory.

Very interesting article. Did you have a look at rolling checksums, like the rep compressor, or what rzip is using? It uses a rolling checksum a sliding window for which a hash is calculated to find common substrings within a large hash map.

You complain throughout it about how what you are doing is not compression ... well, technically it's deduplication, but it's close enough to compression that the different name is a technicality.

I did some work on deduplication years ago, back when I worked hosting BSD-based VPS (Virtual Private Servers). I learned the same thing you did: When you know your dataset, a well tuned chain of algorithms can achieve better results than regular compression at a fraction of the processing cost.

You could possibly patent this then sell it to redmond or something so they can stick it in their smtp-incompatible "collaboration server". Again. I heard they had something fancy that did the one-mail-store thing, but finally broke down and took it out because it lost too much mail.

If you're feeling like reinventing this you could perhaps keep a running dictionary but introduce "seekmarks" in the datastream. Or something. But this is quite a nice bit of neat already.

your article mentions that attachments are skipped. while the mime boundaries will/should be unique to each message, the base64 encoded lines would seem a candidate for the line cache, or perhaps a seperatly maintained line cache? if the same image were sent to 100 users, the same principles should apply per your lru cache of common lines.

I have long been in awe of what you've accomplished! Scaling software systems is tough, and I can imagine that my attempt at a mailinator would probably involve expensive clustering and SSDs to deal with IO bottlenecks.

That you provide such a towering icon as an example of just how much you can get out of a single, 4-core server keeps me motivated to do better.

This is a great article and an eye opener as I didn't even know this 7z compression format even existed!

I have been trying to look for a simple java example for compressing/decompressing to/from this format but I am unable to find any APIs with examples. j7zip.jar seems like the most appropriate API but doesn't include any examples for how to use it! Any ideas anyone?

You wrote of your O(n^2) algorithm that "there might just be an O(nlogn) algorithm hiding in there" and indeed I think there is! If I read it right, steps 4-6 are searching sequentially for the longest cache hit starting with LINE, you could turn that into a binary-looking search, winding up with something like this:

Define Longest_Cache_Hit(Possible_Hit, Remainder):1) See if Possible_Hit is in the cache2) .. if Possible_Hit is NOT there:3) .... if Possible_Hit is a single line, return NO_HIT4) .... otherwise return Longest_Cache_Hit(First half of Possible_Hit, Second half of Possible_Hit)5) .. if Possible_Hit IS there:6) .... If Remainder is empty, return Possible_Hit7) .... Call Longest_Cache_Hit(Possible_Hit + First half of Remainder, Second half of Remainder)8) ...... If (7) returns a hit, return that hit9) ...... If (7) returns NO_HIT, return Possible_Hit

Outer loop:1) Load the whole email into LINES2) CACHE_HIT = Longest_Cache_Hit(LINES, EMPTY_STRING)3) .. If CACHE_HIT == NO_HIT, store first line of LINES in our email, and also put that first line in the cache4) .. If CACHE_HIT != NO_HIT, store CACHE_HIT in our email. If there is a NEXT_LINE after CACHE_HIT (we didn't hit to the end of the email), then put CACHE_HIT + NEXT_LINE in the cache.5) Remove the first line from LINES and GOTO 2

Notes:1) The caveat about quicksort vs bubblesort for small data could apply here... depending on the size of your emails, your algorithm might make CPU cache happier than mine.2) I'm sure my algorithm could be done without the recursion.

Again thanks for the article, it clearly got me thinking about algorithms for fun :)

As you mentioned, there is a low-constant O(n log n) algorithm that uses Suffix Arrays and a high-constant O(n) Suffix Tree algorithm that can solve your longest substring problem. The latter one is very complicated, but can be made online (check Ukonnen's algorithm). However, I like the first one for its simplicity.

You can always use a doubling approach to amortize costs and get an overall O(n log n) (for building) and O(k log n) to match a string of length 'k' in the built dictionary.

Awesome article. Just a thought... normally, running compression alg's in parallel gets you nowhere. Since you are (if I read correctly) compressing individual strings, could you get some bonus parallelism from running compression on the GPU (and also free up some of that CPU)? If your "server" even has a GPU...

This is an awesome way to learn algorithm design. A story of how you thought through the design and incorporated real world constraints rather than slavish devotion to the BIG-O. (I can't tell you how many times I have seen an algorithm with O(n) but their n was 100,000 times larger than it needed to be. It looks great on a chalkboard but it doesn't translate.) You should be the one writing the textbooks. find yourself an agent.

Your article was a real treat, thanks man. Same goes for all comments made. They're equally entertaining and insightfull and it just proves that doing the things you're passionate about (without begging for aknowledgment) will eventually atract the kind of people that take time to read and understand. Good job you all. Thanks :-)