A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, MuQSS, BFS and -ck.

Sunday, 12 December 2010

lrzip-0.551, overlapping streams for faster compression

A few bug reports have been trickling in since I last released a version of lrzip, which as I said last time is good and bad. It's helped direct my workflow on this, even though I keep trying to let it settle down. I'd love to leave the code since it mostly works quite well, and there is a common mantra that "release early, release often" is considered good, but in practice that doesn't seem to work. If you release stuff while it's still broken, it will get a bad reputation and people will dismiss it that time around and often not give it a chance again. So if there are show stopper bugs, they need attending to ASAP.

So anyway, the fact is that the fun part is implementing stuff rather than just fixing bugs, so for the next release I worked on doing both since I had to fix the bugs I knew about. I've mentioned before that lrzip breaks data into rzip streams and compresses each stream a block at a time. I had implemented multithreading by splitting that block up according to number of CPUs, but at the end of the stream, it waited to get all the blocks back before moving onto the next stream. For this release I moved the threading up earlier in the overall process, thus allowing the rzip stage to move onto the next stream while the back end threads are still busy compressing data from the previous stream. This means that if a file is large enough (greater than approximately 2/3 ram), once all CPUs are in use with back end threads, they should remain in use until the end of the file is reached. This speeds up compression on large files on SMP machines.

While looking for bugs to do with lzma compression failing due to running out of ram, I discovered why changing from level 7 compression to 9 compression did nothing... It's been a long time since the lzma library was imported into lrzip, and I had completely forgotten that it only goes up to level 7! The default lzma compression level is meant to be 5, so I've scaled it down now, meaning lrzip will now use the best all round compromise level for lzma. This also was affecting the window sizes internally used by the lzma library. At level 5, the windows are 16MB in size, and go up to 64MB for level 7. As one kind person noted in the comments on this blog, lzma uses up to 12x the memory of the compression window, so each thread would have been needing up to 800MB each. This explained why people were getting lzma failing on window sizes and complaining despite apparently enough ram. While the new level does lose slightly on compression, it does also improve compression speed significantly, and should pretty much eliminate lzma compression failing due to "too big window sizes".

With the two speed improvements, here are the sort of speed ups one can see with version 0.550 of lrzip:
10GB virtual image on quad core with 8GB ram:
lrzip 15m45s -> 11m03s
lrzip -M 12m03s -> 9m30s

And just as a reminder, xz completed this in 50m58s and the file size was almost halved by lrzip compared to xz.

On version 0.544 of lrzip, I had changed multithreading to use two separate groups of threads for stream 0 and stream 1 to make decompression more robust for out-of-order storage in the archive. Well unfortunately this doubled the possible number of threads, and this brought a now all-too-familiar problem: running out of ram on big files. In .550 I changed it to only use one thread for stream 0 rather than multiple ones since it's usually only one block of stream 0 before multiple blocks of stream 1. This can slow down decompression a little more, but reliability is more important.

I put some effort into tackling the dreaded running out of ram problem with multiple threads all competing for ram. What lrzip does now is detect when the compression back end has failed (which is usually for memory reasons) and instead of failing completely, it will then wait for the previous thread to complete its compression before trying again. This serialises work that would have otherwise been in parallel. Thus less ram is required since only one thread is likely to be using ram. I did this on both compression and decompression. Also, I made the lzma compression back end first try a lower compression level (since this uses a smaller window) before failing, and then finally resorting to bzip if it never succeeds. This should make running out of memory a non-fatal problem since it will just try again using less and being slower.

On the subject of constraining memory usage, I finally bit the bullet and dropped the maximum mmap down to only 1/3 of ram since I can use my sliding mmap to do the remaining chunk size. It is a tiny bit slower, but far more reliable, and leaves a lot more ram for the compression back ends. This makes my sliding mmap implementation, which was just an experiment originally, used routinely for any file larger than 1/3 of ram size.

My "spreading out of thread spawning evenly" idea in the last release was a failure. I reverted it since it provided no speed ups and occasionally slowed things down.

I fixed a few other minor errors to do with not callocing enough ram in some structs which looked really scary but I have no idea if they were causing problems. Hopefully they fixed some random problem somewhere :P

So I was very proud of this release and made it go gold as version 0.550...

About 5 minutes after releasing it, I found a couple of new bugs that were pretty nasty. When lzma worked on an incompressible block, it now failed instead of just leaving that block uncompressed. That was an easy fix. I also found that I had broken compression from STDIN a while back and hadn't noticed, and the compression summary at the end was reading 0. So I quickly corrected those and bumped the version up to 0.551.

No doubt more bugs will show up and I'll have to tend to those as well, but this should be a more robust version since a lot of failure cases in 0.544 have been addressed.

Other ideas I'm considering for lrzip at this stage based on requests include a good error testing and recovery mechanism with either reed-solomon or low density parity checking, and encrypted password protection. I have a few libraries I've looked at for both of these but am open to suggestions if people have recommendations.

EDIT: People didn't understand how the chunks and streams work, so I've copied this comment I made below into the blogpost.

---

Lrzip splits files up multiple times. Imagine a file is twice the size of ram. It will first be split into chunks:
A B C
where each chunk is 1/3 the size of the file.

Then the rzip stage is performed on chunk A, splitting it into streams:
A0 A1
where A0 is the "dictionary" and A2 the "matches"

Then each stream is split up into further chunks for the compression backend, chunks
A0A A0B
and then
A1A A1B A1C A1D A1E
and so on, and each chunk is passed onto the compressor (such as lzma) and the results written to disk.

All this used to be done in series.

Version 0.530 made the last chopping up part be done in parallel since the slow compression back end (lzma or zpaq) was the time consuming part of compression. However when once streams A0 and A1 would have finished their rzip stage, lrzip would wait for A0A A0B A1A A1B etc. to finish compressing completely before lrzip would move onto chunk B.

Version 0.551 now allows lrzip to move onto chunk B while A0A A0B A1A A1B etc are all still compressing.

---

This means that on faster compression (lzo, gzip and bzip2), the rate limiting step is the stream production of A0 and A1. The pipe dream is to parallelise those, but they rely on using as much ram as possible to make the rzip stage effective so it would compromise compression greatly (and be messy as hell to implement).

9 comments:

I'm trying to understand what "stream 0" and "streams 1" do mean in your above explanation.If your streams are already cut into blocks, it seems your mechanism is generic enough to keep busy all your cores ? So what could be the use of several streams, or family streams ?

Lrzip splits files up multiple times. Imagine a file is twice the size of ram. It will first be split into chunks:A B Cwhere each chunk is 1/3 the size of the file.

Then the rzip stage is performed on chunk A, splitting it into streams:A0 A1where A0 is the "dictionary" and A2 the "matches"

Then each stream is split up into further chunks for the compression backend, chunksA0A A0Band thenA1A A1B A1C A1D A1Eand so on, and each chunk is passed onto the compressor (such as lzma) and the results written to disk.

All this used to be done in series.

Version 0.530 made the last chopping up part be done in parallel since the slow compression back end (lzma or zpaq) was the time consuming part of compression. However when once streams A0 and A1 would have finished their rzip stage, lrzip would wait for A0A A0B A1A A1B etc. to finish compressing completely before lrzip would move onto chunk B.

Version 0.551 now allows lrzip to move onto chunk B while A0A A0B A1A A1B etc are all still compressing.

Obviously, it makes me think about new questions, so i hope i does not bother you too much.

> A0 is the "dictionary"Do you mean "literals" ? It would make sense, considering the rest of your explanation.In LZ semantic, "literal" is the "rest", what could not be represented using matches.While "dictionary" tend to be the sliding window into input stream (or a list of pointers, but that's for more advanced formats).

2) Separating literals for matches is an excellent choice, since they have pretty different patterns. It's much clearer now how your back end succeed at being efficient.

Maybe you have already thought about it : did you envisage to separate "matches" field even further, such as having a "offset" stream, a "match length" stream and another "literal length" stream ?In theory, it should help compression, since data pattern would be more homogeneous.

3) Understanding chunks : From reading your example, it seems chunk need to fill entirely into memory. Is that correct ?I was under the feeling that the rzip stage was working on the entire file, even if bigger than memory.Maybe there's something i do not properly understand here...

I had not considered separating them further because I am no compression expert. The internal workings of the rzip preprocessing and the backend compressors are entirely the work of other people. I work on the glue that puts it all together as lrzip. Splitting it out as you say would be a non-trivial exercise but it's definitely something I will consider for the distant future and newer versions. Thanks for the suggestion.

The chunks to begin with have to be seen at the same time for the search to work. The original implementation of rzip had the requirement that the chunk could be mmapped. I modified it using the "sliding mmap" I invented to allow it to extend beyond the limits of how much you can mmap, but it uses progressively more ram and (potentially) swap and takes much longer the more you extend beyond the ram range. That's what the "unlimited" -U option is there for: To make it all be searched as one large chunk, so there'd only be an A, and no B and C. It is much kinder on the system to default to making the chunks approximately 2/3 of the ram size, although it makes the final compressed file larger. I'm very conscious of the effects running lrzip have on the machine for the user. Trying to parallelise the first stage into multiple chunks, worked on at the same time to speed up compression would significantly sacrifice the utility of the rzip preprocessing, and will make it quite hard to keep memory in check (which is already a nightmare).

In my opinion, LRZip is a brilliant piece of software, and by choosing to combine complementary algorithms, in a very effective implementation (comparable attempts are pretty much batch based) you addressed compression in an innovative manner.

You might have noticed that i'm interested and have a little bit of experience in the backend stages. So if there is a need for some help on this side, even just for a question, i'll be glad to support.