A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, BFS and -ck.

Wednesday, 23 March 2011

lrzip-0.600, STDIN/STDOUT and Encryption

TL;DR I haven't got a BFS for 2.6.38 yet.

The short short changelog: Lrzip now supports stdin/stdout fully on both compression and decompression without using temporary files. It also now supports very high grade password protected encryption, faster decompression and testing, big endian machines, and has had numerous minor fixes. There is still no progress on BFS for 2.6.38, but the fact that I've released this version of lrzip means I will likely be looking at it soon.

Lrzip was originally based on rzip code by Andrew Tridgell and for a very very long time full stdin and stdout support was considered so orthogonal to the design that it would be virtually impossible to implement. The reason for this is it compressed data in 2 separate passes, and worked over chunks of data that would totally fill ram and then store it, move on to the next chunk and so on, then it would go back and fill in all the header details in the various compression chunks and finally go back to the start of the file and write down the size of the file. This meant that unless the whole file and two separate compression phases all requiring ram, could all fit into ram, it was impossible to write that chunk to disk. If that part wasn't bad enough, on decompression, it would decompress each chunk, then write down parts to disk, decompress the next chunk, then refer -back- to parts already written to copy them to the new parts, then move onto the next chunk. It's not hard to see why this would make it impossible to read from stdin and write to stdout with all that fast forwarding and rewinding on the file as it's being read from or written to. Basically unless you had about 6 times as much RAM as the size of the file you're compressing it is impossible to fit all that into memory and work on it at once.

One thing I've always kept in mind when developing lrzip is to design a compression format that scales. And by scales I mean as CPUs get faster, computers get more ram, hard drives get faster, and files get bigger, the compression should actually get better. One of the most limiting things about other compression formats is they pretty much generate a static size regardless of how good the machine is they're working on. They also have not scaled very well with increasing CPU cores. For some time it has occurred to me that one day PCs would have enough RAM that the limiting aspect of fitting all the data of a file into RAM to work on it as I described above would no longer be an issue. Of course files will always get larger as well and there will always be files larger than the amount of RAM you have. To make lrzip work on large files, it was going to be necessary to rewrite the file format such that it could do "chunks" of independent data, write them to stdout, and then move on. So basically the major rework that went into this version of lrzip does precisely that. It carefully tries to figure out how much ram it can allocate to a temporary input buffer, temporary output buffer, rzip compression first stage and multiple compression back end stages, and then starts constructing the archive to/from the temporary buffers before dumping a chunk to disk.

Implementing the temporary buffers was no minor task either. Each phase had to take into account different aspects of whether there would be multiple reads/writes, modify in ram or on disk and safely keep track of it. Basically each different combination of compress/decompress stdin/stdout needed a different implementation to work so I incrementally implemented one after the other. The one caveat remains, though, that you might end up decompressing a file to STDOUT on a machine with MUCH less RAM than the machine it was compressed on. So I needed to also implement an opt-out where if it all failed to fit in RAM, it was still able to safely dump it out to temporary files on disk. This is required on decompression from STDIN to create a temporary file there, and decompression to STDOUT (especially if the file wasn't generated via a STDOUT archive in the first place). As you can well imagine all this took a lot of time and testing but all appears to be working nicely now. Furthermore, if you are unlucky enough to need temporary files for the above said reason, lrzip can now "chunk" them as well so it will use less temporary space than before. It's important to note when using STDIN+STDOUT that the compression on files greater than about 25% RAM size will not be as good as when you compress the file directly on disk. The reason has to do with just how much you can fit in ram at any one time and what size compression window can be allocated. Nonetheless 25% of ram on a modern machine is a fair chunk (and gets better every year :)

One of the bonuses of implementing decompression to a temporary buffer was that it was possible to modify even the normal decompression to use this as much as possible. As decompression writes to disk and then reads back from the file it writes to, this can be VERY slow if it involves lots of small reads and writes (and this sometimes happens with large files with lots of small similar portions). Doing this in RAM is much faster, and if you're just testing the file, there is no need to even write it to disk in the end. Thus decompression and especially testing received a healthy speed-up as a result.

After finally getting all that out of the way and having to create a new file format to accommodate it, it seemed a shame to have to change the file format again in the near future to support the 2nd most requested feature of lrzip - encryption. So instead of releasing it as it was, I started getting advice on how I might implement encryption and was lucky enough to recruit the help of Serge Belyshev who was actually reading all about encryption techniques. Unfortunately it also meant that instead of just slapping together some basic password protection and simple encryption mechanism, we ended up trying to generate some absolutely ludicrously secure encryption for lrzip.

You're probably aware that there are 2 basic ways to circumvent encryption: brute force and backdoors. If we stick to encryption techniques for which there are no known current weaknesses, nor backdoors, then the only way to break them is brute force. This means attempting as many password combinations as you can in the shortest possible time to see if they're correct. The average modern CPU can test about 1 million hashed passwords per second using sha512. However if you multiply hash the password (i.e. do it over and over with slight changes), then you have to multiply hash it on testing the password. So if you multiply hash it 1 million times, it takes one second just to try each password. Of course this is only for today's PCs so I thought up that we could increase the amount of hashing by date. So we went back to Moore's law which predicts how much CPUs increase in power each year, and look at the system date when hashing the password, and then decide how many times to hash the password. That means that 10 years from now (if Moore's law still holds, which it has for 50 years now), it will still take 1 second for each password on a newly encrypted file.

After hashing the password with sha512 millions of times, we added random salt to each compression block after the 2nd compression stage and used AES128 block encryption on the data. Then, and this was the hardest part of all, once each rzip chunk is complete, we go back and overwrite each block header encrypted with yet more random salt, and finally encrypt even the md5 sum at the end. This means that you end up with a file that is recognisable by its header as being an lrzip file, but has basically random indecipherable data after that so there is no way of knowing how many chunks the data is, how compressed it is, nor how big the final file is. It also means that each section that is encrypted is encrypted with a different cipher and if you compress exactly the same file with exactly the same password, it will (virtually) never produce the same file twice. To be honest I think we overdid it by anyone's standards. However since I was implementing it, it seemed illogical to just do a half-arsed effort. By far the weakest link in this equation is the password you choose when encrypting, but it does accept passwords up to 500 bytes in length. If you lose your password, consider the data irrevocably lost as there is no known way of retrieving it.

Michael Blumenkrantz was also kind enough to hack on lrzip and implemented a completely new autoconf style configuration mechanism which should hold lrzip in good stead for the future, being more robust and flexible. He then went on to abstract out most functions in a way that will make it possible for him to implement a liblrzip library in the future. That's a rather exciting prospect, but I'm glad it's him and not me that wants to implement that ;)

Finally with more help we tracked down some big endian bugs that had crept in and made lrzip not work on BE machines, I did some testing on my wife's mac to make sure it built and all of the normal compression/decompression and encryption worked, and fixed up and tweaked too many other things to mention in the code.

All in all I'm very pleased with the result, and as much as I'd love for it to be super stable and bug free, no doubt something will show up.

So do your worst, have fun and make sure to tell everyone where you got it:LRZIP FRESHMEAT

And now a small rest before I look at the kernel again...

EDIT: In the install it seems we forgot the lrzuntar, lrunzip wrappers and the extra manpages. I have rushed out a v0.601 release with these changes.

You gain speed at the cost of some compression with the extra multi-threading. It's a worthwhile sacrifice given the massive speed up. Try with -p 1 and you'll see it slow down and the final size will be smaller. In fact I find lrzip with -z on a quad core not much slower than xz, but for MUCH better compression.

Thanks for testing and feedback :). Bear in mind that one of the good things about the default use of lzma instead of zpaq is that decompression is very fast whereas zpaq decompression takes just as long as compression took. That's some pretty mean processing power you got there too :)