Sunday, 18 March 2012

lrzip-0.612

This time the main update is a new zpaq library back end instead of using the ageing zpipe code. There are a number of advantages to using the libzpaq library over the old zpipe code.

First, the old code required a FILE type stream as it was written with STDIO in mind, so it was the only compression back end that required the use of some lesser known but handy, yet (virtually) linux only memory features like fmemopen, open_memstream and friends. These were not portable for osx and others so they were emulated on those platforms through the incredibly clunky use of temporary files on disk. Using the new library has killed off the need for these features making the code more portable.

Second, the code is significantly faster since it is the latest full c++ version of the zpaq code. Unfortunately it also means it takes a LOT longer to compile this part of lrzip now, but that's not a big deal since you usually only compile it once ;)

Third, it supports 3 different compression levels, one of which is higher than the previously supported one in lrzip. As lrzip uses 9 levels of compression, I've mapped the 3 levels to -L 1-3, 4-7 and 8-9 since -L 7 is the default and that provides the "mid level" compression from zpaq.

Finally, the beauty of the zpaq compression algorithm is the reference decoder can decompress any zpaq compressed data of any profile. This means you are able to use the latest version of lrzip with compression -L 9 (max profile), yet it is still backwardly compatible with older 0.6x versions of lrzip, not requiring an updated minor version and file format. The release archive I provide of lrzip-0.612.tar.lrz is self compressed with the new max profile. Even though there is significantly more code than ever in the lrzip release tarball, it has actually shrunk for the first time in a while.

So all that talk is boring and all, so let's throw around some benchmark results which are much more fun.

From the original readme benchmarks, I had compressed the linux 2.6.37 tarball, so I used that again for comparison. Tests were performed on an Intel quad core 3Ghz core 2 duo.

As you can see, the improvements in speed of the rzip stage have made all the compression back ends pretty snappy, and most fun of all is that lrzip -z on this workload is even faster on compression than the multithreaded 7z and is significantly smaller. Alas the major disadvantage of zpaq remains that it takes about as long to decompress as it takes to compress. However, with the trend towards more CPU cores as time goes on, one could argue that zpaq compression, as used within lrzip, is getting to a speed where it can be in regular use instead of just research/experimental use, especially when files are small like the lrzip distributed tarball I provide.

Using "U"nlimited "z"paq options, it is actually faster than xz now. Note that about 30% of this image is blank space but that's a not-uncommon type of virtual image. If it were full of data, the difference would be less. Anyway I think it's fair to say that it's worth watching zpaq in the future.
Edit: I've sent Matt Mahoney (zpaq author) the latest benchmarks for lrzip and how it performs on the large file benchmark and he's updated his site:
http://mattmahoney.net/dc/text.html
I think it's performing pretty well for a general compression utility.

12 comments:

Nice, CK. For those who are unaware, can you comment on the status and implications of incorporating lrzip into libarchive? I believe you are/were working with Michael Blumenkrantz (author of liblrzip) to do this.

The GPL license in lrzip will prevent the full library support being merged into libarchive, only the simple compress/uncompress features. Michael's forwarded a patch to me for separate-binary support of lrzip in libarchive to do this. I have not yet submitted it on his behalf.

Hopefully, your efforts will result in lrzip support into pacman/makepkg. I ran a more simplistic version of your benchmark compressing the kernel source. Results are like yours but I included some graphs of key meteric.

I posted a first draft patch to libarchive for basic lrzip inclusion and the news is even better - they can actually accept libraries that are GPL licensed which means that we can work on full lrzip support with libarchive.

rep is based on rzip idea and srep is based on the idea mentioned in rzip paper (dictionary larger than memory), but both are much more efficient than rzip implementation. lrzip afaik just adds lzma compression to the rzip code, while i've developed completely improved algorithm

now rep (in-memory deduplication) processes more than 1 gbyte/s on i7-2600, and srep (full-file deduplication) process about 100 mbyte/s, with final compression ratio (rep/srep+lzma) much better than with rzip/lrzip. just now i'm working on 1gbyte/s full-file deduplication

afair, rzip finds 32+ byte matches and then omits part of them when it's out of memory. does lrzip changed the algorithm? i've looked at it back in 2007-09 when i have started to work on rep. look at http://encode.ru/threads/1726-Deduplication-X-Files

it will be great if you can join the forum and start thread about lrzip. in order to post here i should recognize google capture everytime and i hate it

Interesting bug, thanks for pointing it out. It probably makes no sense even using a back end when the file size is below a certain size, and the back ends seem to be unreliable below some limit which I haven't worked out yet. I'll try to work on it next version, whenever that comes out.