A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, MuQSS, BFS and -ck.

Friday, 25 February 2011

lrzip-0.570 for the uncorrupt.

When I last blogged about lrzip, I mentioned the corruption on decompression issue a user was seeing in the field. This bug, not surprisingly, worried me greatly so I set out on a major hunt to eliminate it, and make lrzip more reliable on decompression. After extensive investigation, and testing on the part of the user, to cut a long story short, the corruption was NEVER THERE.

The problem he was encountering was on decompressing a 20GB logfile, he would compare it to the original file with the 'cmp' command. On decompressing the file and comparing it, there would be differences in the file at random places. This made me think there was a memory corruption somewhere in lrzip. However he also noted that the problem went away on his desktop machine when he upgraded from Debian Lenny to Squeeze. So we knew something fishy was going on. Finally it occurred to me to suggest he try simply copying the 20GB logfile and then running 'cmp' on it. Lo and behold just copying a file of that size would randomly produce a file that had differences in it. This is a disturbing bug, and had it been confined to one machine, would have pointed the finger at the hardware. However he had reproduced it on the desktop PC as well, and the problem went away after upgrading his distribution. This pointed to a corruption problem somewhere in the layers between write() and what ends up on the disk. Anyway this particular problem is now something that needs to be tackled elsewhere (i.e. Debian).

Nonetheless, the corruption issue got me thinking about how I could make lrzip more reliable on decompression when it is mandatory that what is on disk is the same as what was originally compressed. Till now, lrzip has silently internally used crc32 to check the integrity of each decompressed block before writing it to disk. crc32 still has its place and is very simple, but it has quite a few collisions once you have files in the gigabyte size (collisions being files with the same CRC value despite being different). Fortunately, even with a hash check as simple as CRC, if only one byte changes in a file, the value will never be the same. However the crc was only being done on each decompressed chunk and not the whole file. So I set out to change over to MD5. After importing the MD5 code from coreutils and modifying it to suit lrzip, I added an md5 check during the compression phase, and put the MD5 value in the archive itself. For compatibility, the CRC check is still done and stored, so that the file format is still compatible with all previous 0.5 versions of lrzip. I hate breaking compatibility when it's not needed. On decompression, lrzip will now detect what is the most powerful hash function in the archive and use that to check the integrity of the data. One major advantage of md5 is that you can also use md5sum which is standard on all modern linux installations to compare the value to that stored in the archive on either compression or decompression. I took this idea one step further, and added an option to lrzip (-c) to actually do an md5 on the file that has been written to disk on decompression. This is to ensure that what is written on disk is what was actually extracted! The Debian lenny bug was what made me think this would be a useful feature. I've also added the ability to display the md5 hash value with a new -H option, even if the archive was not originally stored with an md5 value.

One broken "feature" for a while now has been multi-threading on OS-X. I have blogged previously about how OS-X will happily compile software that uses unnamed semaphores, yet when you try to run the program, it will say "feature unimplemented". After looking for some time at named semaphores, which are clunky in the extreme by comparison, it dawned on me I didn't need semaphores at all and could do with pthread_mutexes which are supported pretty much everywhere. So I converted the locking primitive to use mutexes instead, and now multi-threading on OS-X works nicely. I've had one user report it scales very well on his 8-way machine.

Over the last few revisions of lrzip, apart from the multi-threaded changes which have sped it up, numerous changes to improve the reliability of compression/decompression (to prevent it from running out of memory or corrupting data) unfortunately also have slowed it down somewhat. Being a CPU scheduler nut myself, I wasn't satisfied with this situation so I set out to speed it up. A few new changes have made their way into version 0.570 which do precisely that. The new hash check of both md5 and crc, which would have slowed it down now with an extra check, are done now only on already buffered parts of the main file. On a file that's larger than your available ram, this gives a major speed up. Multi-threading now spawns one extra thread as well, to take into account that the initial start up of threads is partially serialised, which means we need more threads available than CPUs. One long term battle with lrzip, which is never resolved, is how much ram to make available for each stage of the rzip pre-processing and then each thread for compression. After looking into the internals of the memory hungry lzma and zpaq, I was able to more accurately account for how much ram each thread would use, and push the amount of ram available per compression thread. The larger the blocks sent to the compression back end, the smaller the resulting file, and the greater the multi-threading speed up, provided there's enough data to keep all threads busy. Anyway the final upshot is that although more threads are in use now (which would decrease compression), compression has been kept approximately the same, but is actually faster.

Lots of other internal changes have gone into it that are too numerous to go into depth here (see the Changelog for the short summary), but some user visible changes have been incorporated. Gone is the annoying bug where it would sit there waiting for stdin input if it was called without any arguments. The help information and manual page have been dramatically cleaned up. The -M option has been abolished in favour of just the -U option being used. The -T option no longer takes an argument and is just on/off. A -k option has been added to "keep corrupt/broken files" while corrupt/broken files generated on compression/decompression are automatically deleted by default. The -i information option now gives more information, and has verbose(+) mode to give a breakdown of the lrzip archive, like the following -vvi example:

I didn't bother blogging about version 0.560 because all the while 0.570 was under heavy development as well and I figured I'd wrap it all up as a nice big update instead. I'm also very pleased that Peter Hyman, who helped code for lrzip some time ago, has once again started contributing code.

That's probably enough babbling. You can get it here once freshmeat updates its links:lrzip

question about it. I used "lrzip ./directory" to compress my ZEN kernel branch, 2GB so far. I know, for this you have created the lrztar command, but lrzip was doing something and so I give it a try. After 1 hour I canceled the operation. Until to the break lrzip was using the CPU heavy. But the result was a file directory.lrzip of some KB. OK, so the question, what had lrzip done? (Maybe you can implement some dummy check for file/directory :-) ). However, after that I checked the lrztar command. Is it correct, that this will tar the ./directory and than lrzip the resulting tar file and after that operation it will delete the tar file? That means, I need nearly double the size as free space to lrztar a directory? Or does it make sense to use the --use-compress-program= switch of tar?

I actually have no idea what it's doing when you pass it a directory. It's definitely a bug and it needs to detect that it hasn't been passed a file. That's a good idea.

Now about the whole tar issue: rzip - which became lrzip - was never designed to work on stdin/stdout. Because it has to pretty much read from one end of the file to the other on both the compression and decompression side to derive the benefits of the rzip preprocessing, it means that the whole input file, buffers for compression/decompression, and the output file, all need to fit into ram if it's to work on stdin/stdout. This is a major limitation of lrzip, and whether you use tar --use-compress-program or not, it ends up generating faked temporary files taking up double the space during the process. A relatively recent change to lrzip is the ability to compress from stdin without a temporary file, but because it stores it all in ram, it actually is less efficient than compressing whole files. For a long time I've been considering ways to make it work properly on stdin/stdout and they all point to some problems - that I'd need to change the file format yet again, and that it may be impossible to decompress a file from stdin on a machine with less ram than the one it was compressed on. To change the file format and then not be able to actually cope in all circumstances seems a complete waste. Nonetheless I am still trying to find solutions, but all of them require major code surgery. As for your times and sizes, they look nice. Interestingly you didn't derive any benefit from -U suggesting not much redundancy in your data.

sometimes my free space is limited and so I ask for an pipe with tar. Would be fine, if there is only 1 step to produce the archive for a directory tree. I see the problem, it's better to know which data is to compress, than waiting what's coming into the pipe ;)But the problem with the compression window and the unpacking without enough memory still exist with 64/32 bit versions or I am wrong?Btw., the md5 check is a fine security/verify feature, shouldn't it be the default on decompressing?

Another suggestion: lrzip help output:Usage: lrzip [options] better: Usage: lrzip [options] (As you already have done in the man page). So a dummy could see, that's only for one file ;)

Yes I understand, but for the reasons I already said, piping the data via stdin still generates temporary files. I'm currently working hard to minimise this problem and you can check out the progress here:https://github.com/ckolivas/lrzip/tree/stdinoutIt's a very big rewrite and compression is not as good when you use stdin/out, but so many people have requested not using temporary files that I've been trying to find a way to implement it. I may have to change the file format (perhaps go to version 0.600) for the most benefit though.

md5 checking of decompressed data is always performed on decompression, whether it shows it or not. What is NOT done is to automatically verify what is written to disk as well. Normally you would trust your filesystem and operating system, and no other compression program does a verify on disk by default!

I saw something with stdinout on my git pull. But it seems, that something went wrong, because at the moment I don't have ./configure file anymore. And so I can't compile it. Or must I change the branch or other things with git (still a noob with git, so sorry for the question).

PS: And if the compression with stdinout isn't so good compared to the file method, but I assume the you still beat the others ;)

how can I see, if I use the new stdinout lrzip. I changed to the stdinout branch in git, than ./autogen.sh etc. But the version still shows lrzipWill not read stdin from a terminal. Use -f to override.lrzip version 0.570And it seems, that piping or tar --use-compress-program=lrzip doesn't work. I know, it's still an early beta ;)

If it says "will not read stdin from a terminal" it's because you're not piping anything into it? Unless you have one of the earlier broken versions from git. It is still under heavy development. The current git version works fine here from pipes, with redirection, and with --use-compress-program on tar. I haven't put up the version number yet because it's not ready for release :P

It will always compress better and faster with a real temporary file for the reasons I've already outlined. The only reason I have upgraded the stdinout support is because people keep asking me over and over for it. There is a lot more you can do with solid files that you simply cannot do when reading from or writing to a blind pipe. The design of lrzip works precisely in a way that needs solid files and this latest experimental version works by creating temporary files as large as possible in ram. I'm not sure what I should do with lrztar just yet. The official release of the next version will still be a while away as I finalise what I will include in it.