A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, MuQSS, BFS and -ck.

Monday, 22 November 2010

lrzip-0.542, mmap windows and overcommit

I started out this blog entry with a lot more about the tty patch thingy and then decided against it. Suffice to say I don't like heuristics being special coded into the scheduler.

So back to lrzip progress. I found a lovely little bug in it which would make the sliding mmap buffer not slide from the 2nd compression window onwards. It was a lovely bug to find because after fixing it, a very large file took 1/4 the time it was taking to compress down. Overall sliding mmap has gotten a lot faster and useful.

I did a bit of further investigating to see if it was fast enough to enable unlimited compression windows by default. It hasn't quite gotten that much faster, but it's very usable. I tried it on that 9.1GB all-kernel archive on a 32 bit laptop and the default windows take 20 minutes to compress, while unlimited sized windows with sliding mmap took 80 minutes. Not bad given how much more compression it ends up giving.

Now because I run my main desktop with swap disabled (swap is a steaming pile of dodo's doodoo) I wanted to check what happened with memory allocation on linux and just how much the Virtual Memory subsystem will happily give you since lrzip tests before it allocates a window. With a file backed mmap it turns out you can allocate whatever you bloody well like. It happily let me mmap 50GB on a machine with 4GB ram. Now this is a disaster because imagine reading that file linearly into ram as lrzip is working on it. As soon as it stops being able to fit any more into real ram (which happens at about 2/3 of the ram used), it has all this stuff in ram which is now a complete waste and starts faking caching the rest. It never really has any useful amount in ram at any one time since it ends up dropping everything behind. Then if you read up to the end of the file in short bursts, and back again (as lrzip does), it gets worse and worse. Unfortunately, the -M mode in lrzip just worked that way by using whatever mmap would give you.

So I spent days trying various combinations of sizes of window that would work out optimal (with and without swap) and kept coming up with the same answer: that having a main buffer of ~2/3 of the ram gives the best speed performance. It left me without an answer for what to do with -M mode. After enough playing around I decided to make the sliding mmap mode kick in whenever the compression window is larger than 2/3 ram as it speeds up the performance noticeably compared to just mapping in the whole file and never really caching anything. But I had to make up some upper limit to how big -M would work with and 1.5 times the ram size worked out to be a reasonable compromise in how much it would slow down, and how much it would hit swap.

Lrzip can hit swap real hard was the conclusion by the way... and it's not like it speeds anything up or allows any more ram to be used compared to just disabling it. The only
time swap made a difference to the testing was how much anonymous ram could be mmapped (as in just free ram versus caching a file that's on disk). The linux vm allows you to pretty much allocate everything you have as real ram and swap. You should see what the desktop was like after it finished compressing a huge file - clicking on anything took ages since pretty much every useful application was sitting on swap instead of in ram. I had to test it with a real HD since my main desktop now uses an SSD to see how it would affect regular users. By the way, if you can afford an SSD for your desktop, get one ASAP. It's the greatest desktop speedup I've seen in years. (Sure be vigilant with backups and so on cause people don't trust them blah blah).

Despite choosing 2/3 as the amount of ram to allocate, lrzip still actually tests to ensure it won't fail and then decreases the amount till it succeeds. It's rather tricky making sure there will be enough ram for all the compression components because it needs enough ram to allocate to caching the file on disk, enough for 2 compression streams possibly as large as that first one, and then enough ram to dedicate to the back end compression phase. All in all it's a real ram hog. If you compress a big file with lrzip, expect that most things in ram will be trashed (all for a good cause of course). Making sure it doesn't fail on 32 bits is also rather annoying, but I now use the sliding mmap for anything bigger than 700MB there and that's made a big difference to how effectively a 32 bit machine can compress large files too.

Lrzip has been building successfully on and off on mac osx for a while now. Every 2nd or 3rd release I break it since I can't test it myself. It turns out that I recently broke it again. When I added the massive multithreading capabilities to the compression side, I used a fairly standard posix tool, the unnamed semaphore. It turns out that mac osx just doesn't support them. It builds fine, but then when you run it, it says function not implemented. That's pretty sad... named semaphores are much clunkier and according to the manpage "POSIX named semaphores have kernel persistence: if not removed by sem_unlink(3), a semaphore will exist until the system is shut down." I'm not sure if that's adhered to, but that's rather awkward if you don't clean up after yourself very well when you abort your program. At some stage if I can find the enthusiasm, I might try and get the multithreaded lrzip working on osx by converting the unnamed semaphores to named ones, or use them selectively on osx.

All in all I'm very pleased with how lrzip has shaped up. I went back and tested the huge kernel tarball with zpaq multithreaded compression and managed to get 9.1GB down to 116MB in just 40 minutes with zpaq on an external USB drive. Gotta be happy with that :) It's a shame people don't find this remotely as interesting as anything I have to say on the linux kernel.

EDIT: I keep getting people ask me what the big deal is with 32 bits and lrzip, since 32 bits with PAE should be able to address up to 64GB ram. That may well be the case, but the gnu libraries on the 32 bit userspace take a size_t on malloc and mmap, and they are the size of a signed long, which is 4 bytes on 32 bits, and 8 bytes on 64 bits. So the most you can address in one malloc with 32 bit userspace is up to 31 bits, or a bit over 2GB.

In all fairness, using a non-standard compression routine, no matter how good, has always been risky and only occasionally useful for most people. Personally, I am fascinated by the software and the process that you go through to develop it. I have downloaded 0.542 and will test soon.

I tested with mscore-0.9.3.tar.bz2 (75mb). When it is uncompressed it is about 150+mb. Using xv compression (tar -cJf) it is 55.2mb, with lrzip -zM it is 52.6mb. Both took about 10 minutes to compress.