Rsyncable gzip

This article was first written in February 2005 for the BeezNest technical
website (http://glasnost.beeznest.org/articles/206).
GZIP="--rsyncable" tar zcvf toto.tar.gz /toto

Why do you need this special option ?

Because if you compress your files before synchronising them with rsync, a very small change in one original file may force rsync to re-transmit the whole compressed tar.gz file, instead of just the changed portion.

The basic reason is that rsync works at the byte level : very roughly, it compares the old copy of the file with the latest source, and transmits every byte that is different to update the old copy and make it identical to the new. rsync uses a smart way of doing these comparisons, so that in most cases only a tiny portion of the file needs to be actually transmitted.

Unfortunately, file compression algorithms which use an adaptative compression method (like most do), defeat the rsync logic and can cause the whole file to be retransmitted, even if only one byte has been changed.

Why is that so ?

An adaptative compression method uses an analysis of the bytes already processed, to determine how best to compress the following bytes of the file. For example, suppose the compression program starts at byte 0 with a certain compression method. After 1000 bytes have been compressed, the program will recalculate a new compression method, based on what it found in bytes 0-999. It will then insert a new compression table into the file, and use this table to compress the next 1000 bytes. Then it recalculates it’s compression table based on the bytes 0-1999, and does the same, and so on. This means that a change of one byte in bytes 0-999, can potentially change the compression method for the rest of the file, and that the rest of the output bytes will be totally different. And because rsync compares the files byte per byte, it will not find any similar block of bytes between the old and new file, thus will be forced to resend the whole new compressed file.

The --rsyncable option above fixes this problem. With this option, gzip will regularly “reset” his compression algorithm to what it was at the beginning of the file. So if for example there was a change at byte 23, this change will only affect the output up to maximum (for example) byte #9999. Then gzip will restart ‘at zero’, and the rest of the compressed output will be the same as what it was without the changed byte 23. This means that rsync will now be able to re-synchronise between the old and new compressed file, and can then avoid sending the portions of the file that were unmodified.

Now, for the example above, suppose “/toto” is a directory with plenty of small files for a total of 50 MB, thus the uncompressed tar file would be about 50 MB. By compressing it with gzip, we bring this down to 15 MB in the tar.gz file. Now we ‘rsync’ this file with a remote system.

If nothing has changed since yesterday in the /toto directory, the tar.gz file will be the same as yesterday, rsync will detect this and the file will not be transmitted.

On the other hand, if one single small file at the beginning of the ‘tar’ has changed, then without the --rsyncable option, most of the tar.gz file will be different, and rsync will have to transmit almost 15 MB to the remote rsync target system. In that case, it would have been better to not compress the tar file at all !

With the --rsyncable option, it is possible that only 1000 bytes would be different in the tar.gz file, so only 1000 bytes would be transmitted by rsync, for the same end-result.

Search

Calendar

Popular articles

Recently I’ve had to fix a very strange cPanel installation of Chamilo. For some reason small files could be uploaded but not large files. Modifying the .htaccess file to add post_max_size and upload_max_filesize didn’t work. The problem was due to two things, really: the re-configuration of php settings could only be done through local php.ini […]

This guide is written specifically to cover a lack of quality documentation for the installation procedure of an OpenMeetings 2.2 server on a Debian Wheezy box. It will be based on the manual available already for this procedure (but relatively badly structured and written in a rather improvable English), by Alvaro Bustos with the help […]

There is a common problem appearing when installing a new Chamilo portal that I have seen a lot recently, so I thought I’d share the details here. The problem When installing Chamilo on a cPanel-kind-of-hosting, it might happen that you complete the installation, but when you want to enter a newly-created course, an ugly error […]