Re-compress your gzipp'ed files to bzip2 using a Bash script (HOWTO)

If you were incredibly lucky (like me), perhaps you received an external USB
hard drive for Christmas. Or perhaps you have one lying around already,
with plenty of free space. And perhaps you also read the recent
Slashdot article about compression software and have lots of fairly
sizable gzipped files laying about.

After reading the comments in that article, I was dismayed to learn that my
favorite compression tool of choice (gzip) has no error-correction
capabilities. While I deem it to be the best all-around for quick backups
with a decent compression ratio, gzip will choke if it gets a data error on
restore - and there's something to be said for data integrity.

So, having this nice shiny new USB external drive and some time on my hands, I
wrote a Bash utility script to re-compress gzip files to bzip2, using the
external drive. It takes an order of magnitude longer to compress, but at
least I'll save some space and have a hope of recovering the compressed data
if things go wrong... Right??

My particular external drive is a 120-gig that came factory-formatted as a
single FAT32 partition. Now, any Linux guru worth their salt knows that this
thing practically begs to be customized, since Fat32 has a
2GB(Linux) or 4GB(Windows) filesize limit - depending on who's writing to
it.

So, I fired up my Knoppix HD install and repartitioned it.
Nothing fancy, just good old fdisk.

(I did make a note of the fact that the factory-default was one big type "c",
in case I needed to go back to that.)

Notice the 40GB Fat32 partition. In my other life (sssshhh!) I run Windows
2000 Professional - and was forcibly reminded that everything after Windows
ME has a 32GB partition size limit for formatting Fat32. Note that the
limitation is on formatting - not
accessing - this is by design, and Microsoft has
publically admitted it.

After going through several free Windows tools for formatting and repartitioning
(and running into a brick wall), I eventually gave up on Windows 2000 formatting the
thing. The vendor has a utility on their website to restore the drive to
factory-default partitioning, but that doesn't really help my intended use of
the drive. I could have formatted it in Windows 98, but that's no fun - and it
would need a separate driver for the OS to recognize the drive.

So, rather than give up a perfectly usable 8GB, good old Linux to the rescue
again:

$ mkdosfs -F 32 -v -n wdfat40 /dev/sdb6

and reboot.

Presto! Windows 2000 recognizes the drive just fine now, and it passes all the
chkdsk tests. And for all you dual-booters out there, a wonderful utility
exists called Ext2IFS ( http://www.fs-driver.org/ ). This
allows NT-based systems like Windows 2000 to access ext2/ext3 partitions just
like a regular drive - read/write, so no need for NTFS!

At first, I started out by writing a fairly basic script with a simple
function call and manually-entered filenames. Then I sat down and took
another look at it - and practically rewrote it from scratch, with some
features that occurred to me after several test runs.

rezip Currently Features:

Uses a simple text file of paths and filenames for input -- so you can save
the results of "find" to a file, run rezip, and the files will be
re-compressed one at a time, with a running log and no user intervention (as
long as there's free space on the destination drive.) Example:

$ find /mnt/bkps -name \*.gz > ~/rezipp-files.txt && rezip

Automatically sorts the files to process by size, so the biggest files are
last. This allows more work to get done up front. (Believe me, this is a
consideration when your fastest computer is a 900MHz AMD Duron)

Skips files less than 50MB in size (user defined)

Recreates existing directory structure on the external drive and leaves the
original .gz file in place

By default, does not overwrite existing .bz2 files so previous work doesn't
get run over. This feature was added after I found a bug where ^C won't stop
the script right away, and several hours of .bz2 output were lost. :(

Note: if you abort the script and then re-run it, you have to manually
delete the last (partial) .bz2 file it was working on, or that will be skipped
as well. This is where the log comes in handy. :)

Heavily commented and fairly easy-to-understand (I hope!) source code

Generates a log file, including start/end times per-file

...And last but not least, rezip is released under a GPL license. :)

-- KNOWN BUG(s):

The PROPER way to kill "rezip" when it is running, is to press Ctrl-Z, then
type

$ kill %jobnumber

-- Example:

^Z
[1]+ Stopped rezip
' kill %1 '
[1]+ Terminated rezip

If you DON'T do it that way, trust me - wacky things can happen. I.e., it will
skip to the next file, and gzip/bzip2 will still be running in the background.
Don't use ^C.

The logger function (logecho) has trouble echoing stars (" * "), even when
they are quoted.

The log file can get fairly large after several runs. If you want to reset
it, either "rm" it or

$ >rezip.log

will reset it to 0 length.
WARNING: If .gz files that were listed in rezipp-files.txt
are deleted/moved between runs, you MUST re-do the "find" before
re-running. Otherwise, unexpected results will probably occur.

Tried adding a feature to log if a recompress failed, after a test run
encountered a bad .gz file. (This was a pain, and required several re-runs
with a short, known-bad gzip file, looking up things in the bash man page, and
much experimentation. It logs the error now, but fails to notify the user
that the job failed.)

To create a known-bad .gz file of your own to test:

$ dd if=any-gz-file-more-than-20MB.gz of=KNOWNBAD.gz bs=1M count=21

and redo your "find" to include it.
This creates a .gz file that is a partial copy of the complete
one, and will cause gzip to abend with "Unexpected end of file." Set the
"skipsize" variable to 20000 and run rezip, and it should log the error.
If you can fix the script so that it notifies the user as well, let me
know. ;-)

During the course of writing the script, I had hard-coded most of the
defaults, such as the size of files to skip, the log file name, etc. These
were eventually changed to be variables before the script was published for LG
- so that you, the end-user, can have More Control (TM) over its actions. ;-)

I encourage everyone to READ THE SOURCE CODE before running rezip. You may
find it handy to view it in an editor that colorizes or highlights executable
syntax, such as ' mcedit ' or ' jstar '.

Bio: Born in 1972, Dave Bechtel grew up programming in Basic with Apple ][e's,
TI99 4/A, IBM PC (640K!) and a Tandy 1000SX, none of which actually had hard
drives -- 360K floppy only. And we LIKED IT! ;-)

Eventually left BASIC behind, and moved on to programming in REXX and Bash.

Got interested in Linux around 1997. Started with Red Hat and went on to
SuSE, tried several other distros and a *BSD or two, and has now settled on
Knoppix/Debian/Ubuntu, in roughly that order. Currently living in Lake
Zurich, IL.