Tar and Z

Over the weekend (before I picked up my “regular” files), I started looking at Steve Mosher’s use of raster and zoo – both of which intrigue me a great deal, but got intrigued by something else and ended up finally figuring out how to extract .Z files within an R script without having to handle them manually. (R has utilities for .zip and .gz files, but not the older .Z format.) This isn’t anything other than a nuisance with GHCN which only has one .Z file to worry about, but was a big problem with the very large ICOADS SST data where every month of data is in its own .Z file and manual processing isn’t an alternative. It’s further complicated since the 12 monthly .Z files for each year and packaged into an annual .tar file.

On many occasions, I’ve expressed regret at the disproportionate interest in land datasets relative to SST datasets, which are more important proportionately but about which there’s been negligible third party analysis. We’ve all raised our eyebrows a little at the bucket adjustments, but I, for one, hadn’t handled original data. For a start, the data sets were too big for the computer that I had a couple of years ago, but they are practical on my new computer (though, if I were doing much on this, I’d need to upgrade.)

A couple of years ago, CA reader Nicholas experimented with the extracting .Z files in R, contributing the package compress, which, unfortunately, I couldn’t get to work on the GHCN data set. I lost interest in the issue at the time, but the inability to handle .Z files automatically has been at the back of my mind for a while.

While browsing through some sites, I noticed the following comment at gzip.org

gunzip can decompress files created by gzip, compress or pack. The detection of the input format is automatic

.Z files are produced by compress. So maybe I thought that a simple expedient for extracting .Z files was staring me in the face. I’d gone through the process of installing a gunzip.exe commend on my old computer but hadn’t done it on my new computer and had to retrace my steps. I found a version of gzip here http://www.powerbasic.com/files/pub/tools/win32/gzip124xN.zip – there are other versions around. You have to unzip this file, which yields gzip.exe, but not gunzip.exe. I was stumped by this for a while. A webpage here explains:

Note that this archive contains only gzip.exe — to get gunzip.exe, you must copy gzip.exe to gunzip.exe (a silly Unix trick– don’t ask).

I uploaded a version of gunzip.exe to climateaudit.info to save others the labor. The following short script downloads the .Z file for GHCN and the gunzip.exe program to the working directory. Worked like a champ for me:

This was handy for GHCN but is a nice to have, but not a need to have, whereas it’s essential for ICOADS. I checked it out on the .Z files extracted from the ICOADS .tar files and again it worked fine. The .Z files have to be in the working directory for the R system command to work.

Each of the monthly .Z files could than be unzipped. To do so, I found it convenient to rename each .Z file in turn to temp.Z and unzip, read an R-object and then remove this file, saving the read R-object (which takes up less space than the original Z file anyway.) The following read-instruction reads each of the .Z files and then saves the result by year.

The above assumes that information about ICOADS formats has been extracted into a info file info. This is not a small job see here. I downloaded and extracted data from 1900 to 1980 over the last couple of days (tar files go from a few MB in 1900 to over a GB in 2006.) I got some odd results, which I’ll mention in another post, but will have to move on as looking through SST data is a big analytic job.

On Linux if it is just gnu zip file then gunzip file_name
If it is a gziped tar ball then tar xvfz file_name
If it is a bz2 tar ball then tar xvfj file_name

On most Unix machines you have to do it in 2 steps and pipe them together.

Steve: The issue was doing this within R as opposed to doing it in Unix. The problem wasn’t figuring out the Unix commands – that was the easiest part. The problem was figuring out how to run Unix commands within an R environment – which is what I use and which has many advantages for analysis purposes. Also it wasn’t something that I was working on. I don’t feel badly about not figuring this out earlier since I’d asked a couple of very good programmers about how to do this within an R environment without success.

I think the core of the problem is that you’re running a Windows port of R. R was originally developed in Linux and decompressing files within R is easy because the system() command can call binaries that are already present by default on any Linux system. But you have what you need installed so problem solved.

Code that relies on remote data really needs to calculate a hash on that data and record it along with other (more interesting) results. The remote data might change. If someone downloads the code (a year after it was published, say) and gets different results then, absent the hash, it may not be obvious that this is due to a change in the remote data. The md5deep utility provides the relevant functionality and is freely available for all operating systems.

When ccc-gistemp had this problem (decompressing a .Z file in Windows), I read the C code for the OS X version of compress and reimplemented the same algorithm in Python. Such fun.

Steve: I asked one climate group to provide in.gz, but they didn’t do anything. The method set out here works fine – so I’ve worked around the problem.

BTW I’d emulated quite a long way through GISTEMP steps in R in 2007 and placed the code online. Their so-called UHI adjustment seemed totally ineffective – as it effectively presumed that other inhomogeneities had already been resolved.