Compressing and Decompressing Streams

The Inflater and Deflater classes are a little raw. It would be more convenient to write uncompressed data onto an output stream and have the stream compress, without worrying about the mechanics of deflation. Similarly, it would be useful to have an input stream class that could read from a compressed file but return the uncompressed data. Java, in fact, has several classes that do exactly this. The java.util.zip.DeflaterOutputStream class is a filter stream that compresses the data it receives in deflated format before writing it out to the underlying stream. The java.util.zip.InflaterInputStream class inflates deflated data before passing it to the reading program. java.util.zip.GZIPInputStream and java.util.zip.GZIPOutputStream do the same thing except using the gzip format.

10.2.1. The DeflaterOutputStream Class

DeflaterOutputStream is a filter stream that deflates data before writing it onto the underlying stream:

public class DeflaterOutputStream extends FilterOutputStream

Each stream uses a protected Deflater object called def to compress data stored in a protected internal buffer called buf:

protected Deflater def;
protected byte[] buf;

The same deflater must not be used in multiple streams at the same time, though Java takes no steps to guarantee that this won't happen.

The underlying output stream that receives the deflated data, the deflater object def, and the length of the byte array buf are all set by one of the three DeflaterOutputStream constructors:

The underlying output stream must be specified. The buffer length defaults to 512 bytes, and the Deflater defaults to the default compression level, strategy, and method. Of course, the DeflaterOutputStream has all the usual output stream methods such as write( ), flush( ), and close( ). It overrides three of these methods, but as a client programmer, you don't use them any differently than you would in any other output stream.

There's also one new method, finish( ), which finishes writing the compressed data onto the underlying output stream but does not close the underlying stream:

Example 10-3 is a simple character-mode program that deflates files. Filenames are read from the command line. A file input stream is opened to each file; a file output stream is opened to that same filename with the extension .dfl (for deflated). Finally, the file output stream is chained to a deflater output stream, and a stream copier pours the data from the input file into the output file.

This program is a lot simpler than Example 10-1, even though the two programs do the same thing. In general, a DeflaterOutputStream is preferable to a raw Deflater object for reasons of simplicity and legibility, especially if you want the default strategy, algorithm, and compression level. However, using the Deflater class directly does give you more control over the strategy, algorithm, and compression level. You can get the best of both worlds by passing a custom-configured Deflater object as the second argument to the DeflaterOutputStream( ) constructor.

10.2.2. The InflaterInputStream Class

The InflaterInputStream class is a filter stream that inflates data while reading it from the underlying stream.

public class InflaterInputStream extends FilterInputStream

Each inflater input stream uses a protected Inflater object called inf to decompress data that is stored in a protected internal byte array called buf. There's also a protected int field called len that (unreliably) stores the number of bytes currently in the buffer, as opposed to storing the length of the buffer itself.

protected Inflater inf;
protected byte[] buf;
protected int len;

The same Inflater object must not be used in multiple streams at the same time.

The underlying input stream from which deflated data is read, the Inflater object inf, and the length of the byte array buf are all set by one of the three InflaterInputStream( ) constructors:

The underlying input stream must be specified, but the buffer length defaults to 512 bytes and the Inflater defaults to an inflater for deflated streams (as opposed to zipped or gzipped streams). Of course, the InflaterInputStream has all the usual input stream methods such as read( ), available( ), and close( ). It overrides the following three methods:

For the most part, you use these the same way you'd use any read( ) or skip( ) method. However, it's occasionally useful to know that the read method throws a new subclass of IOExceptionjava.util.zip.ZipExceptionif the data doesn't adhere to the expected format. You should also know that read( ), skip( ), and all other input stream methods count the uncompressed bytes, not the compressed raw bytes that were actually read.

Example 10-4 is a simple character-mode program that inflates files. Filenames are read from the command line. A file input stream is opened from each file that ends in .dfl, and this stream is chained to an inflater input stream. A file output stream is opened to that same file minus the .dfl extension. Finally, a stream copier pours the data from the input file through the inflating stream into the output file.

Although zip files deflate their entries, raw deflated files are uncommon. More common are gzipped files. These are deflated files with some additional header information attached. The header specifies a checksum for the contents, the name of the compressed file, the time the file was last modified, and other information. The java.util.zip.GZIPOutputStream class is a subclass of DeflaterOutputStream that understands when and how to write this extra information to the output stream.

public class GZIPOutputStream extends DeflaterOutputStream

GZIPOutputStream has two constructors. Since GZIPOutputStream is a filter stream, both constructors take an underlying output stream as an argument. The second constructor also allows you to specify a buffer size. (The first uses a default buffer size of 512 bytes.)

Data is written onto a gzip output stream as onto any other stream, typically with the write( ) methods. However, some of the data may be temporarily stored in the input buffer until more data is available. At that point, the data is compressed and written onto the underlying output stream. Therefore, when you are finished writing the data that you want to be compressed onto the stream, you should call finish( ):

public void finish( ) throws IOException

This writes all remaining data in the buffer onto the underlying output stream. It then writes a trailer containing a CRC value and the number of uncompressed bytes stored in the file onto the stream. This trailer is part of the gzip format specification that's not part of a raw deflated file. If you're through with the underlying stream as well as the gzip output stream, call close( ) instead of finish( ). If the stream hasn't yet been finished, close( ) finishes it, then closes the underlying output stream. From this point on, data may not be written to that stream.

public void close( ) throws IOException

Example 10-5 is a simple command-line program that reads a list of files from the command line and gzips each one. A file input stream reads each file. A file output stream chained to a gzip output stream writes each output file. The gzipped files have the same name as the input files plus the suffix .gz.

If this looks similar to Example 10-3, that's because it is. All that has changed is the compression format (gzip instead of deflate) and the compressed file suffix. However, since gzip and gunzip are available on virtually all operating systemsunlike raw deflateyou can test this code by unzipping the files it produces with the Free Software Foundation's (FSF) gunzip or some other program that handles gzipped files.

10.2.4. The GZIPInputStream Class

The java.util.zip.GZIPInputStream class is a subclass of InflaterInputStream that provides a very simple interface for decompressing gzipped data:

Since this is a filter stream, both constructors take an underlying input stream as an argument. The second constructor also accepts a length for the buffer into which the compressed data will be read. Otherwise, GZIPInputStream has the usual methods of an input stream: read( ), skip( ), close( ), mark( ), reset( ), and others. Marking and resetting are not supported. read( ) and close( ) are overridden:

These methods work exactly like the superclass methods they override. The only thing you need to be aware of is that the read( ) method blocks until sufficient data is available in the buffer to allow decompression.

Example 10-6 shows how easy it is to decompress gzipped data with GZIPInputStream. The main( ) method reads a series of filenames from the command line. A FileInputStream object is created for each file and a GZIPInputStream is chained to that. The data is read from the file, and the decompressed data is written into a new file with the same name minus the .gz suffix. (A more robust implementation would handle the case where the suffix is not .gz.) You can test this program with files gzipped by Example 10-5 and with files gzipped by the FSF's gzip program.

You may have noticed that the compression stream classes are not fully symmetrical. You can expand the data being read from an input stream, and you can compress data being written to an output stream, but no classes compress data being read from an input stream or expand data being written to an output stream. Such classes aren't commonly needed. It's possible that you might want to read compressed data from a file and write uncompressed data onto the network, but as long as there are an input stream and an output stream, you can always put the compressor on the output stream or the decompressor on the input stream. In either case, the compressor and decompressor fall between the two underlying streams, so how they're chained doesn't really matter. Alternatively, you may have some reason to work with compressed data in memory; for example, your application might find it more efficient to store large chunks of text in compressed form. In this case, a byte array output stream chained to a deflater output stream will do the trick.