Zip Files

Gzip and deflate are compression formats. Zip is both a compression and an archive format. This means that a single zip file may contain more than one uncompressed file, along with information about the names, permissions, creation and modification dates, and other information about each file in the archive. This makes reading and writing zip archives somewhat more complex and somewhat less amenable to a stream metaphor than reading and writing deflated or gzipped files.

The java.util.zip.ZipFile class represents a file in the zip format. Such a file might be created by zip, PKZip, WinZip, or any of the many other zip programs. The java.util.zip.ZipEntry class represents a single file stored in such an archive.

The java.util.zip.ZipConstants interface that both these classes implement is a rare, nonpublic interface that contains constants useful for reading and writing zip files. Most of these constants define the positions in a zip file where particular information, like the compression method used, is found. You don't need to concern yourself with it.

The ZipFile class contains two constructors. The first takes a filename as an argument. The second takes a java.io.File object as an argument. The third takes a File object and a mode indicating whether or not the file is to be deleted. This mode should be one of the two named constants ZipFile.READ or ZipFile.DELETE. If you specify ZipFile.DELETE, the file will be deleted automatically sometime after you open it and before you close it. However, you'll still be able to read its contents until the application exits. File objects will be discussed in Chapter 17. For now, I'll just use the constructor that accepts a filename. Functionally, these constructors are similar.

ZipException is a subclass of IOException that indicates the data in the zip file doesn't fit the zip format. In this case, the zip exception's message will contain more details, like "invalid END header signature" or "cannot have more than one drive." While these may be useful to a zip expert, in general they indicate that the file is corrupted, and there's not much that can be done about it.

public class ZipException extends IOException

Both constructors attempt to open the specified file for random access. If the file is opened successfully with no exceptions, the entries( ) method will return a list of all the files in the archive:

public Enumeration entries( )

The return value is a java.util.Enumeration object containing one java.util.zip.ZipEntry object for each file in the archive. In Java 5, this method's signature has been genericized to make that a tad more obvious:

public Enumeration entries( )

Example 10-7 lists the entries in a zip file specified on the command line. The toString( ) method is used implicitly to provide the name for each zip entry in the list.

To get a single entry in the zip file rather than a list of the entire contents, pass the name of the entry to the getEntry( ) method:

public ZipEntry getEntry(String name)

Of course, this requires you to know the name of the entry in advance. The name is simply the path and filename, such as java/io/ObjectInputValidation.class. For example, to retrieve the zip entry for java/io/ObjectInputValidation.class from the ZipFile zf, you might write:

ZipEntry ze = zf.getEntry("java/io/ObjectInputValidation.class");

You can also get the name with the getName( ) method of the ZipEntry class, discussed later in this chapter. This method, however, requires you to have a ZipEntry object already, so there's a little chicken-and-egg problem here.

Most of the time, you'll want more than the names of the files in the archive. You can get the actual contents of the zip entry using getInputStream( ):

public InputStream getInputStream(ZipEntry ze) throws IOException

This returns an input stream from which you can read the uncompressed contents of the zip entry (file). Example 10-8 is a simple unzip program that uses this input stream to unpack a zip archive named on the command line.

This is not an ideal unzip program. For one thing, it blindly overwrites any files that already exist with the same name in the current directory. Before creating a new file, it should check to see if one exists and, if it does, ask whether the user wants to overwrite it. Furthermore, it can unzip files only into existing directories. If the archive contains a file in a directory that does not exist, a FileNotFoundException is thrown. Both problems are completely fixable, but fixing them requires the java.io.File class. You'll learn about this in Chapter 17.

Finally, two utility methods in java.util.zip.ZipFile relate to the "File" part of ZipFile rather than the "Zip" part:

public String getName( )
public void close( ) throws IOException

The getName( ) method returns the full path to the filefor example, /usr/local/java/lib/classes.jar. The close( ) method closes the zip file. Even after a file is closed, you can still get an entry or an input stream because the entries are read and stored in memory when the ZipFile object is first constructed. However, you cannot get the actual data associated with the entry. Attempts to do so will throw a NullPointerException.

10.3.1. Zip Entries

The java.util.zip.ZipEntry class represents a file stored in a zip archive. A ZipEntry object contains information about the file but not the contents of the file. Most ZipEntry objects are created by non-Java tools and retrieved from zip files using the getEnTRy( ) or entries( ) methods of the ZipFile class. However, if you're writing your own program to write zip files using the ZipOutputStream class, you'll need to create new ZipEntry objects with this constructor:

public ZipEntry(String name)

Normally, the name argument is the name of the file that's being placed in the archive. It should not be null, or a NullPointerException will be thrown. It is also required to be less than 65,536 bytes long (which is plenty long for a filename).

public String getName( )
public long getTime( )
public long getSize( )
public long getCompressedSize( )
public long getCrc( )
public int getMethod( )
public byte[] getExtra( )
public String getComment( )
public boolean isDirectory( )

The name is simply the relative path and filename stored in the archive, such as com/sun/tools/javac/v8/CommandLine.class or java/awt/Dialog.class. The time is the last time this entry was modified. It is given as a long, counting the number of milliseconds since midnight, January 1, 1970, Greenwich Mean Time. (This is not how the time is stored in the zip file, but Java converts the time before returning it.) -1 indicates that the modification time is not specified. The CRC is a 32-bit cyclic redundancy code for the data that's used to determine whether or not the file is corrupt. If no CRC is included, getCRC( ) returns -1.

The size is the original, uncompressed length of the data in bytes. The compressed size is the length of the compressed data in bytes. The getSize( ) and getCompressedSize( ) methods both return -1 if the size isn't known.

getMethod( ) tells you whether or not the data is compressed; it returns 0 if the data is uncompressed, 8 if it's compressed using the deflation format, and -1 if the compression format is unknown. 0 and 8 are the mnemonic constants ZipEntry.STORED and ZipEntry.DEFLATED.

Each entry may contain an arbitrary amount of extra data. If so, this data is returned in a byte array by the getExTRa( ) method. Similarly, each entry may contain an optional string comment. If it does, the getComment( ) method returns it; if it doesn't, getComment( ) returns null. Finally, the isDirectory( ) method returns true if the entry is a directory and false if it isn't.

Example 10-9 is an improved ZipLister that prints information about the files in a zip archive.

$ java FancyZipListertemp.zip
test.txt was deflated at Wed Jun 11 15:57:32 EDT 1997
from 187 bytes to 98 bytes, a savings of 52.406417112299465%
Its CRC is 1981281836
ticktock.txt was deflated at Wed Jun 11 10:42:02 EDT 1997
from 1480 bytes to 405 bytes, a savings of 27.364864864864863%
Its CRC is 4103395328

There are also six corresponding set methods, which are used to attach information to each entry you store in a zip archive. However, most of the time it's enough to let the ZipEntry class calculate these for you:

Java supports two zip formats, uncompressed and compressed. These are slightly less well known as stored and deflated. They correspond to the mnemonic constants ZipOutputStream.STORED and ZipOutputStream.DEFLATED:

Deflated files are compressed by a Deflater object using the deflation method. Stored files are copied byte for byte into the archive without any compression. This is the right format for files that are already compressed but still need to go into the archive, such as a GIF image or an MPEG movie.

Because zip is not just a compression format like deflation or gzip but an archival format, a single zip file often contains multiple zip entries, each of which contains a deflated or stored file. Furthermore, the zip file contains a header with metainformation about the archive itself, such as the location of the entries in the archive. Therefore, it's not possible to write raw, compressed data onto the output stream. Instead, zip entries must be created for each successive file (or other sequence of data), and data must be written into the entries. The sequence of steps you must follow to write data onto a zip output stream is:

Construct a ZipOutputStream object from an underlying stream, most often a file output stream.

Set the comment for the zip file (optional).

Set the default compression level and method (optional).

Construct a ZipEntry object.

Set the metainformation for the zip entry.

Put the zip entry in the archive.

Write the entry's data onto the output stream.

Close the zip entry (optional).

Repeat steps 4 through 8 for each entry you want to store in the archive.

Finish the zip output stream.

Close the zip output stream.

Steps 4 and 8, the creation and closing of zip entries in the archive, are new. You won't find anything like them in other stream classes, but they are necessary. Attempts to write data onto a zip output stream using only the regular write( ), flush( ), and close( ) methods are doomed to failure.

10.3.2.1. Constructing and initializing the ZipOutputStream

There is a single ZipOutputStream( ) constructor that takes as an argument the underlying stream to which data will be written:

After the zip output stream has been constructed (in fact, at any point before the zip output stream is finished), you can add a single comment to the zip file with the setComment( ) method:

public void setComment(String comment)

The comment is an arbitrary ASCII string comment of up to 65,535 bytes. For example:

zout.setComment("Archive created by Zipper 1.0");

All high-order Unicode bytes are discarded before the comment is written onto the zip output stream. Attempts to attach a comment longer than 65,535 characters throw IllegalArgumentExceptions. Each zip output stream can have only one comment (though individual entries may have their own comments too). Resetting the comment erases the previous comment.

10.3.2.3. Set the default compression level and method

Next, you may wish to set the default compression method with setMethod( ):

public void setMethod(int method)

You can change the default compression method from stored to deflated or deflated to stored. This default method is used only when the zip entry itself does not specify a compression method. The initial value is ZipOutputStream.DEFLATED (compressed); the alternative is ZipOutputStream.STORED (uncompressed). An IllegalArgumentException is thrown if an unrecognized compression method is specified. You can call this method again at any time before the zip output stream is finished. This sets the default compression method for all subsequent entries in the zip output stream. For example:

zout.setMethod(ZipOutputStream.STORED);

You can change the default compression level with setLevel( ) at any time before the zip output stream is finished:

public void setLevel(int level)

For example:

zout.setLevel(9);

As with the default method, the zip output stream's default level is only used when the zip entry itself does not specify a compression level. The initial value is Deflater.DEFAULT_COMPRESSION. Valid levels range from 0 (no compression) to 9 (high compression); an IllegalArgumentException is thrown if a compression level outside that range is requested. You can call setLevel( ) again at any time before the zip output stream is finished to set the default compression level for all subsequent entries in the zip output stream.

10.3.2.4. Construct a ZipEntry object and put it in the archive

Data is written into the zip output stream in separate zip entries represented by ZipEntry objects. A zip entry must be opened before data is written, and each zip entry must be closed before the next one is opened. The putNextEntry( ) method opens a new zip entry on the zip output stream:

public void putNextEntry(ZipEntry ze) throws IOException

If a previous zip entry is still open, it's closed automatically. The properties of the ZipEntry argument ze specify the compression level and method. If ze leaves those unspecified, the defaults set by the last calls to setLevel( ) and setMethod( ) are used. The ZipEntry object may also contain a CRC checksum, the time the file was last modified, the size of the file, a comment, and perhaps some optional data with an application-specific meaning (for instance, the resource fork of a Macintosh file). These properties are set by the setTime( ), setSize( ), setCrc( ), setComment( ), and setExtra( ) methods of the ZipEntry class. (These properties are not set by the ZipOutputStream class since they will be different for each file stored in the archive.)

10.3.2.5. Write the entry's data onto the output stream

Data is written into the zip entry using the usual write( ) methods of any output stream. Only one write( ) method is overridden in ZipOutputStream:

Finally, you may want to close the zip entry to prevent any further data from being written to it. For this, call the closeEntry( ) method:

public void closeEntry( ) throws IOException

If an entry is still open when putNextEntry( ) is called or when you finish the zip output stream, this method will be called automatically. Thus, an explicit invocation is usually unnecessary.

10.3.2.7. Finish the zip output stream

A zip file stores metainformation in both the header and the tail of the file. The finish( ) method writes out this tail information:

public void finish( ) throws IOException

Once a zip output stream is finished, you cannot write any more data to it. However, data may be written to the underlying stream using a separate reference to the underlying stream. In other words, finishing a stream does not close it.

10.3.2.8. Close the zip output stream

Most of the time, you will want to close a zip output stream at the same time you finish it. ZipOutputStream overrides the close() method inherited from java.util.zip.DeflaterOutputStream.

public void close( ) throws IOException

This method finishes the zip output stream and then closes the underlying stream.

10.3.2.9. An example

Example 10-10 uses a zip output stream chained to a file output stream to create a single zip archive from a list of files named on the command line. The name of the output zip file and the files to be stored in the archive are read from the command line. An optional -d command-line flag can set the level of compression anywhere from 0 to 9.

Zip input streams read data from zip archives. As with output streams, it's generally best not to read the raw data. (If you must read the raw data, you can always use a bare file input stream.) Instead, the input is first parsed into zip entries. Once you've positioned the stream on a particular zip entry, you read decompressed data from it using the normal read( ) methods. Then the entry is closed, and you open the next zip entry in the archive. This sequence of steps reads data from a zip input stream:

Construct a ZipInputStream object from an underlying stream.

Open the next zip entry in the archive.

Read data from the zip entry using InputStream methods such as read( ).

Close the zip entry (optional).

Repeat steps 2 through 4 as long as there are more entries (files) remaining in the archive.

Close the zip input stream.

Steps 2 and 4, the opening and closing of zip entries in the archive, are specific to zip streams; you won't find anything like them in other input stream classes.

You probably noticed that the ZipInputStream class provides a second way to decompress zip files. The ZipFile class approach shown in the Unzipper program of Example 10-8 is the first. ZipInputStream uses one input stream to read from successive entries. The ZipFile class uses different input stream objects for different entries. Which to use is mainly a matter of aesthetics. There's not a strong reason to prefer one approach over the other, though the ZipInputStream is somewhat more convenient in the middle of a sequence of filters.

10.3.3.1. Construct a ZipInputStream

There is a single ZipInputStream( ) constructor that takes as an argument the underlying input stream:

No further initialization or parameter setting are needed. A zip input stream can read from a file regardless of the compression method or level used.

10.3.3.2. Open the next zip entry

A zip input stream reads zip entries in the order in which they appear in the file. You do not need to read each entry in its entirety, however. Instead, you can open an entry, close it without reading it, read the next entry, and repeat until you come to the entry you want. The getNextEnTRy( ) method opens the next entry in the zip input stream:

public ZipEntry getNextEntry( ) throws IOException

If the underlying stream throws an IOException, it's passed along by this method. If the stream data doesn't represent a valid zip file, a ZipException is thrown.

10.3.3.3. Reading from a ZipInputStream

Once the entry is open, you can read from it using the regular read( ), skip( ), and available( ) methods of any input stream. (Zip input streams do not support marking and resetting.) Only two of these are overridden:

The read( ) method reads and the skip( ) method skips the decompressed bytes of data.

10.3.3.4. Close the zip entry

When you reach the end of a zip entry, or when you've read as much data as you're interested in, you may call closeEntry( ) to close the zip entry and prepare to read the next one:

public void closeEntry( ) throws IOException

Explicitly closing the entry is optional. If you don't close an entry, it will be closed automatically when you open the next entry or close the stream.

These three stepsopen the entry, read from the entry, close the entrymay be repeated as many times as there are entries in the zip input stream.

10.3.3.5. Close the ZipInputStream

When you are finished with the stream, you can close it using the close( ) method:

public void close( ) throws IOException

As usual for filter streams, this method also closes the underlying stream. Unlike zip output streams, zip input streams do not absolutely have to be finished or closed when you're through with them, but it's polite to do so.

10.3.3.6. An example

Example 10-11 is an alternative unzipper that uses a ZipInputStream instead of a ZipFile. There's not really a huge advantage to using one or the other. Use whichever you find more convenient or aesthetically pleasing.