Zip : Java Glossary

The word zip refers both to American postal codes and PkWare’s public domain
file archiving and compression format. Sun has extended it in its JAR and
WAR (Web Archive) files to have a formal
table of contents.

Zip Postal Codes

ZIP (Zoning Improvement Plan),
the American postal code made of a 5+4 numeric. The code is assigned so that you
can determine the state from the first three digits of the zip code. The US Post
Office has an online zip code lookup.

Zip File Format

Zip files and jars have a similar format. Each
element is preceded by a header, then there is a summary set of headers at the very
end of the file. PKware documents the ZIP file header format.

PKZIP and WinZip use / as the directory separator
character. It is up to you to convert the \ to
/ in element names for the ZipEntry write and back again on read. If you don’t bother, the
\ will get in the zip file and you will have a
platform-dependent zip.

Apache VFS gives you a
common API (Application Programming Interface) for files that works both for regular files and zip
file members. Normally you do your work with ZipFile,
ZipEntry. ZipInputStream and
ZipOutputStream or for simpler takes GZIPInputStream and GZIPOutputStream.

Writing Elements of a Zip File

Here is how to
The classes in package java.util.zip such as
ZipFile, ZipInputStream and
ZipOutputStream will let you read and create zip or jar
files. Don’t worry about ZipEntry.setCrc since it
and setCompressedSize get set automatically.

Reading elements of a Zip File Sequentially

The following
code won’t work if ZipOutputStream was used to
create the zip file. This includes *.jar files created by
jar.exe.

To read all the elements of a zip, you might think you would use
ZipFile. getEntries() to
enumerate all the entries. Unfortunately, this enumeration is in random order — Hashtable order
really. So you need to use the random access method below. To efficiently move the
disk arms over the file, you really should sort the entries first in the order they
appear in the zip.

Reading elements of a Zip File Randomly

The following code
will work to read elements by randomly given the element name, even
if ZipOutputStream was used to create the zip file, which
fails to build the length elements correctly.

Verifying

Here is how you verify a zip for distribution
contains all the files in the corresponding jar.

Directories

Normally directories are not explicitly created
or even stored as separate entries in a zip file. When the file is extracted, any
directories needed to contain the extracted files are automatically created as
needed. However, you can store empty directories in a zip file. They appear as
filenames ending in /.

Nesting

The member files in a zip file can be accessed
individually, just like the files in a jar file (a species of zip file). However,
when one zip is contained within another zip, you can only access the contained zip
file itself, not its individual members. You would need to expand it to disk
somewhere before accessing its members.

There are three approaches to the problem:

Put all members in the same jar/zip.

Use several individual jar files and arrange to have them on the path.

Use a JWS (Java Web Start) installer class to unpack a nested jar into
individual jars.

Why would you nest?

To get super-compression. You create a zip as a pure archive, turning
off compression. (In WinZip you select compression:none.) Then you
compress the whole thing as single file this time with compression on. The
compression algorithm can then exploit repeated strings across members.

Because you want the user to leave some jars packed for use. You bundle them up
for transport as a single download.

Gotchas

ZipOutputStream produces a slightly non-standard Zip
format. ZipOutputStream puts the compressed and
uncompressed size and CRC (Cyclic Redundancy Check) after all the members,
instead of in the local header just in front of it. Unfortunately, when you come
to read this file with ZipInputStream, when you do an
ZipEntry.getSize() you will get 0 because
ZipInputStream is a stream and can’t look ahead
to find the size. There is a second copy of the header put at the end of the file
forming an index. However, ZipFile is able to use this
index to randomly access the file to read individual elements. A normal zip file
has the information recorded redundantly to help make it easier to read ahead and
to recover a damaged zip file.

java.util.zip has one big limitation. It only
understands a few of the possible compression algorithms. It pretty well can only
deal with zips created by itself. If the *.zip came from
the outside world, you need to exec something like WinZip wzunzip.exe or PKWare pkunzip.exe.

Zip entry timestamps are recorded only to two 2 second precision. This reflects the accuracy of
DOS
timestamps in use when PKZIP was created. That number recorded in the Zip will be
the timestamp truncated, not the nearest 2 seconds.

When you archive and restore a file, it will no longer have a timestamp
precisely matching the original. This is above and beyond he similar problem with
Java using 1 millisecond precision and Microsoft
Windows using 100 nanosecond increments. PKZIP format
derives from MS DOS days and hence uses only 16 bits for time and 16 bits for date.
There is defined an extended time stamp in the revised PKZIP format, but Java
does not use it.

Inside zip files, dates and times are stored in local time in
16 bits, not UTC (Coordinated Universal Time/Temps Universel Coordonné)
as is conventional, using an ancient MS DOS
format.
Bit 0 is the least signifiant bit. The format is
little-endian. There was not room in 16 bit to
accurately represent time even to the second, so the seconds field contains the
seconds divided by two, giving accuracy only to the even second.

This means the apparent time of files inside a zip will suddenly differ by an
hour compared with their uncompressed counterparts every time you have a daylight
saving change. It also means that the a zip utility will extract a different
UTC time from a Zip member date depending on which
time zone the calculation was done. This is ridiculous. PKZIP format needs a
modern UTC-based timestamp to avoid these anomalies.

To make matters worse, Standard tools like WinZip or PKZIP will always round
the time up to the next even second when they restore, thereby possibly making
the file one second to two seconds younger. The JDK (Java Development Kit)
(i.e. javaToDosTime in ZipEntry rounds the time down,
thereby making the file one to two seconds older.

The format does not support dates prior to 1980-01-01 0:00
UTC . Avoid file dates 1980-01-01 or earlier (local or UTC
time).

Wait! It gets even worse. Phil Katz, when he documented the Zip format, did
not bother to specify whether the local time used in the archive should be
daylight or standard time.

And to cap it off… Info-ZIP, JSE (Java Standard Edition)
and TrueZIP apply the DST (Daylight Saving Time) schedule (days where
DST
began and ended in any given year) for any date when converting times between
system time and DOS date/time. This is as it should be. Vista’s
Explorer, 7-Zip and WinZip apply only the DST
savings, but do not apply the schedule. So they use the current
DST
savings for any date when converting times between system time and
DOS
date/time. This is just sloppy.

If you think this is bad, have a look at the goofiness in timestamps for
FTP uploads.

Arrggh!

PKZIP time and date formats

PKZIP/MSDOS DOSTIME 16-bit Packed Time format

PKZIP/MSDOS DOSTIME 16-bit Packed Time format

field

hour

minute

seconds/2

values

0…23

0…59

0…29

width

5 bits

6 bits

5 bits

position

15…11

10…5

4…0

PKZIP/MSDOS16-bit DOSDATE Packed Date format

PKZIP/MSDOS16-bit DOSDATE Packed Date format

field

year

month

day

values

1980⇒0

1…12

1…31

width

7 bits

4 bits

5 bits

position

15…9

8…5

4…0

Oracle’s early jar files had no compression. Compression is optional in
Oracle’s zip and jar classes.

There is no method to delete a member from a zip. You need to use TrueZIP.

GZIP vs Zip

GZIP is a more primitive file format than zip.
GZIPInputStream and GZIPOutputStream let you read and create compressed files, but not
using the zip directory structure. The file consists of just one compressed lump,
without any embedded members filenames, timestamps etc. For sample GZIPOutputStream code, consult the File I/O Amanuensis.

encryption

Java’s ZipEntryStream does not support the WinZip
compression scheme. So you must manually encrypt and decrypt either on the plaintext
or the compressed form perhaps using JCE.
Unfortunately you will need your code at both ends to encrypt/decrypt. You
won’t be able to create encrypted files that WinZip can decrypt on its own.

Utilities

There are command line and GUI (Graphic User Interface) utilities to create,
update and extract ZIP files.bzip2jZip: free, based on 7-zipPKZIPWinRarWinZip

Java 7

Java 7 lets you treat a Zip file as if it were a directory tree. You use the
FileSystem and Paths
classes.

Classes of interest include:

Files: various static
methods, to copy, move etc.

FileStore: drive, partition etc. where files are
kept.

FileSystem: refers to a zip file you treat like a
directory tree.

Path: similar to the old File class

Paths: static methods
for Path.

Future Archive Format

Phil Katz designed PkZip format back in the
DOS days. It has
survived so long partly because it was extensible — it permitted new
compression algorithms. If we were to do it over again, I would make the following
changes:

Compress the index as well as the data. The index becomes a major part of the
payload when you have many small files.

Have a pointer to where the index starts at the head of the file.

Use timestamps nominally accurate to the millisecond or nanosecond, based on
UTC time.

Always sort the index before writing it out so that you can find something
rapidly in it with binary search or with an embedded hash mechanism. It should not
be necessary to read the entire index, create an internal lookup structure just to
find one file.

Arrange a mechanism so that several similar small files can be compressed as if
they were one big file, sharing a common preload. This gives much more efficient
compression. The commonality can be factored out into a preload dictionary.

Arrange a mechanism for special dictionaries, e.g.
HTML (Hypertext Markup Language), Java,
class files, AMD32 bit exe, AMD64 bit exe, PDF (Portable Document Format),
that don’t have to be embedded in the archive, just referenced. Normally
everyone has a copy of the dictionary, but if not, they can automatically get one
given a URL (Uniform Resource Locator). This defines a starting point shorthand that does
not need to be described, just referenced in the archive itself. See the
HTMLCompactor and
SuperCompressor
student projects.