Name: rlT66838 Date: 06/07/99
I try to create a ZIP archive containing files, provided that the filenames are french words (ie with accentuated characters). The filenames are contained in String, this means they are encoded in Unicode. If I try to create a File from the String filename, this filename is converted OK to platform specifics; but if I create a ZipEntry from the String filename, it is NOT converted to platform specifics, leading to a filename in ZIP archive which is the Unicode image (unreadable from various ZIP tools !).
For instance:
String filename = "?l?ve.txt";
// This will create a right filename on disk
File myFile = new File(filename);
...
// A file ?l?ve.txt is created on disk
// This will create a bad (unconverted) filename in ZIP archive
ZipEntry myEntry = new Entry(filename);
...
// An entry ??l??ve.txt is created in ZIP archive
The result is that the generated ZIP entry is not usable for extraction...
(Review ID: 83688)
======================================================================
Name: tb29552 Date: 03/24/2000
Solaris VM (build Solaris_JDK_1.2.1_04, native threads, sunwjit)
Classic VM (build JDK-1.2.2-W, green threads, sunwjit)
java version "1.1.6"
Within a ZIP file, pathnames use the forward slash / as separator, as required
by the ZIP
<A HREF="ftp://ftp.uu.net/pub/archiving/zip/doc/appnote-970311-iz.zip">spec</A>.
This requires a conversion from or to the local file.separator on systems like
Windows. The API (ZipEntry) does not take care of the transformation, and the
need for the programmer to deal with it is not documented. As a result, code
like
ZipEntry ze;
File f;
f = new File( ze.getName());
will be written and fail on the Windows platform, or the reverse
ze = new ZipEntry( f.getName());
will fail or produce invalid jars on Windows platforms.
Either the docs or the API needs to be fixed. Preferably a new method and
constructor could be added
File f = ze.toFile(); ze = new ZipEntry( f);
that would perform the translation between '/' and File.separatorChar, leaving
the existing methods/constructors (perhaps deprecated) for use by existing code.
But if the API is not fixed, then the docs must be fixed to make sure the
programmer deals with the translation explicitly.
Note new methods in java.util.zip.ZipEntry would also need to be reflected
in java.util.jar.JarEntry.
(Review ID: 100505)
======================================================================

Comments

SUGGESTED FIX
SAP, as a Java SE Licensee, has provided us with a 1.4.2 solution that does not
require an API change (basically, a system property). They have implemented this
in their 1.4.2 based SAP JVM implementation and are providing it to us for consideration:
-- From SAP --
There are problems with ZIP handling of files with non-UTF8 encoded
file names.
See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4244499.
In order to improve the situation without changing existing APIs
SAP has implemented following solution for java.util.zip.ZipInputStream
into SAPJVM 5.1 and suggests that SUN should think about a similar
approach for JDK 1.4.2, because we were faced with customer problems on
this version:
A new System Property called com.sap.jvm.ZipEntry.encoding was added
with the following behavior:
not set: Reading ZIP files with entries with non-UTF8 chars will fail
with IllegalArgumentException as before this change, but with
a useful message pointing to the cause of the problem and
the new System Property
"default": If decoding an entry name with UTF8 fails, try the
platform's default encoding. Reading ZIP files will succeed,
but filenames might be wrong
<encoding>: If decoding an entry name with UTF8 fails, try the given
encoding. If the right encoding is given, reading the ZIP
file will succeed and entry names will be converted
correctly. WinRar and WinZip seem to use "Cp437" encoding.
The piece of code looks like this:
Replace
ZipEntry e = createZipEntry(getUTF8String(b, 0, len));
by
// SAPJVM SS 2008-07-02 implemented workaround to be able to use
// non-UTF8 encoded zip entry names
String filename = null;
try {
// First try getUTF8String for compatibility
filename = getUTF8String(b, 0, len);
}
catch (IllegalArgumentException e) {
// UTF8 decoding failed!
// alternative encoding requested?
String encoding = System.getProperty("com.sap.jvm.ZipEntry.
encoding");
if (encoding == null) {
// no alternative encoding requested, just throw the
// Exception (for compatibility), but add a message
IllegalArgumentException ee = new IllegalArgumentException(
"zip entry name contained non-utf8 chars, try system
property " +
"com.sap.jvm.ZipEntry.encoding");
ee.setStackTrace(e.getStackTrace());
throw ee;
}
// an alternative encoding is requested
if (encoding.equalsIgnoreCase("default")) {
// use platform's default encoding
filename = new String(b, 0, len);
}
else {
// use the specified encoding
// (WinZip and WinRar seem to use Cp437 )
filename = new String(b, 0, len, encoding);
}
}
ZipEntry e = createZipEntry(filename);
--

EVALUATION
We expect to resolve this in the Dolphin/6.0 release (though our planning for
Dolphin is not complete). We anticipate a Dolphin source repository sometime
this summer. Hopefully, we can get this fix into Dolphin very early, to
discover any unintended consequences well before Dolphin's official release.
A contributor to the JDK community has started workin on this bug (thanks!)
and you can join/follow the discussion here:
https://jdk-collaboration.dev.java.net/servlets/ProjectForumMessageView?messageID=13115&forumID=1463
We're considering two possibilities for the fix: one is largely that proposed
by several people, namely to add constructors that allow clients to indicate a
zip file's encoding. The other is to work with providers of zip
implementations to provide the encoding of the entries in a file in the file
itself. Discussion on the latter has been started at the above URL (see the
entry "Unicode extension for ZIP file specification".
Note that this bug raises two, independent issues: one concerns the character
encoding for the file's entries; the other concerns the kind of path separator
that is used on particular platforms. The latter has a straightforward fix
(and for now, work around as noted).

2006-06-13

EVALUATION
There's a lot of additional information in the JDC discussions about this bug and the duplicates 4532049, 4700978, 4415733, 4820807.
The zip specification does not specify the character encoding to be used for file names (essentially, it doesn't consider file names that include non-ASCII characters). We decided that for jar files, which must be portable between different platforms and different locale environments, only UTF-8 makes sense. Therefore the code currently encodes and decodes all file names within jar/zip files using UTF-8.
However, for normal (non-jar) zip files, the convention used by other tools is to use the platform encoding for file names. Applications that use the java.util.zip package to read/write normal zip files therefore fail (or produce unreadable files) if a file name contains a non-ASCII character, unless the platform encoding happens to be UTF-8.
To solve this problem, I think we need to distinguish between jar and zip files, and enable the use of encodings other than UTF-8 for the file names within non-jar zip files.
A possible solution would be to add a ZipFile constructor:
java.util.zip.ZipFile.ZipFile(File??file, int??mode, String encoding)
which lets an application specify the encoding for the file names and zip comments used within the zip file. Document that the encoding used for the other constructors is UTF-8, and that callers of the new constructor can pass in the result of java.nio.charset.Charset.defaultCharset().name() to request the platform encoding.
This lets applications access zip files that use the encoding of the platform they run on, or even generate zip files using the encoding of the platform of the client machine that a zip files is intended for (some of the bug discussion mentions servlets creating zip files for download).
The jar classes would continue to use the constructors that don't take the encoding parameter, and therefore continue to use UTF-8.
The encoding of the contents of the files included in the zip files is not affected - they're just byte streams.
For command line use, the jar command could be enhanced with an option that specifies the file name encoding, using either an encoding name or "default" for the platform encoding. This option should be disabled when creating jar files.
###@###.### 2005-1-28 18:42:10 GMT