Oracle Blog

making everything ok

Reading Zip files with non-ASCII file names

In a zip file, each entry occupies one slot, which includes a description and the compressed data. Of course, one field of the description is the file name. There's this unfortunate thing that in the original ZIP specification the encoding of the filename is not specified. Most applications choose the default encoding (or, native encoding of the underlying operating systems). For Java, since i18n was considered at the very beginning, and the language has a strong wish to be cross-platform and cross-locale, it chooses the UTF-8 encoding, since that's the only (?) charset that's guaranteed to be supported on every platform and supports every character in the world.

So here comes the co-op problem. If you compress files with Chinese names using WinRAR, it will not be opened by the jar command. Another popular sofware, GMail, also uses UTF-8 when you download several attachments at once, this time, the file you get cannot be opened by WinRAR (use jar, only in JDK).

I do not care about this until recently I try to start playing with Java ME. The first program I want to write is showing what's on now for various TV stations. On the website of CCTV (China Central Television) there's a schedule file for almost all TV channels I can reach. Each channel has one file in the zip bundle, and the file name is the channel's name, in Chinese, in the GB2312 encoding.

If I were writing a Java SE program, I won't hesitate a moment to go inside JDK, change the lines where UTF-8 is forced and thus OK. This is my version of private JDK, I can do anything on it. But for Java ME, I don't know a way to substitute the JRE in my phone with a customized one. (Maybe I can ask the ME guys underfloor)

This is what I did:

Read the zip spec, only the first 4 pages, about the overall layout and the structure of the entry header

Write a FilterInputStream, override the read method to translate the file name from Chinese encoding (gb2312) to UTF-8 on the fly, of course, also update the filename length. This filter let all other fields (as well as compressed data) go thru transparently. After all zip entries, there are still quite a lot of blocks (started with [archive decryption header] if you read the spec). I really don't understand what they are. Shouldn't any data after all zip entries be useless, at least in streaming mode? So, I regard them as a big EOF, and let read() returns -1 when this part is met.

Insert this filter between the InputStream of zip file and ZipInputStream, everything is OK now.

I'm not in the mode of writing an output filter at the moment. If I do want to evolve this into a common tool, maybe I need to have a look at the rest 95% of that spec.

The Java class in question should have an option to allow non-UTF-8 chars in filenames to be passed through. I can't see modifying the JDK as acceptable, but neither is it acceptable for for the JDK to force UTF-8 on a slot where codeset is unspecified -- I18N is not as simple as "use UTF-8 everywhere." Your approch of writing a filter seems like a very good approach. Oh, and Unicode doesn't provide a codepoint for every character -- but it gets close, and it's regularly extended.