Appendix A Java Encoding Schemes

This appendix
describes the character-encoding schemes that are supported by the Java platform.

US-ASCII

US-ASCII is a 7-bit character set and encoding that covers the
English-language alphabet. It is not large enough to cover the characters
used in other languages, however, so it is not very useful for internationalization.

ISO-8859-1

ISO-8859-1 is the character set for Western European languages.
It’s an 8-bit encoding scheme in which every encoded character takes
exactly 8 bits. (With the remaining character sets, on the other hand, some
codes are reserved to signal the start of a multibyte character.)

UTF-8

UTF-8 is an 8-bit encoding scheme. Characters from the English-language
alphabet are all encoded using an 8-bit byte. Characters for other languages
are encoded using 2, 3, or even 4 bytes. UTF-8 therefore produces compact
documents for the English language, but for other languages, documents tend
to be half again as large as they would be if they used UTF-16. If the majority
of a document’s text is in a Western European language, then UTF-8 is
generally a good choice because it allows for internationalization while still
minimizing the space required for encoding.

UTF-16

UTF-16 is a 16-bit encoding scheme. It is large enough to encode
all the characters from all the alphabets in the world. It uses 16 bits for
most characters but includes 32-bit characters for ideogram-based languages
such as Chinese. A Western European-language document that uses UTF-16 will
be twice as large as the same document encoded using UTF-8. But documents
written in far Eastern languages will be far smaller using UTF-16.

Note –

UTF-16 depends on the system’s byte-ordering conventions.
Although in most systems, high-order bytes follow low-order bytes in a 16-bit
or 32-bit “word,” some systems use the reverse order. UTF-16 documents
cannot be interchanged between such systems without a conversion.

Further Information about Character Encoding

The Java programming language represents characters internally using
the Unicode character set, which provides support for most languages. For
storage and transmission over networks, however, many other character encodings
are used. The Java 2 platform therefore also supports character conversion
to and from other character encodings. Any Java runtime must support the Unicode
transformations UTF-8, UTF-16BE, and UTF-16LE as well as the ISO-8859-1 character
encoding, but most implementations support many more. For a complete list
of the encodings that can be supported by the Java 2 platform, see http://java.sun.com/javase/6/docs/technotes/guides/intl/encoding.doc.html.