Tuesday, May 5, 2009

Unicode - How to get the characters right?

Computers understand only bits and bytes. You know, the binary numeral system of zeros and ones. Humans, on the other hand, understand characters only. You know, the building blocks of the natural languages. So, to handle human readable characters using a computer (read, write, store, transfer, etcetera), they have to be converted to bytes. One byte is an ordered collection of eight zeros or ones (bits). The characters are only used for pure presentation to humans. Behind any character you see, there is a certain order of bits. For a computer a character is in fact nothing less or more than a simple graphical picture (a font) which has an unique "identifier" in form of a certain order of bits.

To convert between chars and bytes a computer needs a mapping where every unique character is associated with unique bytes. This mapping is also called the character encoding. The character encoding exist of basically two parts. The one is the character set (charset), which represents all of the unique characters. The other is the numeral representation of each of the characters of the charset. The numeral representation to humans is usually in hexadecimal, which is in turn easily to be "converted" to bytes (both are just numeral systems, only with a different base).

The world would be much simpler if only one character encoding existed. That would have been clear enough for everyone. Unfortunately the truth is different. There are a lot of different character encodings, each with its own charsets and numeral mappings. So it may be obvious that a character which is converted to bytes using character encoding X may not be the same character when it is converted back from bytes using character encoding Y. That would in turn lead to confusion among humans, because they wouldn't understand the way the computer represented their natural language. Humans would see completely different characters and thus not be able to understand the "language" which is also known as the "mojibake". It can also happen that humans would not see any linguistic character at all, because the numeral representation of the character in question isn't covered by the numeral mapping of the character encoding used. It's simply unknown.

How such an unknown character is displayed differs per application which handles the character. In the webbrowser world, Firefox would display an unknown character as a black diamond with a question mark in it, while Internet Explorer would display it as an empty white square with a black border. Both represents the same Unicode character though: xFFFD, which is displayed in your webbrowser as "�". Internet Explorer simply doesn't have a font (a graphical picture) for it, hence the empty square. In Java/JSP/Servlet world, any unknown character which is passed through the write() methods of an OutputStream (e.g. the one obtained by ServletResponse#getOutputStream()) get printed as a plain question mark "?". Those question marks can in some cases also be caused by the database. Most database engines replaces uncovered numeral representations by a plain question mark during save (INSERT/UPDATE), which is in turn later displayed to the human when the data is been queried and sent to the webbrowser. The plain question marks are thus not per se caused by the webbrowser.

Here is a small test snippet which demonstrates the problem. Keep in mind that Java supports and uses Unicode all the time. So the encoding problem which you see in the output is not caused by Java itself, but by using the ISO 8859-1 character encoding to display Unicode characters. The ISO 8859-1 character encoding namely doesn't cover the numeral representations of a large part of the Unicode charset which is also known as the UTF-8 charset. By the way, the term "Unicode character" is nowhere defined, but it usually used by (unaware) programmers/users who actually meant "Any character which is not covered by the ISO 8859 character encoding".

Important note: your own operating system should of course have the proper fonts (yes, the human representations) supporting those Unicode charsets for both Czech and Japanese languages installed to see the proper characters/glyphs at this webpage :) Otherwise you will see in for example Firefox a black-bordered square with hexcode inside (0-9 and/or A-F) and in most other webbrowsers such as IE, Safari and Chrome a nothing-saying empty square with a black border. For example x1A05 (a letter in an exotic language) which isn't supported by the Verdana/Geneva/Helvetica/Arial fonts (yet?) is in your webbrowser represented as "ᨅ". If your operating system for example doesn't have the Japanese glyphs in the font as required by this page, then you should in Firefox see three squares with hexcodes 65E5, 672C and 8A9E. Those hexcodes are actually also called 'Unicode codepoints'. In Windows, you can view all available fonts and the supported characters using 'charmap.exe' (Start > Run > charmap).

Another important note: if you have problems when copypasting the test snippet in your development environment (e.g. you are not seeing the proper characters, but only empty squares or something like), then please wait with playing until you have read the entire article, including step 1 of the OK .. So, I have an "Unicode problem", what now? chapter ;)

Let's go back in the history of character encoding. Most of you may be familiar with the term "ASCII". This was less or more the first character encoding ever. At the ages when a byte was very expensive and 1MHz was extremely fast, only the characters which appeared on those ancient US typewriters (as well as at the average US International keyboard nowadays) were covered by the charset of the ASCII character encoding. This includes the complete Latin alphabet (A-Z, in both the lowercased and uppercased flavour), the numeral digits (0-9), the lexical control characters (space, dot, comma, colon, etcetera) and some special characters (the at sign, the sharp sign, the dollar sign, etcetera). All those characters fill up the space of 7 bits, half of the room a byte provides, with a total of 128 characters.

Later the remaining bit of a byte is used for Extended ASCII which provides room for a total of 255 characters. Most of the remaining room is used by special characters, such as diacritical characters and line drawing characters. Because everyone used the remaining room their own way (IBM, Commodore, Universities, etcetera), it was not interchangeable. Later ISO came up with standard character encoding definitions for 8 bit ASCII extensions, resulting in the known ISO 8859 character encoding standards such as ISO 8859-1.

8 bits may be enough for the languages using the Latin alphabet, but it is certainly not enough for the remaining non-Latin languages in the world, such as Chinese, Japanese, Hebrew, Cyrillic, Sanskrit, Arabic, etcetera. They developed their own non-ISO character encodings which was -again- not interchangeable, such as Guobiao, BIG5, JIS, KOI, MIK, TSCII, etcetera. Finally a new 16 bits character encoding standard based on top of ISO 8859-1 was established to cover any of the characters used at the world so that it is interchangeable everywhere: Unicode. You can find all of those linguistic characters here. Unicode also covers many special characters (symbols) such as punctuation and mathematical operators, which you can find here.

To the point: just ensure that you use UTF-8 (a character encoding which conforms the Unicode standard) all the way. There are more Unicode character encodings as well, but as far they are used very, very seldom. UTF-8 is likely the Unicode standard. To solve the "Unicode problem" you need to ensure that every step which involves byte-character conversion uses the one and the same character encoding: reading data from input stream, writing data to output stream, querying data from database, storing data in database, manipulating the data, displaying the data, etcetera. For a Java EE web developer, there are a lot of things you have to take into account.

Development environment: yes, the development environment has to use UTF-8 as well. By default most text files are saved using the operating system default encoding such as ISO 8859-1 or even an proprietary encoding such as Windows ANSI (also known as CP-1252, which is in turn not interchangeable with non-Windows platforms!). The most basic text editor of Windows, Notepad, uses Windows ANSI by default, but Notepad supports UTF-8 as well. To save a text file containing Unicode characters using Notepad, you need to choose the File » Save As option and select UTF-8 from the Encoding dropdown. The same Save As story applies on many other self-respected text editors as well, like EditPlus, UltraEdit and Notepad++.

In an IDE such as Eclipse you can set the encoding at several places. You need to explore the IDE preferences thoroughly to find and change them. In case of Eclipse, just go to Window » Preferences and enter filter text encoding. In the filtered preferences (Workspace, JSP files, etcetera) you can select the desired encoding from a dropdown. Important note: the Workspace encoding also covers the output console and thus also the outcome of System.out.println(). If you sysout an Unicode character using the default encoding, it would likely be printed as a plain vanilla question mark!

In theory, in the Windows command prompt you have to use a font which supports a broad range of Unicode characters. You can set the font by opening the command console (Start > Run > cmd), then clicking the small cmd icon at the left top, then choosing Properties and finally choosing the Font tab. In a default Windows environment only the Lucida Console font has the "best" support of Unicode fonts. It unfortunately lacks a broad range of Unicode characters though.

The cmd.exe parameter \U and/or the command chcp 65001 (which changes the code page to UTF-8) doesn't help much if the font already doesn't support the desired characters. You could hack the registry to add more fonts, but you still have to find a specific command console font which supports all of the desired characters. In the end it's better to use Swing to create a command console like UI instead of using the standard command console. Especially if the application is intended to be distributed (you don't want to require the enduser to hack/change their environment to get your application to work, do you? ;) ).

Java properties files: as stated in its Javadoc the load(InputStream) method of the java.util.Properties API uses ISO 8859-1 as the default encoding. Here's an extract of the class' Javadoc:

.. the input/output stream is encoded in ISO 8859-1 character encoding. Characters that cannot be directly represented in this encoding can be written using Unicode escapes ; only a single 'u' character is allowed in an escape sequence. The native2ascii tool can be used to convert property files to and from other character encodings.

If you have full control over loading of the properties files, then you should use the Java 1.6 load(Reader) method in combination with an InputStreamReader instead:

If you don't have full control over loading of the properties files (e.g. managed by some framework), then you need the in the Javadoc mentioned native2ascii tool. The native2ascii tool can be found in the /bin folder of the JDK installation directory. When you for example need to maintain properties files with Unicode characters for i18n (Internationalization; also known as resource bundles), then it's a good practice to have both an UTF-8 properties file and an ISO 8859-1 properties file and some batch program to convert from the UTF-8 properties file to an ISO 8859-1 properties file. You use the UTF-8 properties file for editing only. You use the converter to convert it to ISO 8859-1 properties file after every edit. You finally just leave the ISO 8859-1 properties file as it is. In most (smart) IDE's like Eclipse you cannot use the .properties extension for those UTF-8 properties files, it would complain about unknown characters because it is forced to save properties files in ISO 8859-1 format. Name it .properties.utf8 or something else. Here's an example of a simple Windows batch file which does the conversion task:

Save it as utf8.converter.bat (or something like) and run it once to convert all UTF-8 properties files to standard ISO 8859-1 properties files. If you're using Maven and/or Ant, this can even be automated to take place during the build of the project.

JSP/Servlet request: during request processing an average application server will by default use the ISO 8859-1 character encoding to URL-decode the request parameters. You need to force the character encoding to UTF-8 yourself. First this: "URL encoding" must not to be confused with "character encoding". URL encoding is merely a conversion of characters to their numeral representations in the %xx format, so that special characters can be passed through URL without any problems. The client will URL-encode the characters before sending them to the server. The server should URL-decode the characters using the same character encoding. Also see "percent encoding".

How to configure this depends on the server used, so the best is to refer its documentation. In case of for example Tomcat you need to set the URIEncoding attribute of the <Connector> element in Tomcat's /conf/server.xml to set the character encoding of HTTP GET requests, also see this document:

<Connector (...) URIEncoding="UTF-8"/>

In for example Glassfish you need to set the <parameter-encoding> entry in webapp's /WEB-INF/sun-web.xml (or, since Glassfish 3.1, glassfish-web.xml), see also this document:

<parameter-encoding default-charset="UTF-8"/>

URL-decoding POST request parameters is a story apart. The webbrowser is namely supposed to send the charset used in the Content-Type request header. However, most webbrowsers doesn't do it. Those webbrowsers will just use the same character encoding as the page with the form was delivered with, i.e. it's the same charset as specified in Content-Type header of the HTTP response or the <meta> tag. Only Microsoft Internet explorer will send the character encoding in the request header when you specify it in the accept-charset attribute of the HTML form. However, this implementation is broken in certain circumstances, e.g. when IE-win says "ISO-8859-1", it is actually CP-1252! You should really avoid using it. Just let it go and set the encoding yourself.

You can solve this by setting the same character encoding in the ServletRequest object yourself. An easy solution is to implement a Filter for this which is mapped on an url-pattern of /* and basically contains only the following lines in the doFilter() method:

Note: URL-decoding POST request parameters the above way is not necessary when you're using Facelets instead of JSP as it defaults to UTF-8 already. It's also not necessary when you're using Glassfish as the <parameter-encoding> also takes care about this.

Here's a test snippet which demonstrates what exactly happens behind the scenes when it all fails:

Original input string from client: 日本語
URL-encoded by client with UTF-8: %E6%97%A5%E6%9C%AC%E8%AA%9E
Then URL-decoded by server with ISO-8859-1: æ—¥æœ¬èªž
Server should URL-decode with UTF-8: 日本語

JSP/Servlet response: during response processing an average application server will by default use ISO 8859-1 to encode the response outputstream. You need to force the response encoding to UTF-8 yourself. If you use JSP as view technology, then adding the following line to the top (yes, as the first line) of your JSP ought to be sufficient:

<%@page pageEncoding="UTF-8"%>

This will set the response outputstream encoding to UTF-8 and set the HTTP response content-type header to text/html;charset=UTF-8. To apply this setting globally so that you don't need to edit every individual JSP, you can also add the following entry to your /WEB-INF/web.xml file:

Note: this is not necessary when you're using Facelets instead of JSP as it defaults to UTF-8 already.

The HTTP content-type header actually does nothing at the server side, but it should instruct the webbrowser at the client side which character encoding to use for display. The webbrowser must use it above any specified HTML meta content-type header as specified by w3 HTML spec chapter 5.2.2. In other words, the HTML meta content-type header is totally ignored when the page is served over HTTP. But when the enduser saves the page locally and views it from the local disk file system, then the meta content-type header will be used. To cover that as well, you should add the following HTML meta content-type header to your JSP anyway:

If you (ab)use a HttpServlet instead of a JSP to generate HTML content using out.write(), out.print() statements and so on, then you need to set the encoding in the ServletResponse object itself inside the servlet method block before you call getWriter() or getOutputStream() on it:

response.setCharacterEncoding("UTF-8");

You can do that in the aforementioned Filter, but this can lead to problems if you have servlets in your webapplication which uses the response for something else than generating HTML content. After all, there shouldn't be any need to do this. Use JSP to generate HTML content, that's where it is for. When generating other plain text content than HTML, such as XML, CSV, JSON, etcetera, then you need to set the response character encoding the above way.

JSF/Facelets request/response: JSF/Facelets uses by default already UTF-8 for all HTTP requests and responses. You only need to configure the server as well to use the same encoding as described in JSP/Servlet request section.

Only when you're using a custom filter or a 3rd party component library which calls request.getParameter() or any other method which implicitly needs to parse the request body in order to extract the data, then there's chance that it's too late for JSF/Facelets to set the UTF-8 character encoding before the request body is been parsed for the first time. PrimeFaces 3.2 for example is known to do that. In that case, you'd still need a custom filter as described in JSP/Servlet request section.

Databases: also the database has to take the character encoding in account. In general you need to specify it during the CREATE and if necessary also during the ALTER statements and in some cases you also need to specify it in the connection string or the connection parameters. The exact syntax depends on the database used, best is to refer its documentation using the keywords "character set". In for example MySQL you can use the CHARACTER SET clause as pointed out here:

CREATE DATABASE db_name CHARACTER SET utf8;
CREATE TABLE tbl_name (...) CHARACTER SET utf8;

Usually the database's JDBC driver is smart enough to use the database and/or table specified encoding for querying and storing the data. But in worst cases you have to specify the character encoding in the connection string as well. This is true in case of MySQL JDBC driver because it does not use the database-specified encoding, but the client-specified encoding. How to configure it should already be answered in the JDBC driver documentation. In for example MySQL you can read it here:

Strings: although Java uses Unicode all the time under the hood, when you convert between String and byte[] using String#getBytes() or String(byte[]), you should rather use the overloaded method/constructor which takes the character encoding:

Fascinating article .. thanks for the same, learnt a lot about Unicode from this, which I wanted so very badly..

However, can you help me with a problem I have, that being when I read a .txt file having east European characters (like, say, German) created in ANSI file format (which is default of notepad on Windows) in java using BuuferedReader and InputStreamReader,with utf-8 encoding set for the instance of latter, then also I get "�" symbols in Java instead of the German characters. How can I go around this problem w/o changing the file format of the .txt file from ANSI to Unicode or UTF-8 in Notepad.

I have tried your suggestions - but I still have a problem. My database is NOT configured to support UTF-8 and some of our external data consumption services are unable to handle UTF-8. Therefore, I need the encoding to be set to ISO-8859-1.

I am using your encoding filter and put the form accept-charset, and page meta tags, etc. YET still, I am getting the UTF-8 character 日 (as test) on form submit.

How can I ensure that all form inputs are encoded to my chosen encoding?

About

Donate

For the ones who want to express their excessive thanks for my work, I used to have an Amazon wishlist with a list of books, but right now I don't have any interesting books on the list anymore (to anyone who've sent books before: thank you very much, I got 6 books in 6 months). You can always donate something so that I can use it for other stuff, such as Nespresso coffee.