I've been playing around with various web servers recently, paying special attention to how browsers communicate non-ASCII text via GET and POST HTTP commands. While it appears that current browsers take hints from the page or form encoding and send form data back to the server in the same encoding, web servers remain blissfully unaware. They typically assume that the request encoding is ISO-8859-1. So, if my application url-encodes a GET parameter in UTF-8 (a Unicode encoding), the backend server (let's say Tomcat 5.5.9) assumes 8859-1. The result, of course, is that text data becomes mangled almost immediately as it travels through the various tiers of even a simple web-based application.

Type in "John" and press the submit button. The result is that a URL like this is created:

http://localhost/sayhello.jsp?NAME=John

No problem there. The call to request.getParameter("NAME") retrieves the simple ASCII text without a hitch. Subsequently, the name is output to the HTML stream back to the browser where the expected greeting appears:

The trouble is that none of this charset information gets sent back to the web server during a GET or POST operation. The server has no way of knowing how to interpret the url-encoded GET parameters, so it assumes ISO-8859-1.

OK, so here's a small oversight in the HTTP or HTML spec...I haven't thought about it enough yet to decide. Regardless, this really affects multilingual communication via HTTP. JSP/Servlet containers and web servers are effectively broken in this area because of it. How should it be resolved?

What if I want to POST some non-ASCII data, presumably to enter into a backend database? All is well since I set that URIEncoding flag, right? Wrong. It turns out that Tomcat (sorry to pick on this particular server), doesn't use this URIEncoding flag for POSTed form data. So, what does it use? ISO-8859-1 of course! So now, I'm back to where I started, and my imaginary application still greets Mr. Ã§Â”Â°Ã¤Â¸ instead of Mr. ç”°ä¸­. Not good.

Now how do I get around this. Maybe I can set a hidden FORM parameter to the correct charset, read this, reset the request's character encoding via request.setCharacterEncoding(), and be done with it. I searched the online world again...sorry nothing in the Tomcat docs on this, although I did see several requests that a parameter similar to URIEncoding be created to handle POSTed data. That would be nice. I got around my particlar problem by explicitly calling request.setCharacterEncoding("UTF-8") in a control servlet. I passed in the encoding preference via a servlet initialization parameter POST_ENCODING. That's ok, I suppose.

I think it would be easier, though, if there were a more visible standard on this for all JSP/Servlet containers, HTTP servers, or application servers. In the JSP/Servlet container area, Tomcat's URIEncoding goes a long way at least for GET requests. Unfortunately, this isn't a J2EE standard setting in a web.xml file or anything, or it's not obvious to me so far. To make matters worse, each server platform (Tomcat, Weblogic, others) tries to handle this in its own way, creating proprietary solutions all around. I noticed that Weblogic uses entries in its weblogic.xml deployment descriptor to handle the same problem. A standard solution for all containers would be best I think.

The blog server here at java.net seems to handle UTF-8 just fine. That is, the server knows to expect POST data encoded as UTF-8. Does the weblogs.java.net server simply call some method setting the request handler to use UTF-8? Does it read this preference from a properties file? a descriptor file? a command line argument when starting the server? Hmm...anyone at java.net willing to share how you handle this problem?