The 3rd party is not sending iso-8859-1, they are sending utf-8. But they dont specify that in the content-type, which forces the request to use default. The default charset for our system is utf-8 as specified in the file.encoding system property, and confirmed by a quick look in:

Charset.defaultCharset().name();

It was my belief that when the charset was not specified, it would fall back on using the default charset. However some extensive testing have proven that this is not so. In fact, the charset used as a default seem to be iso-8859-1 when none is specified.

So here is my new question:

Since http requests use a different default charset than the one specified in file.encoding, where does this default charset come from, and how can we set a different one?

Since http requests use a different default charset than the one specified in file.encoding, where does this default charset come from, and how can we set a different one?

ignore the fact that you are getting the xml via an HTTP request. xml is well defined in terms of encodings. an xml document with no encoding information is defined to be utf-8. any reasonable xml processor will process xml as utf-8 if there is no encoding information, not the system default character encoding. so, you have nothing to worry about.

the only time you would have a problem is if you have an xml document with no encoding which is not in utf-8. this would technically be a broken xml document (as in, it breaks the xml spec). if you were in this situation, you would probably have a difficult time as i doubt many xml frameworks provide ways to alter the default xml encoding (since it is defined as part of the xml spec).

The 3rd party is not sending iso-8859-1, they are sending utf-8. But they dont specify that in the content-type, which forces the request to use default. The default charset for our system is utf-8 as specified in the file.encoding system property, and confirmed by a quick look in:

Charset.defaultCharset().name();

It was my belief that when the charset was not specified, it would fall back on using the default charset. However some extensive testing have proven that this is not so. In fact, the charset used as a default seem to be iso-8859-1 when none is specified.

So here is my new question:

Since http requests use a different default charset than the one specified in file.encoding, where does this default charset come from, and how can we set a different one?

According to the HTTP spec:

http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html

The default, if none is specified, is ISO-8859-1

+"When no explicit charset parameter is provided by the sender, media subtypes of the "_text_" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP"+

... and since the Content-Type of "text/xml", it is therefore defaulting to ISO-8859-1.

Re-reading the thread, you have also have the problem that the SOAP message itself does not have the XML declaration.... So no charset in the HTTP Content-Type header, and no XML declaration, no encoding.

Since http requests use a different default charset than the one specified in file.encoding, where does this default charset come from, and how can we set a different one?

ignore the fact that you are getting the xml via an HTTP request. xml is well defined in terms of encodings. an xml document with no encoding information is defined to be utf-8. any reasonable xml processor will process xml as utf-8 if there is no encoding information, not the system default character encoding. so, you have nothing to worry about.

the only time you would have a problem is if you have an xml document with no encoding which is not in utf-8. this would technically be a broken xml document (as in, it breaks the xml spec). if you were in this situation, you would probably have a difficult time as i doubt many xml frameworks provide ways to alter the default xml encoding (since it is defined as part of the xml spec).

I doubt that the XML processors always assumes it is UTF-8 if the XML declaration is missing.

The following spec:

http://www.w3.org/TR/xml/#sec-guessing

.. says that XML processor can guess the encoding IF the encoding in the XML declaration is missing ( the XML declaration is there, just the encoding part is the one missing ). It's a different story though if the XML itself is missing.

In which case, it goes to:

http://www.ietf.org/rfc/rfc3023.txt

Section 8.5 says:

8.5 Text/xml with Omitted Charset

This example shows text/xml with the charset parameter omitted. In this case, MIME and XML processors MUST assume the charset is "us-ascii", the default charset value for text media types specified in [RFC2046]. The default of "us-ascii" holds even if the text/xml entity is transported using HTTP.

Omitting the charset parameter is NOT RECOMMENDED for text/xml. For example, even if the contents of the XML MIME entity are UTF-16 or UTF-8, or the XML MIME entity has an explicit encoding declaration, XML and MIME processors MUST assume the charset is "us-ascii".

I quoted text/xml since that is what the HTTP Content-Type indicated by the OP ... but maybe it would have been different if the Content-Type sent was "application/soap+xml".

We have done a lot of testing, and the conclusion is the same as you say, it defaults to ISO-8859-1. I also read that XML should default to utf-8, but with testing we found that both text/plain and text/xml defaults to ISO-8859-1.

We have isolated the cause, now we need to find a way to fix it.

This webservice runs of a jboss 7.1.1 using cxf 2.4.6

1. Is it possible to set a different default than iso-8859-1 when no charset is defined?

2. Is it possible to intercept the request in a handler or filter and modify the http headers by adding charset to the content-type before it is handled?

3. Is there any other way to force the system to use utf-8 instead of iso-8859-1 by modifying xml, request, etc?

MortenN wrote:
We have done a lot of testing, and the conclusion is the same as you say, it defaults to ISO-8859-1. I also read that XML should default to utf-8, but with testing we found that both text/plain and text/xml defaults to ISO-8859-1.

We have isolated the cause, now we need to find a way to fix it.

This webservice runs of a jboss 7.1.1 using cxf 2.4.6

1. Is it possible to set a different default than iso-8859-1 when no charset is defined?

2. Is it possible to intercept the request in a handler or filter and modify the http headers by adding charset to the content-type before it is handled?

3. Is there any other way to force the system to use utf-8 instead of iso-8859-1 by modifying xml, request, etc?

1. When the HTTP Content-Type does not have a charset is defined, and perhaps only for the Content-Type you are interested in ... text/xml or application/soap+xml

2. When the body does not have an XML declaration. Be aware that XML can have a byte-order-marking ( BOM ) in front of the XML declaration.

Of course, if you can convince the client to send the correct Content-Type AND the XML declaration, you don't have to do any of the above.

On an unrelated note:

I just noticed that the SOAP payload is actually an MM7 request. I used to work in the mobile industry, actually implementing the 3GPP MM7 specification and have to test with various carriers around the world. Some of the operators or software vendors for that matter that were implementing the spec took the informative sections of the spec as normative .. but the informative sections had bugs in them which I had to point out to get corrected in the spec as vendors and operators were focusing on the informative sections rather than the normative section .. so I am not surprised that you are having this same scenario that I had 6 years ago.

Of course, if you can convince the client to send the correct Content-Type AND the XML declaration, you don't have to do any of the above.

The only reason I have this problem is because the client is not sending according to the specs. I have emailed them about the problem, with several of the links you provided, so hopefully they can fix it in time. In the meantime, I have to make do with these workarounds.

On an unrelated note:

I just noticed that the SOAP payload is actually an MM7 request. I used to work in the mobile industry, actually implementing the 3GPP MM7 specification and have to test with various carriers around the world. Some of the operators or software vendors for that matter that were implementing the spec took the informative sections of the spec as normative .. but the informative sections had bugs in them which I had to point out to get corrected in the spec as vendors and operators were focusing on the informative sections rather than the normative section .. so I am not surprised that you are having this same scenario that I had 6 years ago.

He he, yes it is :) In fact most of the carriers that are our clients use MM7, and without any problems. However there is one that has a really old version which is not completely up to spec, and that causes a lot of different challenges.

I also dealt with MM7 requests for the first time about 6 years ago, but back then I simply used a servlet and manually parsed the payload. It was much easier to control charsets and so like that. This time I wanted to do it right and use web services :)

Thanks a lot for your help. I will use this workaround until the client fix the problem.

Since http requests use a different default charset than the one specified in file.encoding, where does this default charset come from, and how can we set a different one?

ignore the fact that you are getting the xml via an HTTP request. xml is well defined in terms of encodings. an xml document with no encoding information is defined to be utf-8. any reasonable xml processor will process xml as utf-8 if there is no encoding information, not the system default character encoding. so, you have nothing to worry about.

the only time you would have a problem is if you have an xml document with no encoding which is not in utf-8. this would technically be a broken xml document (as in, it breaks the xml spec). if you were in this situation, you would probably have a difficult time as i doubt many xml frameworks provide ways to alter the default xml encoding (since it is defined as part of the xml spec).

I doubt that the XML processors always assumes it is UTF-8 if the XML declaration is missing.

.. says that XML processor can guess the encoding IF the encoding in the XML declaration is missing ( the XML declaration is there, just the encoding part is the one missing ). It's a different story though if the XML itself is missing.

okay, i guess that is technically correct. the xml processor can infer certain encodings based on the initial bytes. i was referring to the final case:

"Other -> UTF-8 without an encoding declaration, or else the data stream is mislabeled (lacking a required encoding declaration), corrupt, fragmentary, or enclosed in a wrapper of some kind"

my general point was just that xml has a well defined "guess the encoding" process in the case that the declaration is missing (as opposed to some arbitrary textual document).

If that option does not work, or too hard, you can always try to use Apache's mod_headers:

http://httpd.apache.org/docs/2.2/mod/mod_headers.html

Go to the section "RequestHeader Directive", which says:

+"This directive can replace, merge, change or remove HTTP request headers. The header is modified just before the content handler is run, allowing incoming headers to be modified"+

You may then need to have Apache ( or Tomcat ) be a proxy to JBoss, so that the request goes to JBoss with the charset in the Content-Type already set by Apache, and the carrier then does not HTTP POST directly to JBoss, but via Apache or Tomcat proxy.