Wednesday, February 14, 2007

I was working on Java application recently when I got the following exception

org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x1a) was found in the element content of the document.

I was using Castor in an attempt to unmarshal (XML->Java Objects) an XML string. The XML originated from a bookmarks file that was uploaded into the application by the user, tidied up, transformed and then stored in a Oracle database. It was at the point at which the XML string was returned from the database and was being converted into Java objects that this error occurred.

I immediately assumed it to be some kind of character encoding conversion problem.

With some databases (e.g. MySQL) it is possible to set data to be passed as UTF8 by passing certain settings via the JDBC url. I do not use MySQL so I do not know whether this would fix my problem; although I suspect it would not.

I tried various methods to convert my XML string into valid UTF8 and I was pretty sure that I had achieved satisfactory UTF8 conversion but I still got the error.

This is when I discovered that not all valid UTF8 characters are valid XML characters, which probably makes sense (what with control characters and such) but I have never had to think about this before.

After spending several hours previously messing with numerous UTF8 conversion techniques I eventually found a solution. I found it in the Xalan mailing list. I am reproducing this solution here because it was not mentioned in the context of the "Unicode: 0x1a" error, if it was I would have found the solution more quickly. The XML standard specifies which UTF8 characters are valid in XML documents, so it is possible to take a UTF8 document and filter out all the invalid characters using a method like this:

/** * This method ensures that the output String has only * valid XML unicode characters as specified by the * XML 1.0 standard. For reference, please see * <a href="http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char">the * standard</a>. This method will return an empty * String if the input is null or empty. * * @param in The String whose non-valid characters we want to remove. * @return The in String, stripped of non-valid characters. */public String stripNonValidXMLCharacters(String in) {StringBuffer out = new StringBuffer(); // Used to hold the output.char current; // Used to reference the current character.

javax.servlet.ServletException: javax.xml.transform.TransformerException: com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: An invalid XML character (Unicode: 0x0) was found in the element content of the document.Note: Comment imported. Original by Robot website: http://www.moesol.com at 2007-06-07 08:07

I think the character ranges are correct according to the latest XML specification (Fourth edition, last edited 29 September 2006).

http://www.w3.org/TR/xml/#charsets

Also, CDATA sections are about storing strings of characters that are not to be treated as markup. I am almost certain that they should not contain out of range characters (as this would invalidate the entire XML document).

I found this a very difficult question to answer. I find reading detailed specifications hard at the best of times! I have done some digging around and I am still not certain that I have the right answer. My guess is whatever character encoding you use, valid XML must only contain characters from the defined range. I base my guess on the following:

"Remember that character encodings, despite their name do not apply to characters - they apply to byte sequences which represent characters. If you have a char variable in Java it has no character encoding as far as you are concerned, it’s just that character."

However, if you do decide to opt for a character encoding other than UTF-8/UTF-16 you will have additional constraints in order to satisfy XML validity.

I used the method, but could not get it to work. Still get $#19; an invalid xml char. I decoded the String using UTF-8, str.charAt(i) returned $,#,1,9, and ; respectively. Individually, these are valid xml char. Could you advice what I did wrong? ThanksNote: Comment imported. Original by Anonymous at 2008-02-25 17:57

Thank you so much. I had a huge data set that we converted into XML and about 300 xml failed fue to invalid char in them. I was trying to code in some method do this. But what you have is much better then what I had in my mind. THANK YOU!!!Note: Comment imported. Original by sidd at 2008-03-03 18:41

I have a web service that builds and forwards an Adobe PDF file in a String object. And a service client’s rebuilds this PDF file on the other side.

When the web service client unmarshall the response flow, it generates this exception : An invalid XML character (Unicode: 0x2) was found in the element content of the document. My PDF Document stored in the String object contains valid characters for the PDF reader, but invalid for the xml parser.

When I uses your solution, invalid characters are well removed, and the web service work correctly.

But when I recovers the pdf after calling my service, this one is corrupted. The removed characters are necessary for read the PDF properly.

I try multiple solutions to convert the String object to UTF8 encoding, but nothing works. Those characters are always presents and obviously necessary.

Is there a solution to replace (Not remove) invalid characters in the original String and successfully pass the xml parser

And recover invalid characters after the parsing step to rebuild at the identical the original message ?

I am using the service in my code , but not able to test it :( please post if somebody has tested the service before.The service will trim 0x13 if somebody is having this character plz post.Note: Comment imported. Original by vishal paisal at 2008-07-03 12:54

Anyone know why XML parsers or deserializaion objects through when an invalid character is found? Just curious about the motivation for raising throwing an exception and not ignoring it. Note: Comment imported. Original by bock at 2009-03-18 17:29

Good question! I will try to answer it! In the above error the XML parser is correctly reporting that this data does not conform to the strict guidelines of the XML specification. It is the parser's job to report any non-compliance to any extent (even if it seems minor). However, in my particular case it is sufficient to discard the non-standard characters and continue working with the data BUT in another situation this course of action might not be appropriate.

I am dealing with a value that is sourced in systems not in my control/scope and I cannot strip any character that is a valid UTF-8 but not valid XML 1.0. Is there a way to transform these invalid characters into valid XML 1.0 equivalent.

This question has already been asked a few times in the thread.

Many thanks in advance.Note: Comment imported. Original by Mani at 2010-04-28 10:13

I am a complete newbie to this kind of thing so please try and not laugh to hard at my request.

I created a website with wordpress with about 5 000 products. Now i am trying to list my products on an action site in our country. he gave me a plugin to install to pull our product feed from the site. the problem is that the feed keeps on failing because of an 0x3 invalid xml character error.

If i had to go through all the product descriptions to try and find the illegal characters, it'll take me forever.

It's a wordpress site with woocommerce installed.

I know that most of the data gets stored on the database.

Is there a way to seek and destroythese illegal characters via phpmyadmin

the table in question is wp_postsand the columns in question are post_excerpt, post_content and possibly post_title.the post type is 'product'