Thursday, May 25, 2006

Here there is the character reference #150 and the actual character 0x96 that character reference #150 resolves to. The underscore is used to prevent the character reference being resolved by the browser, and the string '0x96' is used instead of the actual character 0x96 because blogger complains otherwise (stick with me...) Note the encoding used in the prolog.

Here you can see the character reference has been written back out as the same character, however character 0x96 has become 0x2013 (8211 in decimal). Character 150 is the unicode character "START OF GUARDED AREA" in the non-displayed C1 control character range, but in the Windows-1252 encoding it's mapped to to the displayable character 0x2013 "en-dash" (a short dash). Microsoft squeezed more characters into the single byte range by replacing non-displayed control characters with more useful displayable characters, but mistakenly went on to label files encoded in this way as ISO-8859-1 in some MS Office applications. In ISO-8859-1 the characters in the C0 and C1 ranges are the non-displayable control characters, but this mis-labelling was so widespread that parsers began detecting this situation and silently switching the read encoding to Windows-1252.

This problem surfaces when serving XHTML to an XHTML browser. While browsers are reading files using their HTML parsers, any file mis-labelled as ISO-8859-1 that contains characters in the C0 or C1 ranges will still auto-magically display the characters in those ranges, as the forgiving parsers auto-switch the read encoding. However, when an XHTML file is served (using the correct mime type eg "application/xhtml+xml") XHTML browsers such as Firefox will parse the file using its stricter XML parser - and all the characters in the C0 and C1 ranges will remain as those non-displayed characters. The auto-switch won't take place and characters such as 0x96 (en-dash) that were once displayed will just disappear.

This problem only occurs when an XML file is saved in Windows-1252 but is labelled as something else, usually IS0-8859-1. The most common culprit is Notepad, where a user has edited and saved an XML file without realising/caring that Notepad is unaware of the XML prolog.

So back to the example above:

<?xml version="1.0" encoding="Windows-1252" ?><foo>&_#150;0x96</foo>

The main point to realise here is that character references (the #150 in the example) are always resolved using Unicode codepoints, regardless of the specified encoding. Actual characters in the file will be read using the specified encoding. Therefore the #150 is resolved to 0x96 (its Unicode codepoint), while the actual character 0x96 in the source becomes 0x2013 (#8211) as specified in the Windows-1252 encoding.

The result of the transformation demonstrates this when serialised using the US-ASCII encoding (so all bytes above 127 will be written out as character references) looks like this:

<foo>&_#150;&_#8211;</foo>

A great example I think :)

Wikipedia has lots more information: http://en.wikipedia.org/wiki/ISO_8859-1

*There are two versions of ISO-8859-1 - The ISO and IANA versions. The ISO versions doesn't contain the C0 and C1 control characters, the IANA version does contain them. The XML recommendation uses the IANA version.

Sunday, May 07, 2006

I've uploaded a new version Kernow that contains Saxon 8.7.1 with a few minor bugs fixed. These mainly centered around paths - the systemId of the source XML and base uri used by xsl:result-document.

The latter was pretty straightforward, the base uri used by xsl:result-document is set using Controller.setBaseOutputURI() - this value is now set to either the uri of the stylesheet, or the output file/directory (depending if one is supplied). This makes sense, as if you are outputting to a given file/directory, you would like the output from your xsl:result-document instructions to be relative to that (in my view, anyway). If you haven't supplied an output file/directory and you're outputting to the Kernow output window, then it should be resovled against the stylesheet.

Setting the systemId of the TransformerHandler to be that of the XML file seemed wrong to me at first, as you would expect it to be used by the stylesheet. After thinking about it a little, it makes some sense: the TransformerHandler knows the systemId of the stylesheet from when it was created - it doesn't know the location of the source XML as that arrives as a series of SAX events - therefore it's possible to set it through setSystemId. Perhaps someone knows for sure...

Tuesday, May 02, 2006

I finally got around to finishing the Sudoku solver after several weeks of other things getting the way.

I've improved the heuristics that check for the allowed values for a given cell. It now checks the row and column "friends" - the other rows and columns in a group (eg columns 1 and 2 when the cell is in column 3, or columns 1 and 3 when the cell is in 2). This is quite straightforward for a human and is done almost without noticing, but its reasonably involved to code.

The other change was to recursively insert all the values into cells with only one possible value, and then re-evaluate the allowed values with those cells inserted. This takes care of the situation where there are initially two possible values, but after inserting a definite value elsewhere there is only one possible values left.

Ultimately all this reduces the number of times the transform has to guess - on certain boards it would take a long time to solve purely because it was guessing wrongly early on, and then have to go through all the permutations before working its way back to that early point and trying the next number. Now there should be the maximum number of values in place before it has to guess (if at all) so that it doesn't waste time backtracking.

Here it is. To try it out run it using Saxon with "-it main" set, or run it against a dummy input XML. The are a few sample boards included, to try them out change the main template to point the variable containing the board, or cut and paste it's contents into the param at the top of the stylesheet.

If you find a board that this stylesheet can't process quickly, please let me know.