Description

The same items parse correctly, if written to a byte array and then a ByteArrayInputStream on the byte array, is passed to parse.
parser.parse(response.getResponseBodyAsStream());

Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character (NULL, unicode 0) encountered: not valid in any content
at [row,col

{unknown-source}

]: [3,56]
at com.ctc.wstx.sr.StreamScanner.constructNullCharException(StreamScanner.java:615)
at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:644)
at com.ctc.wstx.sr.BasicStreamReader.readTextPrimary(BasicStreamReader.java:4554)
at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2886)
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)
at org.apache.abdera.parser.stax.FOMBuilder.getNextElementToParse(FOMBuilder.java:163)
at org.apache.abdera.parser.stax.FOMBuilder.next(FOMBuilder.java:187)

This appears to trigger when the socket read boundaries fall such that the first byte of a multi byte character is the first byte in a read from the network socket.

In our failing case, there are 3 reads issed against the input stream returned by the httpmethod.
1 for 4 bytes
1 for 196 bytes
1 for 3800 bytes
and then for 4 k bytes.

In our failing case, the read for 196 bytes does returns less that 196 bytes, and the first character read in the next read is the start byte of our multibyte character.
The multi-byte character is returned in the 3rd READ_ARRAY call and written to position 200 in the input buffer.
When the mutli-byte character is not the first byte sequence returned by read, there is no exception.

Jason Venner (www.prohadoop.com)
added a comment - 25/Mar/09 22:06 This appears to trigger when the socket read boundaries fall such that the first byte of a multi byte character is the first byte in a read from the network socket.
In our failing case, there are 3 reads issed against the input stream returned by the httpmethod.
1 for 4 bytes
1 for 196 bytes
1 for 3800 bytes
and then for 4 k bytes.
In our failing case, the read for 196 bytes does returns less that 196 bytes, and the first character read in the next read is the start byte of our multibyte character.
The multi-byte character is returned in the 3rd READ_ARRAY call and written to position 200 in the input buffer.
When the mutli-byte character is not the first byte sequence returned by read, there is no exception.
"TIME" "method" "read byte count" "read byte count after mark resets" "where read data is written into the buffer passed to read" "read request size" "count read"
1238017735367 " AVAILABLE" 0 0 0 4 4
1238017735367 "READ_ARRAY" 0 0
1238017735367 " AVAILABLE" 4 4
1238017735367 "READ_ARRAY" 4 4 4 196 158
1238017735367 " AVAILABLE" 162 162
1238017735367 "READ_ARRAY" 162 162 200 3800 2890
1238017735370 " CLOSE" 3052 3052

HttpClient is using a ChunkedInputStream under the covers, which forces no read to span a chunk boundary.
The jetty server on the other side is arranging chunks so that the multi-byte characters, start the chunks.

Jason Venner (www.prohadoop.com)
added a comment - 25/Mar/09 23:39 HttpClient is using a ChunkedInputStream under the covers, which forces no read to span a chunk boundary.
The jetty server on the other side is arranging chunks so that the multi-byte characters, start the chunks.

Jason Venner (www.prohadoop.com)
added a comment - 25/Mar/09 23:45 In all faling cases, if I pass the parser an InputStreamReader( method.getRequestBodyAsStream(), "UTF-8"), the parse and element extraction is successful.
This is definitely a bug in the new i18n code.

This code, when run against abdera 4.0 using HttpClient 3.1 demonstrates the chunked transfer multi-byte failures

There are two examples in the code,
one that places a multibyte character at position 0 in a chunk, the byte array rawChunkWithMultiByteAtStart
and one that does not place a multbyte character at position 0 of any chunk.
rawNoChunkWithMultiByteAtStart

Jason Venner (www.prohadoop.com)
added a comment - 26/Mar/09 02:24 This code, when run against abdera 4.0 using HttpClient 3.1 demonstrates the chunked transfer multi-byte failures
There are two examples in the code,
one that places a multibyte character at position 0 in a chunk, the byte array rawChunkWithMultiByteAtStart
and one that does not place a multbyte character at position 0 of any chunk.
rawNoChunkWithMultiByteAtStart