[ https://issues.apache.org/jira/browse/JENA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234888#comment-13234888
]
Andy Seaborne edited comment on JENA-225 at 3/21/12 7:32 PM:
-------------------------------------------------------------
This issue is not related to transactions per se. Normally, node caching hides the fact the
DB has corrupted by illegal UTF-8.
The transaction system just happens to highlight the problem as it works without high-level
node caches to make the actions idempotent.
The attached file shows it can happen for a raw storage dataset. The code resets the system
storage cache to remove all node table caches.
Also in teh code snippet, print out the size f the byte buffer after 'encode' and it will
show it is short.
The problem is in the encoding of chars to bytes. The jaba.nio.Charset encoder needs "onMalformedInput(CodingErrorAction.REPLACE)"
to be set and it isn't. This replaces the bad uniocde codepoint (high surrogate without a
following low surrogate to make a surrorgate pair) with a '?' character -- standard Java charset
decoder.
A more ambitious fix is to not use Java encoder/decoders, which are sensitive to codepoint-legality,
and drop down to custom code that only uses UTF-8 encoding rules without checking for legal
codepoints. This would make TDB robust though something else may break when the data is
leaves the JVM and is read in elsewhere because the data is not legal unicode.
Classes InStreamUTF8 and OutStreamUTF8 in ARQ show the encoding algorithm. They are slightly
slower (a few percent) than the standard java encoders when used in RIOT on large files needing
multiple seconds decoding time. It wil only show in TDB on very large literals (100k+ ?).
Normally, lexical forms are less than a few 100 bytes and the difference is not measurable
(the custom codec process may even be faster due to lower startup costs). It is well below
the rest of the database processing costs.
was (Author: andy.seaborne):
This issue is not related to transactions per se. Normally, node caching hides the fact
the DB has corrupted by illegal UTF-8.
The transaction system just happens to highlight the problem as it works without high-level
node caches to make the actions idempotent.
The attached file shows it can happen for a raw storage dataset. The code resets the system
storage cache to remove all node table caches.
Also in teh code snippet, print out the size f the byte buffer after 'encode' and it will
show it is short.
The problem is in the encoding of chars to bytes. The jaba.nio.Charset encoder needs "onMalformedInput(CodingErrorAction.REPLACE)"
to be set and it isn't. This replaces the bad uniocde codepoint (high surrogate without a
following low surrogate to make a surrorgate pair) with a '?' character -- standard Java charset
decoder.
A more ambitious fix is to not use Java encoder/decoders, which are sensitive to codepoint-legality,
and drop down to custom code that only uses UTF-8 encoding rules without checking for legal
codepoints. This would make TDB robust though something else may break when the data is
leaves the JVM and is read in elsewhere because the data is not legal unicode.
Classes InStreamUTF8 and OutStreamUTF8 in ARQ show the encoding algorithm. They are slightly
slower (a few percent) than the standard java encoders (which are probably native code) when
used in RIOT on large files needing multiple seconds decoding time. It wil only show in TDB
on very large literals (100k+ ?). Normally, lexical forms are less than a few 100 bytes and
the difference is not measurable (the custom codec process may even be faster due to lower
startup costs). It is well below the rest of the database processing costs.
> TDB datasets can be corrupted by performing certain operations within a transaction
> ------------------------------------------------------------------------------------
>
> Key: JENA-225
> URL: https://issues.apache.org/jira/browse/JENA-225
> Project: Apache Jena
> Issue Type: Bug
> Affects Versions: TDB 0.9.0
> Environment: jena-tdb-0.9.0-incubating
> Reporter: Sam Tunnicliffe
> Attachments: ReportBadUnicode1.java
>
>
> In a web application, we read some triples in a HTTP POST, using a LangTurtle instance
and a tokenizer obtained from from TokenizerFactory.makeTokenizerUTF8.
> We then write the parsed Triples back out (to temporary storage) using OutputLangUtils.write.
At some later time, these Triples are then re-read, again using a Tokenizer from TokenizerFactory.makeTokenizerUTF8,
before being inserted into a TDB dataset.
> We have found it possible for the the input data to contain character strings which pass
through the various parsers/serializers but which cause TDB's transaction layer to error in
such a way as to make recovery from journals ineffective.
> Eliminating transactions from the code path enables the database to be updated successfully.
> The stacktrace from TDB looks like this:
> org.openjena.riot.RiotParseException: [line: 1, col: 2 ] Broken token: Hello
> at org.openjena.riot.tokens.TokenizerText.exception(TokenizerText.java:1209)
> at org.openjena.riot.tokens.TokenizerText.readString(TokenizerText.java:620)
> at org.openjena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:248)
> at org.openjena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:112)
> at com.hp.hpl.jena.tdb.nodetable.NodecSSE.decode(NodecSSE.java:105)
> at com.hp.hpl.jena.tdb.lib.NodeLib.decode(NodeLib.java:93)
> at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:234)
> at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:228)
> at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
> at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.append(NodeTableTrans.java:188)
> at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.writeNodeJournal(NodeTableTrans.java:306)
> at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.commitPrepare(NodeTableTrans.java:266)
> at com.hp.hpl.jena.tdb.transaction.Transaction.prepare(Transaction.java:131)
> at com.hp.hpl.jena.tdb.transaction.Transaction.commit(Transaction.java:112)
> at com.hp.hpl.jena.tdb.transaction.DatasetGraphTxn.commit(DatasetGraphTxn.java:40)
> at com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction._commit(DatasetGraphTransaction.java:106)
> at com.hp.hpl.jena.tdb.migrate.DatasetGraphTrackActive.commit(DatasetGraphTrackActive.java:60)
> at com.hp.hpl.jena.sparql.core.DatasetImpl.commit(DatasetImpl.java:143)
> At least part of the issue seems to be stem from NodecSSE (I know this isn't actual unicode
escaping, but its derived from the user input we've received).
> String s = "Hello \uDAE0 World";
> Node literal = Node.createLiteral(s);
> ByteBuffer bb = NodeLib.encode(literal);
> NodeLib.decode(bb);
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira