Details

Description

for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.

supplementary character support should be fixed for code that works with char/char[]

For example:
StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
LowercaseFilter should be modified to lowercase suppl. characters correctly.
CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.

in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

Robert Muir
added a comment - 12/Jun/09 19:54 my example has some "issues" (i.e. it depends upon the knowledge that no surrogate pairs lowercase to BMP codepoints).
it also doesn't correctly handle invalid unicode data (unpaired surrogates).
Such complexity can be all added to the slower-path supplementary case. the intent is just to show it wouldnt be a terribly invasive change.
Thanks!

Robert Muir
added a comment - 12/Jun/09 20:07 icu uses the following idiom to check if a char ch is a surrogate. might be the fastest way to ensure the performance impact is minimal:
(ch & 0xFFFFF800) == 0xD800

Michael McCandless
added a comment - 13/Jun/09 10:45 Robert, could you flesh this patch out to a committable point? Ie, handle unpaired surrogates, add test case that first shows that LowercaseFilter incorrectly breaks up surrogates? Thanks!
it depends upon the knowledge that no surrogate pairs lowercase to BMP codepoints
Is it invalid to make this assumption? Ie, does the unicode standard not guarantee it?

The scary thing is that this happens already if you run lucene on a 1.5 VM even without introducing 1.5 code.
I think we need to act on this issue asap and release it together with 3.0. -> ful support for unicode 4.0 in lucene 3.0
I also thought about the implementation a little bit. The need to detect chars > BMP and operate on those might be spread out across lucene (quite a couple of analyzers and filters etc). Performance could truely suffer from this if it is done "wrong" or even more than once. It might be considerable to make the detection pluggable with an initial filter that only checks where surrogates are present in a token and sets an indicator to the token represenation so that subsequent TokenStreams can operate on it without rechecking. This would also preserve performance for those who do not need chars > BMP (which could be quite a large amout of people). Those could then simply not supply such a initial filter.

Simon Willnauer
added a comment - 13/Jun/09 15:23 The scary thing is that this happens already if you run lucene on a 1.5 VM even without introducing 1.5 code.
I think we need to act on this issue asap and release it together with 3.0. -> ful support for unicode 4.0 in lucene 3.0
I also thought about the implementation a little bit. The need to detect chars > BMP and operate on those might be spread out across lucene (quite a couple of analyzers and filters etc). Performance could truely suffer from this if it is done "wrong" or even more than once. It might be considerable to make the detection pluggable with an initial filter that only checks where surrogates are present in a token and sets an indicator to the token represenation so that subsequent TokenStreams can operate on it without rechecking. This would also preserve performance for those who do not need chars > BMP (which could be quite a large amout of people). Those could then simply not supply such a initial filter.
Just a couple of random thoughts.

Simon, I want to address your comment on performance.
I think that surrogate detection is cheap when done right and I don't think there's a ton of places that need changes.
But I don't think any indicator is really appropriate, for example my TokenFilter might want to convert one chinese character in the BMP to another one outside of the BMP. It is all unicode.

But there is more than just analysis involved here, for example I have not tested WildcardQuery: ? operator.
I'm not trying to go berzerko and be 'ultra-correct', but basic things like that should work.
For situations where its not worth it, i.e. FuzzyQuery's scoring, we should just doc that the calculation is based on 'code units', and leave it alone.

Robert Muir
added a comment - 13/Jun/09 15:55 Simon, I want to address your comment on performance.
I think that surrogate detection is cheap when done right and I don't think there's a ton of places that need changes.
But I don't think any indicator is really appropriate, for example my TokenFilter might want to convert one chinese character in the BMP to another one outside of the BMP. It is all unicode.
But there is more than just analysis involved here, for example I have not tested WildcardQuery: ? operator.
I'm not trying to go berzerko and be 'ultra-correct', but basic things like that should work.
For situations where its not worth it, i.e. FuzzyQuery's scoring, we should just doc that the calculation is based on 'code units', and leave it alone.

#1 The patch is needed for correct java 1.5 behavior, its a 1.5 migration issue.
#2 The patch won't work on java 1.4 because the jdk does not supply the needed functionality
#3 Its my understanding you like to deprecate things (say CharTokenizer) and give people a transition time. How will this work?

i intend to supply patch that fixes the problems, but I wanted to bring up those issues...

Robert Muir
added a comment - 13/Jun/09 16:16 I am curious how you plan on approaching backwards compat?
#1 The patch is needed for correct java 1.5 behavior, its a 1.5 migration issue.
#2 The patch won't work on java 1.4 because the jdk does not supply the needed functionality
#3 Its my understanding you like to deprecate things (say CharTokenizer) and give people a transition time. How will this work?
i intend to supply patch that fixes the problems, but I wanted to bring up those issues...

Simon, a response to your other comment 'I think we need to act on this issue asap and release it together with 3.0. -> full support for unicode 4.0 in lucene 3.0'

Full unicode support would be great!

But this involves a whole lot more, I'm trying to create a contrib under LUCENE-1488 for "full unicode support".
A lot of what is necessary isn't available in the Java 1.5 JDK, such as unicode normalization.

Maybe we can settle for 'full support for UCS (Unicode Character Set)' as a start!

Robert Muir
added a comment - 13/Jun/09 16:40 Simon, a response to your other comment 'I think we need to act on this issue asap and release it together with 3.0. -> full support for unicode 4.0 in lucene 3.0'
Full unicode support would be great!
But this involves a whole lot more, I'm trying to create a contrib under LUCENE-1488 for "full unicode support".
A lot of what is necessary isn't available in the Java 1.5 JDK, such as unicode normalization.
Maybe we can settle for 'full support for UCS (Unicode Character Set)' as a start!

it depends upon the knowledge that no surrogate pairs lowercase to BMP codepoints
Is it invalid to make this assumption? Ie, does the unicode standard not guarantee it?

I do not think it guarantees this for all future unicode versions. In my opinion, we should exploit things like this if I can show a test case that proves its true for all codepoint in the current version of unicode
And it should be documented that this could possibly change in some future version.
In this example, its a nice simplification because it guarantees the length (in code units) will not change!

I think for a next step on this issue I will create and upload a test case showing the issues and detailing some possible solutions.
For some of them, maybe a javadoc update is the most appropriate, but for others, maybe an API change is the right way to go.
Then we can figure out what should be done.

Robert Muir
added a comment - 13/Jun/09 18:01 i forgot to answer your question Michael:
it depends upon the knowledge that no surrogate pairs lowercase to BMP codepoints
Is it invalid to make this assumption? Ie, does the unicode standard not guarantee it?
I do not think it guarantees this for all future unicode versions. In my opinion, we should exploit things like this if I can show a test case that proves its true for all codepoint in the current version of unicode
And it should be documented that this could possibly change in some future version.
In this example, its a nice simplification because it guarantees the length (in code units) will not change!
I think for a next step on this issue I will create and upload a test case showing the issues and detailing some possible solutions.
For some of them, maybe a javadoc update is the most appropriate, but for others, maybe an API change is the right way to go.
Then we can figure out what should be done.

Yonik Seeley
added a comment - 13/Jun/09 19:11 I am curious how you plan on approaching backwards compat?
I don't see a real back compat issue... I can't imagine anyone relying on the fact that >BMP chars wouldn't be lowercased. To rely on that would also be relying on undocumented behavior.
It seems like this (handling code points beyond the BMP) is really a collection of issues that could be committed independently when ready?
Moving to 3.1 since it requires Java 1.5 for much of it.
If there's something desirable to slip in for 2.9 that doesn't depend on Java5, we can still do that.

Robert Muir
added a comment - 13/Jun/09 20:06 Yonik, I think the problem is where method signatures must change, such as CharTokenizer, required to fix LetterTokenizer and friends.
These are probably the worst offenders, as a lot of these tokenizers just remove >BMP chars, which is really nasty behavior for CJK.
I agree its a collection of issues but maybe there can be an overall strategy?

Yonik Seeley
added a comment - 13/Jun/09 21:09 onik, I think the problem is where method signatures must change, such as CharTokenizer, required to fix LetterTokenizer and friends.
Ah, OK. Actually it just occurred to me that this would also require reindexing, otherwise queries that hit documents in the past would mysteriously start missing them (for text outside the BMP).

perhaps these just need some javadocs:
FuzzyQuery: scoring is strange because its based upon surrogates, leave alone and javadoc it.
LengthFilter: length is calculated based on utf-16 code units, leave alone and javadoc it.

... and theres always the option to not change any code, but just javadoc all the behavior as a "fix", providing stuff in contrib or elsewhere that works correctly.
let me know what you think.

Robert Muir
added a comment - 15/Jun/09 02:12 this is just a patch with testcases showing the existing behavior.
perhaps these should be fixed:
Simple/Standard/StopAnalyzer/etc: deletes all supp. characters completely.
LowerCaseFilter: doesn't lowercase supp. characters correctly.
WildcardQuery: ? operator does not work correctly.
perhaps these just need some javadocs:
FuzzyQuery: scoring is strange because its based upon surrogates, leave alone and javadoc it.
LengthFilter: length is calculated based on utf-16 code units, leave alone and javadoc it.
... and theres always the option to not change any code, but just javadoc all the behavior as a "fix", providing stuff in contrib or elsewhere that works correctly.
let me know what you think.

Robert, are you still working on this issue? I have seen a couple of questions about supplementary character support on the mailinig-lists though - I will see how much time I have during the next weeks to get this going.

Simon Willnauer
added a comment - 31/Jul/09 15:27 Robert, are you still working on this issue? I have seen a couple of questions about supplementary character support on the mailinig-lists though - I will see how much time I have during the next weeks to get this going.
simon

Robert Muir
added a comment - 31/Jul/09 20:43 one tricky part of this is CharTokenizer. a few things subclass this guy.
Sometimes it is ok (WhitespaceTokenizer, because all whitespace in unicode <= 0xFFFF)
Sometimes it is not (LetterTokenizer)
It depends how its being used. So I think instead of just
abstract boolean isTokenChar( char )
char normalize( char )
we can add additional int-based methods, where the default implementation is to defer to the char-based ones.
This would preserve backwards compatibility.

Michael McCandless
added a comment - 31/Jul/09 22:58 That sounds right.
We would also deprecate the char based methods? And note that in 3.0 we'll remove them, and make the int-based methods abstract?
And, then fix CharTokenizer to properly handle the extended chars?
We may need to use reflection to see if the current subclass "wants" chars or ints, and fallback to the old (current) char processing if it's a subclass that doesn't define the int-based methods.

like you mentioned, the non-final CharTokenizer-based things (Whitespace, Letter) are a problem.

The easiest way (it looks) is to put the reflection inside the non-final CharTokenizers: Whitespace and Letter.
I want them to work correctly if you have not subclassed them, but if you have, it should call your char based method (which wont work for supp characters, but it didnt before)!

Robert Muir
added a comment - 31/Jul/09 23:14 michael yes thats what I had in mind.
like you mentioned, the non-final CharTokenizer-based things (Whitespace, Letter) are a problem.
The easiest way (it looks) is to put the reflection inside the non-final CharTokenizers: Whitespace and Letter.
I want them to work correctly if you have not subclassed them, but if you have, it should call your char based method (which wont work for supp characters, but it didnt before)!
thoughts?

Robert Muir
added a comment - 31/Jul/09 23:24 by the way i plan to replace any scary bit operations with UnicodeUtil functions. JDK doesnt have methods to yank the lead or trail surrogate out of an int, except in sun.text.normalizer of course

I think instead of the way I prototyped in the first patch, it might be better to have the original chartokenizer incrementToken definition still available in the code.
this is some temporary code duplication but would perform better for the backwards compat case, and the backwards compatibility would be more clear to me at least.

Robert Muir
added a comment - 03/Aug/09 13:19 I think instead of the way I prototyped in the first patch, it might be better to have the original chartokenizer incrementToken definition still available in the code.
this is some temporary code duplication but would perform better for the backwards compat case, and the backwards compatibility would be more clear to me at least.

I think instead of the way I prototyped in the first patch, it might be better to have the original chartokenizer incrementToken definition still available in the code.
this is some temporary code duplication but would perform better for the backwards compat case, and the backwards compatibility would be more clear to me at least.

I suppose we could simply make an entirely new class, which properly handles surrogates, and deprecate CharTokenizer in favor of it? Likewise we'd have to make new classes for the current subclasses of CharTokenizer (Whitespace,LetterTokenizer). That would simplify being back compatible.

Michael McCandless
added a comment - 08/Aug/09 11:01
I think instead of the way I prototyped in the first patch, it might be better to have the original chartokenizer incrementToken definition still available in the code.
this is some temporary code duplication but would perform better for the backwards compat case, and the backwards compatibility would be more clear to me at least.
I suppose we could simply make an entirely new class, which properly handles surrogates, and deprecate CharTokenizer in favor of it? Likewise we'd have to make new classes for the current subclasses of CharTokenizer (Whitespace,LetterTokenizer). That would simplify being back compatible.

Michael, I do think that would be the simplest, but most invasive (we would have to make new duplicate classes for Arabic & Russian even!)
i think i will try to create a patch for my 2nd method, and if it cannot be made simple, we can always do the new class/deprecation route?

Robert Muir
added a comment - 08/Aug/09 14:27 Michael, I do think that would be the simplest, but most invasive (we would have to make new duplicate classes for Arabic & Russian even!)
i think i will try to create a patch for my 2nd method, and if it cannot be made simple, we can always do the new class/deprecation route?

patch with a different technique for CharTokenizer and friends. I like this much more.

for non-final subclasses that are moved to the new int api, I had to expose a protected boolean switch so that any old subclasses of them work as expected: see Arabic/Russian/Letter/Whitespace for examples.

Robert Muir
added a comment - 09/Aug/09 05:12 patch with a different technique for CharTokenizer and friends. I like this much more.
for non-final subclasses that are moved to the new int api, I had to expose a protected boolean switch so that any old subclasses of them work as expected: see Arabic/Russian/Letter/Whitespace for examples.

Uwe Schindler
added a comment - 09/Aug/09 15:14 In your latest patch you have this method getDeclaringClass() using the while loop. To get the class declaring a method, it is simplier to call
this .getClass().getMethod(...).getDeclaringClass()
See TokenStream.java for example where the same is done for the new TokenStream API.

So I must use Class.getDeclaredMethod in this case, because the methods in question are protected. The problem with Class.getDeclaredMethod is that it does not check superclasses like Class.getMethod does, thus the loop.

Robert Muir
added a comment - 09/Aug/09 15:30 Uwe, unfortunately Class.getMethod only works for public methods!
So I must use Class.getDeclaredMethod in this case, because the methods in question are protected. The problem with Class.getDeclaredMethod is that it does not check superclasses like Class.getMethod does, thus the loop.

OK, I did not know this. I read again the Class.getMethod() javadocs and it states that it only returns public methods. In the case of TokenStream bw compatibility, only public methods are affected, so I used the "simple and faster" way.

I was already wondering why you changed this call, because the rest of the reflection code seemed to be copied from TokenStream e.g. the name of the parameter type arrays...

Uwe Schindler
added a comment - 09/Aug/09 15:43 OK, I did not know this. I read again the Class.getMethod() javadocs and it states that it only returns public methods. In the case of TokenStream bw compatibility, only public methods are affected, so I used the "simple and faster" way.
I was already wondering why you changed this call, because the rest of the reflection code seemed to be copied from TokenStream e.g. the name of the parameter type arrays...

Robert Muir
added a comment - 09/Aug/09 15:55 Uwe, and I thank you for the good example! I tried to apply your same logic but was frustrated when it wasn't working only to discover this annoyance with java reflection.

For me it is strange, that getMethod() does not work for protected methods, because they can be inherited. That it does not work for private methods is clear, because they are only known to the class itsself (you cannot even override a private methods). But the JDK defines it in that way and we must life with it

I would stop the iteration when getSuperclass() returns null and then return null (which cannot happen, as the method is defined at least in CharTokenizer, you wrote this in the comment).

Uwe Schindler
added a comment - 09/Aug/09 16:09 - edited For me it is strange, that getMethod() does not work for protected methods, because they can be inherited. That it does not work for private methods is clear, because they are only known to the class itsself (you cannot even override a private methods). But the JDK defines it in that way and we must life with it
I would stop the iteration when getSuperclass() returns null and then return null (which cannot happen, as the method is defined at least in CharTokenizer, you wrote this in the comment).
private Class getDeclaringClass( String name, Class [] params) {
for ( Class clazz = this .getClass(); clazz != null ; clazz = clazz.getSuperclass()) {
try {
clazz.getDeclaredMethod(name, params);
return clazz;
} catch (NoSuchMethodException e) {}
}
return null ; /* impossible */
}

Robert Muir
added a comment - 11/Aug/09 12:46 this is just a reminder that I think if we go this route, we will need caching similar to what Uwe added for TokenStream back compat so that construction performance is not bad.

Robert Muir
added a comment - 16/Nov/09 15:44 a couple people have asked me about this issue lately, I would prefer to spin off smaller issues rather than create large patches that become out of date.
also I think Simon is interested in working on some of this, so more jira spam but i think easier to make progress.

I don't see a real back compat issue... I can't imagine anyone relying on the fact that >BMP chars wouldn't be lowercased. To rely on that would also be relying on undocumented behavior.

Ah, OK. Actually it just occurred to me that this would also require reindexing, otherwise queries that hit documents in the past would mysteriously start missing them (for text outside the BMP).

what should be our approach here wrt index back compat?
For the issues mentioned here, I cant possibly see >BMP working currently for anyone, but you are right it will change results.

I don't want to break index back compat, just wanted to mention that introducing Unicode 4 support, still with API back compat, with no performance degradation, is going to be somewhat challenging already.
If we want to somehow support the "broken" analysis components for index back compat, then we have to also have a broken implementation available on top of the correct impl (probably using Version to handle this).
In my opinion, this would introduce a lot of complexity, I will help do it though, if we must.

Robert Muir
added a comment - 16/Nov/09 18:03 Yonik, or anyone else, please let me know your thoughts on the following:
I don't see a real back compat issue... I can't imagine anyone relying on the fact that >BMP chars wouldn't be lowercased. To rely on that would also be relying on undocumented behavior.
Ah, OK. Actually it just occurred to me that this would also require reindexing, otherwise queries that hit documents in the past would mysteriously start missing them (for text outside the BMP).
what should be our approach here wrt index back compat?
For the issues mentioned here, I cant possibly see >BMP working currently for anyone, but you are right it will change results.
I don't want to break index back compat, just wanted to mention that introducing Unicode 4 support, still with API back compat, with no performance degradation, is going to be somewhat challenging already.
If we want to somehow support the "broken" analysis components for index back compat, then we have to also have a broken implementation available on top of the correct impl (probably using Version to handle this).
In my opinion, this would introduce a lot of complexity, I will help do it though, if we must.

Robert Muir
added a comment - 16/Nov/09 18:10 btw, its worth mentioning that this whole concept of index back compat wrt Unicode is already completely broken in Lucene.
if someone upgrades to lucene 3.0 and uses java 1.5, Character.* operations already return different results than on java 1.4, so they are already broken from their previous index.

If there is nothing we can do here, then we just have to do the best we can -

such as a prominent notice alerting that if you transition JVM's between building and searching the index and you are using or doing X, things will break.

We should put this in a spot that is always pretty visible - perhaps even a new readme file titlted something like IndexBackwardCompatibility or something, to which we can add other tips and gotchyas as they come up. Or MaintainingIndicesAcrossVersions, or FancyWhateverGetsYourAttentionAboutUpgradingStuff. Or a permanent entry/sticky entry at the top of Changes.

Mark Miller
added a comment - 16/Nov/09 20:42 If there is nothing we can do here, then we just have to do the best we can -
such as a prominent notice alerting that if you transition JVM's between building and searching the index and you are using or doing X, things will break.
We should put this in a spot that is always pretty visible - perhaps even a new readme file titlted something like IndexBackwardCompatibility or something, to which we can add other tips and gotchyas as they come up. Or MaintainingIndicesAcrossVersions, or FancyWhateverGetsYourAttentionAboutUpgradingStuff. Or a permanent entry/sticky entry at the top of Changes.

If there is nothing we can do here, then we just have to do the best we can -

Oh there is definitely a choice. and as I said, I am willing to help either way we go.
I am just warning there will be a lot more complexity if we want to use say, Version to support both broken implementation and fixed implementation.
lots of tests, and hopefully no bugs

Uwe already commented he would like it controlled with Version, just saying suppl. characters are notoriously tricky as it is!

Robert Muir
added a comment - 16/Nov/09 20:46 If there is nothing we can do here, then we just have to do the best we can -
Oh there is definitely a choice. and as I said, I am willing to help either way we go.
I am just warning there will be a lot more complexity if we want to use say, Version to support both broken implementation and fixed implementation.
lots of tests, and hopefully no bugs
Uwe already commented he would like it controlled with Version, just saying suppl. characters are notoriously tricky as it is!

btw, its worth mentioning that this whole concept of index back compat wrt Unicode is already completely broken in Lucene.
if someone upgrades to lucene 3.0 and uses java 1.5, Character.* operations already return different results than on java 1.4, so they are already broken from their previous index.

We can fix that too? If so, I think we should. The tokenstream backcompat was a bitch as well - but we had to slog through it.

Mark Miller
added a comment - 16/Nov/09 20:51 I'm speaking in regards to:
btw, its worth mentioning that this whole concept of index back compat wrt Unicode is already completely broken in Lucene.
if someone upgrades to lucene 3.0 and uses java 1.5, Character.* operations already return different results than on java 1.4, so they are already broken from their previous index.
We can fix that too? If so, I think we should. The tokenstream backcompat was a bitch as well - but we had to slog through it.

We can fix that too? If so, I think we should. The tokenstream backcompat was a bitch as well - but we had to slog through it.

Mark honestly, I do not yet know how this one could be reasonably done.
The problem is the behavior is dependent a lot upon the users JVM, how in the hell do we know which JVM version was used to create some given index????

Robert Muir
added a comment - 16/Nov/09 20:57 We can fix that too? If so, I think we should. The tokenstream backcompat was a bitch as well - but we had to slog through it.
Mark honestly, I do not yet know how this one could be reasonably done.
The problem is the behavior is dependent a lot upon the users JVM, how in the hell do we know which JVM version was used to create some given index????

Steve Rowe
added a comment - 16/Nov/09 20:58 I don't know if this is the right place to point this out, but: JFlex-generated scanners (e.g. StandardAnalyzer) do not properly handle supplementary characters.
Unfortunately, it looks like the as-yet-unreleased JFlex 1.5 will not support supplementary characters either, so this will be a gap in Lucene's Unicode handling for a while.

Mark Miller
added a comment - 16/Nov/09 21:01 Mark honestly, I do not yet know how this one could be reasonably done.
Then thats what I am saying we should be documenting.
how in the hell do we know which JVM version was used to create some given index????
If something like this made sense long term, we can start putting it in as index metadata - not helpful now, but could end up being so.

Steven, no its definitely the right place to point it out! I know this is true, even with 1.5

One reason I wanted to split this issue up was to try to make 'improvements', maybe we do not fix everything.
there are also other options for StandardTokenizer/Jflex
For instance, we could not break between any surrogate pairs and classify them as CJK (index individual character) for the time being.
While technically incorrect, it would handle the common cases, i.e. ideographs from Big5-HKSCS, etc.

but right now the topic is i guess on unicode and index back compat in general... trying to figure out what is the reasonable approach to handling this (supporting the old broken behavior/indexes created with them)

Robert Muir
added a comment - 16/Nov/09 21:01 Steven, no its definitely the right place to point it out! I know this is true, even with 1.5
One reason I wanted to split this issue up was to try to make 'improvements', maybe we do not fix everything.
there are also other options for StandardTokenizer/Jflex
For instance, we could not break between any surrogate pairs and classify them as CJK (index individual character) for the time being.
While technically incorrect, it would handle the common cases, i.e. ideographs from Big5-HKSCS, etc.
but right now the topic is i guess on unicode and index back compat in general... trying to figure out what is the reasonable approach to handling this (supporting the old broken behavior/indexes created with them)

+1 I think it should be in the release notes that if you upgrade Java version to either 1.5 or 1.7, you should reindex because Unicode versions change.
Because of this, character properties change and some tokenizers, etc will emit different results.

This is really only somewhat related to this issue, but thats what my email thread on java-dev was all about.

Robert Muir
added a comment - 16/Nov/09 21:03 Then thats what I am saying we should be documenting.
+1 I think it should be in the release notes that if you upgrade Java version to either 1.5 or 1.7, you should reindex because Unicode versions change.
Because of this, character properties change and some tokenizers, etc will emit different results.
This is really only somewhat related to this issue, but thats what my email thread on java-dev was all about.