Description

Stack Trace:
java.lang.AssertionError
at __randomizedtesting.SeedInfo.seed([8E91A6AC395FEED9:618A6129A5BB9EC]:0)
at org.apache.lucene.analysis.MockTokenizer.readCodePoint(MockTokenizer.java:153)
at org.apache.lucene.analysis.MockTokenizer.incrementToken(MockTokenizer.java:123)
at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:558)
at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:488)
at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:430)
at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:424)
at org.apache.lucene.analysis.pattern.TestPatternReplaceCharFilter.testRandomStrings(TestPatternReplaceCharFilter.java:323)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1969)
at com.carrotsearch.randomizedtesting.RandomizedRunner.access$1100(RandomizedRunner.java:132)
at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:814)
at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:875)
at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:889)
at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50)
at org.apache.lucene.util.TestRuleFieldCacheSanity$1.evaluate(TestRuleFieldCacheSanity.java:32)
at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55)
at org.apache.lucene.util.TestRuleReportUncaughtExceptions$1.evaluate(TestRuleReportUncaughtExceptions.java:68)
at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(Randomized

Dawid Weiss
added a comment - 26/May/12 22:32 This bug is caused by Robert's evil random Pattern instances. I think it qualifies as a bug in the JDK's implementation... but then, I'm not sure. Here it is – the Pattern:
Pattern p = Pattern.compile( "]|" );
For those of you who wondered, this also is a valid pattern:
Pattern p = Pattern.compile( "|" );
What should these match? I have no clue. Are they even valid? I have no clue. Anyway, what happens is that surrogate pairs (and everything else) is divided, so:
String s1 = "AB\uD840\uDC00C" ;
String s2 = s1.replaceAll( "]|" , "xyz" );
System .out.println(s1);
System .out.println(s2);
results in:
AB𠀀C
xyzAxyzBxyz?xyz?xyzCxyz
and an assertion is righteously thrown saying a surrogate pair is broken.
Don't know what to do with this. Robert?

When you think of it '|' is not an operator, it is an alternative that simply splits (the automaton) into two conditional paths. Since preconditions and post-conditions are empty, these paths converge into an empty state. Which matches on every character. Which seems to be a bug because it matches on every char[] instead of codepoint ('.' matches each codepoint).

Dawid Weiss
added a comment - 26/May/12 23:03 More insanity:
String s1 = "AB\uD840\uDC00C" ;
String s2 = s1.replaceAll( "", " xyz");
String s1 = "AB\uD840\uDC00C" ;
String s2 = s1.replaceAll( "|||||||||||||||||||||||||||||||" , "xyz" );
These are equivalent.
When you think of it '|' is not an operator, it is an alternative that simply splits (the automaton) into two conditional paths. Since preconditions and post-conditions are empty, these paths converge into an empty state. Which matches on every character. Which seems to be a bug because it matches on every char[] instead of codepoint ('.' matches each codepoint).
Don't know what to make of it.

What i can't explain, is why java treats "empty string" as something that matches in the middle of a code point. that certainly sounds like bug, unless there is some subtlety in Unicode TR#18 that i'm not seeing...

Hoss Man
added a comment - 26/May/12 23:19 - edited When you think of it '|' is not an operator ...
I'm not really following you there ... '|' is the OR operator, so the regex "|" is a redundant way of saying "" which is "the empty pattern" or a way of saying "match the empty string".
Consider in particular when the regex is used for a replace or split type operation: "z|" means "match z or the empty string"...
$ perl -MData::Dumper -le 'print Dumper split /z|/, "ABzCD" ;'
$VAR1 = 'A';
$VAR2 = 'B';
$VAR3 = 'C';
$VAR4 = 'D';
You should see similar results in java with if you use Matcher.find() as an iterator.
As for the original pattern: "]|" – that's just a convince form of (imposible to write this in jira markup w/o code tags)...
\]|
.. an unescaped close bracket (that has no matching open bracket) is treated as a literal...
$ perl -MData::Dumper -le 'print Dumper split /]|/, "AB]CD" ;'
$VAR1 = 'A';
$VAR2 = 'B';
$VAR3 = 'C';
$VAR4 = 'D';
What i can't explain, is why java treats "empty string" as something that matches in the middle of a code point. that certainly sounds like bug, unless there is some subtlety in Unicode TR#18 that i'm not seeing...
http://www.unicode.org/reports/tr18/

I'm not really following you there ... '|' is the OR operator, so the regex "|" is a redundant way of saying "" which is "the empty pattern" or a way of saying "match the empty string".

Yeah, I am a bit surprised at what "" matches. It doesn't match an empty string. It matches an empty string in between characters... or in other words, it matches what's not there. Makes sense when you think of it.

As for '|', I looked at it from automata theory point of view – '|' doesn't need any arguments or post-arguments (or states), unlike '', '*' or the like which need a state to reference. I'd be convinced '|' is a consistent way of saying 'match empty string or empty string' if "" pattern worked ("match empty string one or more times"), but it doesn't – this will fail with an error. So '|' is kind of special here.

I don't know much about regexp theory to argue if I'm right or wrong though. I don't even think there is one "right" way to do things if this is a true quote:

I define UNIX as “30 definitions of regular expressions living under one roof.” —Don Knuth

Dawid Weiss
added a comment - 26/May/12 23:44 I'm not really following you there ... '|' is the OR operator, so the regex "|" is a redundant way of saying "" which is "the empty pattern" or a way of saying "match the empty string".
Yeah, I am a bit surprised at what "" matches. It doesn't match an empty string. It matches an empty string in between characters... or in other words, it matches what's not there. Makes sense when you think of it.
As for '|', I looked at it from automata theory point of view – '|' doesn't need any arguments or post-arguments (or states), unlike ' ', '*' or the like which need a state to reference. I'd be convinced '|' is a consistent way of saying 'match empty string or empty string' if " " pattern worked ("match empty string one or more times"), but it doesn't – this will fail with an error. So '|' is kind of special here.
I don't know much about regexp theory to argue if I'm right or wrong though. I don't even think there is one "right" way to do things if this is a true quote:
I define UNIX as “30 definitions of regular expressions living under one roof.” —Don Knuth
Dawid

It doesn't match an empty string. It matches an empty string in between characters...

Well, it's more complicated then that. it does match the empty string (in the sense of "does this regex match this entire string which happens to be empty) but in the context of "find" or "replace" on a larger string you are correct that it matches nothing, which means it matches the emptiness between characters.

I'd be convinced '|' is a consistent way of saying 'match empty string or empty string' if "+" pattern worked ("match empty string one or more times"), but it doesn't – this will fail with an error. So '|' is kind of special here.

I think that's just a fluke of syntax/precedence ... if you use parens (capturing or otherwise) you can say "match the empty pattern 1 or more times)...

Bottom Line: these patterns are all valid and meaningful, and everything we've discussed is tangential to the problem – which seems to be that the JVM lets the empty pattern split in between chars instead of codepoints, which seems like a bug.

Hoss Man
added a comment - 27/May/12 07:53 It doesn't match an empty string. It matches an empty string in between characters...
Well, it's more complicated then that. it does match the empty string (in the sense of "does this regex match this entire string which happens to be empty) but in the context of "find" or "replace" on a larger string you are correct that it matches nothing, which means it matches the emptiness between characters.
I'd be convinced '|' is a consistent way of saying 'match empty string or empty string' if "+" pattern worked ("match empty string one or more times"), but it doesn't – this will fail with an error. So '|' is kind of special here.
I think that's just a fluke of syntax/precedence ... if you use parens (capturing or otherwise) you can say "match the empty pattern 1 or more times)...
$ perl -MData::Dumper -le 'print Dumper split /(?:)+/, "ABCD" ;'
$VAR1 = 'A';
$VAR2 = 'B';
$VAR3 = 'C';
$VAR4 = 'D';
Bottom Line: these patterns are all valid and meaningful, and everything we've discussed is tangential to the problem – which seems to be that the JVM lets the empty pattern split in between chars instead of codepoints, which seems like a bug.

Dawid Weiss
added a comment - 27/May/12 09:51 Thanks for the insight – interesting.
is tangential to the problem – which seems to be that the JVM lets the empty pattern split in between chars instead of codepoints, which seems like a bug.
Absolutely. This seems like a bug to me too.

Dawid Weiss
added a comment - 27/May/12 19:15 I checked python3 and it works according to the intuition (works on codepoints). I also asked on openjdk's i18n mailing list about where to file it.
In the mean time I'll fix it by looping the random pattern generator until it processes a non-bmp character so that the returned string is valid utf16.

Dawid Weiss
added a comment - 28/May/12 13:33 This is very likely a JDK bug. I committed a patch that works around some of the trivial random patterns but it's definitely not a full resolution of the problem.