Robert dug into the intermittent failures and found that they often happened when the test used Lucene's "spoon feeding reader" (MockReaderWrapper).
This helpful test-only class wraps any incoming
java.io.Reader and randomly chops up the incoming large blocks of characters into small randomly sized chunks, much like how you would spoon-feed a baby. Its purpose is to tickle any buffering bugs such as this classic, still-open Xerces-J Unicode bug.

Lucene has also had various exciting buffering bugs in its tokenizers in the past, but this time MockReaderWrapper caught a bug in the JVM, specifically in
System.arraycopy!

Robert eventually boiled the failing test down to a small test case which finally led to this OpenJDK issue. The issue was quickly fixed (thank you!), but have a look at how it was fixed to see just how hairy it is for the JVM to implement the seemingly innocent System.arraycopy! This is like pulling off the volume knob on your car radio only to discover it has a small nuclear reactor inside.

This collaboration between the OpenJDK team and Lucene developers is win/win: new versions of OpenJDK (and of course Oracle's JDK, nearly the same thing) get more extensive testing before being unleashed to the world and Lucene users gain some confidence that there are no specific Java bugs causing horrible things like silent index corruption such as this nasty Java 1.6.0 bug from the past.

There is one Oracle developer who really stands out in resolving the scary JVM bugs we discover: on behalf of Lucene committers, I'd like to extend a warm thank you to Vladimir Kozlov. We are perpetually in awe of Vladimir because somehow, with even the most cryptic and difficult Lucene test failures, iterating with Dawid or Robert or Uwe or sometimes all three, Vladimir can stare at heaps and heaps of assembly code created by the hotspot compiler and understand and fix the JVM bugs. We are not sure how he does it but he always does!

But the silver lining in this unfortunate event was the closer collaboration and squashed bugs we see today, not just in Lucene but also many other projects that Rory notifies on new JDK snapshot builds.

IBM's J9 JDK joins the fun

IBM has its own J9 JDK, and we used to include it in Lucene's tests rotation, but there were too many JVM bugs, such as this mis-compilation of the FST.pack method, causing test failures. Long ago, we never succeeded in getting IBM's attention to resolve them.

The interactions with J9 developers is even more commercially limited than OpenJDK, since J9 is closed-source and there is not even a public issue tracking system for us to see the progress on issues, let alone open and comment on them. So instead of seeing how things are being fixed, as we could above with the tricky System.arraycopy bug, we see cryptic comments like this one. Still, this is better than nothing, and beggars can't be choosers.

I hope that some time soon we can declare that J9 won't crash or corrupt Lucene indices.

Overall, it's wonderful that Lucene's exhaustive randomized tests are so effective at finding not only Lucene bugs, but also bugs in the various JVM implementations. We've come a long ways since the buggy 1.7.0 Oracle JDK release, and juicy bugs are being discovered and squashed. Even so, it's not clear this tenuous process is scalable going forward, with the unnecessary friction in how outside users report issues and the sizable time required to isolate new JVM issues. This is time taken away from, say, building a search engine!