Uwe Schindler
added a comment - 31/Oct/11 07:26 See LUCENE-3537 and also the Lucene/Solr web homepage.
A complete report is here:
Main article
Explanation of the string concat issues, this explains why StringConcat optimizations trigger this
Discussion about the update release

Matt Ryall
added a comment - 31/Oct/11 02:14 The Java 7u1 release notes report that this issue is fixed in that release:
JIT and Loop Bugs
Three bugs reported by various parties, including Apache Lucene developers, have been fixed in JDK 7 Update 1, in addition to a fourth related bug found by Oracle (7070134, 7068051, 7044738, 7077439).
I haven't yet been able to verify this.

Robert Muir
added a comment - 02/Aug/11 13:17 I don't think there is any sense in this, who cares?
We reported this crash to Oracle in plenty of time, and the worse wrong-results bug has been open since May 13: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7044738 , but Oracle decided not to fix that, too.

@Shay: Sorry I did not want to be too italian I just wanted to ensure that such configurations, leading to bugs in JVMs, would be reported to us. It would help us to also respond quicker on such bug reports, like the one we already got 2 months ago (which nobody was able to reproduce, as we did not know that the user used aggressive opts).

Uwe Schindler
added a comment - 02/Aug/11 12:09 @Shay: Sorry I did not want to be too italian I just wanted to ensure that such configurations, leading to bugs in JVMs, would be reported to us. It would help us to also respond quicker on such bug reports, like the one we already got 2 months ago (which nobody was able to reproduce, as we did not know that the user used aggressive opts).

@Uwe I actually forgot about this, and did not think it was because of the porter stemmer at the time, especially since I did try and reproduce it and never managed to (I thought it was coincidence it crashed there). From my experience, you get very little help from sun/oracle when using unorthodox flags like agressive opts without proper recreation. Well, you get very little help there even when you do produce recreation... (see this issue that I opened for example: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7066129) . I am the reason behind Lucene 1.9.1 release with the major bug in buffering introduced in 1.9 way back in the days, do you really think I would not contact if I thought there really was a problem associated with Lucene?

Shay Banon
added a comment - 02/Aug/11 06:55 @Uwe I actually forgot about this, and did not think it was because of the porter stemmer at the time, especially since I did try and reproduce it and never managed to (I thought it was coincidence it crashed there). From my experience, you get very little help from sun/oracle when using unorthodox flags like agressive opts without proper recreation. Well, you get very little help there even when you do produce recreation... (see this issue that I opened for example: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7066129 ) . I am the reason behind Lucene 1.9.1 release with the major bug in buffering introduced in 1.9 way back in the days, do you really think I would not contact if I thought there really was a problem associated with Lucene?

Uwe Schindler
added a comment - 01/Aug/11 23:06 The SIGSEGV bug was already reported on the Elastic Search mailing list in January: http://elasticsearch-users.115913.n3.nabble.com/Java-6u23-and-ES-0-14-2-crashing-on-signal-6-SIGABT-td2289578.html
It would have been nice, if Shay Bannon contacted us!

> I got new information from Vladimir about the Porter bug in Java 1.6: "The code in memnode.cpp was there
> for long time (before 6u26). But before my changes it was guarded by OptimizeStringConcat flag. So if you
> use -XX:+OptimizeStringConcat or -XX:+AggressiveOpts flags you will hit the same problem (I reproduced it
> even with 1.6.0_23)"
>
> This might be the reason behind http://www.lucidimagination.com/search/document/3beaa082c4d2fdd4/porterstemfilter_kills_jvm,
> but we never got a response. If he used aggressive opts he has the same problem.

@Uwe, sorry for not answering that one or creating an issue as Robert said, but while switching from FAST Search to Lucene/Solr I had (and still have) several problems to solve. One was the UTF-8 jetty problem, then this PorterStemFilter came up and right after that Solr/Lucene crashes with OOM due to FieldCache problems. And there is still my plan to get FST for synonyms running. Dang, my day only has 24 hours.
Yes, I used -XX:+AggressiveOpts and as we know now thats the cause why JVM chrashed.
java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)
After the crashes with PorterStemFilter I removed AggressiveOpts from my JAVA_OPTS.
Now I'm watching what Lucenes FieldCache is doing and if its still doubling its size until OOM.
So I'm deep inside

Well interesting idea to know that if I had filed an issue and that that one had been traced down a month ago that this might have been prevented a buggy release of java 1.7

Bernd Fehling
added a comment - 29/Jul/11 12:51 > I got new information from Vladimir about the Porter bug in Java 1.6: "The code in memnode.cpp was there
> for long time (before 6u26). But before my changes it was guarded by OptimizeStringConcat flag. So if you
> use -XX:+OptimizeStringConcat or -XX:+AggressiveOpts flags you will hit the same problem (I reproduced it
> even with 1.6.0_23)"
>
> This might be the reason behind http://www.lucidimagination.com/search/document/3beaa082c4d2fdd4/porterstemfilter_kills_jvm ,
> but we never got a response. If he used aggressive opts he has the same problem.
@Uwe, sorry for not answering that one or creating an issue as Robert said, but while switching from FAST Search to Lucene/Solr I had (and still have) several problems to solve. One was the UTF-8 jetty problem, then this PorterStemFilter came up and right after that Solr/Lucene crashes with OOM due to FieldCache problems. And there is still my plan to get FST for synonyms running. Dang, my day only has 24 hours.
Yes, I used -XX:+AggressiveOpts and as we know now thats the cause why JVM chrashed.
java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)
After the crashes with PorterStemFilter I removed AggressiveOpts from my JAVA_OPTS.
Now I'm watching what Lucenes FieldCache is doing and if its still doubling its size until OOM.
So I'm deep inside
Well interesting idea to know that if I had filed an issue and that that one had been traced down a month ago that this might have been prevented a buggy release of java 1.7

I got new information from Vladimir about the Porter bug in Java 1.6: "The code in memnode.cpp was there for long time (before 6u26). But before my changes it was guarded by OptimizeStringConcat flag. So if you use -XX:+OptimizeStringConcat or -XX:+AggressiveOpts flags you will hit the same problem (I reproduced it even with 1.6.0_23)"

Uwe Schindler
added a comment - 27/Jul/11 08:02 I got new information from Vladimir about the Porter bug in Java 1.6: "The code in memnode.cpp was there for long time (before 6u26). But before my changes it was guarded by OptimizeStringConcat flag. So if you use -XX:+OptimizeStringConcat or -XX:+AggressiveOpts flags you will hit the same problem (I reproduced it even with 1.6.0_23)"
This might be the reason behind http://www.lucidimagination.com/search/document/3beaa082c4d2fdd4/porterstemfilter_kills_jvm , but we never got a response. If he used aggressive opts he has the same problem.

@Hoss Yeah, it's scary, isn't it? But then: there is no piece of software that is 100% bug free and anybody running a production server will be running migration tests first before running on a new infrastructure. Hey, that's also part of the reason we still have folks running 1.5

I think I'm for releasing 1.7 and getting the road paved for bugfix releases rather than delaying it indefinitely... I mean: it'll be motivational for Oracle if people start screaming!

Dawid Weiss
added a comment - 27/Jul/11 07:54 @Hoss Yeah, it's scary, isn't it? But then: there is no piece of software that is 100% bug free and anybody running a production server will be running migration tests first before running on a new infrastructure. Hey, that's also part of the reason we still have folks running 1.5
I think I'm for releasing 1.7 and getting the road paved for bugfix releases rather than delaying it indefinitely... I mean: it'll be motivational for Oracle if people start screaming!

Even if we found a work around for all the affected issues in Lucene that didn't hurt performance in older JVMs, and spun up a 3.3.1 RC in the next 5 minutes, we still don't have enough time to vote for that release and get it out to the mirrors by the time Java 7 comes out – let alone have any confidence that all our users will upgrade Lucene/Solr before they upgrade their JVM.

I agree, I'm not implying we should rush anything. But I guess I'm saying its worth it to understand the scope of what's affected, because if its just:

Robert Muir
added a comment - 27/Jul/11 07:36
Even if we found a work around for all the affected issues in Lucene that didn't hurt performance in older JVMs, and spun up a 3.3.1 RC in the next 5 minutes, we still don't have enough time to vote for that release and get it out to the mirrors by the time Java 7 comes out – let alone have any confidence that all our users will upgrade Lucene/Solr before they upgrade their JVM.
I agree, I'm not implying we should rush anything. But I guess I'm saying its worth it to understand the scope of what's affected, because if its just:
PorterStemmer jrecrash <- workarounds already posted here
Pulsing negative readVint <-- no workaround yet.
well, thats manageable, only one of these affects any released code.

Robert Muir
added a comment - 27/Jul/11 06:34 I just wrote a test (Test10KPulsings) designed to seek out the corrupt index bug.
it didnt work, but it separately sometimes creates a corrupt index with java6
Adding lucene/src/test/org/apache/lucene/index/codecs/pulsing
Adding lucene/src/test/org/apache/lucene/index/codecs/pulsing/Test10KPulsings.java
Transmitting file data .
Committed revision 1151335.

Frankly i'm amazed that the jdk7 guys are saying "yes this is a bug that can cause a sigsegv in code that worked fine using Java 1.6, but we're going to go ahead and release 1.7 with this bug in place anyway, it should make it in by 1.7_u2"

makes me scared shitless of what other known bugs will be in Java 1.7.0.

Even if we found a work around for all the affected issues in Lucene that didn't hurt performance in older JVMs, and spun up a 3.3.1 RC in the next 5 minutes, we still don't have enough time to vote for that release and get it out to the mirrors by the time Java 7 comes out – let alone have any confidence that all our users will upgrade Lucene/Solr before they upgrade their JVM.

I think the most important thing we can do is publicize the shit out of this hotspot bug, and warn everybody on the fucking planet not to use Java1.7.0 because of it.

if we also find clean workarounds we can commit and release in our own code, so be it – but that seems like priority #2

Hoss Man
added a comment - 27/Jul/11 06:33 Frankly i'm amazed that the jdk7 guys are saying "yes this is a bug that can cause a sigsegv in code that worked fine using Java 1.6, but we're going to go ahead and release 1.7 with this bug in place anyway, it should make it in by 1.7_u2"
makes me scared shitless of what other known bugs will be in Java 1.7.0.
Even if we found a work around for all the affected issues in Lucene that didn't hurt performance in older JVMs, and spun up a 3.3.1 RC in the next 5 minutes, we still don't have enough time to vote for that release and get it out to the mirrors by the time Java 7 comes out – let alone have any confidence that all our users will upgrade Lucene/Solr before they upgrade their JVM.
I think the most important thing we can do is publicize the shit out of this hotspot bug, and warn everybody on the fucking planet not to use Java1.7.0 because of it.
if we also find clean workarounds we can commit and release in our own code, so be it – but that seems like priority #2

Should we place a warning on the "Download" and "News" page on Solr and Lucene website? The risk is high that you corrupt your index, if you index using these JDK versions.

Not totally sure, the issue is not so different from LUCENE-2975: if we can we make a easy workaround I think (there are 2 possible ones on this issue for the Porter bug), we give it our best try, and we get it out in a release. this way if someone has to support jdk 7, we can at least say, upgrade to this version of lucene rather than "won't fix". No matter how much we scream, users will be confused because it seems these bugs only affect loops of a very specific form.

On the other hand if it makes our code messy or confusing or slows things down, we should not do this.

I will look into this new negative vint bug, it might only affect pulsing, and see if i can make a test case+workaround for it.

Robert Muir
added a comment - 27/Jul/11 03:50
Should we place a warning on the "Download" and "News" page on Solr and Lucene website? The risk is high that you corrupt your index, if you index using these JDK versions.
Not totally sure, the issue is not so different from LUCENE-2975 : if we can we make a easy workaround I think (there are 2 possible ones on this issue for the Porter bug), we give it our best try, and we get it out in a release. this way if someone has to support jdk 7, we can at least say, upgrade to this version of lucene rather than "won't fix". No matter how much we scream, users will be confused because it seems these bugs only affect loops of a very specific form.
On the other hand if it makes our code messy or confusing or slows things down, we should not do this.
I will look into this new negative vint bug, it might only affect pulsing, and see if i can make a test case+workaround for it.

Uwe Schindler
added a comment - 26/Jul/11 21:44 Here the final patch for OpenJDK including Porter.java as testcase:
http://cr.openjdk.java.net/~kvn/7070134/webrev/7070134.patch (see also http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2011-July/005972.html , http://cr.openjdk.java.net/~kvn/7070134/webrev/ )
For the full bugfix, also the following fixes are needed:
http://cr.openjdk.java.net/~kvn/7044738/webrev/7044738.patch
http://cr.openjdk.java.net/~kvn/7068051/webrev/7068051.patch
All three were applied to Jenkins' OpenJDK7 (excluding the testcases).

About java 7 release. We are late to do any bugs fixes in GA which should happen soon. All loop optimization fixes will go definitely into jdk7 update 2. We will try to push them into update 1 (which is targeted only for security fixes) but we can't promise.

There is going discussion about using current Hotspot VM in future jdk6 updates but there is no decision yet. Note: current Hotspot VM sources are targeted for
JDK8 and jdk7 updates only.

Regards,
Vladimir

This means, Java 7 will come out with heavy broken loops (so almost for any for or while loop you cannot make sure that it is still working correct when executed 10thousand times.

What do others mean. Should we place a warning on the "Download" and "News" page on Solr and Lucene website? The risk is high that you corrupt your index, if you index using these JDK versions. Also the default configuration of Solr will SIGSEGV.
We should also inform the user mailing lists.

I can prepare something and we can discuss? Oracle JDK 1.7.0 GA will be released on July 28th, according to Oracle's press releases. At least on that day we should have something available to present to the users.

Uwe Schindler
added a comment - 26/Jul/11 21:02 Response from the Hotspot mailing list about their release plans:
Thank you, Uwe
I will send the patch for reviews shortly.
About java 7 release. We are late to do any bugs fixes in GA which should happen soon. All loop optimization fixes will go definitely into jdk7 update 2. We will try to push them into update 1 (which is targeted only for security fixes) but we can't promise.
There is going discussion about using current Hotspot VM in future jdk6 updates but there is no decision yet. Note: current Hotspot VM sources are targeted for
JDK8 and jdk7 updates only.
Regards,
Vladimir
This means, Java 7 will come out with heavy broken loops (so almost for any for or while loop you cannot make sure that it is still working correct when executed 10thousand times.
What do others mean. Should we place a warning on the "Download" and "News" page on Solr and Lucene website? The risk is high that you corrupt your index, if you index using these JDK versions. Also the default configuration of Solr will SIGSEGV.
We should also inform the user mailing lists.
I can prepare something and we can discuss? Oracle JDK 1.7.0 GA will be released on July 28th, according to Oracle's press releases. At least on that day we should have something available to present to the users.

Hi,
we had some success with direct communication to the hotspot developers.

The whole story:

Java 7 contains a fix to the readVInt issue since 1.6.0_21 (approx, LUCENE-2975), this fix was fortunately not included in 1.6.0_26

This fix causes the SIGSEGV on Porter code, but also breaks other loops (e.g. a strange CheckIndex failure in org.apache.lucene.facet.search.SamplingWrapperTest)

We had contact to the hotspot-compiler-dev list and Vladimir sent me the patches, that should fix the bug. The attached patch is a combination of all patches received, in a format suitable for the FreeBSD ports build framework. Place the file in your port's "files/" folder and rebuild the package. In Debian/Ubuntu you should be able to do the same thing by placing the file in the debian/patches folder somehow.

I have now disabled all jenkins builds and queued the Java 7 builds for 3.x and trunk quarter-hourly. The machine now stress tests.

We will report the resuls back to Oracle, but it seems that the attached patch fixes the issues.

If they would have added their original broken fix to the 1.6.0_26 release it would have been catastrophic...

Uwe Schindler
added a comment - 26/Jul/11 14:05 Hi,
we had some success with direct communication to the hotspot developers.
The whole story:
Java 7 contains a fix to the readVInt issue since 1.6.0_21 (approx, LUCENE-2975 ), this fix was fortunately not included in 1.6.0_26
This fix causes the SIGSEGV on Porter code, but also breaks other loops (e.g. a strange CheckIndex failure in org.apache.lucene.facet.search.SamplingWrapperTest)
We had contact to the hotspot-compiler-dev list and Vladimir sent me the patches, that should fix the bug. The attached patch is a combination of all patches received, in a format suitable for the FreeBSD ports build framework. Place the file in your port's "files/" folder and rebuild the package. In Debian/Ubuntu you should be able to do the same thing by placing the file in the debian/patches folder somehow.
I have now disabled all jenkins builds and queued the Java 7 builds for 3.x and trunk quarter-hourly. The machine now stress tests.
We will report the resuls back to Oracle, but it seems that the attached patch fixes the issues.
If they would have added their original broken fix to the 1.6.0_26 release it would have been catastrophic...

If anyone has a few minutes, it would be cool if they voted on it (the oracle site is horrendously slow, and i know thats discouraging).

I think there will be a lot of confusion if java 7 is released with this bug, for instance simple things like the solr example will not really work at all.
you don't need some crazy random test to trigger this, once this method passes the compile threshold, (e.g. 10k invocations) then boom.

Robert Muir
added a comment - 25/Jul/11 17:44 The bug is now visible at http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7070134
If anyone has a few minutes, it would be cool if they voted on it (the oracle site is horrendously slow, and i know thats discouraging).
I think there will be a lot of confusion if java 7 is released with this bug, for instance simple things like the solr example will not really work at all.
you don't need some crazy random test to trigger this, once this method passes the compile threshold, (e.g. 10k invocations) then boom.

I can confirm, Roberts fixes fix all bugs in Lucene & Modules (I used the "slow" one which is not slow ). Solr tests no longer segfault when they use PorterStemFilter, but the above test failures are real and not hotspot related.

Uwe Schindler
added a comment - 23/Jul/11 22:36 I can confirm, Roberts fixes fix all bugs in Lucene & Modules (I used the "slow" one which is not slow ). Solr tests no longer segfault when they use PorterStemFilter, but the above test failures are real and not hotspot related.

Solr tests also do not pass with -Xint. It seems to be a concurrency bug in Solr's caching. With caching disabled (in SolrIndexSearcher), tests pass except those which directly check cache contents. This affects TestFiltering, RequiredFieldsTest and more tests (fail randomly depending on load).

Another test randomly fails without reason: TestEchoParams (this test looks like chinese to me, I dont understand any single line and what is tested at all).

Uwe Schindler
added a comment - 23/Jul/11 22:35 wait, how do you know? Do all Solr tests pass with -Xint?
Solr tests also do not pass with -Xint. It seems to be a concurrency bug in Solr's caching. With caching disabled (in SolrIndexSearcher), tests pass except those which directly check cache contents. This affects TestFiltering, RequiredFieldsTest and more tests (fail randomly depending on load).
Another test randomly fails without reason: TestEchoParams (this test looks like chinese to me, I dont understand any single line and what is tested at all).

Robert Muir
added a comment - 23/Jul/11 21:28 wait, how do you know? Do all Solr tests pass with -Xint?
Maybe there is some other issue affecting Solr, perhaps something XML related.
Please open up a separate JIRA issue for that: I don't want to confuse that stuff with this one.

Robert Muir
added a comment - 23/Jul/11 15:43
You can monitor this bug on the Java Bug Database at
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7070134.
It may take a day or two before your bug shows up in this external database.

Robert Muir
added a comment - 23/Jul/11 15:06 yeah i think its a bad bug, obviously even casually using this stemmer will cause it to crash (this is no crazy random test but just stemming a file of a few thousand english words)
how do we vote -1 to release Java7

i traced this down to the step4 method... maybe we can code it differently and dodge the bug.
e.g. this passes:
ant test -Dtestcase=TestPorterStemFilter -Dtests.iter=100 -Dargs="-XX:CompileCommand=exclude,org/apache/lucene/analysis/en/PorterStemmer,step4"

Robert Muir
added a comment - 23/Jul/11 14:10 i traced this down to the step4 method... maybe we can code it differently and dodge the bug.
e.g. this passes:
ant test -Dtestcase=TestPorterStemFilter -Dtests.iter=100 -Dargs="-XX:CompileCommand=exclude,org/apache/lucene/analysis/en/PorterStemmer,step4"