Activity

IIRC, I changed that on LUCENE-2245. The core motivation was to keep the addIndexes versions doing the minimal work that's required for them to function properly. So the Directory version merely copies the files over to the target index and the Reader version migrates them to the target index. Other than that, we should let the MP workout the best merges to execute etc.

This lets the caller decide how expensive should addIndexes be, on his part. You can freely call forceMerge/maybeMerge before/after addIndexes, or not execute merges at all. I found this beneficial e.g. when creating indexes via MapReduce, where I used addIndexes, but wanted to wait with merges until the end. When addIndexes always calls maybeMerge, it means it always does this work, and the only way to avoid that is use NoMergePolicy. Perhaps NoMP came after the addIndexes cleanup, not sure...

I like it that addIndexes does the minimum work that's needed. Can you elaborate why is this buggy? What prevents the app from invoking merges itself? I mean, this is not like addDocument() calls ... I believe that addIndexes are less common, and more ... expert?

Shai Erera
added a comment - 18/May/14 06:53 IIRC, I changed that on LUCENE-2245 . The core motivation was to keep the addIndexes versions doing the minimal work that's required for them to function properly. So the Directory version merely copies the files over to the target index and the Reader version migrates them to the target index. Other than that, we should let the MP workout the best merges to execute etc.
This lets the caller decide how expensive should addIndexes be, on his part. You can freely call forceMerge/maybeMerge before/after addIndexes, or not execute merges at all. I found this beneficial e.g. when creating indexes via MapReduce, where I used addIndexes, but wanted to wait with merges until the end. When addIndexes always calls maybeMerge, it means it always does this work, and the only way to avoid that is use NoMergePolicy. Perhaps NoMP came after the addIndexes cleanup, not sure...
I like it that addIndexes does the minimum work that's needed. Can you elaborate why is this buggy? What prevents the app from invoking merges itself? I mean, this is not like addDocument() calls ... I believe that addIndexes are less common, and more ... expert?

Robert Muir
added a comment - 11/Jul/14 22:16 I don't agree with this argument about "This lets the caller decide how expensive should addIndexes be, on his part."
The user can freely configure this with MergePolicy. Its no different from any other index operation. This is a bug.
There is lots of confusion, including a current discussion on the ML.

The user can freely configure this with MergePolicy. Its no different from any other index operation. This is a bug.

I was leaning towards Shai's argument at first, but after a bit of deeper thought, I agree with Robert.

I don't know that having an option to not use the merge policy will add any confusion if the default is right, but it does seem the merge policy itself is sufficient for cases I can think of. I don't know that you need this extra way to control merges.

Mark Miller
added a comment - 11/Jul/14 22:27 The user can freely configure this with MergePolicy. Its no different from any other index operation. This is a bug.
I was leaning towards Shai's argument at first, but after a bit of deeper thought, I agree with Robert.
I don't know that having an option to not use the merge policy will add any confusion if the default is right, but it does seem the merge policy itself is sufficient for cases I can think of. I don't know that you need this extra way to control merges.

FYI: this is the third time i've heard of this trap hitting people and creating hundreds or thousands of index segments: once was from coworkers at a past job, twice was lucene user list discussion "Merger performance degradation on 3.6.1", thrice was Erick's recent ML post.

For people that don't want merging they have NoMergePolicy, maybeMerge() is even documented as "expert" and "Explicit calls to maybeMerge() are usually not necessary. The most common case is when merge policy parameters have changed". So requiring the user to manually invoke this after index operations to prevent segment explosion is wrong IMO.

Robert Muir
added a comment - 11/Jul/14 22:46 FYI: this is the third time i've heard of this trap hitting people and creating hundreds or thousands of index segments: once was from coworkers at a past job, twice was lucene user list discussion "Merger performance degradation on 3.6.1", thrice was Erick's recent ML post.
For people that don't want merging they have NoMergePolicy, maybeMerge() is even documented as "expert" and "Explicit calls to maybeMerge() are usually not necessary. The most common case is when merge policy parameters have changed". So requiring the user to manually invoke this after index operations to prevent segment explosion is wrong IMO.

If this was that important, why wasn't it pushed on weeks ago? I hate to play the bad guy, but I'm a little scared to throw this into the release branch, when it has undergone minimal testing (ie it hasn't be hammered by jenkins for days or weeks).

Ryan Ernst
added a comment - 21/Aug/14 17:49 If this was that important, why wasn't it pushed on weeks ago? I hate to play the bad guy, but I'm a little scared to throw this into the release branch, when it has undergone minimal testing (ie it hasn't be hammered by jenkins for days or weeks).

Ryan Ernst
added a comment - 21/Aug/14 21:21
I think we would have way more major concerns if this was a dangerous change.
This is a really ugly bug with a simple fix.
I don't know enough about the history of this issue here, so I trust your judgement. I'm just trying to be protective from a RM standpoint.