Activity

I propose OPTIMIZE should still work in 3.x but be deprecated and yell about it in the logs. The most straight-forward is perhaps to add a new forceMerge command to replace the old one. Then from 4.0 the old optimize command would be a NOP command.

Reasoning behind this is that <optimize/> causes a lot of people trouble in Solr today because it's over-used due to its luring name. I don't think anyone will miss it once it's gone, and those who really need it can start using <forceMerge/> which is a better name anyhow.

Jan Høydahl
added a comment - 19/Feb/12 11:00 I propose OPTIMIZE should still work in 3.x but be deprecated and yell about it in the logs. The most straight-forward is perhaps to add a new forceMerge command to replace the old one. Then from 4.0 the old optimize command would be a NOP command.
Reasoning behind this is that <optimize/> causes a lot of people trouble in Solr today because it's over-used due to its luring name. I don't think anyone will miss it once it's gone, and those who really need it can start using <forceMerge/> which is a better name anyhow.

Long term API stability is very important, and this simply boils down to a documentation issue.

If we changed the external API every time we thought of a slightly better name, things would be quite a mess. What might make sense for a Java library doesn't necessarily make sense for a server, and we have different back compatible goals. Lucene renaming something should not be reason for Solr to do so.

Yonik Seeley
added a comment - 19/Feb/12 14:02 -1
Long term API stability is very important, and this simply boils down to a documentation issue.
If we changed the external API every time we thought of a slightly better name, things would be quite a mess. What might make sense for a Java library doesn't necessarily make sense for a server, and we have different back compatible goals. Lucene renaming something should not be reason for Solr to do so.

Uwe Schindler
added a comment - 19/Feb/12 14:06 - edited Yonik: I disagree here:
One problem is e.g., DIH it optimizes by default which is the stupidest thing it could do on every incremental update (see SOLR-3142 )
If you disagree, I would simple (as suggested before by me) to make optimize a no-op in Solr. Very easy and hurts nobody, but prevents people from doing the wrong thing.

Yonik Seeley
added a comment - 19/Feb/12 14:22 The biggest mess is DIH - it optimizes by default which is the stupidest thing it could do
Are you saying that committers don't know the cost of optimize?
If all the renaming craziness in lucene-land is going to creep to solr, I should start vetoing those!

Robert Muir
added a comment - 19/Feb/12 14:32 I will open a separate issue to remove this auto-optimize in DIH.
This seems less controversial than the whole issue.
If someone wants to optimize, they can pass &optimize=true, it will
only speed up most peoples applications, especially if they often
do incremental updates from their database.

To come back to the orginal issue:
I am very glad that Jan opened the issue. I would suggest (as mentioned in other issues, too) to make optimize a no-op in solr and add a new forceMerge=segments with loud warnings.
By this no existing code breaks (it just no longer optimizes).

Uwe Schindler
added a comment - 19/Feb/12 15:42 To come back to the orginal issue:
I am very glad that Jan opened the issue. I would suggest (as mentioned in other issues, too) to make optimize a no-op in solr and add a new forceMerge=segments with loud warnings.
By this no existing code breaks (it just no longer optimizes).
Is this a good idea, Yonik?

Most of the above items also affect Solr. E.g. the first one (I know people from FIZ Karlsruhe and Fedora) is really funny. Fedora GSearch calls optimze=true on every add of a single document to Solr. I even know people using Solr and complained about GSearch because of this.

We can fix those horrible user-code bugs very fast by making optimize a no-op in Solr, they all will appreciate that. I just repeat: Nobody's installation would break, it would just get faster.

Some funny detail: With Lucene 3.x, search actuall gets faster with multiple segments if you do parallel ExceutorService-based search (I still dont really recommend to use ExceutorService on IndexSearcher...). On the other hand by executing the search on a non-optimized pre 2.9 index with no per segment search was really slower, as MultiTermsEnum and MultiDocsEnum was used.

With Lucene 3.x there is really no slowdown at all caused by multiple segments, as each segment is searched on its own with no interaction and just the results added to same priority queue. I agree, Solr has some problems with facetting, but people should use per-segment facetting and not optimize, this would improve their installations immense (although the actual facetting might get slower, but on the other hand FieldCaches can be reused, so it actually gets faster). The current default is global facetting and (for most installations) "optimize on every single item added" (see above links).

Uwe Schindler
added a comment - 19/Feb/12 16:09 I just repeat here, what Mike already posted on the Lucene issue:
Some quick googling uncovers depressing examples of over-optimizing:
https://jira.duraspace.org/browse/FCREPO-155
http://stackoverflow.com/questions/3912253/is-it-mandatory-to-optimize-the-lucene-index-after-write
http://issues.liferay.com/browse/LPS-2944
http://download.oracle.com/docs/cd/E19316-01/820-7054/girqf/index.html
https://issues.sonatype.org/browse/MNGECLIPSE-2359
http://blog.inflinx.com/tag/lucene
That last one has this fun comment:
// Lucene recommends calling optimize upon completion of indexing writer.optimize();
Most of the above items also affect Solr. E.g. the first one (I know people from FIZ Karlsruhe and Fedora) is really funny. Fedora GSearch calls optimze=true on every add of a single document to Solr. I even know people using Solr and complained about GSearch because of this.
We can fix those horrible user-code bugs very fast by making optimize a no-op in Solr, they all will appreciate that. I just repeat: Nobody's installation would break, it would just get faster.
Some funny detail: With Lucene 3.x, search actuall gets faster with multiple segments if you do parallel ExceutorService-based search (I still dont really recommend to use ExceutorService on IndexSearcher...). On the other hand by executing the search on a non-optimized pre 2.9 index with no per segment search was really slower, as MultiTermsEnum and MultiDocsEnum was used.
With Lucene 3.x there is really no slowdown at all caused by multiple segments, as each segment is searched on its own with no interaction and just the results added to same priority queue. I agree, Solr has some problems with facetting, but people should use per-segment facetting and not optimize, this would improve their installations immense (although the actual facetting might get slower, but on the other hand FieldCaches can be reused, so it actually gets faster). The current default is global facetting and (for most installations) "optimize on every single item added" (see above links).

With Lucene 3.x there is really no slowdown at all caused by multiple segments

There is less of a slowdown - but it's certainly still there. Whether it matters or not will depend on the exact use-cases.

Solr has some problems with facetting, but people should use per-segment facetting and not optimize

No... people should do whatever suits their usecase best.

Some very well informed users of Solr still optimize. They change their index infrequently (like once a day), and have determined that the performance increases they see by optimizing make it worth it for them.

Yonik Seeley
added a comment - 19/Feb/12 16:21 With Lucene 3.x there is really no slowdown at all caused by multiple segments
There is less of a slowdown - but it's certainly still there. Whether it matters or not will depend on the exact use-cases.
Solr has some problems with facetting, but people should use per-segment facetting and not optimize
No... people should do whatever suits their usecase best.
Some very well informed users of Solr still optimize. They change their index infrequently (like once a day), and have determined that the performance increases they see by optimizing make it worth it for them.

Yonik Seeley
added a comment - 19/Feb/12 16:29 This is really a documentation issue. I took a shot at improving it here:
http://wiki.apache.org/solr/UpdateXmlMessages#A.22commit.22_and_.22optimize.22
Are there other places in the docs we need to improve (by either adding details, or removing the example altogether)?

If somebody passes optimize=true to the update request handle, we dont do anything (no optimize) and instead print a warning message to the log saying, that optimize was disabled in Luecen because it has no positive effect on most installations. It should also metion, that there is a new forceMerge, but people should not call it unless they exactly know what they are doing.

The above examples and a lot of more "howtos" on the web make the users think, they have to optimize (after every single add). After that they complain how slow solr is. Is this really what you want.

The FIZ Karslruhe eSciDoc projects develops the so called Europeana project, which is supposed to index all cultural content from Europe. They are using Fedora as repository, so the above issue was like a no-go for them to use GSearch (based on Solr). If you have so many misinformation about optimize on the net, the most reasonable approach is to simply disable the feature in quesion to prevent further harm.

People that rely on optimize (because they want their statistics 100% correct) will get informed by the warning messages in the logs. For them its almost a one-line code change in their Solr client. If they dont do it, they will also not be disaapointed, because:

There is less of a slowdown - but it's certainly still there

So they would in most cases not even recognizing because new versions of solr will bring other improvements.

Uwe Schindler
added a comment - 19/Feb/12 16:32 We can even handle that:
If somebody passes optimize=true to the update request handle, we dont do anything (no optimize) and instead print a warning message to the log saying, that optimize was disabled in Luecen because it has no positive effect on most installations. It should also metion, that there is a new forceMerge, but people should not call it unless they exactly know what they are doing.
The above examples and a lot of more "howtos" on the web make the users think, they have to optimize (after every single add). After that they complain how slow solr is. Is this really what you want.
The FIZ Karslruhe eSciDoc projects develops the so called Europeana project, which is supposed to index all cultural content from Europe. They are using Fedora as repository, so the above issue was like a no-go for them to use GSearch (based on Solr). If you have so many misinformation about optimize on the net, the most reasonable approach is to simply disable the feature in quesion to prevent further harm.
People that rely on optimize (because they want their statistics 100% correct) will get informed by the warning messages in the logs. For them its almost a one-line code change in their Solr client. If they dont do it, they will also not be disaapointed, because:
There is less of a slowdown - but it's certainly still there
So they would in most cases not even recognizing because new versions of solr will bring other improvements.

With Lucene 3.x there is really no slowdown at all caused by multiple segments, as each segment is searched on its own with no interaction and just the results added to same priority queue.

Do we have benchmarks for this in some issue - would love to see some numbers.

So, in the past, sorting certainly added a cost to multiple segments as you moved from segment to segment - did that go away in some issue? That must be completely different code these days if 100 segments or more performs like one.

Mark Miller
added a comment - 19/Feb/12 16:35 With Lucene 3.x there is really no slowdown at all caused by multiple segments, as each segment is searched on its own with no interaction and just the results added to same priority queue.
Do we have benchmarks for this in some issue - would love to see some numbers.
So, in the past, sorting certainly added a cost to multiple segments as you moved from segment to segment - did that go away in some issue? That must be completely different code these days if 100 segments or more performs like one.

In comparison the numbers for Lucene 2.9 lowered extensively, pre-2.9 optimizing was often a must, I agree! The problem was Multi* with itsself having priority-queue like structures slowing down term enumeration and postings rerieval. With Lucene 3.x the difference between an optimized and a "standard 8 segment index" was always below measurement uncertainity (see lots of benchmarks from Mike on Lucene 4). For standard relevance-ranked or numerics sorting there was never a real slowdown.

I am always talking about relevance-ranked results and numerics. With StringIndex sorting there is certainly an overhead, but as we support sortMissingLast now also for numerics, almost nobody has to use it.

Uwe Schindler
added a comment - 19/Feb/12 17:01 100 segments?
In comparison the numbers for Lucene 2.9 lowered extensively, pre-2.9 optimizing was often a must, I agree! The problem was Multi* with itsself having priority-queue like structures slowing down term enumeration and postings rerieval. With Lucene 3.x the difference between an optimized and a "standard 8 segment index" was always below measurement uncertainity (see lots of benchmarks from Mike on Lucene 4). For standard relevance-ranked or numerics sorting there was never a real slowdown.
I am always talking about relevance-ranked results and numerics. With StringIndex sorting there is certainly an overhead, but as we support sortMissingLast now also for numerics, almost nobody has to use it.

Jan Høydahl
added a comment - 19/Feb/12 17:10 The Python Django-solr search library ALWAYS calls optimize after adding documents, see indexing.py:
http://code.google.com/p/django-solr-search/source/search?q=optimize+commit&origq=optimize+commit&btnG=Search+Trunk
I had a customer using this library to batch-load a bunch of documents, and it took AGES and almost killed the JVM.

Mark Miller
added a comment - 19/Feb/12 17:14 With StringIndex sorting there is certainly an overhead, but as we support sortMissingLast now also for numerics, almost nobody has to use it.
Ah, okay - that makes sense.

the queries we often see in the field can be vastly more complex than the standard ones that lucene tests with

people are often most concerned with their slowest queries, not their average query speed (as long as they can meet throughput needs)

full-text search is often not the bottleneck at all

Another issue that I've seen a couple of customers hit: big memory increases in the field cache as the number of segments grows. The string index values are not shared per-segment, and hence in the worst case, 2 times the number of segments equals almost 2 times the memory for the per-segment FieldCache entries.

There are tradeoffs to a lot of these things, and we should be careful to not fall into a "one size fits all" mentality.

Yonik Seeley
added a comment - 19/Feb/12 17:15 I am always talking about relevance-ranked results and numerics.
And those are often not the bottleneck for Solr users.
There are a few issues here:
the queries we often see in the field can be vastly more complex than the standard ones that lucene tests with
people are often most concerned with their slowest queries, not their average query speed (as long as they can meet throughput needs)
full-text search is often not the bottleneck at all
Another issue that I've seen a couple of customers hit: big memory increases in the field cache as the number of segments grows. The string index values are not shared per-segment, and hence in the worst case, 2 times the number of segments equals almost 2 times the memory for the per-segment FieldCache entries.
There are tradeoffs to a lot of these things, and we should be careful to not fall into a "one size fits all" mentality.

Another issue that I've seen a couple of customers hit: big memory increases in the field cache as the number of segments grows. The string index values are not shared per-segment, and hence in the worst case, 2 times the number of segments equals almost 2 times the memory for the per-segment FieldCache entries.

This goes in the same direction as my answer to Mark: With sortMissingLast support on numerics, numerics as Strings are no longer needed. So the solution here is to use real numerics.

Uwe Schindler
added a comment - 19/Feb/12 17:18 Another issue that I've seen a couple of customers hit: big memory increases in the field cache as the number of segments grows. The string index values are not shared per-segment, and hence in the worst case, 2 times the number of segments equals almost 2 times the memory for the per-segment FieldCache entries.
This goes in the same direction as my answer to Mark: With sortMissingLast support on numerics, numerics as Strings are no longer needed. So the solution here is to use real numerics.

Lots of use of string fields that are not numerics though - the product I worked on in the past only sorted by non numeric string fields, many times lots of them at once.

I'm coming around on this issue myself though. For the benefits, optimize is not a good name. It calls out to be called. The abuse is clearly there, and we should probably try more to address it than just doc.

My opinion is coming around to leave it for 3.x, change it to an expert option for 4 that works the same, is understated, and is called forceMerge or whatever.

Mark Miller
added a comment - 19/Feb/12 17:24 Lots of use of string fields that are not numerics though - the product I worked on in the past only sorted by non numeric string fields, many times lots of them at once.
I'm coming around on this issue myself though. For the benefits, optimize is not a good name. It calls out to be called. The abuse is clearly there, and we should probably try more to address it than just doc.
My opinion is coming around to leave it for 3.x, change it to an expert option for 4 that works the same, is understated, and is called forceMerge or whatever.
Big -1 to making it a no op.

And if we did change, naive users would be:
"oh, optimize doesn't work any more..." (looks up what it's been changed to) "ok, changed to forceMerge."

After forceMerge is out there for a while, it would have the same problem as optimize. Someone tries it, their queries run faster, and it gets passed along as something to try to speed things up (and it is in the right scenario). The correct path here is to document it correctly, and get rid of any bad examples in our documentation.

Someone can add a big fat message at the top of CHANGES explaining the cost of optimize and the fact that it's often less necessary than it was in the past if they want.

Yonik Seeley
added a comment - 19/Feb/12 17:43 And if we did change, naive users would be:
"oh, optimize doesn't work any more..." (looks up what it's been changed to) "ok, changed to forceMerge."
After forceMerge is out there for a while, it would have the same problem as optimize. Someone tries it, their queries run faster, and it gets passed along as something to try to speed things up (and it is in the right scenario). The correct path here is to document it correctly, and get rid of any bad examples in our documentation.
Someone can add a big fat message at the top of CHANGES explaining the cost of optimize and the fact that it's often less necessary than it was in the past if they want.

I'm coming around on this issue myself though. For the benefits, optimize is not a good name. It calls out to be called. The abuse is clearly there, and we should probably try more to address it than just doc.

My opinion is coming around to leave it for 3.x, change it to an expert option for 4 that works the same, is understated, and is called forceMerge or whatever.

I think we can probably make improvements here, here are my ideas:

any 'auto-optimization' in our own code is really bad. We should fix any auto/default
optimizes so that if users want to optimize, they must specify it.

any 'auto-optimization' in third-party integrations is equally bad, but we can fix this
in a number of ways. Sure, making optimize a no-op is one solution, another is to
actually fix the docs, ping those projects with an email or offer patches, etc.

we can improve the docs to really emphasize to users how expensive manual
optimize and expungeDeletes calls are. Personally I feel the wiki text Yonik linked to
is way too nice about this.

the name 'optimize' will always be a trap I think. Can't we start by adding 'forceMerge'
and issue a deprecation warning if someone uses optimize (but still doing it anyway). Then
the next step would be (in some future release), to return a hard error if someone uses
'optimize', since eventually it gets removed.

Robert Muir
added a comment - 19/Feb/12 18:56
I'm coming around on this issue myself though. For the benefits, optimize is not a good name. It calls out to be called. The abuse is clearly there, and we should probably try more to address it than just doc.
My opinion is coming around to leave it for 3.x, change it to an expert option for 4 that works the same, is understated, and is called forceMerge or whatever.
I think we can probably make improvements here, here are my ideas:
any 'auto-optimization' in our own code is really bad. We should fix any auto/default
optimizes so that if users want to optimize, they must specify it.
any 'auto-optimization' in third-party integrations is equally bad, but we can fix this
in a number of ways. Sure, making optimize a no-op is one solution, another is to
actually fix the docs, ping those projects with an email or offer patches, etc.
we can improve the docs to really emphasize to users how expensive manual
optimize and expungeDeletes calls are. Personally I feel the wiki text Yonik linked to
is way too nice about this.
the name 'optimize' will always be a trap I think. Can't we start by adding 'forceMerge'
and issue a deprecation warning if someone uses optimize (but still doing it anyway). Then
the next step would be (in some future release), to return a hard error if someone uses
'optimize', since eventually it gets removed.

Robert: I would also agree with this. If others dont want to make optimize() a noop, I would agree to make a serious log.warn() or even better log.fatal() out of it saying that it's a bad idea in most cases. And that it's deprecated (deprecation by log printing, funny). People who call optimize or forceMerge after ech single document will have a log filled with warning messages, this should make them look into it.
In my opinion expungeDeletes and forceMerge should always print a warning-like message to the log, saying that it's doing something heavy and resource-wasteful. Optimize aditionally also say that its deprecated.

Uwe Schindler
added a comment - 19/Feb/12 19:10 Robert: I would also agree with this. If others dont want to make optimize() a noop, I would agree to make a serious log.warn() or even better log.fatal() out of it saying that it's a bad idea in most cases. And that it's deprecated (deprecation by log printing, funny). People who call optimize or forceMerge after ech single document will have a log filled with warning messages, this should make them look into it.
In my opinion expungeDeletes and forceMerge should always print a warning-like message to the log, saying that it's doing something heavy and resource-wasteful. Optimize aditionally also say that its deprecated.

Personally I feel the wiki text Yonik linked to is way too nice about this.

Here's the current wiki text (I just modified it to suggest what "infrequently" might mean... i.e. nightly, not on the minute or something), added the term "very expensive" and bolded the "entire" to draw attention to it.

An optimize is like a hard commit except that it forces all of the index segments to be merged into a single segment first. Depending on the use cases, this operation should be performed infrequently (like nightly), if at all, since it is very expensive and involves reading and re-writing the entire index. Segments are normally merged over time anyway (as determined by the merge policy), and optimize just forces these merges to occur immediately.

Yonik Seeley
added a comment - 19/Feb/12 19:27 - edited Personally I feel the wiki text Yonik linked to is way too nice about this.
Here's the current wiki text (I just modified it to suggest what "infrequently" might mean... i.e. nightly, not on the minute or something), added the term "very expensive" and bolded the "entire" to draw attention to it.
An optimize is like a hard commit except that it forces all of the index segments to be merged into a single segment first. Depending on the use cases, this operation should be performed infrequently (like nightly), if at all, since it is very expensive and involves reading and re-writing the entire index. Segments are normally merged over time anyway (as determined by the merge policy), and optimize just forces these merges to occur immediately.
I would agree to make a serious log.warn()
I'd be fine with that part. I'll give it a shot.

Walter Underwood
added a comment - 19/Feb/12 19:29 A warning message seems over the top. There are perfectly valid reasons to do a full merge. It is just fine as the last step if you rebuild a medium to small index every day, like we did at Netflix.
I've worked on two other engines with automatic index merging, Ultraseek and MarkLogic. One called it "full merge", the other "force merges" (I think). Neither one logged a warning.

I don't think a warning message for a deprecated command is over the top,
how else will people know to switch to 'forceMerge' (in the case they really need it).

We already log warning messages if people use e.g. deprecated analyzers or other things,
I'm just suggesting we deprecate the trappy name like anything else would be deprecated.
It seems worse to me to silently deprecate something.

By the way: I think it would also be nice if the forceMerge required n as a parameter,
rather than defaulting to 1.

Here's the current wiki text (I just modified it to suggest what "infrequently" might mean... i.e. nightly, not on the minute or something), added the term "very expensive" and bolded the "entire" to draw attention to it.

Robert Muir
added a comment - 19/Feb/12 19:38
A warning message seems over the top.
I don't think a warning message for a deprecated command is over the top,
how else will people know to switch to 'forceMerge' (in the case they really need it).
We already log warning messages if people use e.g. deprecated analyzers or other things,
I'm just suggesting we deprecate the trappy name like anything else would be deprecated.
It seems worse to me to silently deprecate something.
By the way: I think it would also be nice if the forceMerge required n as a parameter,
rather than defaulting to 1.
Here's the current wiki text (I just modified it to suggest what "infrequently" might mean... i.e. nightly, not on the minute or something), added the term "very expensive" and bolded the "entire" to draw attention to it.
+1, I think those are good improvements.

Yonik Seeley
added a comment - 19/Feb/12 19:48 Here's a warn patch.
The text I used is this:
log.warn("Starting optimize... reading and rewriting entire index.");
It tries to just state what is going on, and tries not to indicate it's an error or that the user should not be doing it.

I just checked the Solr tutorial and saw this:
"There is also an optimize command that does the same thing as commit, in addition to merging all index segments into a single segment, making it faster to search and causing any deleted documents to be removed."

It would be no great loss to just remove that sentence since it's just an introduction and not a reference.

Yonik Seeley
added a comment - 19/Feb/12 20:38 I just checked the Solr tutorial and saw this:
"There is also an optimize command that does the same thing as commit, in addition to merging all index segments into a single segment, making it faster to search and causing any deleted documents to be removed."
It would be no great loss to just remove that sentence since it's just an introduction and not a reference.

I am fine with the log messages, I just would also like to deprecate the term "optimize" and change to "forceMerge". Thats all this issue is about. The above log messages would then apply to forceMerge. Of course old-style optimize would get a different message thats this is deprecated and the user is most-likely not want to call this.

Uwe Schindler
added a comment - 19/Feb/12 22:26 - edited I am fine with the log messages, I just would also like to deprecate the term "optimize" and change to "forceMerge". Thats all this issue is about. The above log messages would then apply to forceMerge. Of course old-style optimize would get a different message thats this is deprecated and the user is most-likely not want to call this.

I'm against deprecating optimize. We can't change the name of every operation that people might use incorrectly (and this is one of the easiest to understand), and we shouldn't here. We shouldn't penalize the majority of users who use APIs correctly due to some minority calling it when they have no idea what it does. Being a server with a whole ecosystem of other systems that talk to us (think like a database), we have a much higher bar for back compat changes in our interfaces.

Yonik Seeley
added a comment - 19/Feb/12 22:40 I'm against deprecating optimize. We can't change the name of every operation that people might use incorrectly (and this is one of the easiest to understand), and we shouldn't here. We shouldn't penalize the majority of users who use APIs correctly due to some minority calling it when they have no idea what it does. Being a server with a whole ecosystem of other systems that talk to us (think like a database), we have a much higher bar for back compat changes in our interfaces.

Uwe Schindler
added a comment - 19/Feb/12 23:29 - edited We shouldn't penalize the majority of users who use APIs correctly due to some minority calling it when they have no idea what it does
Minority?:
https://github.com/mbaechler/OBM/blob/9e1c79e01fde7f78e87b125563c7e6730068e24d/ui/obminclude/of/of_indexingService.inc
http://grokbase.com/t/lucene.apache.org/solr-user/2011/12/how-to-disable-auto-commit-and-auto-optimize-operation-after-addition-of-few-documents-through-dataimport-handler/16q7rwo6crvlzr5aoo3ic2bgd2ni
http://support.sms-fed.com/tracker/browse/TDI-134
http://web.archiveorange.com/archive/v/AAfXf4khqdVNtnjqzodS
http://vufind.org/wiki/performance#index_optimization
http://netbeans.org/bugzilla/show_bug.cgi?id=205899
https://github.com/tonytw1/wellynews/blob/759960b7e7df6b77c9fa3791efb7da67dd27783e/src/java/nz/co/searchwellington/repositories/solr/SolrQueryService.java
http://stackoverflow.com/questions/2787591/solr-autocommit-and-autooptimize
http://opensource.timetric.com/sunburnt/indexmanagement.html
http://drupal.org/node/292662
http://blog.aisleten.com/2008/01/26/optimizing-solr-and-rails-index-in-the-background/
http://www.searchworkings.org/forum/-/message_boards/view_message/412894#_19_message_412894
http://code.google.com/p/kiwi/source/browse/lmf-search/src/main/java/at/newmedialab/lmf/search/services/indexing/SolrIndexingServiceImpl.java?r=fbbeec96b5ad3d31364755a88218860405393cac

Robert Muir
added a comment - 19/Feb/12 23:47 I think the majority of users don't know what this command really does... we should rename it.
optimize just begs for people to use it. If this is really controversial, lets call
a committer vote on dev@ and see what everyone thinks.

Yonik Seeley
added a comment - 19/Feb/12 23:51 Creative use of google, but it does't always add up. Just looking randomly at a couple:
The vufind reference oddly states that you should optimize after updating, but it also states:
Note: Optimizing the index can take a lot of server resources, so you should schedule your index updates and optimizations for non-peak times when possible.
So you can see they have that very infrequent update model in mind, and they seem well aware of the cost of an optimize.
The stackoverflow is a thing asking how to automate commit and optimize and how often he should optimize.
And the archiveorange link mentions a guy optimizing, but it's certainly not clear at all that he shouldn't be.... we don't know his requirements.
Solr is at 400 downloads a day via the website (twice that many visit the download page... but the actual link is hard to see!). Yes, I'll stand by "minority".

Yonik Seeley
added a comment - 19/Feb/12 23:56 think the majority of users don't know what this command really does... we should rename it.
I doubt it. An dhow did they find the command in the first place?
The answer is documentation - wherever they learn about the command, let them know what it does.
Let's not be a nanny state.

Robert Muir
added a comment - 20/Feb/12 00:02
twice that many visit the download page... but the actual link is hard to see
We need a huge download button.
Let's not be a nanny state.
I don't think of it as a nanny state, its us fixing a mistake.
The mistake was this method has a poor name.

Just to come back to programmers that should have read the documentation, but in fact they did not. The best example from the above list is http://drupal.org/node/292662. Drupal is one of the most often used CMS systems (I just mention that your company also uses it for their home page) and its installed on thousands of servers. And this tool also contains a full text search engine (maybe your company is not using that one), but this one called commit and optimize after every update (until they fixed it). Isn't that funny. In fact Drupal users are a huuuuuuuuuuuuuuuuuuuuuuuuuuuuge majority that dont know what their system is doing under the hood and largely depend on the fact that PHP programmers like the Drupal ones dont call optimize just because it's called optimize.

Uwe Schindler
added a comment - 20/Feb/12 00:09 I doubt it. An dhow did they find the command in the first place?
By copying one of those "shiny" code examples I posted!
Just to come back to programmers that should have read the documentation, but in fact they did not. The best example from the above list is http://drupal.org/node/292662 . Drupal is one of the most often used CMS systems (I just mention that your company also uses it for their home page) and its installed on thousands of servers. And this tool also contains a full text search engine (maybe your company is not using that one), but this one called commit and optimize after every update (until they fixed it). Isn't that funny. In fact Drupal users are a huuuuuuuuuuuuuuuuuuuuuuuuuuuuge majority that dont know what their system is doing under the hood and largely depend on the fact that PHP programmers like the Drupal ones dont call optimize just because it's called optimize.

Yonik Seeley
added a comment - 20/Feb/12 00:20 A slight improvement in name does not come anywhere near compensating for the pain of having countless external systems and users having to change their code for no gain in functionality.

Uwe Schindler
added a comment - 20/Feb/12 00:29 A slight improvement in name does not come anywhere near compensating for the pain of having countless external systems and users having to change their code for no gain in functionality.
The compensation is that they are forced to again look at that code an then they think about removing the call alltogether.

A slight improvement in name does not come anywhere near compensating for the pain of having countless external systems and users having to change their code for no gain in functionality.

But I don't think things are fixed in stone: this is an open source project and it would be bad if things
never changed. We aren't putting a gun to their head forcing them to upgrade either, so I don't understand
the pain compensation... but it won't hold a candle to the pain all these unnecessary optimizes must be
causing users hard disk drives.

Robert Muir
added a comment - 20/Feb/12 01:27
A slight improvement in name does not come anywhere near compensating for the pain of having countless external systems and users having to change their code for no gain in functionality.
But I don't think things are fixed in stone: this is an open source project and it would be bad if things
never changed. We aren't putting a gun to their head forcing them to upgrade either, so I don't understand
the pain compensation... but it won't hold a candle to the pain all these unnecessary optimizes must be
causing users hard disk drives.

I think 4.0 is a good train in which to do this rename, when people will anyway take a thorough new look at all the changes, and most will hopefully discover that they do not need forceMerge even if they used optimize before. And I agree, in 4.x, "optimize" should not be a silent NOOP, but instead yell loudly in the logs.

Perhaps an official migration guide on the CMS would be helpful too when 4.0 hits the road. Such a guide would be more in-depth than the upgrading notes in CHANGES. We could have a paragraph about optimize/forceMerge, and another paragraph about softCommit/commitWithin as preferred to explicit commit, which is also a huge mistake many people do, they do over-committing!

Jan Høydahl
added a comment - 20/Feb/12 13:01 I think 4.0 is a good train in which to do this rename, when people will anyway take a thorough new look at all the changes, and most will hopefully discover that they do not need forceMerge even if they used optimize before. And I agree, in 4.x, "optimize" should not be a silent NOOP, but instead yell loudly in the logs.
Perhaps an official migration guide on the CMS would be helpful too when 4.0 hits the road. Such a guide would be more in-depth than the upgrading notes in CHANGES. We could have a paragraph about optimize/forceMerge, and another paragraph about softCommit/commitWithin as preferred to explicit commit, which is also a huge mistake many people do, they do over-committing!

The compensation is that they are forced to again look at that code an then they think about removing the call alltogether.

The proposal simply breaks existing systems (on purpose) on upgrade with no offsetting gain in functionality, just because we believe some people have made the wrong tradeoff in their app. This is not the right solution.

We see people making what we believe to be the wrong tradeoffs all the time in Solr. One example is optimizing for query performance by pumping up cache sizes to insane levels, pumping up the heap to compensate, and then being plagued with long GC times. The answer is not to second guess everyone and break existing configurations. People will continue to make mistakes like this, and even if optimize was changed to forceMerge, you can be assured that some people will still make the wrong trade-off in the future using the new name.

I've thought about this for a while now... please consider this my formal veto to this change.

Yonik Seeley
added a comment - 20/Feb/12 14:01 The compensation is that they are forced to again look at that code an then they think about removing the call alltogether.
The proposal simply breaks existing systems (on purpose) on upgrade with no offsetting gain in functionality, just because we believe some people have made the wrong tradeoff in their app. This is not the right solution.
We see people making what we believe to be the wrong tradeoffs all the time in Solr. One example is optimizing for query performance by pumping up cache sizes to insane levels, pumping up the heap to compensate, and then being plagued with long GC times. The answer is not to second guess everyone and break existing configurations. People will continue to make mistakes like this, and even if optimize was changed to forceMerge, you can be assured that some people will still make the wrong trade-off in the future using the new name.
I've thought about this for a while now... please consider this my formal veto to this change.

Yonik Seeley
added a comment - 20/Feb/12 14:05 and another paragraph about softCommit/commitWithin as preferred to explicit commit, which is also a huge mistake many people do, they do over-committing!
This is a much bigger real problem (because people had no soft commit and hence hard commit was the only option). We should probably open up a new issue for this one.

Default could be "noopWithLogWarning", and nothing would happen on an attempted optimize, except logging a warning in logs pointing people to some documentation. This will give people three choices: A) Stop using optimize if they don't need it. Problem solved. B) If they wind up really needing it, start using forceMerge=N instead. Problem solved or C) Change the config param to whatever suits their situation the best, e.g. "forceMerge=1" would mimic old behaviour but "commit" would cause a commit to happen on optimize, "noop" would do noop, but get rid of log warnings etc. This would be for people who cannot or won't change their own code.

Jan Høydahl
added a comment - 20/Feb/12 14:56 @Yonik, How would you feel about this approach instead:
Add the new forceMerge feature, but instead of true/false, it takes N as number of segments, i.e. &forceMerge=2. This adds value to Solr's API
Keep the old &optimize=true API (equivalent to forceMerge=1), but let users control in solrconfig.xml how an old optimize is interpreted. The option could look like (don't mind the naming for now):
<mainIndex>
<oldOptimizeIsIntrepretedAs> noop|noopWithLogWarning|commit|softCommit|forceMerge=N </oldOptimizeIsIntrepretedAs>
</mainIndex>
Default could be "noopWithLogWarning", and nothing would happen on an attempted optimize, except logging a warning in logs pointing people to some documentation. This will give people three choices: A) Stop using optimize if they don't need it. Problem solved. B) If they wind up really needing it, start using forceMerge=N instead. Problem solved or C) Change the config param to whatever suits their situation the best, e.g. "forceMerge=1" would mimic old behaviour but "commit" would cause a commit to happen on optimize, "noop" would do noop, but get rid of log warnings etc. This would be for people who cannot or won't change their own code.

Jan Høydahl
added a comment - 20/Feb/12 14:59 This is a much bigger real problem (because people had no soft commit and hence hard commit was the only option). We should probably open up a new issue for this one.
SOLR-3146

Yonik Seeley
added a comment - 20/Feb/12 15:02 Add the new forceMerge feature, but instead of true/false, it takes N as number of segments, i.e. &forceMerge=2. This adds value to Solr's API
But we already have this functionality as a maxSegments parameter to optimize.

I think when it comes to API breaks, trying to say we can't fix this one because we can't fix every old little thing doesn't jive. The name is clearly not a good one, and the call will not be the right move for most people that upgrade to 4. Having to rethink that will be doing 99% of users a favor. Changing the name will be doing all future users a favor.

I think 4 should be about getting things right without clinging to old baggage. We are not talking about the update or request apis here. We are talking about a very expensive, very poorly named, very little little returns API call that is certainly over used (and much of the over use is not going to end up on google).

Making those that upgrade rethink optimize seems like just what the Dr ordered - we can add it to the release announce, the release notes, etc. Even though I know exactly what this does, even though i know the price/benefits - I still want to call this thing at least once a week. It's a terrible name at this point. Why are we stuck with terrible?

Mark Miller
added a comment - 20/Feb/12 15:19 I think when it comes to API breaks, trying to say we can't fix this one because we can't fix every old little thing doesn't jive. The name is clearly not a good one, and the call will not be the right move for most people that upgrade to 4. Having to rethink that will be doing 99% of users a favor. Changing the name will be doing all future users a favor.
I think 4 should be about getting things right without clinging to old baggage. We are not talking about the update or request apis here. We are talking about a very expensive, very poorly named, very little little returns API call that is certainly over used (and much of the over use is not going to end up on google).
Making those that upgrade rethink optimize seems like just what the Dr ordered - we can add it to the release announce, the release notes, etc. Even though I know exactly what this does, even though i know the price/benefits - I still want to call this thing at least once a week. It's a terrible name at this point. Why are we stuck with terrible?

I don't have the energy to really get in depth with all of the discussion that's taken place so far, i'll try to keep my comments brief:

0) i'm a fan of the patch currently attached.

1) i largely agree with most of yonik's points – this is a documentation problem first and foremost. Saying that all people who optimize are wrong is ridiculous, and breaking something that has use and value for a set of people just because some other set of people are using it foolishly seems really absurd.

2) changing the "optimize" command to be a no-op with a warning logged, or a failure, where the documented "fix" to regain old behavior for people who genuinely need it is to search & replace the string "optimize" with some new string "forceMerge" seems uterly absurd to me. this is not the first time we've had a param name that people later regretted giving the name that we did – are we going to change all of them for 4.0? Unlike a method renamed in java code where it's easy to see how the change affects you because of compilation failures, this kind of HTTP param change is a serious pain in the ass for people with client apps written using multiple languages/libraries ... naming consistency for existing users seems far more important then having perfect names.

3) Even if the goal is to force people to evaluate whether they really want to merge down to one segment, we have to consider how hard we make things for people when the answer is "yes". If someone is using a client library/app to talk to Solr it may not be easy/simple/possible for them to replace "optimize" with "forceMerge" or something like it w/o mucking in the internals of that library – there's no reason to piss off users like that.

4) any discussion about renaming/removing "optimize" in the Solr HTTP APIs should really consider how that will impact a few other user visible things...

SolrDeletionPolicy has options related to how many optimized indexes to keep

spellchecker has options relating to building on optimize (although if i remember correctly there is a bug about this being broken so it can probably die no problem)

5) Assuming that too many people optimize when the shouldn't, either out of ignorance or because their tools do it out of ignorance and we want to help minimize that moving forward; and given my opinion that renaming "optimize" will only hurt people w/o actually helping the root problem – here's my straw man proposal to try and improve the situation (similar to what jan suggested but taking into account that we already support a "maxSegments" option when doing optimize commands) ...

commit the attached patch as is (it's just plain a good idea, regardless of anything else we might do)

change CommitUpdateCommand.maxOptimizeSegments so it defaults to "-1" and document that when the value is less then 0 it means the UpdateHandler configuration determines the value.

add a new <defaultOptimizeSegments/> config option to <updateHandler/> - make the UpdateHandler use that value anytime CommitUpdateCommand.maxOptimizeSegments is less then 0, and for backcompat have it default to "1" if not specified.

update the example configs to include <defaultOptimizeSegments>9999999</defaultOptimizeSegments> with a comment warning against hte evils of over-optimization

change the code in Solr which deals with <optimize ... /> formated instructions so that any SolrParams in the request with names the same as xml attributes override the attributes – ie: POST /update?maxSegments=4 with data: <optimize maxSegments="9" /> should result in a CommitUpdateCommand with maxOptimizeSegments=4

The end result being:

new users who start with new configs have an UpdateHandler that is going effectively ignore "optimize" commands that don't specify a "maxSegments"

nothing breaks for existing users

existing users who only want to allow optimize commands when "maxSegments" is specified can cut/paste that oneline <defaultOptimizeSegments/> config

new and existing users who want Solr to ignore all optimize commands, even when they do have a "maxSegments", can configure an invariant maxSegments=9999999 param on the affected request handlers

Hoss Man
added a comment - 28/Feb/12 01:07 I don't have the energy to really get in depth with all of the discussion that's taken place so far, i'll try to keep my comments brief:
0) i'm a fan of the patch currently attached.
1) i largely agree with most of yonik's points – this is a documentation problem first and foremost. Saying that all people who optimize are wrong is ridiculous, and breaking something that has use and value for a set of people just because some other set of people are using it foolishly seems really absurd.
2) changing the "optimize" command to be a no-op with a warning logged, or a failure, where the documented "fix" to regain old behavior for people who genuinely need it is to search & replace the string "optimize" with some new string "forceMerge" seems uterly absurd to me. this is not the first time we've had a param name that people later regretted giving the name that we did – are we going to change all of them for 4.0? Unlike a method renamed in java code where it's easy to see how the change affects you because of compilation failures, this kind of HTTP param change is a serious pain in the ass for people with client apps written using multiple languages/libraries ... naming consistency for existing users seems far more important then having perfect names.
3) Even if the goal is to force people to evaluate whether they really want to merge down to one segment, we have to consider how hard we make things for people when the answer is "yes". If someone is using a client library/app to talk to Solr it may not be easy/simple/possible for them to replace "optimize" with "forceMerge" or something like it w/o mucking in the internals of that library – there's no reason to piss off users like that.
4) any discussion about renaming/removing "optimize" in the Solr HTTP APIs should really consider how that will impact a few other user visible things...
<listener event="postOptimize" /> hooks in solrconfig and the corisponding SolrEventListener.postOpimize method
SolrDeletionPolicy has options related to how many optimized indexes to keep
spellchecker has options relating to building on optimize (although if i remember correctly there is a bug about this being broken so it can probably die no problem)
5) Assuming that too many people optimize when the shouldn't, either out of ignorance or because their tools do it out of ignorance and we want to help minimize that moving forward; and given my opinion that renaming "optimize" will only hurt people w/o actually helping the root problem – here's my straw man proposal to try and improve the situation (similar to what jan suggested but taking into account that we already support a "maxSegments" option when doing optimize commands) ...
commit the attached patch as is (it's just plain a good idea, regardless of anything else we might do)
change CommitUpdateCommand.maxOptimizeSegments so it defaults to "-1" and document that when the value is less then 0 it means the UpdateHandler configuration determines the value.
add a new <defaultOptimizeSegments/> config option to <updateHandler/> - make the UpdateHandler use that value anytime CommitUpdateCommand.maxOptimizeSegments is less then 0, and for backcompat have it default to "1" if not specified.
update the example configs to include <defaultOptimizeSegments>9999999</defaultOptimizeSegments> with a comment warning against hte evils of over-optimization
change the code in Solr which deals with <optimize ... /> formated instructions so that any SolrParams in the request with names the same as xml attributes override the attributes – ie: POST /update?maxSegments=4 with data: <optimize maxSegments="9" /> should result in a CommitUpdateCommand with maxOptimizeSegments=4
The end result being:
new users who start with new configs have an UpdateHandler that is going effectively ignore "optimize" commands that don't specify a "maxSegments"
nothing breaks for existing users
existing users who only want to allow optimize commands when "maxSegments" is specified can cut/paste that oneline <defaultOptimizeSegments/> config
new and existing users who want Solr to ignore all optimize commands, even when they do have a "maxSegments", can configure an invariant maxSegments=9999999 param on the affected request handlers

Hoss Man
added a comment - 21/Mar/12 18:08 Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently.
email notification suppressed to prevent mass-spam
psuedo-unique token identifying these issues: hoss20120321nofix36

The problem with optimize is not the name. The problem is that the Solr admin panel suggests that we optimize often. In a Solr admin panel click the name of your index ("collection1" for instance) and what do you see? A big "Optimize Now" button alongside a graphical indicator that the index is not optimized.

Dotan Cohen
added a comment - 22/Oct/12 23:03 The problem with optimize is not the name. The problem is that the Solr admin panel suggests that we optimize often. In a Solr admin panel click the name of your index ("collection1" for instance) and what do you see? A big "Optimize Now" button alongside a graphical indicator that the index is not optimized.

—
I would support removing the optimize button from the GUI, or at least removing it from the Overview page. Keeping it on the CoreAdmin page would not be a bad thing, optionally with at least one confirmation dialog that reminds the user that optimization is not normally required.

Deprecating "optimize" from the GUI and the API in favor of forceMerge would not make me upset either, as long as it continued to work through all 4.x versions. Based on what happened with waitFlush and the PHP Solr API packages after the 4.0 release, this is a dangerous path, but if we kept optimize around until 6.0, perhaps it might be OK.
—

After reading the proposals, I think there might be a small amount of merit in my ideas, but his ideas are safer.

Shawn Heisey
added a comment - 10/Jun/13 18:12 Before I read HossMan's proposals thoroughly, I had these thoughts:
—
I would support removing the optimize button from the GUI, or at least removing it from the Overview page. Keeping it on the CoreAdmin page would not be a bad thing, optionally with at least one confirmation dialog that reminds the user that optimization is not normally required.
Deprecating "optimize" from the GUI and the API in favor of forceMerge would not make me upset either, as long as it continued to work through all 4.x versions. Based on what happened with waitFlush and the PHP Solr API packages after the 4.0 release, this is a dangerous path, but if we kept optimize around until 6.0, perhaps it might be OK.
—
After reading the proposals, I think there might be a small amount of merit in my ideas, but his ideas are safer.