Details

Description

With current trunk, whenever you open an IndexReader on a directory you get back a DirectoryReader which is a composite reader. The interface of IndexReader has now lots of methods that simply throw UOE (in fact more than 50% of all methods that are commonly used ones are unuseable now). This confuses users and makes the API hard to understand.

This issue should split "atomic readers" from "reader collections" with a separate API. After that, you are no longer able, to get TermsEnum without wrapping from those composite readers. We currently have helper classes for wrapping (SlowMultiReaderWrapper - please rename, the name is really ugly; or Multi*), those should be retrofitted to implement the correct classes (SlowMultiReaderWrapper would be an atomic reader but takes a composite reader as ctor param, maybe it could also simply take a List<AtomicReader>). In my opinion, maybe composite readers could implement some collection APIs and also have the ReaderUtil method directly built in (possibly as a "view" in the util.Collection sense). In general composite readers do not really need to look like the previous IndexReaders, they could simply be a "collection" of SegmentReaders with some functionality like reopen.

On the other side, atomic readers do not need reopen logic anymore? When a segment changes, you need a new atomic reader? - maybe because of deletions thats not the best idea, but we should investigate. Maybe make the whole reopen logic simplier to use (ast least on the collection reader level).

We should decide about good names, i have no preference at the moment.

Michael McCandless
added a comment - 11/Jan/11 18:00 +1
We very much need this now – it should be strongly (statically) typed, whether you are using an AtomicIndexReader vs CompositeIndexReader.
And I agree CompositeIndexReader should feel more like a Collection than what IR is today.

Robert Muir
added a comment - 11/Jan/11 18:02 (SlowMultiReaderWrapper - please rename, the name is really ugly; or Multi*)
My problem with Multi* is that this makes things sound cool and powerful.
If they are really slow transition mechanisms, I prefer Slow* as it will actually discourage their use.
But I agree with this issue in general and it would be great for it to be cleaned up.

Earwin Burrfoot
added a comment - 11/Jan/11 22:30 On the other side, atomic readers do not need reopen logic anymore? When a segment changes, you need a new atomic reader?
There is a freakload of places that "upgrade" SegmentReader in various ways, with deletions guilty only for the part of the cases. I'll try getting back to LUCENE-2355 at the end of the week.

Any comments about removing write access from IndexReaders? I think setNorms() will be removed soon, but how about the others like deleteDocument()? I would propose to also make all IndexReaders simply readers not writers?

Uwe Schindler
added a comment - 15/Jan/11 10:26 - edited Any comments about removing write access from IndexReaders? I think setNorms() will be removed soon, but how about the others like deleteDocument()? I would propose to also make all IndexReaders simply readers not writers?

Robert Muir
added a comment - 15/Jan/11 14:30 I think setNorms() will be removed soon
Why do you think this?
On the norms cleanup issue, i only removed setNorm(float), because its completely useless.
All it did was call Similarity.getDefault().encode(float) + setNorm(byte).

Robert Muir
added a comment - 15/Jan/11 15:18 Ah, ok. sorry i was confused. Still, i think we would need this method (somewhere) even
with CSF, so that people can change the norms and they instantly take effect for searches.

Any comments about removing write access from IndexReaders? I think setNorms() will be removed soon, but how about the others like deleteDocument()? I would propose to also make all IndexReaders simply readers not writers?

Earwin Burrfoot
added a comment - 15/Jan/11 16:34 Any comments about removing write access from IndexReaders? I think setNorms() will be removed soon, but how about the others like deleteDocument()? I would propose to also make all IndexReaders simply readers not writers?
Voting with all my extremities - yes!!

Earwin Burrfoot
added a comment - 15/Jan/11 16:49 Still, i think we would need this method (somewhere) even with CSF, so that people can change the norms and they instantly take effect for searches.
This still puzzles me. I can strain my imagination, and get people who just need to change norms without reindexing.
But doing this and requiring instant turnaround? Kid me not

I don't think we should remove setNorm/deleteDocuments, even from the composite reader class.

Deleting docs from IR has advantages over deleting from IW: the change is "live" to any searches running on that IR; you get an immediate count of how many docs were deleted; you can delete by docID.

setNorm is also useful in that it can be use to boost docs (globally), live, if that reader is being used for searching. When/if we cutover norms -> doc values we'll have to decide what to do about setNorm...

At a higher level, for this "strong typing of atomic vs composite IRs", we shouldn't try to change functionality – let's just do a rote refactoring, such that methods that now throw UOE on IR are moved to the atomic reader only.

Separately we can think about whether existing functions should be dropped...

Michael McCandless
added a comment - 15/Jan/11 16:57 I don't think we should remove setNorm/deleteDocuments, even from the composite reader class.
Deleting docs from IR has advantages over deleting from IW: the change is "live" to any searches running on that IR; you get an immediate count of how many docs were deleted; you can delete by docID.
setNorm is also useful in that it can be use to boost docs (globally), live, if that reader is being used for searching. When/if we cutover norms -> doc values we'll have to decide what to do about setNorm...
At a higher level, for this "strong typing of atomic vs composite IRs", we shouldn't try to change functionality – let's just do a rote refactoring, such that methods that now throw UOE on IR are moved to the atomic reader only.
Separately we can think about whether existing functions should be dropped...

> Deleting docs from IR has advantages over deleting from IW: the change is
> "live" to any searches running on that IR; you get an immediate count of how
> many docs were deleted; you can delete by docID.

Alternate plan:

Move responsibility for deletions to a pluggable DeletionsReader
subcomponent of SegmentReader.

Have the default DeletionsReader be read-only.

People who need the esoteric functionality described above can use a
subclass of DeletionsReader.

Marvin Humphrey
added a comment - 15/Jan/11 19:07 > Deleting docs from IR has advantages over deleting from IW: the change is
> "live" to any searches running on that IR; you get an immediate count of how
> many docs were deleted; you can delete by docID.
Alternate plan:
Move responsibility for deletions to a pluggable DeletionsReader
subcomponent of SegmentReader.
Have the default DeletionsReader be read-only.
People who need the esoteric functionality described above can use a
subclass of DeletionsReader.

Earwin Burrfoot
added a comment - 15/Jan/11 20:52 APIs have to be there still. All that commity, segment-deletery, mutabley stuff (that spans both atomic and composite readers).
So, while your plan is viable, it won't remove that much cruft.

Uwe Schindler
added a comment - 14/Dec/11 22:35 - edited I try to do this now, Robert will help.
Most crap is removed form IndexReader, so this should be easy now. We should use AtomicIndexReader and CompositeIndexReader.
IndexReader base class only contains the methods that can actually be implemented in every reader.
The current BaseMultiReader (pkg-private) would be the later CompositeIndexReader
AtomicIndexReader is also abstract, SegmentReader and SlowMultiReaderWrapper would implement it.
The static open methods should maybe in a separate class or the abstract IndexReader base with the limited API, but they would return CompositeIndexReader
FilterIndexReader could be split in two classes, but a FilterMultiReader makes no sense FilterIndexReader will be FilterAtomicIndexReader (also the superclass of SlowMultiReaderWrapper)

Uwe Schindler
added a comment - 17/Jan/12 22:30 I will hopefully start this weekend. I just have a schedule currently that has no slots left for that.
Robert and I wanted to start a branch soon, maybe I simply do that as first step soon!

Uwe Schindler
added a comment - 21/Jan/12 16:19 Simon: Just to inform you, I am working on this. Currently I have a heavy broken checkout that does no longer compile at all Working, working, working... It's a mess!
Once I have something intially compiling for core (not tests), I will create a branch!

Uwe Schindler
added a comment - 21/Jan/12 23:50 - edited I created the branch at https://svn.apache.org/repos/asf/lucene/dev/branches/lucene2858 and committed my first steps:
Add CompositeIndexReader and AtomicIndexReader
Moved methods around, still not yet finished (see below)
DirectoryReader is public now and is returned by IR.open() and IW.getReader()
TODO:
IR.openIfChanged makes no sense for any reader other than DirectoryReader, let's move it also there
isCurrent and getVersion() is also useless for atomic readers and composite readers except DR
The strange generics in ReaderContext caused by the final field will go away, when changing reader field to aaccessor method returning the correct type (by return type overloading).
Comments welcome and also heavy committing.

Uwe Schindler
added a comment - 22/Jan/12 12:59 More TODOs:
Rename SlowMultiReaderWrapper -> SlowCompositeReaderWrapper (as its not only for MultiReaders, also other CompositeIndexReaders like DirectoryReader. Name is not in line with our current naming
All version/commit/reopen stuff should go into DirectoryReader, its useless for the abstract IR interfaces

Uwe Schindler
added a comment - 29/Jan/12 14:17 I now fixed the branch's test-framework and all remaining TODOs about the API.
Now the horrible stupid slave-work to port all test starts. I assume the API is now fixed, as nobody complained after one week.

SlowMultiReaderWrapper - please rename, the name is really ugly; or Multi*

+1, the Slow* is misleading as it makes it seem like there's a faster way you should be doing it.
CompositeReaderWrapper should be fine. And no, it doesn't sound too "cool" for the hypothetical developers who use that as a criteria when coding

Yonik Seeley
added a comment - 29/Jan/12 15:38 SlowMultiReaderWrapper - please rename, the name is really ugly; or Multi*
+1, the Slow* is misleading as it makes it seem like there's a faster way you should be doing it.
CompositeReaderWrapper should be fine. And no, it doesn't sound too "cool" for the hypothetical developers who use that as a criteria when coding
Other possibilities include AtomicReaderEmulator, AtomicEmulatorReader, CompositeAsAtomicReader, etc

Robert Muir
added a comment - 29/Jan/12 22:58 Can we please do some eclipse-renames like:
AtomicIndexReader -> AtomicReader
AtomicIndexReader.AtomicReaderContext -> AtomicReader.Context
The verbosity of the api is killing me

I renamed the enclosing classes and also removed the public ctors from ReaderContexts (to prevent stupid things already reported on mailing lists).

The renameing of ReaderContexts all to the same name Context, but with different enclosing class is a refactoring, Eclipse cannot do (it creates invalid code). It seems only NetBeans can do this, I will try to find a solution. The problem is that Eclipse always tries to import the inner class, what causes conflicts.

Finally, e.g. the method getDocIdSet should look like getDocIdSet(AtomicReader.Context,...) [only importing AtomicReader], but Eclipse always tries to use Context [and import oal.AtomicReader.Context]. At the end we should have abstract IndexReader.Context, AtomicReader.Context, CompositeReader.Context.

Uwe Schindler
added a comment - 30/Jan/12 01:27 I renamed the enclosing classes and also removed the public ctors from ReaderContexts (to prevent stupid things already reported on mailing lists).
The renameing of ReaderContexts all to the same name Context, but with different enclosing class is a refactoring, Eclipse cannot do (it creates invalid code). It seems only NetBeans can do this, I will try to find a solution. The problem is that Eclipse always tries to import the inner class, what causes conflicts.
Finally, e.g. the method getDocIdSet should look like getDocIdSet(AtomicReader.Context,...) [only importing AtomicReader] , but Eclipse always tries to use Context [and import oal.AtomicReader.Context] . At the end we should have abstract IndexReader.Context, AtomicReader.Context, CompositeReader.Context.
Will go to bed now.

After sleeping one more night about it, the easiest and most simple way to remove the stupid verbosity in imports:

I will move the ReaderContexts out of their enclosing class and make a top-level class without ctor out of it. Then import statements look nice. The RCs are now so important for search, so they should be on its own.

Uwe Schindler
added a comment - 30/Jan/12 10:08 After sleeping one more night about it, the easiest and most simple way to remove the stupid verbosity in imports:
I will move the ReaderContexts out of their enclosing class and make a top-level class without ctor out of it. Then import statements look nice. The RCs are now so important for search, so they should be on its own.

Here the final patch. Thanks for the good work and lots of help from Robert and Mike.

As this patch contains heavy changes, we will commit it as soon as possible so work can go on. It would be nice, if you would not commit anything until this is done.

There are some minor issues open:

DirectoryReader is final and has only static factory methods. It is not possible to subclass it in any way. The problem is mainly Solr, as Solr accesses directory(), IndexCommits,... and therefore cannot work on abstract IndexReader anymore. This should be changed, by e.g. handling reopening in the IRFactory, also versions, commits,... Currently its not possible to implement any other IRFactory that returns something else.

The PayloadProcessorProvider has a broken API, this should be fixed. The current patch mimics the old behaviour, but not 100%.

ParallelReader is now atomic. We should add a sugar wrapper method to allow synchronized composite readers (with same segment sizes) to be aligned with MultiReaders or wrapped by Slow.

Uwe Schindler
added a comment - 30/Jan/12 22:21 - edited Here the final patch. Thanks for the good work and lots of help from Robert and Mike.
As this patch contains heavy changes, we will commit it as soon as possible so work can go on. It would be nice, if you would not commit anything until this is done.
There are some minor issues open:
DirectoryReader is final and has only static factory methods. It is not possible to subclass it in any way. The problem is mainly Solr, as Solr accesses directory(), IndexCommits,... and therefore cannot work on abstract IndexReader anymore. This should be changed, by e.g. handling reopening in the IRFactory, also versions, commits,... Currently its not possible to implement any other IRFactory that returns something else.
The PayloadProcessorProvider has a broken API, this should be fixed. The current patch mimics the old behaviour, but not 100%.
ParallelReader is now atomic. We should add a sugar wrapper method to allow synchronized composite readers (with same segment sizes) to be aligned with MultiReaders or wrapped by Slow.
The remaining issues will be fixed in later issues!

SlowMultiReaderWrapper creates the fields and liveDocs on its ctor (as its expensive) and reuses. This is no problem, but the test did something very bad: It closed not the slow reader but the wrapped DirectoryReader. As fields() was not delegated to the DR or MultiFields with DR, Slow returned its own cached instance. Slow's ensureOpen was not throwing Ex, as it was still open.

Theoretically, Slow* should incRef the underlying indexreader and decRef on close(). But that would require, that you close the SlowReader after wrapping.

The current solution is not optimal but makes it easy to wrap without explicitely closing the slow wrapper.

The fix was to call in.ensureOpen() in slow before the cached instance was returned.

Uwe Schindler
added a comment - 31/Jan/12 20:21 Hi Robert, this patch fixes the problem:
SlowMultiReaderWrapper creates the fields and liveDocs on its ctor (as its expensive) and reuses. This is no problem, but the test did something very bad: It closed not the slow reader but the wrapped DirectoryReader. As fields() was not delegated to the DR or MultiFields with DR, Slow returned its own cached instance. Slow's ensureOpen was not throwing Ex, as it was still open.
Theoretically, Slow* should incRef the underlying indexreader and decRef on close(). But that would require, that you close the SlowReader after wrapping.
The current solution is not optimal but makes it easy to wrap without explicitely closing the slow wrapper.
The fix was to call in.ensureOpen() in slow before the cached instance was returned.

Not entirely sure why but it looks like the commits here slowed down our NRT reopen latency. If you look at the nightly bench graph: http://people.apache.org/~mikemccand/lucenebench/nrt.html and click + drag from Jan 2012 to today, annotation R shows we increased from ~46 msec NRT reopen latency to ~50 msec ... could just be hotspot being upset...

Michael McCandless
added a comment - 01/Jun/12 17:48 Not entirely sure why but it looks like the commits here slowed down our NRT reopen latency. If you look at the nightly bench graph: http://people.apache.org/~mikemccand/lucenebench/nrt.html and click + drag from Jan 2012 to today, annotation R shows we increased from ~46 msec NRT reopen latency to ~50 msec ... could just be hotspot being upset...

Not entirely sure why but it looks like the commits here slowed down our NRT reopen latency.

OK, sorry, I was wrong about this!

I went back and re-ran the NRTPerfTest and isolated the slowdown to LUCENE-3728 (also committed on the same day)... it was because we lost the SegmentInfo.sizeInBytes caching... but then in LUCENE-4055 we got it back and we got the performance back ... so all's well that ends well

Michael McCandless
added a comment - 03/Jun/12 09:59 Not entirely sure why but it looks like the commits here slowed down our NRT reopen latency.
OK, sorry, I was wrong about this!
I went back and re-ran the NRTPerfTest and isolated the slowdown to LUCENE-3728 (also committed on the same day)... it was because we lost the SegmentInfo.sizeInBytes caching... but then in LUCENE-4055 we got it back and we got the performance back ... so all's well that ends well

Uwe Schindler
added a comment - 03/Jun/12 10:16 Thanks Mike for taking care. For me this looked crazy, because refactoring like this should not change perf.
This explains the change back after 4055, I thought un-compressed oops responsible for the speedup.

This explains the change back after 4055, I thought un-compressed oops responsible for the speedup.

That's what I thought too (and I was very confused that uncompressed oops would improve things)! But LUCENE-4055 landed on the same day I turned off uncompressed oops... so I think all mysteries are now explained.

Michael McCandless
added a comment - 03/Jun/12 10:19 This explains the change back after 4055, I thought un-compressed oops responsible for the speedup.
That's what I thought too (and I was very confused that uncompressed oops would improve things)! But LUCENE-4055 landed on the same day I turned off uncompressed oops... so I think all mysteries are now explained.