We can make it protected that way it's expert level and a user
needs to inherit from IndexWriter to use it. I don't think today
it's possible to simply inherit from IW to get the merge
information because IW.merge is final, and there needs to be a
way to know the merge was successful.

Jason Rutherglen
added a comment - 10/Jun/09 23:18 We can make it protected that way it's expert level and a user
needs to inherit from IndexWriter to use it. I don't think today
it's possible to simply inherit from IW to get the merge
information because IW.merge is final, and there needs to be a
way to know the merge was successful.

The problem is you need more information than simply "these segments got merged" to actually do something interesting with your caches.

Okay, now I've thought a bit. What we need is a notification on which segments remained, which are new and which got toasted, plus docid ranges for them. Their ancestry is irrelevant, because you're right, to exploit it we also need deleted docs, and then replicate some of the merging logic and it gets really messy from here. Dropping parts of the cache related to dead segments, rebasing survivors and doing a fair-and-square load/uninversion/whatever for new ones is enough.

Can you explain what's missing in Lucene's FieldCache?

It's not that easy to say. Our version was initially used only for sorting, but without concurrency issues and with async warmup. But then we used it to load docs (way better than storing fields and using IndexReader.document), tied up with our strongly-typed-fields code, added handling for multi-valued fields, used it for faceted searches.
So now it is essentially just something different from Lucene field cache.

Earwin Burrfoot
added a comment - 07/Apr/09 16:50 The problem is you need more information than simply "these segments got merged" to actually do something interesting with your caches.
Okay, now I've thought a bit. What we need is a notification on which segments remained, which are new and which got toasted, plus docid ranges for them. Their ancestry is irrelevant, because you're right, to exploit it we also need deleted docs, and then replicate some of the merging logic and it gets really messy from here. Dropping parts of the cache related to dead segments, rebasing survivors and doing a fair-and-square load/uninversion/whatever for new ones is enough.
Can you explain what's missing in Lucene's FieldCache?
It's not that easy to say. Our version was initially used only for sorting, but without concurrency issues and with async warmup. But then we used it to load docs (way better than storing fields and using IndexReader.document), tied up with our strongly-typed-fields code, added handling for multi-valued fields, used it for faceted searches.
So now it is essentially just something different from Lucene field cache.

Michael McCandless
added a comment - 07/Apr/09 15:37 This is required in one form or another for any kinds of segment-aware caches.
The problem is you need more information than simply "these segments got merged" to actually do something interesting with your caches.
EG you'd need to know which deleted docs got zapped, right?
We're currently using our own field cache (I doubt we'll ever switch back to lucene's native, fixed one or not)
Can you explain what's missing in Lucene's FieldCache? (Since we are going to build a new one for LUCENE-831 it'd be great to address all known limitations...).

.bq I'd like to step back and understand the wider use case / context that's driving this need (to know precisely when segments got merged)
This is required in one form or another for any kinds of segment-aware caches.
We're currently using our own field cache (I doubt we'll ever switch back to lucene's native, fixed one or not) and filter cache. Both caches are warmed up on reopen, asynchronously from search requests and both would warm up considerably faster if we have data on how segments have changed.

Earwin Burrfoot
added a comment - 07/Apr/09 13:58 .bq I'd like to step back and understand the wider use case / context that's driving this need (to know precisely when segments got merged)
This is required in one form or another for any kinds of segment-aware caches.
We're currently using our own field cache (I doubt we'll ever switch back to lucene's native, fixed one or not) and filter cache. Both caches are warmed up on reopen, asynchronously from search requests and both would warm up considerably faster if we have data on how segments have changed.

I think it's good to take a step back, "if we fix Lucene's field
cache, and Lucene's near real-time search manages CSF's
efficiently in memory" fixes the use case. Relying on CSF coming
in probably won't help this the case if it doesn't make it into
the 2.9 release. I like the callback method because it does not
rely on passing segment infos around and instead uses the
already public IndexReader classes.

Jason Rutherglen
added a comment - 06/Apr/09 18:16 I think it's good to take a step back, "if we fix Lucene's field
cache, and Lucene's near real-time search manages CSF's
efficiently in memory" fixes the use case. Relying on CSF coming
in probably won't help this the case if it doesn't make it into
the 2.9 release. I like the callback method because it does not
rely on passing segment infos around and instead uses the
already public IndexReader classes.

I'd like to step back and understand the wider use case / context that's driving this need (to know precisely when segments got merged). EG if we fix Lucene's field cache, and Lucene's near real-time search manages CSF's efficiently in memory, does that address the use case behind this?

It's possible that we should simply make SegmentInfo(s) public, so that MergePolicy/Scheduler can be fully created external to Lucene, and track all specifics of why/when merges are happening. But those APIs have a high surface area, and we do make changes over time.

Michael McCandless
added a comment - 04/Apr/09 14:53 I'd like to step back and understand the wider use case / context that's driving this need (to know precisely when segments got merged). EG if we fix Lucene's field cache, and Lucene's near real-time search manages CSF's efficiently in memory, does that address the use case behind this?
It's possible that we should simply make SegmentInfo(s) public, so that MergePolicy/Scheduler can be fully created external to Lucene, and track all specifics of why/when merges are happening. But those APIs have a high surface area, and we do make changes over time.

I would like to move away from our current position of somewhat
closed APIs that require user classes be a part of the Lucene
packages.

It's always best to reuse existing APIs, however we've migrated
to OSGi which means anytime we need to place new classes in
Lucene packages, we need to rollout specific JARs (I think,
perhaps it's more complex) for the few classes outside of our
main package classes. This makes deployment of search
applications a bit more difficult and time consuming.

Jason Rutherglen
added a comment - 03/Apr/09 21:13 I would like to move away from our current position of somewhat
closed APIs that require user classes be a part of the Lucene
packages.
It's always best to reuse existing APIs, however we've migrated
to OSGi which means anytime we need to place new classes in
Lucene packages, we need to rollout specific JARs (I think,
perhaps it's more complex) for the few classes outside of our
main package classes. This makes deployment of search
applications a bit more difficult and time consuming.
A related thread regarding MergePolicy is at:
http://markmail.org/thread/h5bxjflpcyejrcqg

I think this can be achieved, today, by making your own MergeScheduler wrapper, or by subclassing ConcurrentMergeScheduler and eg overriding the doMerge method? If so, I'd prefer not to add a callback to IW.

Michael McCandless
added a comment - 03/Apr/09 11:38 I think this can be achieved, today, by making your own MergeScheduler wrapper, or by subclassing ConcurrentMergeScheduler and eg overriding the doMerge method? If so, I'd prefer not to add a callback to IW.

Jason Rutherglen
added a comment - 03/Apr/09 02:12 Patch is combined with LUCENE-1516 .
IndexWriter has a setSegmentMergerCallback method that is called
in IW.mergeMiddle where the readers being merged and the newly
merged reader are passed to the SMC.mergedSegments method.
I think we need to expose the SegmentReader segment name somehow
either via IndexReader.getSegmentName or an interface on top of
SegmentReader?