Details

Description

I am currently changing my code to be most compatible with 2.4. I switched on deprecation warnings and got a warning about the autoCommit parameter in IndexWriter constructors.

My code should use autoCommit=false, so I want to use the new semantics. The default of IndexWriter is still autoCommit=true. My problem now: How to disable autoCommit whithout deprecation warnings?

Maybe, the "old" constructors, that are deprecated should use autoCommit=true. But there are new constructors with this "IndexWriter.MaxFieldLength mfl" in it, that appear new in 2.4 but are deprecated:

IndexWriter(Directory d, boolean autoCommit, Analyzer a, boolean create, IndexDeletionPolicy deletionPolicy, IndexWriter.MaxFieldLength mfl)
Deprecated. This will be removed in 3.0, when autoCommit will be hardwired to false. Use IndexWriter(Directory,Analyzer,boolean,IndexDeletionPolicy,MaxFieldLength) instead, and call commit() when needed.

What the hell is meant by this, a new constructor that is deprecated? And the hint is wrong. If I use the other constructor in the warning, I get autoCommit=true.

There is something completely wrong.

It should be clear, which constructors set autoCommit=true, which set it per default to false (perhaps new ones), and the Deprecated text is wrong, if autoCommit does not default to false.

Activity

Achieving the migration from autoCommit=true to autoCommit=false is somewhat tricky. As things stand now, all ctors that don't take autoCommit param still default autoCommit to true.

Maybe we could flip the autoCommit default to false, now, with the new ctors (the ones that take a MaxFieldLength). This may be better since you have to explicitly update your code, anyway, to switch to IndexWriter's new ctors and so if we call this change out in the javadocs, users are more likely to catch it. Whereas if we suddenly flip the default for autoCommit to false in 3.0, since there's no API signature change, users may not realize this had happened.

OK I like that approach better. Are there any objections?

But there are new constructors with this "IndexWriter.MaxFieldLength mfl" in it, that appear new in 2.4 but are deprecated

I agree: we should not introduce new deprecated ctors. I'll eliminate these. This happened because there were two separate changes (addition of MaxFieldLength, and, deprecation of autoCommit).

Michael McCandless
added a comment - 23/Sep/08 09:40
Achieving the migration from autoCommit=true to autoCommit=false is somewhat tricky. As things stand now, all ctors that don't take autoCommit param still default autoCommit to true.
Maybe we could flip the autoCommit default to false, now, with the new ctors (the ones that take a MaxFieldLength). This may be better since you have to explicitly update your code, anyway, to switch to IndexWriter's new ctors and so if we call this change out in the javadocs, users are more likely to catch it. Whereas if we suddenly flip the default for autoCommit to false in 3.0, since there's no API signature change, users may not realize this had happened.
OK I like that approach better. Are there any objections?
But there are new constructors with this "IndexWriter.MaxFieldLength mfl" in it, that appear new in 2.4 but are deprecated
I agree: we should not introduce new deprecated ctors. I'll eliminate these. This happened because there were two separate changes (addition of MaxFieldLength, and, deprecation of autoCommit).

Michael McCandless
added a comment - 23/Sep/08 22:02 Attached patch that removes the new deprecated ctors, and sets autoCommit=false for the new ctors (that take MaxFieldLength).
The bulk of the patch is fixing all places where we were calling the new deprecated ctors.

This patch seems to work, the IndexWriters created by the MaxFieldLength ctors are with autocommit=false, I have seen this, because the segment file does not change during indexing.

There is on small thing (was also there before your patch):
I use writer.setUseCompoundFile(true) to use compound files (which is also the default). It generates normally always only CFS files (on index creation, when optimizing,...). There is only one use case, when cfs and cfx files are generated:

Use IndexWriter with create=true

add documents to the index

optimize the index (without closing in between)

After that the optimized index contains of one cfs and one cfx. During indexing (before optimization), I always see only cfs files for new segments (and for short times as usual the contents tfx,...).

When optimizing the index after closing it or later after adding documents, i got only one cfs file.

Two questions:

Is this a small bug, which would be not release critical - but it is strange?

How can I enable creation of doc store (cfx) and cfs always, I found nothing in the docs. In my opinion the separate cfs/cfx files are good for search performance (right?).

Uwe Schindler
added a comment - 24/Sep/08 08:14 This patch seems to work, the IndexWriters created by the MaxFieldLength ctors are with autocommit=false, I have seen this, because the segment file does not change during indexing.
There is on small thing (was also there before your patch):
I use writer.setUseCompoundFile(true) to use compound files (which is also the default). It generates normally always only CFS files (on index creation, when optimizing,...). There is only one use case, when cfs and cfx files are generated:
Use IndexWriter with create=true
add documents to the index
optimize the index (without closing in between)
After that the optimized index contains of one cfs and one cfx. During indexing (before optimization), I always see only cfs files for new segments (and for short times as usual the contents tfx,...).
When optimizing the index after closing it or later after adding documents, i got only one cfs file.
Two questions:
Is this a small bug, which would be not release critical - but it is strange?
How can I enable creation of doc store (cfx) and cfs always, I found nothing in the docs. In my opinion the separate cfs/cfx files are good for search performance (right?).

That (cfx/cfs file creation) is actually "normal" behavior for
Lucene.

With autoCommit=false, in a single session of IndexWriter, Lucene
will share the doc store files (stored fields, term vectors) across
multiple segments. This saves alot of merge time because those files
don't need to be merged if we are merging segments that all share the
same doc store files. When building up a large index anew this saves
alot of time.

A cfx file is the compound-file format of the doc store files.

However, when segments spanning multiple doc stores are merged, then
the doc store files are in fact merged, and written privately for that
one segment, and then folded into that segment's cfs file. When all
such segments reference a given doc store segment are merged away,
then that doc store segment is deleted.

So it's currently only the "level 0" segments that may share a cfx
file. As a future optimization we could consider extending Lucene's
index format so that a single segment could reference multiple doc
stores. This would require logic in FieldsReader and
TermVectorsReader to do a binary search when locating which doc store
segment holds a given document, but, would enable merging non level 0
segments to skip having to merge the doc store. This is an invasive
optimization.

So you can't separately control when Lucene uses cfx file; it's the
merge policy that indirectly controls this.

Michael McCandless
added a comment - 24/Sep/08 10:02
That (cfx/cfs file creation) is actually "normal" behavior for
Lucene.
With autoCommit=false, in a single session of IndexWriter, Lucene
will share the doc store files (stored fields, term vectors) across
multiple segments. This saves alot of merge time because those files
don't need to be merged if we are merging segments that all share the
same doc store files. When building up a large index anew this saves
alot of time.
A cfx file is the compound-file format of the doc store files.
However, when segments spanning multiple doc stores are merged, then
the doc store files are in fact merged, and written privately for that
one segment, and then folded into that segment's cfs file. When all
such segments reference a given doc store segment are merged away,
then that doc store segment is deleted.
So it's currently only the "level 0" segments that may share a cfx
file. As a future optimization we could consider extending Lucene's
index format so that a single segment could reference multiple doc
stores. This would require logic in FieldsReader and
TermVectorsReader to do a binary search when locating which doc store
segment holds a given document, but, would enable merging non level 0
segments to skip having to merge the doc store. This is an invasive
optimization.
So you can't separately control when Lucene uses cfx file; it's the
merge policy that indirectly controls this.