Chuck Williams wrote:
> It appears that termIndexInterval is factored into the stored index and
> thus cannot be changed dynamically to work around the problem after an
> index has become polluted. Other than identifying the documents
> containing binary data, deleting them, and then optimizing the whole
> index, has anybody found a better way to recover from this problem?
Hadoop's MapFile is similar to Lucene's term index, and supports a
feature where only a subset of the index entries are loaded (determined
by io.map.index.skip). It would not be difficult to add such a feature
to Lucene by changing TermInfosReader#ensureIndexIsRead().
Here's a (totally untested) patch.
Doug