Support for TokenFilters that may modify input documents

Details

Description

In some scenarios it's useful to be able to create or modify fields in the input document based on analysis of other fields of this document. This need arises e.g. when indexing multilingual documents, or when doing NLP processing such as NER. However, currently this is not possible to do.

This issue provides an implementation of this functionality that consists of the following parts:

DocumentAlteringFilterFactory - abstract superclass that indicates that TokenFilter-s created from this factory may modify fields in a SolrInputDocument.

TypeAsFieldFilterFactory - example implementation that illustrates this concept, with a JUnit test.

Activity

Is this better than writing a custom UpdateRequestProcessor that takes the value of the incoming SolrInputDocument (SID), does something to it, removes the original field, and adds the modified version back to SID?

Otis Gospodnetic
added a comment - 05/Nov/09 20:15 Is this better than writing a custom UpdateRequestProcessor that takes the value of the incoming SolrInputDocument (SID), does something to it, removes the original field, and adds the modified version back to SID?

My opinion may be biased, but I'll try to be as objective as I can I think it's better, because it provides you much more flexibility in building analysis & indexing chains without coding. If we went with URProcessor you would have to implement a new one whenever your analysis chain changes ... With the approach in this patch it's just a configuration issue, and not an issue of implementing as many custom update processors as there are possible combinations ...

Andrzej Bialecki
added a comment - 06/Nov/09 00:51 My opinion may be biased, but I'll try to be as objective as I can I think it's better, because it provides you much more flexibility in building analysis & indexing chains without coding. If we went with URProcessor you would have to implement a new one whenever your analysis chain changes ... With the approach in this patch it's just a configuration issue, and not an issue of implementing as many custom update processors as there are possible combinations ...

Mike Perham
added a comment - 11/Feb/10 14:19 This would be hugely useful for us in implementing a profanity detector. We'd like to scan the 'content' field for profane tokens and mark a boolean 'safe' field with the results.

Trouble is, when you need to act on an analyzed version of a field, say, to match terms against a normalized dictionary. To allow this, could we allow Analysis to run anywhere in the update chain? That way we can put UpdateRequestProcessors after analysis as well:

Jan Høydahl
added a comment - 11/Feb/10 15:25 In my head document-level modifications belong in UpdateRequestProcessors. You always have SOLR-1725 to script those quickly, and configuring a chain is easily done in XML ( http://wiki.apache.org/solr/SolrConfigXml#UpdateRequestProcessorChain_section ).
Trouble is, when you need to act on an analyzed version of a field, say, to match terms against a normalized dictionary. To allow this, could we allow Analysis to run anywhere in the update chain? That way we can put UpdateRequestProcessors after analysis as well:
<updateRequestProcessorChain name= "test" >
<processor class= "org.apache.solr.update.processor.MyPreProcessorFactory" />
<analysis />
<processor class= "org.apache.solr.update.processor.MyPostProcessorFactory" />
</updateRequestProcessorChain>
Making <analysis/> optional, the default would be at end as today. I have no idea of how easy such a change would be with the current architecture.

Another developer just mentioned that I might be able to use TFVs to implement the profanity detector. We've got termVectors="true" on the content field since we are also using MoreLikeThis. If I can get access to the field's TFV in the URP, I can just run through the profanities, checking for each one in the TFV... I'm not sure if this is possible - need to check the javadocs.

Mike Perham
added a comment - 11/Feb/10 15:35 Another developer just mentioned that I might be able to use TFVs to implement the profanity detector. We've got termVectors="true" on the content field since we are also using MoreLikeThis. If I can get access to the field's TFV in the URP, I can just run through the profanities, checking for each one in the TFV... I'm not sure if this is possible - need to check the javadocs.

Andrzej Bialecki
added a comment - 19/Feb/10 09:06 Term freq. vectors are not available at this stage, unless you go to an expense of creating a MemoryIndex. I think the solution I proposed is less costly and more generic.

Kai Gülzau
added a comment - 08/Feb/13 16:17 Is there a follow up ticket for Jan Høydahl 's idea of placing the analyzer phase in the middle of the updateRequestProcessorChain?
This would solve my problem http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201302.mbox/%3CB65DA877C3F93B4FB39EA49A1A03C95CC30173%40email.novomind.com%3E