Quotes dependent StopWords removal

When the user performs a search, the analyzer should remove the "stopwords" only if the stop word is not present in quotes. If the stop word is present in quotes, I don't want the stop word to be removed by the analyzer.

For e.g.

"no dress code" - should not remove "no" as it's present in quotes.

shirts with trousers - should remove "with" as a stop word.

I have been trying to do this with Lucene, but have not found a straight forward way of doing it. I have been digging in Lucene mail archives, but it seems like there is no easy way to do this apart from extending / modifying the QueryParser. In some sense, it is similar to the issue discussed in:

Re: Quotes dependent StopWords removal

This appears tricky to me. I may be completely wrong but I would start
by looking at the Standard Analyzer. I would try and create a new token
that matched an open parenthesis. I would then change the next() method
in StandardAnalyzer.jj to mark when it recognizes an open parenthesis.
Now you are in a quote. Somehow mark each token (might not be an obvious
way to do this) until you find another close parenthesis. Now mark that
you are not in a quote. When not in a quote do not mark the tokens
coming out of Next(). ) Now in the Stop Filter, check the token for your
marker and do not remove it if it is marked.

Re: Quotes dependent StopWords removal

My last answer was terrible. QueryParser will not sent any parenthesis
into the analyzer. How about this:

Below are lines about 965-992 of QueryParser.java. Change
getFieldQuery(field, term.image.substring(1, term.image.length()-1), s)
(line 992) to call an identical function to the one called except have
this function use an analyzer that does not remove stop words. Case
QUOTED occurs when a QUOTED token is eaten. getFieldQuery puts that
token (or tokens, possibly at the same position) through an analyzer and
returns a Query object. You want that analyzer that is used to not strip
stop words if the token type is QUOTED. Sounds reasonable to me.

Now replacing the entire method to just change the analyzer is very
brute force but maybe it will spark an idea to something more elegant.
Same "bear in mind this might be BS" applies to this answer.

Re: Quotes dependent StopWords removal

Allow me to amend my last email: you should not be making those changes
in QueryParser.java but in QueryParser.jj (line 866) . Also, I know you
did not want to subclass QueryParser, but you cannot know you are in
quotes in the analyzer unless you hook into QueryParser.

Re: Quotes dependent StopWords removal

- during indexing, use the StandardAnalyzer without stopwords
- during the search, use 2 different Analyzers - one with and one without stopwords. Thereyby, you look first whether the user
has typed in quotes inside her query String.
# If so, look whether there are stopwords between the quotes
* in the case there is a stopword between quotes, use the Analyzer without stopwords
* in the case there is no stopword between quotes, use the one with stopwords
# If not, use the one with stopwords anyway

...the lack on this approach is that when a user mix up stopwords queries with and without quotes, you can not decide such easily-
maybe there a solution can be to modify the analyzer stopword lists on the fly...then the last problem left is when the user types
a specific stopword twice - with and without quotes..so maybe you can live in this situation to use the Analyzer without stopwords -
depending on your scenario, it could be a good compromise...or search n times - but this wouldn't straight forward also ;)

greetz

Christian

Sameer Maggon schrieb:

> Currently, in my application (that uses Lucene), I am using a Porter + StandardAnalyzer (with stop words).
>
>
>
> I would like to do the following:
>
> When the user performs a search, the analyzer should remove the "stopwords" only if the stop word is not present in quotes. If the stop word is present in quotes, I don't want the stop word to be removed by the analyzer.
>
>
>
> For e.g.
>
>
>
> "no dress code" - should not remove "no" as it's present in quotes.
>
>
>
> shirts with trousers - should remove "with" as a stop word.
>
>
>
> I have been trying to do this with Lucene, but have not found a straight forward way of doing it. I have been digging in Lucene mail archives, but it seems like there is no easy way to do this apart from extending / modifying the QueryParser. In some sense, it is similar to the issue discussed in:
>
>
>
> http://www.gossamer-threads.com/lists/lucene/java-user/38946>
>
>
> Is there any way I can avoid subclassing QueryParser ?
>
>
>
> Thanks,
>
> Sameer Maggon.
>
>

Re: Quotes dependent StopWords removal

This keeps popping back into my head. A little more info for you. Bear
in mind I have not dealt with the QueryParser before.

Use the approach I gave last time. Pull out the QueryParser and change
either QueryParser.jj or QueryParser.java...you may be able to just
change QueryParser.java and avoid having to recompile the JavaCC grammer
file. Now at line 992 of QueryParser.java (different line in
QueryParser.jj) you will see the line:

This is still using the same strategy I mentioned last time. The query
parser will analyze the text passed into the getFieldQuery function.
This particular call of getFieldQuery is made when the query parser sees
a quoted set of tokens (remember...I'm half guessing on all of this...I
don't know). The analyzer used by getFieldQuery is stored in the
QueryParser member variable analyzer. So a possible solution is to save
the member variable analyzer to a local variable and replace it with a
non stop word using analyzer right before the getFieldQuery call.
Restore the original analyzer to the analyzer member variable after the
call.