On 22/7/2010 9:20 PM, Shai Erera wrote:
> How is that different than extending QP?
>
Mainly because the problem I'm having isn't there, and doing it from
there doesn't feel right, and definitely not like solving the issue. I
want to explore what other options there are before doing anything, and
I started this thread because I hit a dead end after seeing Similarity
can no more be of help.
> About the "song of songs" example -- the result you describe is already what
> will happen. A document which contains just the word 'song' will score lower
> than a document containing "song of songs".
Incorrect, and I have a sample app to show that (this is how I thought
of this example for the first place).
Since while indexing the 2 words will be saved into index as 1:(song
song$) 2:(song songs$), short documents with one word "song" will score
higher than longer documents with "song of songs". This is a product of
Lucene's default tf/idf implementation which cares about a field's
length, and at this stage I want to avoid replacing it (with BM25 for
example).
> Also, what I'd do in such a case
> is search for the phrase (in addition to the rest), 'cause documents
> containing the word "songs" 100 times will score higher than the single
> document that will contain "song of songs" once ...
>
In one of my applications I am providing an "as typed" capability, which
does exactly what you are suggesting (looking for the $-ed terms only),
but I want my original analyzer (the one that also looks for non $-ed
terms) to do better scoring. Without this the implementation is somewhat
broken...
> If you just want a query "abc def" to rank higher if a document contains the
> exact words, then I'd go w/ the QP extension approach, or do other
> sophistication like searching for 'abc' '\"abc\"' etc. or something like
> that. There are many tricks you can do on your end, w/o overriding much in
> Lucene. Still, IMO extending QP is the easiest and gives you the control you
> need.
>
I am overriding stuff in Lucene either way. I also don't want an exact
match of a phrase to rank higher; I want an original term (saved as-is
with a $ marker) to score higher than a stemmed / lemmatized one
(without the marker). Sorry if the thread's title is misleading.
I'd have used payloads if it wasn't costly. So my question is: where do
I have control over boosting (or scoring), and also have access to the
term's text?
Thanks,
Itamar.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org