lucene-java-user mailing list archives

Hi Radha,
On 4/17/2009 at 6:19 AM, Radhalakshmi Sreedharan wrote:
> What I need is the following :
> If my document field is ( ab,bc,cd,ef) and Search tokens are
> (ab,bc,cd).
>
> Given the following :
> I should get a hit even if all of the search tokens aren't present
> If the tokens are found they should be found within a distance x of
> each other ( proximity search)
>
> I need the percentage match of the search tokens with the document
> field.
>
> Currently this is my query :
> 1) I form all possible permutation of the search tokens
> 2) do a spanNearQuery of each permutation
> 3) Do a DisjunctionMaxQuery on the spannearqueries.
>
> This is how I compute % match :
> % match = ( Score by running the query on the document field ) /
> ( score by running the query on a document field created
> out of search tokens )
>
> The numerator gives me the actual score with the search tokens run on
> the field.
> Denominator gives me the best possible or maximum possible score with
> the current search tokens
>
> For this example << If my document field is ( ab,bc,cd,ef) and Search
> tokens are (ab,bc,cd).>> I expect a % match of around 90%.
I'm having trouble understanding what your "% match" function represents. What scores do
you want if a document field contains (ab,bc,cd,ef,gh)? Or (ab,bc,cd,ef,gh,ij,jk,lm,no)?
Where does your "best possible" document reside? I'm assuming that: either the query set
is fixed, and so you have a pre-indexed set of pseudo-documents corresponding to the queries;
or you're creating new pseudo-documents for the query just prior to launching it. In either
case, do your pseudo-documents reside in the same index as the one that contains the real
documents?
> I tried out your approach and the problem got solved to an extent but
> still it remains.
>
> The problem is the score reduces quite a bit even now as bc is not
> found in the combinations ( bc,cd) ( bc,ef) and ( ab,bc,cd,ef) etc.
>
> The boosting infact has a negative impact and reduces the score further
> :(
>
> The factor which is affected by boosting is the queryNorm .
>
> With a boost of 6 -
>
> 0.015559823 = (MATCH) max of:
> 0.015559823 = (MATCH) weight(spanNear([SearchField:cd,
> SearchField:ef], 10, false)^6.0 in 0), product of:
> 0.07606166 = queryWeight(spanNear([SearchField:cd, SearchField:ef],
> 10, false)^6.0), product of:
> 6.0 = boost
> 0.61370564 = idf(SearchField: cd=1 ef=1)
> 0.02065639 = queryNorm
> 0.20456855 = (MATCH) fieldWeight(SearchField:spanNear([cd, ef], 10,
> false)^6.0 in 0), product of:
> 0.33333334 = tf(phraseFreq=0.33333334)
> 0.61370564 = idf(SearchField: cd=1 ef=1)
> 1.0 = fieldNorm(field=SearchField, doc=0)
>
> Without a boost -
>
> 0.07779912 = (MATCH) max of:
> 0.07779912 = (MATCH) weight(spanNear([SearchField:cd,
> SearchField:ef], 10, false) in 0), product of:
> 0.3803083 = queryWeight(spanNear([SearchField:cd, SearchField:ef],
> 10, false)), product of:
> 0.61370564 = idf(SearchField: cd=1 ef=1)
> 0.6196917 = queryNorm
> 0.20456855 = (MATCH) fieldWeight(SearchField:spanNear([cd, ef], 10,
> false) in 0), product of:
> 0.33333334 = tf(phraseFreq=0.33333334)
> 0.61370564 = idf(SearchField: cd=1 ef=1)
> 1.0 = fieldNorm(field=SearchField, doc=0)
The queryNorm for the boosted variant is 1/30th of the non-boosted query, but the boost is
multiplied through, resulting in a score that is 1/5th of the non-boosted query. I would
think that the queryNorm would be unaffected by the boost. Sorry, I don't know what's happening
- maybe a bug?
Steve
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org