Does this have a theoretical base? On what basis was the decition make to have it? Does anybody know a paper (in Information Retrieval, Information Seeking, etc.) or other more general information about this?

Best Regards, Karl

P.S.: This is my second question about Lucene scoring (current version). It relates to the question I posted on the older scoring version. I decised to repost since most people here seemed not to read it since it relates to an old version - well actually it doesn't. -- "Ein Herz für Kinder" - Ihre Spende hilft! Aktion: www.deutschlandsegelt.de Unser Dankeschön: Ihr Name auf dem Segel der 1. deutschen America's Cup-Yacht!

Karl Koch wrote: > The coord(q,d) normalisation is "a score factor based on how many of > the query terms are found in the specified document." and described > here: > > http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_coord> > Does this have a theoretical base? On what basis was the decition > make to have it? Does anybody know a paper (in Information Retrieval, > Information Seeking, etc.) or other more general information about > this?

Many retrieval systems represent documents and queries by the words they contain, and base the comparison on the number of words they have in common. The more words the query and document have in common, the higher the document is ranked; this is referred to as a "coordination match." Performance is improved by weighting query and document words using frequency information from the collection and individual document texts [27].

I looked up the paper and read the relevant part. The text quote you provided is from the introcution. I belief that quote referes to the basic purpose of an information retrieval system in general. At least to the purpose of a vector space model IR system.

If this is the theoretical justfication of the coord_q_d normalisation than it is actually replicating the the other part of the scoring formula to some degree. The entire forumla is actually concerned with this - comparing the term frequencies of query and document.

Is there any other paper that actually shows the benefit of doing this particular normalisation with coord_q_d? I am not suggesting here that it is not useful, I am just looking for evidence how the idea developed.

> Karl Koch wrote: > > The coord(q,d) normalisation is "a score factor based on how many of > > the query terms are found in the specified document." and described > > here: > > > > > http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_coord> > > > Does this have a theoretical base? On what basis was the decition > > make to have it? Does anybody know a paper (in Information Retrieval, > > Information Seeking, etc.) or other more general information about > > this? > > Following is quoted from: Krovetz, R. & Croft, W. B. (1992) Lexical > Ambiguity and Information Retrieval. ACM Transactions on Information > Systems, 10(2): 115-141. > > Many retrieval systems represent documents and queries > by the words they contain, and base the comparison on > the number of words they have in common. The more > words the query and document have in common, the > higher the document is ranked; this is referred to as > a "coordination match." Performance is improved by > weighting query and document words using frequency > information from the collection and individual > document texts [27]. > > 27. Salton, G. & McGill, M. Introduction to Modern Information > Retrieval. McGraw-Hill, New York, 1983. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene > For additional commands, e-mail: java-user-help [at] lucene

Karl Koch wrote: > Is there any other paper that actually shows the benefit of doing > this particular normalisation with coord_q_d? I am not suggesting > here that it is not useful, I am just looking for evidence how the > idea developed.

I think it's a mischaracterization to call coordination a "normalization". In my mind, "normalization" is something applied equally to all documents' scores. The coordination component of a document's score varies from document to document, and so doesn't meet this criterion.

I don't know the answer to your larger question: why use a coordination component in a similarity measure when other components (tf*idf) seem to serve the same function? What you seem to be looking for is a study that directly compares a system using a coordination component in its similarity measure with the *same* system, varying the measure only in that coordination is elided. Unfortunately, I know of no such study.

unfortunately I don't have access to these books right now. I will try to get hold of them. Thank you for these pointers. :)

I had a quick look at "coordination level matching" on the web and found evidence that this seemed to be an early retrieval strategy. My question is mainly, why one should use coordination level matching, if one is already doing (proper) TFxIDF based matching. When I look at Lucenes scoring forumla, it seems to me that two kinds of matching are performed and combined together in a single matching formula.

In the paper, "Exploiting the Similarity of Non-matching Terms at Retrieval Time" which can be found here:

it is directly compared with TFxIDF. To me, it seems that coordination level matching could be used if I don't want to use TFxIDF but not together with it. In this context, I wonder what benefit the "coordination level matching" has in combination with TFxIDF?

It is likely that I have some kind of misunderstanding here. Perhaps with your help I can untangle that a bit further. As I said earlier, I am only looking for a reasonable explaination (perhaps augmented with some evidence in literature) that makes it clear why it is used together with TFxIDF.

> Karl Koch wrote: > > Is there any other paper that actually shows the benefit of doing > > this particular normalisation with coord_q_d? I am not suggesting > > here that it is not useful, I am just looking for evidence how the > > idea developed. > > I think it's a mischaracterization to call coordination a > "normalization". In my mind, "normalization" is something applied > equally to all documents' scores. The coordination component of a > document's score varies from document to document, and so doesn't meet > this criterion. > > I repeat the citation of the book cited by the paper I cited :) : > > >> Salton, G. & McGill, M. Introduction to Modern Information > >> Retrieval. McGraw-Hill, New York, 1983. > > In addition to the above book, here are two other books that I've seen > cited as describing "coordination-level matching" (a.k.a. "overlap > ranking"): > > Salton, G. (1968). Automatic information organization and retrieval. > New York: McGraw-Hill. > > Lancaster, F.W. (1979). Information retrieval systems: Characteristics, > testing and evaluation (2nd ed.). New York: Wiley. > > I don't know the answer to your larger question: why use a coordination > component in a similarity measure when other components (tf*idf) seem to > serve the same function? What you seem to be looking for is a study > that directly compares a system using a coordination component in its > similarity measure with the *same* system, varying the measure only in > that coordination is elided. Unfortunately, I know of no such study. > > Good luck, > Steve > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene > For additional commands, e-mail: java-user-help [at] lucene

On 12/13/06, Karl Koch <TheRanger [at] gmx> wrote: > To me, it seems that coordination level matching could be used if I don't want to use TFxIDF but not together with it. In this context, I wonder what benefit the "coordination level matching" has in combination with TFxIDF?

Well, if I search for blue kangaroo, the coord is nice to get documents with "blue" and "kangaroo" to score higher than documents with just one term. And among documents with just one term, the idf factor will make "kangaroo" rank above "blue", which is generally desired.

I have seen complaints about the default similarity though, where the coord factor does not give enough of a boost in relation to the idf of some of the individual terms.

> On 12/13/06, Karl Koch <TheRanger [at] gmx> wrote: > > To me, it seems that coordination level matching could be used if I > don't want to use TFxIDF but not together with it. In this context, I wonder > what benefit the "coordination level matching" has in combination with > TFxIDF? > > Well, if I search for blue kangaroo, the coord is nice to get > documents with "blue" and "kangaroo" to score higher than documents > with just one term. And among documents with just one term, the idf > factor will make "kangaroo" rank above "blue", which is generally > desired. > > I have seen complaints about the default similarity though, where the > coord factor does not give enough of a boost in relation to the idf of > some of the individual terms. > > > -Yonik > http://incubator.apache.org/solr Solr, the open-source Lucene search > server > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene > For additional commands, e-mail: java-user-help [at] lucene

thank you for providing the link to that paper. I read it again, and you are right. I discovered the following text part:

"In normal term co-ordination matches, if a request and document have a frequent term in common, this counts for as much as a non-frequent one; so if a request and document share three common terms, the document is retrieved at the same level as another one sharing three rare terms with the request. But it seems we should treat matches on non-frequent terms as more valuable than ones on frequent terms, without disregarding the latter altogether. The natural solution is to correlate a term's matching value with its collection frequency."

If I do not misunderstand that extract, I would say it suggests the combination of coordination level matching with IDF. I am interested in your view and those who read this?

Are there any other papers that regard the combination of coordination level matching and TFxIDF as advantageous?

thank you for providing the link to that paper. I read it again, and you are right. I discovered the following text part:

"In normal term co-ordination matches, if a request and document have a frequent term in common, this counts for as much as a non-frequent one; so if a request and document share three common terms, the document is retrieved at the same level as another one sharing three rare terms with the request. But it seems we should treat matches on non-frequent terms as more valuable than ones on frequent terms, without disregarding the latter altogether. The natural solution is to correlate a term's matching value with its collection frequency."

If I do not misunderstand that extract, I would say it suggests the combination of coordination level matching with IDF. I am interested in your view and those who read this?

Are there any other papers that regard the combination of coordination level matching and TFxIDF as advantageous?

Karl Koch wrote: > If I do not misunderstand that extract, I would say it suggests the combination of coordination level matching with IDF. I am interested in your view and those who read this?

I understand that sentence: "The natural solution is to correlate a term's matching value with its collection frequency." exactly in that way, to combine coordination level matching with IDF.

The score for a document is the sum of the term weights w(tf, idf) for each containing term. So you have already the combination of coordination level matching with IDF. Now it is possible that your query requests three terms A, B and C. Two of them (A and B) are quite often in the collection one (C) is very rare. It could be possible that documents are matching just C have a higher score than documents containing A and B. To avoid this you can give the coordination a higher influence by multiplying the sum of term weights with the coordination as additional factor.

> Karl Koch wrote: > > If I do not misunderstand that extract, I would say it suggests the > combination of coordination level matching with IDF. I am interested in your > view and those who read this? > > I understand that sentence: > "The natural solution is to correlate a term's matching value with its > collection frequency." > exactly in that way, to combine coordination level matching with IDF. > > The score for a document is the sum of the term weights w(tf, idf) for > each containing term. So you have already the combination of > coordination level matching with IDF. Now it is possible that your query > requests three terms A, B and C. Two of them (A and B) are quite often > in the collection one (C) is very rare. It could be possible that > documents are matching just C have a higher score than documents > containing A and B. To avoid this you can give the coordination a higher > influence by multiplying the sum of term weights with the coordination > as additional factor. > > Sören > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene > For additional commands, e-mail: java-user-help [at] lucene

Soeren Pekrul wrote: > The score for a document is the sum of the term weights w(tf, idf) for > each containing term. So you have already the combination of > coordination level matching with IDF. Now it is possible that your query > requests three terms A, B and C. Two of them (A and B) are quite often > in the collection one (C) is very rare. It could be possible that > documents are matching just C have a higher score than documents > containing A and B. To avoid this you can give the coordination a higher > influence by multiplying the sum of term weights with the coordination > as additional factor.

FYI: The Wiki has a fair number of resources on IR: http:// wiki.apache.org/jakarta-lucene/InformationRetrieval (I have added a link to this conversation, which contains a lot of useful information)

Karl, if you are so inclined, please feel free to add any of the references you have found that have been helpful that aren't already on this page (anyone can edit the Wiki with an login)

-Grant

On Dec 14, 2006, at 4:59 AM, Soeren Pekrul wrote:

> Soeren Pekrul wrote: >> The score for a document is the sum of the term weights w(tf, idf) >> for each containing term. So you have already the combination of >> coordination level matching with IDF. Now it is possible that your >> query requests three terms A, B and C. Two of them (A and B) are >> quite often in the collection one (C) is very rare. It could be >> possible that documents are matching just C have a higher score >> than documents containing A and B. To avoid this you can give the >> coordination a higher influence by multiplying the sum of term >> weights with the coordination as additional factor. > > Addendum: > For the query Q(A, B, C) with > A: df++ (ifd--) > B: df++ (idf--) > C: df-- (idf++) > the user would probably expect the following ranking: > 1. D(A, B, C) > 2. D(A, C), D(B, C) > 3. D(A, B) > 4. D(C) > 5. D(A), D(B) > > Sören > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene > For additional commands, e-mail: java-user-help [at] lucene >