Add BM25 Scoring to Lucene

Details

Description

http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of Okapi-BM25 scoring in the Lucene framework,
as an alternative to the standard Lucene scoring (which is a version of mixed boolean/TFIDF).
I have refactored this a bit, added unit tests and improved the runtime somewhat.
I would like to contribute the code to Lucene under contrib.

Also, this issue is marked as contrib/*. Should this not go straight to core, so more people actually use this and provide feedback? Who knows, there is a chance (ha!) BM25 might turn out better than the current approach, and become the default.

Otis Gospodnetic
added a comment - 30/Nov/09 04:00 - edited Has anyone compared this particular BM25 impl. to the current Lucene's quasi-VSM approach in terms of:
any of the relevance eval methods
indexing performance
search performance
...
Aha, I found something:
http://markmail.org/message/c2r4v7zj7mduzs5d
Also, this issue is marked as contrib/*. Should this not go straight to core, so more people actually use this and provide feedback? Who knows, there is a chance (ha!) BM25 might turn out better than the current approach, and become the default.

the hamshahri corpus contains standardized encoding of persian (i.e. the normalization filter is a no-op).
so any analyzer gain is strictly due to "stopwords", although in persian i wouldn't call some of these words.

this was mostly to show that the analyzer is actually useful, i.e. the scoring system can't completely make up for lack of support like this.

btw, you can play around with openrelevance svn and duplicate my experiments on this same corpus yourself if you want. there's an indonesian corpus there too. i've also tested hindi with this impl.

Robert Muir
added a comment - 30/Nov/09 04:29 - edited otis attached is a graph i produced from the hamshahri corpus, comparing 4 different combinations
Lucene SimpleAnalyzer
Lucene SimpleAnalyzer + BM25
Lucene PersianAnalyzer
Lucene PersianAnalyzer + BM25
the hamshahri corpus contains standardized encoding of persian (i.e. the normalization filter is a no-op).
so any analyzer gain is strictly due to "stopwords", although in persian i wouldn't call some of these words.
this was mostly to show that the analyzer is actually useful, i.e. the scoring system can't completely make up for lack of support like this.
btw, you can play around with openrelevance svn and duplicate my experiments on this same corpus yourself if you want. there's an indonesian corpus there too. i've also tested hindi with this impl.

Otis and Robert, Here's my (limited) experience with BM25:
On a proprietary corpus (alas) I got a nice improvement, which was more pronounced in recall
(hits that were previously not ranked as top ones, and therefore remained unseen, now appear in the top results).
I have worked on lowering the BM25 run time to a reasonable level.
I hope that once this gets into the hands of the Lucene community, BM25 performance
will approach the current Lucene scoring's performance. This is a tall order,
as the latter has been in the works for the last eight years or so.
As for use cases, in my use case BM25 helps, I believe this may be true for other cases.

Yuval Feinstein
added a comment - 30/Nov/09 07:02 Otis and Robert, Here's my (limited) experience with BM25:
On a proprietary corpus (alas) I got a nice improvement, which was more pronounced in recall
(hits that were previously not ranked as top ones, and therefore remained unseen, now appear in the top results).
I have worked on lowering the BM25 run time to a reasonable level.
I hope that once this gets into the hands of the Lucene community, BM25 performance
will approach the current Lucene scoring's performance. This is a tall order,
as the latter has been in the works for the last eight years or so.
As for use cases, in my use case BM25 helps, I believe this may be true for other cases.

Robert Muir
added a comment - 30/Nov/09 07:22 Yuval, bm25 has been working nicely for me too.
on some collections, it really helps, but I haven't yet found a case where it hurts (compared to lucene's current scoring algorithm)
thanks in advance for working this!

I developed this add-on for Lucene in 2008, for some experiments that I was doing, and I would like to express my impressions about this.

In my experience and after reading lot of papers I have never found a case where the Lucene-VSM implementation improves BM25 performance.
BM25 (with standard parameters) outperforms Lucene-VSM, moreover a room for improvement exists if the parameters are fixed specifically for the collection. I made publish some results with the Eurogov collection some time ago.

I can show you now some experiments with TREC Disk4&5 collection, these results have been obtained with default parameters with the Robust track topics. As you can see BM25 improves the Lucene-VSM ranking function.

MAP P@5
VSM 0.2079 0.4096
BM25 0.2340 0.4578

This implementation is getting more popular and I know that some people is using it on their research, thus it will be really nice if at some point it is included in the core.

The only concerns that I have about it, are related with:

Only simple boolean queries based on terms are supported (with operators or, and, not). For instance it does not support PhraseQuery.

IDF cannot be calculated at a document level (this is important for BM25F).

Another issue is related with computing the document average length, but this could be easily solved.

These issues are described in detail in the documentation that I made public in my website.

Joaquin Perez-Iglesias
added a comment - 01/Dec/09 16:00 - edited Hi Otis, Robert and Yuval.
I developed this add-on for Lucene in 2008, for some experiments that I was doing, and I would like to express my impressions about this.
In my experience and after reading lot of papers I have never found a case where the Lucene-VSM implementation improves BM25 performance.
BM25 (with standard parameters) outperforms Lucene-VSM, moreover a room for improvement exists if the parameters are fixed specifically for the collection. I made publish some results with the Eurogov collection some time ago.
I can show you now some experiments with TREC Disk4&5 collection, these results have been obtained with default parameters with the Robust track topics. As you can see BM25 improves the Lucene-VSM ranking function.
MAP P@5
VSM 0.2079 0.4096
BM25 0.2340 0.4578
This implementation is getting more popular and I know that some people is using it on their research, thus it will be really nice if at some point it is included in the core.
The only concerns that I have about it, are related with:
Only simple boolean queries based on terms are supported (with operators or, and, not). For instance it does not support PhraseQuery.
IDF cannot be calculated at a document level (this is important for BM25F).
Another issue is related with computing the document average length, but this could be easily solved.
These issues are described in detail in the documentation that I made public in my website.
Thanks to all for your interest and work.
Joaquin Perez-Iglesias

In my experience and after reading lot of papers I have never found a case where the Lucene-VSM implementation improves BM25 performance.
BM25 (with standard parameters) outperforms Lucene-VSM, moreover a room for improvement exists if the parameters are fixed specifically for the collection

Just wanted to mention that in the results I have provided, I never change the default parameters (B,K1) either.

Robert Muir
added a comment - 01/Dec/09 16:38
In my experience and after reading lot of papers I have never found a case where the Lucene-VSM implementation improves BM25 performance.
BM25 (with standard parameters) outperforms Lucene-VSM, moreover a room for improvement exists if the parameters are fixed specifically for the collection
Just wanted to mention that in the results I have provided, I never change the default parameters (B,K1) either.

I have attached a patch for this case. This is Joaquin's BM25 and BM25F library. I refactored it a bit for speed. It should work with the trunk. I have tested it lightly, mainly through unit tests. Your feedback is more than welcome.

Yuval Feinstein
added a comment - 03/Dec/09 10:16 I have attached a patch for this case. This is Joaquin's BM25 and BM25F library. I refactored it a bit for speed. It should work with the trunk. I have tested it lightly, mainly through unit tests. Your feedback is more than welcome.

Yuval Feinstein
added a comment - 03/Dec/09 10:20 So what is the next step? Robert, do you have time to look at this? Or should I assign this to someone else? And how?
Sorry if these are trivial questions, but I'm a newbie...

Hi Yuval, I see your patch, I can help with some relevance testing and comments.

I don't know if it should be assigned to me, maybe we can trick one of the devs who really knows the scoring system to well to look at it, especially about performance and things like that.

Here is the first thing I noticed, maybe I am completely stupid but I never understood this:

I don't understand why we need BM25Boolean.* and everything like that. I don't understand why these are necessary, they seem to be duplicates of BooleanQuery etc and just sum up subscorers or whatever.

So in my usages I dropped them. I just have BM25TermQuery,BM25TermScorer, and BM25Parameters and to use it, I override a method in QueryParser.

edit: by the way, I don't want to imply that what I am doing is "best" either, because I don't think it is, only that this would be one way to simplify the code a lot as a first step.

Robert Muir
added a comment - 03/Dec/09 10:52 - edited Hi Yuval, I see your patch, I can help with some relevance testing and comments.
I don't know if it should be assigned to me, maybe we can trick one of the devs who really knows the scoring system to well to look at it, especially about performance and things like that.
Here is the first thing I noticed, maybe I am completely stupid but I never understood this:
I don't understand why we need BM25Boolean.* and everything like that. I don't understand why these are necessary, they seem to be duplicates of BooleanQuery etc and just sum up subscorers or whatever.
So in my usages I dropped them. I just have BM25TermQuery,BM25TermScorer, and BM25Parameters and to use it, I override a method in QueryParser.
edit: by the way, I don't want to imply that what I am doing is "best" either, because I don't think it is, only that this would be one way to simplify the code a lot as a first step.

I was wondering about the separate BooleanQuery, too, as it is almost simply a copy (of an old version of it). The question is more, why do we need the BM25 classes at all, why should it be not possible to use normal term queries and other query types together with BM25 by just changing some scoring defaults? So replace Similarity and maybe have a switch inside the Scorers. So TermQuery could be switched to BM25 mode and then using another Scorer or something like that.

That was just my first impression, these additional classes do not look like a good public API to me. Query classes should be abstract wrappers for Weights and Scoreres. The internal impl like BM25 or conventional scoring should be hidden from the user (and maybe properties e.g. on the IndexSearcher to use BM25 scoring). This way, it could also be used for other query types (not only TermQ/BQ), but eg. for function queries (to further change the score) or FuzzyQuery and what else.

If what I said is complete nonsense, don't hurt me, I do not know much about BM25, but for me it is an implementation detail and not part of a public API.

Uwe Schindler
added a comment - 03/Dec/09 11:03 - edited I was wondering about the separate BooleanQuery, too, as it is almost simply a copy (of an old version of it). The question is more, why do we need the BM25 classes at all, why should it be not possible to use normal term queries and other query types together with BM25 by just changing some scoring defaults? So replace Similarity and maybe have a switch inside the Scorers. So TermQuery could be switched to BM25 mode and then using another Scorer or something like that.
That was just my first impression, these additional classes do not look like a good public API to me. Query classes should be abstract wrappers for Weights and Scoreres. The internal impl like BM25 or conventional scoring should be hidden from the user (and maybe properties e.g. on the IndexSearcher to use BM25 scoring). This way, it could also be used for other query types (not only TermQ/BQ), but eg. for function queries (to further change the score) or FuzzyQuery and what else.
If what I said is complete nonsense, don't hurt me, I do not know much about BM25, but for me it is an implementation detail and not part of a public API.

Robert Muir
added a comment - 03/Dec/09 11:09 Uwe, yes I agree it would be nicer to do a tighter integration.
I am suggesting we tackle it one step at a time, first we can answer this question, next we can talk about average document length and other tricky things like that.
For this I calculate it from norms, as Doug suggested here: https://issues.apache.org/jira/browse/LUCENE-965?focusedCommentId=12515803&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12515803 , but lets get to this later.

Robert Muir
added a comment - 03/Dec/09 12:17 Hi Yuval, I think we should also consider BM25F being the really relevant version for Lucene, as it is field-oriented (and dropping document-oriented BM25).
because BM25F makes more sense for lucene, in my opinion. If you are searching BODY for some keywords, who cares how long the AUTHOR field is!
I think narrowing focus to BM25F would eliminate confusion about things such as "average document length" and then we work with just field and "average field length" and more people will have ideas.

I think the answer to Uwe's question about why this can't just be a different Similarity or some such is that BM25 requires some data that Lucene currently doesn't collect. That's what there were some of those static methods in examples on the author's site. I think what I'm saying is correct.

Otis Gospodnetic
added a comment - 03/Dec/09 19:53 +1 for skipping BM25 and going straight to BM25F.
I think the answer to Uwe's question about why this can't just be a different Similarity or some such is that BM25 requires some data that Lucene currently doesn't collect. That's what there were some of those static methods in examples on the author's site. I think what I'm saying is correct.

I'm going to try to answer some of your questions, when I started to develop this library I didn't want
to modify the Lucene code, moreover I tried to create a jar that could be straight added to the official
Lucene distribution. That is the main reason why there are some duplicated classes.
So yes it would be better a tigher integration, and I believe we will get more support for different query types.

In relation with BM25 or BM25F they are equivalent, BM25F is the version for more than a field, so yes go for BM25F.
What it is really important is the way boost factors are applied, as you can see in the equation these must be applied to raw frequencies and not to normalized frequencies or saturated frequencies.
(Currently Lucene is doing it after normalization and saturation of frequencies, what in my opinion is not the best approach.)
A more detailed explanation of BM25F and this issue can be found in this paper http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.9.5255

The problem, as I said, comes from IDF. In the BM25 equations family, IDF is always computed at document level (that is why
I recommend as heuristic to use the field with more terms, or use an special field that contains all the terms). As far as I know that is a problem
because Lucene doesn't store the document frequency per document but per field.

Otis is right as far as I know just changing similarity is not enough, some data is not available to TermScorer neither similarity and TermScorer
apply the obtained values from similarity in a way that make it incompatible with BM25.
It is really important to follow the steps as it appears in my explanation:

Joaquin Perez-Iglesias
added a comment - 03/Dec/09 21:32 Hi everybody,
I'm going to try to answer some of your questions, when I started to develop this library I didn't want
to modify the Lucene code, moreover I tried to create a jar that could be straight added to the official
Lucene distribution. That is the main reason why there are some duplicated classes.
So yes it would be better a tigher integration, and I believe we will get more support for different query types.
In relation with BM25 or BM25F they are equivalent, BM25F is the version for more than a field, so yes go for BM25F.
What it is really important is the way boost factors are applied, as you can see in the equation these must be applied to raw frequencies and not to normalized frequencies or saturated frequencies.
(Currently Lucene is doing it after normalization and saturation of frequencies, what in my opinion is not the best approach.)
A more detailed explanation of BM25F and this issue can be found in this paper http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.9.5255
The problem, as I said, comes from IDF. In the BM25 equations family, IDF is always computed at document level (that is why
I recommend as heuristic to use the field with more terms, or use an special field that contains all the terms). As far as I know that is a problem
because Lucene doesn't store the document frequency per document but per field.
Otis is right as far as I know just changing similarity is not enough, some data is not available to TermScorer neither similarity and TermScorer
apply the obtained values from similarity in a way that make it incompatible with BM25.
It is really important to follow the steps as it appears in my explanation:
1. Normalize frequencies with document/field length and b factor.
2. Saturate the effect of frequency with k1
3. Compute summatory of terms weights
4. Apply IDF
I really believe that this can be done (not sure how), so maybe we will need the suggestions of some 'scorer guru'.

About the IDF: The problem with a per-document IDF in lucene would be that most users also add fields that are e.g. catch-all fields (which would be the per doc IDF) but in addition they add special fields like numeric fields (which would not produce a good IDF, but at the moment this IDF is ignored). Some users also add fields simply for sorting. So a IDF for documents is impossible with Lucene. You can only use e.g. catch all fields (which are always a godd idea for non-fielded searches, because oring all fields together is slower that just indexing the same terms a second time in a catch-all field), e.g. "contents" contains all terms from "title", "subject", "mailtext" as an example for emails. But the IDF for BM25F could be taken from the "contents" field even when searching only for a title.

Uwe Schindler
added a comment - 03/Dec/09 22:12 - edited Thanks for the explanation!
About the IDF: The problem with a per-document IDF in lucene would be that most users also add fields that are e.g. catch-all fields (which would be the per doc IDF) but in addition they add special fields like numeric fields (which would not produce a good IDF, but at the moment this IDF is ignored). Some users also add fields simply for sorting. So a IDF for documents is impossible with Lucene. You can only use e.g. catch all fields (which are always a godd idea for non-fielded searches, because oring all fields together is slower that just indexing the same terms a second time in a catch-all field), e.g. "contents" contains all terms from "title", "subject", "mailtext" as an example for emails. But the IDF for BM25F could be taken from the "contents" field even when searching only for a title.

Basically what we are trying is to constraint the effect of the raw frequency (saturate the frequency).
In Lucene this is carried out with the root square of the frequency, another classical approach
is to use the log. With both approaches we avoid giving a linear 'importance' to the frequency.

BM25 is a bit tricky, it parametrises the 'saturation' of the frequency with a parameter k1, with the
equation weight(t)/(weight(t)+k1). Usually k1 is fixed to 2, but it can be fixed by collection.

(Uwe) Related with the IDF issue, I believe that the more correct approach (in theoretical terms), would be to use the docFreq on the fields where the user wants to search but I don't think that this can be done.
For example if we have indexed with 3 fields. F1, F2, F3, and the user want to search on F1, and F2 there is no way to compute docFreq in both fields. With a catch-all field we have docFreq for all fields.

So maybe the best available approach would be to use IDF per field. What do you think?

Joaquin Perez-Iglesias
added a comment - 04/Dec/09 09:43 - edited Yes sorry.
Basically what we are trying is to constraint the effect of the raw frequency (saturate the frequency).
In Lucene this is carried out with the root square of the frequency, another classical approach
is to use the log. With both approaches we avoid giving a linear 'importance' to the frequency.
BM25 is a bit tricky, it parametrises the 'saturation' of the frequency with a parameter k1, with the
equation weight(t)/(weight(t)+k1). Usually k1 is fixed to 2, but it can be fixed by collection.
(Uwe) Related with the IDF issue, I believe that the more correct approach (in theoretical terms), would be to use the docFreq on the fields where the user wants to search but I don't think that this can be done.
For example if we have indexed with 3 fields. F1, F2, F3, and the user want to search on F1, and F2 there is no way to compute docFreq in both fields. With a catch-all field we have docFreq for all fields.
So maybe the best available approach would be to use IDF per field. What do you think?

Michael McCandless
added a comment - 04/Dec/09 11:19 Could someone summarize what's missing in Lucene's index format, to allow a query's scorer to efficiently compute BM25?
It looks like these two are important for BM25F:
How many times does term T occur in all fields for doc D?
How many documents contain term T in any of their fields? (for computing document level IDF)
Is there anything else?
Only simple boolean queries based on terms are supported (with operators or, and, not). For instance it does not support PhraseQuery.
This is concerning – is there no way to score a PhraseQuery in BM25F?

For example if we have indexed with 3 fields. F1, F2, F3, and the user want to search on F1, and F2 there is no way to compute docFreq in both fields. With a catch-all field we have docFreq for all fields.

can you explain this a little more in detail. From my understanding, if I use a BooleanQuery a:x1 AND b:x2 TermWeigth will calculate the IDF for Term(a, x1) and Term(b, x2), am I missing something?

Simon Willnauer
added a comment - 04/Dec/09 13:11 Joaquin,
For example if we have indexed with 3 fields. F1, F2, F3, and the user want to search on F1, and F2 there is no way to compute docFreq in both fields. With a catch-all field we have docFreq for all fields.
can you explain this a little more in detail. From my understanding, if I use a BooleanQuery a:x1 AND b:x2 TermWeigth will calculate the IDF for Term(a, x1) and Term(b, x2), am I missing something?

Yes, you are right what I meant was related with multifield queries, if you search a:F1^F2, the right approach will be to compute IDF with docFreq(a,F1^F1) what in my understanding cannot be done.

If I'm right Lucene does weight(a)*idf(a,F1) + weight(a)*idf(a,F2), and the correct approach would be weight(a)*idf(a,F1^F2).

That's the reason why Uwe (and I) suggested to use IDF per field in the previous case, and if the query is executed on each field, use a kind of catch-all field to compute docFreq in all fields.

(Michael)
In summary it will be nice to have:

1. docFreq at document level, something like "int docFreq(term, doc_id)" and return the number of documents where term occurs, but if it is not possible a catch-all field will be enough.
2. The Collection Average Document Length and Collection Average Field Length (per each field).

I don't think that we need "How many times does term T occur in all fields for doc D", frequency is necessary per field and not per document.

I don't know too much about the implementation of PhraseQuery, but I think that should be possible to implement BM25F for it (and any other query type), as far as frequency and docFreq of the phrase/terms are available.

At this point it is not supported in the patch, but I don't see any reason why it couldn't be implemented, moreover what I don't really know is how to do it .

Joaquin Perez-Iglesias
added a comment - 04/Dec/09 19:15 - edited Yes, you are right what I meant was related with multifield queries, if you search a:F1^F2, the right approach will be to compute IDF with docFreq(a,F1^F1) what in my understanding cannot be done.
If I'm right Lucene does weight(a)*idf(a,F1) + weight(a)*idf(a,F2), and the correct approach would be weight(a)*idf(a,F1^F2).
That's the reason why Uwe (and I) suggested to use IDF per field in the previous case, and if the query is executed on each field, use a kind of catch-all field to compute docFreq in all fields.
(Michael)
In summary it will be nice to have:
1. docFreq at document level, something like "int docFreq(term, doc_id)" and return the number of documents where term occurs, but if it is not possible a catch-all field will be enough.
2. The Collection Average Document Length and Collection Average Field Length (per each field).
I don't think that we need "How many times does term T occur in all fields for doc D", frequency is necessary per field and not per document.
I don't know too much about the implementation of PhraseQuery, but I think that should be possible to implement BM25F for it (and any other query type), as far as frequency and docFreq of the phrase/terms are available.
At this point it is not supported in the patch, but I don't see any reason why it couldn't be implemented, moreover what I don't really know is how to do it .

Ahh OK I just misunderstood – BM25 can score PhraseQuery; it's just that the current patch doesn't implement that.

1. docFreq at document level, something like "int docFreq(term, doc_id)" and return the number of documents where term occurs, but if it is not possible a catch-all field will be enough.

OK, catch all seems like an OK starting point. I wonder if we could enable storing terms dict but not postings... then we could store catch all just for the terms stats, so we wouldn't waste disk space. Though merging gets tricky, since we'd have to walk postings for all fields (or at least all involved in BM25F) in parallel, re-computing the catch-all stats.

2. The Collection Average Document Length and Collection Average Field Length (per each field).

Lucene doesn't store/compute this today... we can easily compute these stats for newly created segments, and record in the segments file, but then recomputing them during segment merging with deletions gets tricky. We could just take the linear approximate avg with deletions, but that may end up being too approximate, so we could instead make a dedicated posting list, which would be properly merged, but we'd then have to re-walk to compute the stats for the newly merged segment.

Michael McCandless
added a comment - 04/Dec/09 20:23 I think that should be possible to implement BM25F for it
Ahh OK I just misunderstood – BM25 can score PhraseQuery; it's just that the current patch doesn't implement that.
1. docFreq at document level, something like "int docFreq(term, doc_id)" and return the number of documents where term occurs, but if it is not possible a catch-all field will be enough.
OK, catch all seems like an OK starting point. I wonder if we could enable storing terms dict but not postings... then we could store catch all just for the terms stats, so we wouldn't waste disk space. Though merging gets tricky, since we'd have to walk postings for all fields (or at least all involved in BM25F) in parallel, re-computing the catch-all stats.
2. The Collection Average Document Length and Collection Average Field Length (per each field).
Lucene doesn't store/compute this today... we can easily compute these stats for newly created segments, and record in the segments file, but then recomputing them during segment merging with deletions gets tricky. We could just take the linear approximate avg with deletions, but that may end up being too approximate, so we could instead make a dedicated posting list, which would be properly merged, but we'd then have to re-walk to compute the stats for the newly merged segment.

Should we take just a small step back and consider what it would take to actually make scoring more pluggable instead of just thinking about how best to integrate BM25? In other words, someone else has also donated an implementation of the Axiomatic Retr. Function. Much like BM25, I think it also requires avg. doc length, as does (I believe) language modeling and some other approaches. Of course, we need to do this in a way that doesn't hurt performance for the default case.

I'm also curious if anyone has compared BM25 w/ a Lucene similarity that uses a different length normalization factor? I've seen many people use a different len. norm with good success, but it isn't necessarily for everyone.

Grant Ingersoll
added a comment - 04/Dec/09 21:02 I haven't looked at the patch yet, but...
Should we take just a small step back and consider what it would take to actually make scoring more pluggable instead of just thinking about how best to integrate BM25? In other words, someone else has also donated an implementation of the Axiomatic Retr. Function. Much like BM25, I think it also requires avg. doc length, as does (I believe) language modeling and some other approaches. Of course, we need to do this in a way that doesn't hurt performance for the default case.
I'm also curious if anyone has compared BM25 w/ a Lucene similarity that uses a different length normalization factor? I've seen many people use a different len. norm with good success, but it isn't necessarily for everyone.

In other words, someone else has also donated an implementation of the Axiomatic Retr. Function.

I've never been able to get that scoring function to do anything more than be consistently worse than the default Lucene formula. I tried at least 3 test collections with it...

I'm also curious if anyone has compared BM25 w/ a Lucene similarity that uses a different length normalization factor? I've seen many people use a different len. norm with good success, but it isn't necessarily for everyone.

Yes in the image posted here, I tried modifying length normalization with SweetSpot etc as others have done in the past. For this corpus I was unable to improve it in this way.

But maybe I made some mistakes in both these cases, so anyone feel free to try this themselves.

Robert Muir
added a comment - 04/Dec/09 22:36 In other words, someone else has also donated an implementation of the Axiomatic Retr. Function.
I've never been able to get that scoring function to do anything more than be consistently worse than the default Lucene formula. I tried at least 3 test collections with it...
I'm also curious if anyone has compared BM25 w/ a Lucene similarity that uses a different length normalization factor? I've seen many people use a different len. norm with good success, but it isn't necessarily for everyone.
Yes in the image posted here, I tried modifying length normalization with SweetSpot etc as others have done in the past. For this corpus I was unable to improve it in this way.
But maybe I made some mistakes in both these cases, so anyone feel free to try this themselves.

Grant Ingersoll
added a comment - 04/Dec/09 22:57 Yes in the image posted here, I tried modifying length normalization with SweetSpot etc as others have done in the past. For this corpus I was unable to improve it in this way.
Yeah, can't speak for SweetSpot, but there are other approaches too that don't favor shorter docs all the time.

Robert Muir
added a comment - 04/Dec/09 23:01 Yeah, can't speak for SweetSpot, but there are other approaches too that don't favor shorter docs all the time.
This is why it would be interesting if someone came up with a different modification (preferably one that isn't corpus-specific tuning) that actually works for that one corpus.
Its in openrelevance svn, anyone can try.

Its of interest how they modified BM25's idf formula slightly in a way to improve results when no stopwords list is used. I'm curious what you think about this as it looks like a potential improvement for people not using stopwords (multilingual situation, etc)

edit here is the quote: for simplicity

Using the original idf formula
idf =log[(n−dfj +0.5)/(dfj +0.5)], we have noticed
that when the underlying term tj occurs in more than half of
the documents (dfj >n/2), the resulting idf value would be
negative, and the final document score also could be negative.
As a means of estimating idf,we therefore suggest a new variant
defined as idf =log{1+[(n−dfj +0.5)/(dfj +0.5)]}.

Robert Muir
added a comment - 16/Feb/10 20:01 - edited Joaquin, have you seen this paper: http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf
Its of interest how they modified BM25's idf formula slightly in a way to improve results when no stopwords list is used. I'm curious what you think about this as it looks like a potential improvement for people not using stopwords (multilingual situation, etc)
edit here is the quote: for simplicity
Using the original idf formula
idf =log[(n−dfj +0.5)/(dfj +0.5)], we have noticed
that when the underlying term tj occurs in more than half of
the documents (dfj >n/2), the resulting idf value would be
negative, and the final document score also could be negative.
As a means of estimating idf,we therefore suggest a new variant
defined as idf =log{1+[(n−dfj +0.5)/(dfj +0.5)]}.

It is a consequence of the logarithm, you can get negative numbers, and a negative score doesn't have to much sense. As far as I know this version of IDF is pretty theoretical and based on the binary independence model (BIR), so transform the products of probabilities into a summation of logarithms. Anyway it is quite usual to add a 1 to the final result before applying the logarithm to avoid situations like previous.

In my opinion it should be added to the patch. It doesn't hurt but it helps

Please note that the above formula for IDF shows potentially major drawbacks when using it for terms appearing in more than half of the corpus documents. These terms' IDF is negative, so for any two almost-identical documents, one which contains the term and one which does not contain it, the latter will possibly get a larger score. This means that terms appearing in more than half of the corpus will provide negative contributions to the final document score. This is often an undesirable behavior, so many real-world applications would deal with this IDF formula in a different way:

Each summand can be given a floor of 0, to trim out common terms;

The IDF function can be given a floor of a constant ε, to avoid common terms being ignored at all;

The IDF function can be replaced with a similarly shaped one which is non-negative, or strictly positive to avoid terms being ignored at all.

Joaquin Perez-Iglesias
added a comment - 16/Feb/10 20:47 It is a consequence of the logarithm, you can get negative numbers, and a negative score doesn't have to much sense. As far as I know this version of IDF is pretty theoretical and based on the binary independence model (BIR), so transform the products of probabilities into a summation of logarithms. Anyway it is quite usual to add a 1 to the final result before applying the logarithm to avoid situations like previous.
In my opinion it should be added to the patch. It doesn't hurt but it helps
This stuff is clearly explained on the wikipedia http://en.wikipedia.org/wiki/Okapi_BM25 .
Just a quote from Wikipedia
Please note that the above formula for IDF shows potentially major drawbacks when using it for terms appearing in more than half of the corpus documents. These terms' IDF is negative, so for any two almost-identical documents, one which contains the term and one which does not contain it, the latter will possibly get a larger score. This means that terms appearing in more than half of the corpus will provide negative contributions to the final document score. This is often an undesirable behavior, so many real-world applications would deal with this IDF formula in a different way:
Each summand can be given a floor of 0, to trim out common terms;
The IDF function can be given a floor of a constant ε, to avoid common terms being ignored at all;
The IDF function can be replaced with a similarly shaped one which is non-negative, or strictly positive to avoid terms being ignored at all.

I am trying to compile the BM25 classes for Lucene using the source code in http://nlp.uned.es/~jperezi/Lucene-BM25/jar/models.jar. But it wouldn't compile with later vesrion of Lucene i.e. lucene-3.0.1. I see that many of the interfaces org.apache.lucene.search.Query, org.apache.lucene.search.Scorer, and org.apache.lucene.search.Weight are all changed. Does anyone have a modified version of BM25 classes which works with latest version of Lucene?

Vinay Setty
added a comment - 06/Mar/10 10:00 I am trying to compile the BM25 classes for Lucene using the source code in http://nlp.uned.es/~jperezi/Lucene-BM25/jar/models.jar . But it wouldn't compile with later vesrion of Lucene i.e. lucene-3.0.1. I see that many of the interfaces org.apache.lucene.search.Query, org.apache.lucene.search.Scorer, and org.apache.lucene.search.Weight are all changed. Does anyone have a modified version of BM25 classes which works with latest version of Lucene?

@Vinay - Please take the LUCENE-2091.patch file from this very JIRA issue, and try to apply it. I modified the code so that it works with Lucene 2.9.1, and therefore should be compatible with Lucene 3.0.1 as well. The only feature missing is the explain() function. If this requires further modifications to work with Lucene 3.0.1, please let me know and I will try to handle the changes.

Yuval Feinstein
added a comment - 06/Mar/10 22:10 @Vinay - Please take the LUCENE-2091 .patch file from this very JIRA issue, and try to apply it. I modified the code so that it works with Lucene 2.9.1, and therefore should be compatible with Lucene 3.0.1 as well. The only feature missing is the explain() function. If this requires further modifications to work with Lucene 3.0.1, please let me know and I will try to handle the changes.

I was trying to apply the patch to both lucene 2.9.2 and 3.0.1. In both cases it failed to compile. The problem is that it assumes a method decodeNormValue(byte) in Similarity. I wasn't able to find a version where this method exists. Do you have any suggestions on how to proceed?

Hi Katja.
I am adding this to my task list.
I hope to get around to it in a few days.
In the meantime, I suggest you try replacing the instance method call
this.getSimilarity().decodeNormValue()
by the static method
Similarity.decodeNorm()

Please let me know if it works; If it does, I will change the patch accordingly.

Yuval Feinstein
added a comment - 22/Mar/10 10:13 Hi Katja.
I am adding this to my task list.
I hope to get around to it in a few days.
In the meantime, I suggest you try replacing the instance method call
this.getSimilarity().decodeNormValue()
by the static method
Similarity.decodeNorm()
Please let me know if it works; If it does, I will change the patch accordingly.

While working on LUCENE-2392, I have implemented several formula, including BM25 and I(Ne)C2/I(Ne)B2 from the divergence-from-randomness models.

I first implemented DFR and after verifying the relevance was correct against published results, I noticed BM25 (with this patch) had much lower relevance scores. This didn't make a lot of sense to me.

So I implemented BM25 from scratch, in terms of the stats framework of LUCENE-2392.

Here are some MAP scores on an english text collection (I have tested several and this is the consistent behavior):

Robert Muir
added a comment - 27/Apr/10 15:03 Hello, I think there might be a bug in the formula here.
While working on LUCENE-2392 , I have implemented several formula, including BM25 and I(Ne)C2/I(Ne)B2 from the divergence-from-randomness models.
I first implemented DFR and after verifying the relevance was correct against published results, I noticed BM25 (with this patch) had much lower relevance scores. This didn't make a lot of sense to me.
So I implemented BM25 from scratch, in terms of the stats framework of LUCENE-2392 .
Here are some MAP scores on an english text collection (I have tested several and this is the consistent behavior):
DefaultSimilarity
0.4333
BM25 ( LUCENE-2091 )
0.4131
BM25 ( LUCENE-2392 )
0.4395
BM25* ( LUCENE-2392 )
0.4401
DFR I(Ne)C2
0.4453
DFR I(Ne)B2
0.4475
Notes:
BM25* is what is actually attached, as it uses the flooring for IDF, but i also tested without this (BM25)
I used the same defaults as this patch for k1 and b parameters.
I've attached my implementation for the sake of it, although the APIs are pretty meaningless right now, but just so you can see how I implemented the equation.

@Joaquin, I want to use BM25 scoring for evaluating phrase queries, I have created a positional index in Lucene, but have no clue how to use it for evaluating phrase queries using BM25 scorer. I had a quick look at the code, by default the queries are boolean, and could not find a easy way to make it phrase query. Any ideas?

Vinay Setty
added a comment - 19/May/10 10:05 @Joaquin, I want to use BM25 scoring for evaluating phrase queries, I have created a positional index in Lucene, but have no clue how to use it for evaluating phrase queries using BM25 scorer. I had a quick look at the code, by default the queries are boolean, and could not find a easy way to make it phrase query. Any ideas?

@Vinay - I have this suggestion. I am unsure whether it will work.
First, I would implement the BM25BooleanQuery, and use it to create a QueryWrapperFilter qwf.
(See http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/search/QueryWrapperFilter.html)
Next, I would create a Phrase query, and call search(phraseQuery, qwf, 50).
This way, the scorer will first look for matches for the BM25 query, and later look among them for matches for the phrase query.
Hope this is understandable.
– Yuval

Yuval Feinstein
added a comment - 24/May/10 14:36 @Vinay - I have this suggestion. I am unsure whether it will work.
First, I would implement the BM25BooleanQuery, and use it to create a QueryWrapperFilter qwf.
(See http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/search/QueryWrapperFilter.html )
Next, I would create a Phrase query, and call search(phraseQuery, qwf, 50).
This way, the scorer will first look for matches for the BM25 query, and later look among them for matches for the phrase query.
Hope this is understandable.
– Yuval

Hi Rob.
your attachment (BM25SimilarityProvider) seems to rely on some other code (Stats.DocFieldStats) & AggregatesProvider .. which I guess is part of your DFR patch.. can you provide a pointer to that..

TIA
also I'm guessing that those rely on 2392, and provides an alternate implementation to this. Should we just close this as a duplicate to 2392 ?

Ian Holsman
added a comment - 11/Mar/11 00:32 Hi Rob.
your attachment (BM25SimilarityProvider) seems to rely on some other code (Stats.DocFieldStats) & AggregatesProvider .. which I guess is part of your DFR patch.. can you provide a pointer to that..
TIA
also I'm guessing that those rely on 2392, and provides an alternate implementation to this. Should we just close this as a duplicate to 2392 ?

The next big step is to separate scoring from matching (see the latest patch on LUCENE-2392) so that similarity has full responsibility for all calculations, and so we get full integration with all queries, etc.

This isn't that complicated: however, in order to do this, we need to first refactor Explanations, so that a Similarity has the capability (and responsibility!) to fully explain its calculations. So I think this is the next issue to resolve before going any further.

Robert Muir
added a comment - 11/Mar/11 15:53
your attachment (BM25SimilarityProvider) seems to rely on some other code (Stats.DocFieldStats) & AggregatesProvider .. which I guess is part of your DFR patch.. can you provide a pointer to that.
Yeah this is from LUCENE-2392 . Unfortunately it won't work with the most recent patch there, but both patches are just really exploration to see how we can divide into subtasks.
For an update, the JIRA issues aren't well linked but we have actually made pretty good progress on some major portions (imo these are the most interesting):
Collection term stats: LUCENE-2862
per-field similarity: LUCENE-2236
termstate, to avoid redundant i/o for stats: LUCENE-2694
norms cleanup: LUCENE-2771 , LUCENE-2846
The next big step is to separate scoring from matching (see the latest patch on LUCENE-2392 ) so that similarity has full responsibility for all calculations, and so we get full integration with all queries, etc.
This isn't that complicated: however, in order to do this, we need to first refactor Explanations, so that a Similarity has the capability (and responsibility!) to fully explain its calculations. So I think this is the next issue to resolve before going any further.

Not sure am i using this BM25BooleanQuery correctly, getting variation in the number of hits when testing v QueryParser. Is there limitations to the query string that BM25BooleanQuery can deal with, e.g. "gas OR ((oil AND car) NOT ship)", the results returned by BM25BooleanQuery seem to be the all docs that don't contain the term "ship", (comparing BM25BooleanQuery v QueryParser)

ian towey
added a comment - 16/Jun/11 16:51 Not sure am i using this BM25BooleanQuery correctly, getting variation in the number of hits when testing v QueryParser. Is there limitations to the query string that BM25BooleanQuery can deal with, e.g. "gas OR ((oil AND car) NOT ship)", the results returned by BM25BooleanQuery seem to be the all docs that don't contain the term "ship", (comparing BM25BooleanQuery v QueryParser)