What do you mean by 'breadth'? Unit and integration tests (well... the "heart" test) already cover all the sims, and this includes score vs explanation comparison. As for the correctness tests, both LM and IB sims are tested, as well as four DFR methods. I can write tests for the three missing DFR sims, but that is as much breadth as I can add. Or do you have something else in mind?

David Mark Nemeskey
added a comment - 19/Aug/11 06:56 - edited I would just shoot for 'breadth' as far as across the different sims?
What do you mean by 'breadth'? Unit and integration tests (well... the "heart" test) already cover all the sims, and this includes score vs explanation comparison. As for the correctness tests, both LM and IB sims are tested, as well as four DFR methods. I can write tests for the three missing DFR sims, but that is as much breadth as I can add. Or do you have something else in mind?

Robert Muir
added a comment - 18/Aug/11 21:29
Do you think that I should re-write the ones where the computation of the gold value is missing? Or the other way around?
I don't think so, i think we will take whatever we can get as far as tests I would just shoot for 'breadth' as far as across the different sims?

David Mark Nemeskey
added a comment - 18/Aug/11 21:22 I've added the correctness tests (is there a better name for these?). Do you think that I should re-write the ones where the computation of the gold value is missing? Or the other way around?

freq: I didn't know about that! Still, I want to provide not "plausible", but at least "safe" statistics in this case. You didn't touch docFreq and numberOfDocuments, so I assumed at least these two are filled with the actual values, is that so?

But I don't think we should populate it with arbitrary ones, I like 1 because this is consistent with what you asked for if you omit term frequency (I think its confusing to put something other than 1 here, its inconsistent with how omitTF works for lucene's default scoring).

right, docFreq is always populated. but if you omitTF, freq will be 1 (for exact scorers) or <= 1 (for sloppy scorers) as no frequency is available.

I ran a quick test and got decreases in MAP (probably slight, maybe not even significant) with PL2 and dirichlet with the changes. I figure we can first fix D and then move on to P and such, save LM for last as its a major pain

Robert Muir
added a comment - 12/Aug/11 13:19
freq: I didn't know about that! Still, I want to provide not "plausible", but at least "safe" statistics in this case. You didn't touch docFreq and numberOfDocuments, so I assumed at least these two are filled with the actual values, is that so?
But I don't think we should populate it with arbitrary ones, I like 1 because this is consistent with what you asked for if you omit term frequency (I think its confusing to put something other than 1 here, its inconsistent with how omitTF works for lucene's default scoring).
right, docFreq is always populated. but if you omitTF, freq will be 1 (for exact scorers) or <= 1 (for sloppy scorers) as no frequency is available.
I ran a quick test and got decreases in MAP (probably slight, maybe not even significant) with PL2 and dirichlet with the changes. I figure we can first fix D and then move on to P and such, save LM for last as its a major pain

D: good question, I think if F > tfn, then D > 0, but I guess I have to prove it (and fix it if it isn't).

Could you tell me which sims were affected negatively?

freq: I didn't know about that! Still, I want to provide not "plausible", but at least "safe" statistics in this case. You didn't touch docFreq and numberOfDocuments, so I assumed at least these two are filled with the actual values, is that so?

David Mark Nemeskey
added a comment - 12/Aug/11 13:06 - edited D: good question, I think if F > tfn, then D > 0, but I guess I have to prove it (and fix it if it isn't).
Could you tell me which sims were affected negatively?
freq: I didn't know about that! Still, I want to provide not "plausible", but at least "safe" statistics in this case. You didn't touch docFreq and numberOfDocuments, so I assumed at least these two are filled with the actual values, is that so?

I think the change to D is fine? what about the rest of the equation? (especially the variable "D")

I tested D and its fine with this change, however with some of the other sims the changes had some negative effect... lets figure out D for now.

Also as far as the values if you omit stuff: i don't think we should provide fake values that seem plausible: remember if you omit term frequencies such that totalTermFreq is unavailable, then freq will always be 1

Robert Muir
added a comment - 12/Aug/11 12:48 I think the change to D is fine? what about the rest of the equation? (especially the variable "D")
I tested D and its fine with this change, however with some of the other sims the changes had some negative effect... lets figure out D for now.
Also as far as the values if you omit stuff: i don't think we should provide fake values that seem plausible: remember if you omit term frequencies such that totalTermFreq is unavailable, then freq will always be 1

Robert: I modified the nocommits a bit to provide input to the Similarities that looks somewhat plausible. I think it's better to avoid situations where e.g. docLen < freq to minimize the chance of error.

Please let me know what you think of these modifications; if they're OK, I'll nuke the nocommits.

David Mark Nemeskey
added a comment - 11/Aug/11 16:08 - edited Robert: I modified the nocommits a bit to provide input to the Similarities that looks somewhat plausible. I think it's better to avoid situations where e.g. docLen < freq to minimize the chance of error.
Please let me know what you think of these modifications; if they're OK, I'll nuke the nocommits.

David Mark Nemeskey
added a comment - 11/Aug/11 14:38 Did something so that D and P (the binomial models) return only positive scores, but neither is it theoretically sound, nor do I like it much.
Robert: could you test D please, to see how the results are affected?

Robert Muir
added a comment - 11/Aug/11 12:11 heh, i fought that guy last night for quite some time... couldn't figure out a good solution.
if you make a patch I can do some sanity testing though to try to help.

Apparently the Dirichlet method returns a negative score if the tf / docLen < corpusTf / corpusLen. Unfortunately the negative number can be arbitrarily large, so it's not as easy as adding a constant to the score. This of course makes sense if all documents are scored, as the function is monotone and consequently documents, whose tf is 0, will always be ranked lower than those that contain the word. But this is not how IR engines work.

Having said that, I believe that we could simulate such a system. I don't know exactly how the query architecture works, but I presume the clauses that don't match a document are assigned a zero value. Now instead of this zero, the Scorer (or whatever class does this) could ask for a default value from the Similarity. In this case LMDirichletSimilarity could return score(stats, 0, Integer.MAX_VALUE), which is somewhere around -12.

If we don't do this, we have three options:
1. add score(stats, 0, Integer.MAX_VALUE) to the score
2. if (score < 0) return 0
3. add corpusTf / corpusLen * docLen to tf

All ensure a positive score, but also each has its own disadvantage.
1. adds a pretty big constant to the score, which may not play well with the other parts of the query
2. some documents that contain the term get the same 0 score as documents that don't (though I cannot say this is not in line with the LM approach)
3. this introduces a transformation that is difficult to characterize

David Mark Nemeskey
added a comment - 11/Aug/11 12:02 Apparently the Dirichlet method returns a negative score if the tf / docLen < corpusTf / corpusLen. Unfortunately the negative number can be arbitrarily large, so it's not as easy as adding a constant to the score. This of course makes sense if all documents are scored, as the function is monotone and consequently documents, whose tf is 0, will always be ranked lower than those that contain the word. But this is not how IR engines work.
Having said that, I believe that we could simulate such a system. I don't know exactly how the query architecture works, but I presume the clauses that don't match a document are assigned a zero value. Now instead of this zero, the Scorer (or whatever class does this) could ask for a default value from the Similarity. In this case LMDirichletSimilarity could return score(stats, 0, Integer.MAX_VALUE), which is somewhere around -12.
If we don't do this, we have three options:
1. add score(stats, 0, Integer.MAX_VALUE) to the score
2. if (score < 0) return 0
3. add corpusTf / corpusLen * docLen to tf
All ensure a positive score, but also each has its own disadvantage.
1. adds a pretty big constant to the score, which may not play well with the other parts of the query
2. some documents that contain the term get the same 0 score as documents that don't (though I cannot say this is not in line with the LM approach)
3. this introduces a transformation that is difficult to characterize
For the time being, I'll go with 2, but we have to discuss this.

Basically we have the case for norms/totalTermFreq/sumTotalTermFreq that they can be unavailable because
freqs or norms are omitted, but currently all sims have to deal with this problem

Ideally sims would not have to deal with this stuff, but for the time being it prevents NaN/Inf for the hearts test
if the test gets preflexcodec (about 1/4 of the time), and it will prevent NPE if norms are omitted.

in the case these values are unavailable i set these to "1"... if you can review that this is ok, we can nuke the nocommits.

Robert Muir
added a comment - 11/Aug/11 01:36 Ok I added some things (marked nocommit for your review):
Basically we have the case for norms/totalTermFreq/sumTotalTermFreq that they can be unavailable because
freqs or norms are omitted, but currently all sims have to deal with this problem
Ideally sims would not have to deal with this stuff, but for the time being it prevents NaN/Inf for the hearts test
if the test gets preflexcodec (about 1/4 of the time), and it will prevent NPE if norms are omitted.
in the case these values are unavailable i set these to "1"... if you can review that this is ok, we can nuke the nocommits.

David Mark Nemeskey
added a comment - 10/Aug/11 21:50 - edited Fixed NaN and infinite scores in DFR and IB; all that's left is to fix the negative scores as well. ... and everything else discussed earlier.

in the case norms are omitted by the user, the formula behaves as if b=0 (no length normalization). so this is always a possibility sims should handle, thoguh for EasySimilarity perhaps it should just supply doclen=1 or something of that nature?

in the case norms are available, but sumTotalTermFreq is not (e.g. frequencies are omitted by the user), I use a value of 1 for avg doc len... This is probably ok
since in the case of omitTF all the TF values will be 1 anyway.

Robert Muir
added a comment - 10/Aug/11 12:23 Ok, here is what i did here for BM25:
in the case norms are omitted by the user, the formula behaves as if b=0 (no length normalization). so this is always a possibility sims should handle, thoguh for EasySimilarity perhaps it should just supply doclen=1 or something of that nature?
in the case norms are available, but sumTotalTermFreq is not (e.g. frequencies are omitted by the user), I use a value of 1 for avg doc len... This is probably ok
since in the case of omitTF all the TF values will be 1 anyway.

Robert: I'm on the Nan/Inf problems. As for the negative score, I'll leave it there for the time being, these Similarities should always return positive scores. I don't feel very confident about this test myself, so I guess I'll remove it (or at least make it optional) once all tests are successful.

Ahh, ok. I didn't know the sims should always return positive scores! If this is really the case, then its good to test for it.

As for the PreFlex codec, I must admit I am not familiar with it, so I would be grateful for a few pointers.

PreFlex codec emulates the Lucene 3.x index format, which does not support TotalTermFreq, SumTotalTermFreq, SumDocFreq, etc. It will return -1 here.
Though I just realized: in some situations any codec can return -1 for these values, for example if frequencies are omitted by the user (omitTFAP).
So currently, unfortunately, similarities have to deal with this case (and also the case where norms == null, because norms are omitted by the user (omitNorms) !).

I've been working on the BM25 sim with all these regards, Ill commit an update to it as an example.

Robert Muir
added a comment - 10/Aug/11 12:17
Robert: I'm on the Nan/Inf problems. As for the negative score, I'll leave it there for the time being, these Similarities should always return positive scores. I don't feel very confident about this test myself, so I guess I'll remove it (or at least make it optional) once all tests are successful.
Ahh, ok. I didn't know the sims should always return positive scores! If this is really the case, then its good to test for it.
As for the PreFlex codec, I must admit I am not familiar with it, so I would be grateful for a few pointers.
PreFlex codec emulates the Lucene 3.x index format, which does not support TotalTermFreq, SumTotalTermFreq, SumDocFreq, etc. It will return -1 here.
Though I just realized: in some situations any codec can return -1 for these values, for example if frequencies are omitted by the user (omitTFAP).
So currently, unfortunately, similarities have to deal with this case (and also the case where norms == null, because norms are omitted by the user (omitNorms) !).
I've been working on the BM25 sim with all these regards, Ill commit an update to it as an example.

Robert: I'm on the Nan/Inf problems. As for the negative score, I'll leave it there for the time being, these Similarities should always return positive scores. I don't feel very confident about this test myself, so I guess I'll remove it (or at least make it optional) once all tests are successful.

As for the PreFlex codec, I must admit I am not familiar with it, so I would be grateful for a few pointers.

David Mark Nemeskey
added a comment - 10/Aug/11 10:01 Robert: I'm on the Nan/Inf problems. As for the negative score, I'll leave it there for the time being, these Similarities should always return positive scores. I don't feel very confident about this test myself, so I guess I'll remove it (or at least make it optional) once all tests are successful.
As for the PreFlex codec, I must admit I am not familiar with it, so I would be grateful for a few pointers.

Robert Muir
added a comment - 10/Aug/11 00:06 I wouldn't worry about the scores being negative necessarily myself: there is nothing wrong with this.
But we should fix the Nan/Inf score problems.
Also: some of the stats that are newer in Lucene will get stupid results with PreFlex codec, it doesnt support them.
In my opinion add the following to the test's setup:
assumeFalse("test cannot run with PreFlex codec !",
"PreFlex".equals(CodecProvider.getDefault().getDefaultFieldCodec()));
and I can help in the places where EasySim collects these stats, for example I think we should add checks in case totalTermFreq = -1, and throw UOE here instead.

Fixed a bug in TestEasySimilarity that prevented Similarities that use a subclass of EasyStats to be tested. Also, modified EasyStats so that totalBoost is set to the value of queryBoost in the constructor.

David Mark Nemeskey
added a comment - 09/Aug/11 21:30 Fixed a bug in TestEasySimilarity that prevented Similarities that use a subclass of EasyStats to be tested. Also, modified EasyStats so that totalBoost is set to the value of queryBoost in the constructor.