Details

Description

When highlighting multi-valued fields, the highlighter sometimes returns snippets which span multiple values, e.g. with values "foo" and "bar" and search term "ba" the highlighter will create the snippet "foo<em>ba</em>r". Furthermore it sometimes returns smaller snippets than it should, e.g. with value "foobar" and search term "oo" it will create the snippet "<em>oo</em>" regardless of hl.fragsize.

I have been unable to determine the real cause for this, or indeed what actually goes on at all. To reproduce the problem, I've used the following steps:

create an index with multi-valued fields, one document should have at least 3 values for these fields (in my case strings of length between 5 and 15 Japanese characters – as far as I can tell plain old ASCII should produce the same effect though)

search for part of a value in such a field with highlighting enabled, the additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, hl.mergeContiguous=true (changing the parameters does not seem to have any effect on the result though)

Activity

Patch against SVN HEAD to treat multi valued fields like single valued fields when highlighting by looping over the field values and accumulating the highlighted snippets.

This corrects the behaviour I've described and simplifies the code. The downside is that it may impose a performance penalty for large numbers of snippets. The code breaks out of the loop when enough snippets have been found without considering the other values of the fields, which means that the returned snippets may not be the best ones.

Lars Kotthoff
added a comment - 02/May/08 09:45 Patch against SVN HEAD to treat multi valued fields like single valued fields when highlighting by looping over the field values and accumulating the highlighted snippets.
This corrects the behaviour I've described and simplifies the code. The downside is that it may impose a performance penalty for large numbers of snippets. The code breaks out of the loop when enough snippets have been found without considering the other values of the fields, which means that the returned snippets may not be the best ones.

Attaching new patch which checks the fragments returned by Lucene properly. The getBestTextFragments method sometimes returns fragments which do not contain the search term at all (with a score of 0), which I wasn't aware of. The new patch checks whether the score of a fragment is > 0 before adding it to the list of fragments. It also removes the intermediate list of TextFragments and only accumulates a list of Strings, since this is the only information that's returned anyway.

With the new patch the unit tests (which I finally convinced to run at all in Eclipse) succeed.

Lars Kotthoff
added a comment - 22/May/08 07:41 Attaching new patch which checks the fragments returned by Lucene properly. The getBestTextFragments method sometimes returns fragments which do not contain the search term at all (with a score of 0), which I wasn't aware of. The new patch checks whether the score of a fragment is > 0 before adding it to the list of fragments. It also removes the intermediate list of TextFragments and only accumulates a list of Strings, since this is the only information that's returned anyway.
With the new patch the unit tests (which I finally convinced to run at all in Eclipse) succeed.

Sorry for all the noise. Attaching yet another new version of the patch which makes sure that for multi-valued field the best fragments are returned, rather than the first fragments with score > 0 as before.

Lars Kotthoff
added a comment - 22/May/08 08:32 Sorry for all the noise. Attaching yet another new version of the patch which makes sure that for multi-valued field the best fragments are returned, rather than the first fragments with score > 0 as before.
Added a unit test to verify this behaviour.

Ah, I see what the problem is: Although it is impossible for tokens from different values to appear in the same fragment (due to the semantics of MultiValuedTokenFilter), the non-token text (typically, punctuation) from different values can bleed into the same fragment, since lucene's highlighter can only create a new fragment on token boundaries.

Unfortunately SOLR-553 was committed a day after you submitted your patch, and rearranges the code slightly so that it no longer applies. Could you sync the patch with trunk? I think the basic approach is sound.

Mike Klaas
added a comment - 05/Jun/08 05:04 Ah, I see what the problem is: Although it is impossible for tokens from different values to appear in the same fragment (due to the semantics of MultiValuedTokenFilter), the non-token text (typically, punctuation) from different values can bleed into the same fragment, since lucene's highlighter can only create a new fragment on token boundaries.
Unfortunately SOLR-553 was committed a day after you submitted your patch, and rearranges the code slightly so that it no longer applies. Could you sync the patch with trunk? I think the basic approach is sound.

Thanks for the patch, Lars. I think that the basic approach is sound, though I am a little nervous about the performance implications (especially in the case of phrase highlighting, where we spin up an entirely new spanhighlighter for each value in a multi-valued field). I wonder if I am the only one who highlights large text fields composed of dozens of individual values?

Mike Klaas
added a comment - 10/Jun/08 06:09 Thanks for the patch, Lars. I think that the basic approach is sound, though I am a little nervous about the performance implications (especially in the case of phrase highlighting, where we spin up an entirely new spanhighlighter for each value in a multi-valued field). I wonder if I am the only one who highlights large text fields composed of dozens of individual values?

In the setup I've been testing it with (one large single-valued "text" field and several multi-valued fields) it didn't seem to have any serious performance implications – i.e. the randomness of my test queries was enough to mask any loss of performance
The length of the multi-valued fields is in the order of 10-20 characters on average though and there're not many multiple different values.

I personally think that returning correct data is more important than performance in this case, but that may just be because my particular setup doesn't suffer any significant loss of performance. I didn't see any other way to correct the behaviour of the current trunk code, but if anybody else has a better idea how to do it, please let us know!

Lars Kotthoff
added a comment - 10/Jun/08 06:20 In the setup I've been testing it with (one large single-valued "text" field and several multi-valued fields) it didn't seem to have any serious performance implications – i.e. the randomness of my test queries was enough to mask any loss of performance
The length of the multi-valued fields is in the order of 10-20 characters on average though and there're not many multiple different values.
I personally think that returning correct data is more important than performance in this case, but that may just be because my particular setup doesn't suffer any significant loss of performance. I didn't see any other way to correct the behaviour of the current trunk code, but if anybody else has a better idea how to do it, please let us know!

Yeah, I'm talking about highlighting 15kB of text in 100-200 character chunks. Maybe I can whip up a perf test for this soon.

The reason we probably see this issue differently is that the incorrect behaviour is quite minor for most users (perhaps a bit of punctuation leaking from value to value at most). Once way to correct what you are seeing is to use a tokenizer that creates tokens out of the CJK characters, or things on boundaries. In your case, inserting a fake token when encountering a right bracket [)] would fix the problem, I think.

Nevertheless, I think I will probably end up committing your patch after pondering it some more.

Mike Klaas
added a comment - 10/Jun/08 06:27 Hey Lars,
Yeah, I'm talking about highlighting 15kB of text in 100-200 character chunks. Maybe I can whip up a perf test for this soon.
The reason we probably see this issue differently is that the incorrect behaviour is quite minor for most users (perhaps a bit of punctuation leaking from value to value at most). Once way to correct what you are seeing is to use a tokenizer that creates tokens out of the CJK characters, or things on boundaries. In your case, inserting a fake token when encountering a right bracket [)] would fix the problem, I think.
Nevertheless, I think I will probably end up committing your patch after pondering it some more.

In my opinion the most important reason for committing the patch is that the current implementation breaks the multi-valued field abstraction. There's no way to assert that tokenizers will always produce tokens suitable for the current implementation. It also makes for a very hard to find bug, because there're no error messages. I just found it by chance. And even if you notice that something is wrong, fixing it is non-trivial and requires quite some knowlegde how Solr does highlighting of multi-valued fields.

So the other option is to add a page to the wiki with a workaround like you've suggested, but I think that's rather going to scare people evaluating Solr for use with CJK text away

Lars Kotthoff
added a comment - 10/Jun/08 06:38 Hi Mike,
In my opinion the most important reason for committing the patch is that the current implementation breaks the multi-valued field abstraction. There's no way to assert that tokenizers will always produce tokens suitable for the current implementation. It also makes for a very hard to find bug, because there're no error messages. I just found it by chance. And even if you notice that something is wrong, fixing it is non-trivial and requires quite some knowlegde how Solr does highlighting of multi-valued fields.
So the other option is to add a page to the wiki with a workaround like you've suggested, but I think that's rather going to scare people evaluating Solr for use with CJK text away