Highlighter component should expose snippet character offsets and the score.

Details

Description

The Highlighter Component does not currently expose the snippet character offsets nor the score. There is a TODO in DefaultSolrHighlighter indicating the intention to add this eventually. This information is needed when doing highlighting on external content. The data is there so its pretty easy to output it in some way. The challenge is deciding on the output and its ramifications on backwards compatibility. The current highlighter component response structure doesn't lend itself to adding any new data, unfortunately. I wish the original implementer had some foresight. Unfortunately all the highlighting tests assume this structure. Here is a snippet of the current response structure in Solr's sample data searching for "sdram" for reference:

Perhaps as a little hack, we introduce a pseudo field called text_startCharOffset which is the concatenation of the matching field and "_startCharOffset". This would be an array of ints. Likewise, there would be another array for endCharOffset and score.

if the structure is poor and hard to add additional metadata to which would be beneficial to new users let's change it

As long as there is an option people can turn on to force the legacy behavior there's nothing wrong with that.

In it's simplest form we can just add a new Highlighting Component (with a different class name) that is registered by default as the component "highlight" and document in CHANGES.txt that if people need/want the old one they should modify their solrconfig.xml to register it explicitly .

alternately we can keep using hte existing class, and modify it so that it changes it's behavior based on some init param, ditto previous comments about default behavior and CHANGES.txt

(back compat should be easy on upgrade, but i'd rather tell existing users "add this one line to your config if you really need the exact same response structure instead of this new better structure" then tell new and existing users "this is the really klunky hoop you have to jump through to make sense of all this hot new data we are returning")

Hoss Man
added a comment - 16/Jun/10 23:31 if the structure is poor and hard to add additional metadata to which would be beneficial to new users let's change it
As long as there is an option people can turn on to force the legacy behavior there's nothing wrong with that.
In it's simplest form we can just add a new Highlighting Component (with a different class name) that is registered by default as the component "highlight" and document in CHANGES.txt that if people need/want the old one they should modify their solrconfig.xml to register it explicitly .
alternately we can keep using hte existing class, and modify it so that it changes it's behavior based on some init param, ditto previous comments about default behavior and CHANGES.txt
(back compat should be easy on upgrade, but i'd rather tell existing users "add this one line to your config if you really need the exact same response structure instead of this new better structure" then tell new and existing users "this is the really klunky hoop you have to jump through to make sense of all this hot new data we are returning")

I think a different component is overkill for such a small change. I agree it should be toggled with a parameter.

Since the majority of users don't care about this extra metadata, perhaps the existing structure should be retained for when it is not asked for. Nobody would have to change, not even the tests. No Solr clients would have to care, necessarily. And when it is asked for (a rare need), the structure would then change to accommodate it. This would brake any client logic expecting to find the existing snippet where it usually is because it wouldn't be there. If this is undesirable or unacceptable, then there's the field suffix method that I describe in this issue and which is implemented in the patch. The only danger is that a client should not assume that that all listed fields are arrays of strings since some will be arrays of ints or floats. My patch includes such a modification for SolrJ.

David Smiley
added a comment - 18/Jun/10 03:55 I think a different component is overkill for such a small change. I agree it should be toggled with a parameter.
Since the majority of users don't care about this extra metadata, perhaps the existing structure should be retained for when it is not asked for. Nobody would have to change, not even the tests. No Solr clients would have to care, necessarily. And when it is asked for (a rare need), the structure would then change to accommodate it. This would brake any client logic expecting to find the existing snippet where it usually is because it wouldn't be there. If this is undesirable or unacceptable, then there's the field suffix method that I describe in this issue and which is implemented in the patch. The only danger is that a client should not assume that that all listed fields are arrays of strings since some will be arrays of ints or floats. My patch includes such a modification for SolrJ.

The spellcheck component changes format if you ask for extendedResults. I find that painful. Let's not repeat that here.

One option is to put the additional information into a parallel data structure keyed by uniqueKey, just like the current highlighting data structure. So rather than within it, beside it. Make sense? I know, I know... we've been talking about moving away from these parallel keyed data structures, but it's really not difficult for a client to deal with.

Erik Hatcher
added a comment - 18/Jun/10 10:14 The spellcheck component changes format if you ask for extendedResults. I find that painful. Let's not repeat that here.
One option is to put the additional information into a parallel data structure keyed by uniqueKey, just like the current highlighting data structure. So rather than within it, beside it. Make sense? I know, I know... we've been talking about moving away from these parallel keyed data structures, but it's really not difficult for a client to deal with.

The leading problem is that clients can no longer expect that every child of the parent parent element here (which identifies the document matched) is necessarily an array of strings, because it could be an array of ints or floats. We're also overloading the field name namespace with suffixes thereby preventing this working if a fieldname actually ends with _startPos. This strikes me as no big deal.

David Smiley
added a comment - 18/Jun/10 14:25 Erik, I think you're suggesting what I suggested. So we'd have:
<arr name= "text" >
<str> CORSAIR ValueSelect 1GB 184-Pin DDR &lt;em&gt;SDRAM&lt;/em&gt; Unbuffered DDR 400 (PC 3200) System Memory - Retail </str>
</arr>
<arr name= "text_startPos" >
<int> 5 </int>
</arr>
The leading problem is that clients can no longer expect that every child of the parent parent element here (which identifies the document matched) is necessarily an array of strings, because it could be an array of ints or floats. We're also overloading the field name namespace with suffixes thereby preventing this working if a fieldname actually ends with _startPos. This strikes me as no big deal.

That way the highlighting section remains untouched, with extra stuff in a 'highlighting-extended-info' (let's use a shorter name though) section as a direct child of the root response, just like 'highlighting' is.

The problem with offsets is.... what are the units? utf8 bytes, utf16 units, real characters?

Walter Underwood proposed a good idea of just alternating segments of text for highlighting. That would also avoid the broken style of highlighting we have now (it's ambiguous since a real <em> could be embedded in the text.

Rather than adding more outer containers like 'highlighting-extended-info', we could change what "highlighting" contains based on a parameter (and perhaps even change the default if this is targeted toward trunk).

Yonik Seeley
added a comment - 18/Jun/10 16:29 The problem with offsets is.... what are the units? utf8 bytes, utf16 units, real characters?
Walter Underwood proposed a good idea of just alternating segments of text for highlighting. That would also avoid the broken style of highlighting we have now (it's ambiguous since a real <em> could be embedded in the text.
Rather than adding more outer containers like 'highlighting-extended-info', we could change what "highlighting" contains based on a parameter (and perhaps even change the default if this is targeted toward trunk).

That way the highlighting section remains untouched, with extra stuff in a 'highlighting-extended-info'

that really seems painful – i think it would be a lot better to just come up with what the "new" structure should look like that's more flexible, populated it with more/less data based on what param the user asks for (ie: hl.positions=true) and then make this new structure the default for all future versions of solr. Folks who don't want the new types of metadata, and don't want to change their clients to understand the new structure can add some param to their defaults to revert the format. this is how we've dealt with several other changes in the past where we want the "default" behavior to be differnet for new users, but still support the old behavior for legacy users

(spellcheck.extendedResults may seem painful because it changes results – but that's because it was never intended for you to toggle it on differnet requsts – it's expected that you'll set it once and forget it – the real problem is that it probably should have been made the default)

The problem with offsets is.... what are the units? utf8 bytes, utf16 units, real characters?

1) isn't highlighting fairly fundamentally character based? would you ever want/expect a highlight position to be based on bytes that break up a logical character?
2) being largely ignorant of highlighting, i would say the units should be in whatever the Highlighter currently use when indexing into string values – my understanidng is that it's the same as the start/end offsets in tokens, so if they are char then it's char, if they are bytes, then it's bytes.

Walter Underwood proposed a good idea of just alternating segments of text for highlighting.

I like that idea, and if structured properly it can still include the "score" for each matching chunk as metadata, but some clients are still going to prefer offset metadata – in particular the situation where i've got a 20MB text file in external storage and i want display the entire document with matches highlighted. returning alternating strings isn't going to really going to help me unless they aren't truncated - at which point you are returning the entire 20MB doc (broken up in a bunch of distinct strings) instead of just returning a bunch of numbers i can use to find the corrisponding points in my local copy of the file.

Hoss Man
added a comment - 18/Jun/10 21:48 That way the highlighting section remains untouched, with extra stuff in a 'highlighting-extended-info'
that really seems painful – i think it would be a lot better to just come up with what the "new" structure should look like that's more flexible, populated it with more/less data based on what param the user asks for (ie: hl.positions=true) and then make this new structure the default for all future versions of solr. Folks who don't want the new types of metadata, and don't want to change their clients to understand the new structure can add some param to their defaults to revert the format. this is how we've dealt with several other changes in the past where we want the "default" behavior to be differnet for new users, but still support the old behavior for legacy users
(spellcheck.extendedResults may seem painful because it changes results – but that's because it was never intended for you to toggle it on differnet requsts – it's expected that you'll set it once and forget it – the real problem is that it probably should have been made the default)
The problem with offsets is.... what are the units? utf8 bytes, utf16 units, real characters?
1) isn't highlighting fairly fundamentally character based? would you ever want/expect a highlight position to be based on bytes that break up a logical character?
2) being largely ignorant of highlighting, i would say the units should be in whatever the Highlighter currently use when indexing into string values – my understanidng is that it's the same as the start/end offsets in tokens, so if they are char then it's char, if they are bytes, then it's bytes.
Walter Underwood proposed a good idea of just alternating segments of text for highlighting.
I like that idea, and if structured properly it can still include the "score" for each matching chunk as metadata, but some clients are still going to prefer offset metadata – in particular the situation where i've got a 20MB text file in external storage and i want display the entire document with matches highlighted. returning alternating strings isn't going to really going to help me unless they aren't truncated - at which point you are returning the entire 20MB doc (broken up in a bunch of distinct strings) instead of just returning a bunch of numbers i can use to find the corrisponding points in my local copy of the file.

1) isn't highlighting fairly fundamentally character based? would you ever want/expect a highlight position to be based on bytes that break up a logical character?
2) being largely ignorant of highlighting, i would say the units should be in whatever the Highlighter currently use when indexing into string values - my understanidng is that it's the same as the start/end offsets in tokens, so if they are char then it's char, if they are bytes, then it's bytes.

Nope, a 'character' in java is utf-16, it cannot even hold a full unicode code point.
In other programming languages that might be solr clients, characters and strings might be utf-8, or utf-32.
So if offsets are to be returned, its necessary to specify what 'unit' they are measured in.
Otherwise, an offset is as useless as saying my house is '4' away from yours... 4 what?!

Robert Muir
added a comment - 18/Jun/10 21:57
1) isn't highlighting fairly fundamentally character based? would you ever want/expect a highlight position to be based on bytes that break up a logical character?
2) being largely ignorant of highlighting, i would say the units should be in whatever the Highlighter currently use when indexing into string values - my understanidng is that it's the same as the start/end offsets in tokens, so if they are char then it's char, if they are bytes, then it's bytes.
Nope, a 'character' in java is utf-16, it cannot even hold a full unicode code point.
In other programming languages that might be solr clients, characters and strings might be utf-8, or utf-32.
So if offsets are to be returned, its necessary to specify what 'unit' they are measured in.
Otherwise, an offset is as useless as saying my house is '4' away from yours... 4 what?!

1) isn't highlighting fairly fundamentally character based? would you ever want/expect a highlight position to be based on bytes that break up a logical character?
2) being largely ignorant of highlighting, i would say the units should be in whatever the Highlighter currently use when indexing into string values - my understanidng is that it's the same as the start/end offsets in tokens, so if they are char then it's char, if they are bytes, then it's bytes.

On IRC rmuir convinced me that the world is not this simple ... char offsets makes sense when dealing just with the java API, but once you send the data over the wire to a remote client, it might be using any arbitrary charset in memory, so the only possible unit you can use to convey position information that can be ignorant of the remote client is bytes.

Hoss Man
added a comment - 18/Jun/10 22:43
1) isn't highlighting fairly fundamentally character based? would you ever want/expect a highlight position to be based on bytes that break up a logical character?
2) being largely ignorant of highlighting, i would say the units should be in whatever the Highlighter currently use when indexing into string values - my understanidng is that it's the same as the start/end offsets in tokens, so if they are char then it's char, if they are bytes, then it's bytes.
On IRC rmuir convinced me that the world is not this simple ... char offsets makes sense when dealing just with the java API, but once you send the data over the wire to a remote client, it might be using any arbitrary charset in memory, so the only possible unit you can use to convey position information that can be ignorant of the remote client is bytes.

Robert Muir
added a comment - 18/Jun/10 23:10 so the only possible unit you can use to convey position information that can be ignorant of the remote client is bytes.
I think bytes are no better, for example:
𩬅 is 6-bytes in UTF-16, 4-bytes in UTF-8, 2 chars in java.
in other programming languages maybe its 1 char and 4-bytes, in java its 2 chars and 6 bytes.
So this is why any offset numbers will be problematic, as they will invite bugs in client applications.

David Smiley
added a comment - 21/Jun/10 12:22 Character offsets may not be perfect, but bytes are worse. And without either, there's no feature here to discuss! Hence lets keep it at UTF-16 character offset as a practical matter.

And without either, there's no feature here to discuss! Hence lets keep it at UTF-16 character offset as a practical matter.

I don't necessarily agree. I would like to see, for example, correct php or perl code that consumes these offsets. I argue anyone trying to use utf16 offsets from anything but java will likely have bugs in their client application.

It would be good to consider alternative ways to implement this without using integer offsets (for example, what yonik proposed)

Robert Muir
added a comment - 21/Jun/10 12:42 And without either, there's no feature here to discuss! Hence lets keep it at UTF-16 character offset as a practical matter.
I don't necessarily agree. I would like to see, for example, correct php or perl code that consumes these offsets. I argue anyone trying to use utf16 offsets from anything but java will likely have bugs in their client application.
It would be good to consider alternative ways to implement this without using integer offsets (for example, what yonik proposed)

I know this has been awhile, but I had a need for something like this and while I implemented (and added it to SOLR-1397) I figured I'd try this out instead as well. After applying the patch I got the following response

Jamie Johnson
added a comment - 15/Jul/11 21:10 - edited I know this has been awhile, but I had a need for something like this and while I implemented (and added it to SOLR-1397 ) I figured I'd try this out instead as well. After applying the patch I got the following response
<lst name="highlighting">
<lst name="1">
<arr name="subject_phonetic">
<str><em>Test</em> subject message</str>
</arr>
<arr name="subject_phonetic_startPos"><int>0</int></arr>
<arr name="subject_phonetic_endPos"><int>29</int></arr>
</lst>
</lst>
seems that the startPos is always 0 and endPos is the length of the field including the highlighting start/end tags. Is this expected?