Activity

As I mentioned on the Solr list, I've discovered similar problems when highlighting with the ShingleFilter. (ShingleFilter does n-gram processing on Tokens, whereas NGramAnalyzer does n-gram processing on characters.) Here's a variation on Koji's demo program that exhibits some problems with ShingleFilter, as well as offering a slightly more textured example of how things work with NGramAnalyzer:

Testing analyzer Bigram shingle analyzer (bigrams and unigrams)...
---------------------------------
// works ok:
<B>Lucene</B> can index and can search [query='Lucene']
// works ok:
Lucene <B>can</B> make an index [query='can']
// same as Koji's example:
Lucene <B>can index and can</B> search [query='can']
// if there are three matches, they all get bundled into a single highlight:
Lucene <B>can index can search and can</B> highlight [query='can']
// it doesn't have to be the same search term that matches:
Lucene can <B>index can search</B> and can highlight [query='+index +search']
Testing analyzer Bigram (non-shingle) analyzer (bigrams only)...
---------------------------------
// works ok:
<B>Lucene</B> can index and can search [query='Lucene']
// is 'an' being treated as a match for 'can'(?):
Lucene <B>can make an</B> index [query='can']
// same as Koji's example:
Lucene <B>can index and can</B> search [query='can']
// if there are three matches, they all get bundled into a single highlight:
Lucene <B>can index can search and can</B> highlight [query='can']
// not sure what' happening here:
Lucene can <B>index can search and</B> can highlight [query='+index +search']

I'm interested what others think, but for me it makes sense to classify both of these as the same issue. From a high-level perspective, the problem in each case seems to be that Highlighter.getBestTextFragments(TokenStream tokenStream, String text, boolean mergeContiguousFragments, int maxNumFragments) makes use of a TokenGroup abstraction that doesn't really work for the n-gram or the bigram shingle case:

A TokenGroup is supposed to represent "one, or several overlapping tokens, along with the score(s) and the scope of the original text". (I assume TokenGroup was introduced to deal with synonym filter expansions.) Tokens are determined to overlap or not basically by seeing whether tokenB.startOffset() >= tokenA.endOffset(). (It's slightly more complex than this, but that's approximately what the test in TokenGroup.isDistinct() amounts to.) With the two analyzers under discussion, that criterion basically means that each token "overlaps" with the next.

In both cases, you never have a token whose startOffset is >= the preceding token's endOffset. So all these tokens are part of the same TokenGroup. That should mean these tokens all "overlap", but that would make for a rather mysterious notion of "overlapping".

Chris Harris
added a comment - 26/Jan/09 22:26 As I mentioned on the Solr list, I've discovered similar problems when highlighting with the ShingleFilter. (ShingleFilter does n-gram processing on Tokens , whereas NGramAnalyzer does n-gram processing on characters .) Here's a variation on Koji's demo program that exhibits some problems with ShingleFilter, as well as offering a slightly more textured example of how things work with NGramAnalyzer:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.NullFragmenter;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.analysis.ngram.NGramTokenizer;
import org.apache.lucene.analysis.shingle.ShingleFilter;
import org.apache.lucene.analysis.WhitespaceTokenizer;
import java.io.Reader;
public class Main {
public static void main( String [] args) throws Exception {
testAnalyzer( new BigramShingleAnalyzer( true ), "Bigram shingle analyzer (bigrams and unigrams)" );
testAnalyzer( new NGramAnalyzer(), "Bigram (non-shingle) analyzer (bigrams only)" );
}
static void testAnalyzer(Analyzer analyzer, String analyzerDescription) throws Exception {
System .out.println( "Testing analyzer " + analyzerDescription + "..." );
System .out.println( "---------------------------------" );
test(analyzer, "Lucene can index and can search" , "Lucene" );
test(analyzer, "Lucene can make an index" , "can" );
test(analyzer, "Lucene can index and can search" , "can" );
test(analyzer, "Lucene can index can search and can highlight" , "can" );
test(analyzer, "Lucene can index can search and can highlight" , "+index +search" );
System .out.println();
}
static void test(Analyzer analyzer, String text, String queryStr) throws Exception {
QueryParser parser = new QueryParser( "f" , analyzer);
Query query = parser.parse(queryStr);
QueryScorer scorer = new QueryScorer(query, "f" );
Highlighter h = new Highlighter(scorer);
h.setTextFragmenter( new NullFragmenter()); // We're not testing fragmenter here.
System .out.println(h.getBestFragment(analyzer, "f" , text) + " [query='" + queryStr + "']" );
}
static class NGramAnalyzer extends Analyzer {
public TokenStream tokenStream( String field, Reader input) {
return new NGramTokenizer(input, 2, 2);
}
}
static class BigramShingleAnalyzer extends Analyzer {
boolean outputUnigrams;
public BigramShingleAnalyzer( boolean outputUnigrams) {
this .outputUnigrams = outputUnigrams;
}
public TokenStream tokenStream( String field, Reader input) {
ShingleFilter sf = new ShingleFilter( new WhitespaceTokenizer(input));
sf.setOutputUnigrams(outputUnigrams);
return sf;
}
}
}
Here's the current output, with commentary:
Testing analyzer Bigram shingle analyzer (bigrams and unigrams)...
---------------------------------
// works ok:
<B>Lucene</B> can index and can search [query='Lucene']
// works ok:
Lucene <B>can</B> make an index [query='can']
// same as Koji's example:
Lucene <B>can index and can</B> search [query='can']
// if there are three matches, they all get bundled into a single highlight:
Lucene <B>can index can search and can</B> highlight [query='can']
// it doesn't have to be the same search term that matches:
Lucene can <B>index can search</B> and can highlight [query='+index +search']
Testing analyzer Bigram (non-shingle) analyzer (bigrams only)...
---------------------------------
// works ok:
<B>Lucene</B> can index and can search [query='Lucene']
// is 'an' being treated as a match for 'can'(?):
Lucene <B>can make an</B> index [query='can']
// same as Koji's example:
Lucene <B>can index and can</B> search [query='can']
// if there are three matches, they all get bundled into a single highlight:
Lucene <B>can index can search and can</B> highlight [query='can']
// not sure what' happening here:
Lucene can <B>index can search and</B> can highlight [query='+index +search']
I'm interested what others think, but for me it makes sense to classify both of these as the same issue. From a high-level perspective, the problem in each case seems to be that Highlighter.getBestTextFragments(TokenStream tokenStream, String text, boolean mergeContiguousFragments, int maxNumFragments) makes use of a TokenGroup abstraction that doesn't really work for the n-gram or the bigram shingle case:
A TokenGroup is supposed to represent "one, or several overlapping tokens, along with the score(s) and the scope of the original text". (I assume TokenGroup was introduced to deal with synonym filter expansions.) Tokens are determined to overlap or not basically by seeing whether tokenB.startOffset() >= tokenA.endOffset(). (It's slightly more complex than this, but that's approximately what the test in TokenGroup.isDistinct() amounts to.) With the two analyzers under discussion, that criterion basically means that each token "overlaps" with the next.
In Koji's bigram case, consider how "dogs" would get tokenized:
"do" (startOffset=0, endOffset=2)
"og" (startOffset=1, endOffset=3)
"gs" (startOffset=2, endOffset=4)
Or in my shingle case, consider how "I love Lucene" would get tokenized:
"I" (startOffset=0, endOffset=1)
"I love" (startOffset=0, endOffset=6)
"love" (startOffset=2, endOffset=6)
"love Lucene" (startOffset=2, endOffset=13)
"Lucene" (startOffset=7, endOffset=13)
In both cases, you never have a token whose startOffset is >= the preceding token's endOffset. So all these tokens are part of the same TokenGroup. That should mean these tokens all "overlap", but that would make for a rather mysterious notion of "overlapping".

The TokenGroup object passed to this method contains all of the tokens and their scores so it should be possible to use this information to deconstruct the originalText parameter and inject markup according to which tokens in the group had a match rather than putting a tag around the whole block. Some complexity may lie in handling token streams that produce tokens that "rewind" to earlier offsets.
SimpleHtmlFormatter suddenly seems less simple!

TokenStreams that produce entirely overlapping streams of tokens will automatically be broken into multiple TokenGroups because TokenGroup has a maximum number of linked Tokens it will ever hold in a single group.

I haven't got the time to fix this right now but if someone has a burning need to leap in, the above seems like what may be required.

Mark Harwood
added a comment - 27/Jan/09 11:56 It looks to me like this could be fixed in the "Formatter" classes when marking up the output string.
Currently classes such as SimpleHTMLFormatter in their "highlightTerm" method put a tag around the whole section of text, if it contains a hit, i.e.
SimpleHTMLFormatter.java
public String highlightTerm( String originalText, TokenGroup tokenGroup)
{
StringBuffer returnBuffer;
if (tokenGroup.getTotalScore()>0)
{
returnBuffer= new StringBuffer ();
returnBuffer.append(preTag);
returnBuffer.append(originalText);
returnBuffer.append(postTag);
return returnBuffer.toString();
}
return originalText;
}
The TokenGroup object passed to this method contains all of the tokens and their scores so it should be possible to use this information to deconstruct the originalText parameter and inject markup according to which tokens in the group had a match rather than putting a tag around the whole block. Some complexity may lie in handling token streams that produce tokens that "rewind" to earlier offsets.
SimpleHtmlFormatter suddenly seems less simple!
TokenStreams that produce entirely overlapping streams of tokens will automatically be broken into multiple TokenGroups because TokenGroup has a maximum number of linked Tokens it will ever hold in a single group.
I haven't got the time to fix this right now but if someone has a burning need to leap in, the above seems like what may be required.
Cheers
Mark

Here's a patch to Highlighter.java that fixes the examples. The basic idea is to throw away (or ignore) overlapping tokens when they don't have a score, so that a token group doesn't get expanded beyond a sequence of tokens that should be highlighted.

David Bowen
added a comment - 02/Oct/09 00:53 Here's a patch to Highlighter.java that fixes the examples. The basic idea is to throw away (or ignore) overlapping tokens when they don't have a score, so that a token group doesn't get expanded beyond a sequence of tokens that should be highlighted.

Mark, I tried the approach you suggested of using the Formatter interface. I found it didn't work because the Formatter did not have a way to map the tokens in the token group into the text. This could be fixed by providing a public accessor function for TokenGroup's matchStartOffset field. However, it seems convoluted to go to the trouble of constructing a TokenGroup only to have every Formatter have to take it all apart again to find the places within it that need highlighting. It seems to me that the purpose of a TokenGroup is to identify (up to) one span of characters that needs to be highlighted.

David Bowen
added a comment - 02/Oct/09 01:01 Mark, I tried the approach you suggested of using the Formatter interface. I found it didn't work because the Formatter did not have a way to map the tokens in the token group into the text. This could be fixed by providing a public accessor function for TokenGroup's matchStartOffset field. However, it seems convoluted to go to the trouble of constructing a TokenGroup only to have every Formatter have to take it all apart again to find the places within it that need highlighting. It seems to me that the purpose of a TokenGroup is to identify (up to) one span of characters that needs to be highlighted.

Koji Sekiguchi
added a comment - 24/May/12 01:12 This problem was my motivation for creating FastVectorHighlighter. Once FVH was committed, I'd never tried the combination of CJKTokenizer + Highlighter. So, I don't know the latest situation.
I'd like to close this as won't fix.