Hi Shoba,
thank you for your reponse.
Imagine an "indexing server" that may receive from a client via a socket three types of command : insert, search and retrieve.
Receving an insert command e.g. "insert 123, d:/tempo/test.txt", the server reads the named file, extracts its content, creates a Lucene Document with three Lucene Fields KEY=123, DOCPATH=d:/tempo/test.txt and CONTENT= and adds the document to the index. KEY is stored and indexed, DOCPATH is juste stored and CONTENT is juste indexed. Let us suppose, this test.txt file contains the text : "i have to go to Sainte-Foy this afternoon but the urban transports are on strike just today in Lyon, they are protesting against the project of law on pensions and thus they are cassing me the nenette with their betises."
Receving a search command e.g. "search tra* AND pro*", the server constructs a Query object
Analyzer azer = StandardAnalyzer();
QueryParser qp = new QueryParser("CONTENT", azer);
Query q = qp.parse("tra* AND pro*");
m_LastQuery = q; //server memorises the last data query
etc, etc. The server responds to the client with an array of KEYs, in this exemple "123".
Receving a retrieve command e.g. "retrieve 123", the server constructs another Query "KEY:123", retrieves the DOCPATH from the obtained Document, re-reads the file and re-extracts its content (this is perhaps what Tom Dunstan calls in his mail "reparsing at runtime"?) and writes it to a String, let us suppose, a variable called sText contains the content of the found file, i.e. sText="i have to go...". At this moment i would like to apply a tool in order to construct an "array of positions" indicating the positions and the lengths of the tokens of the last memorised data query. In this exemple the last memorised data query is "tra* AND pro*", the tokens of this query are "transports" and "protesting", the "array of positions" must be : "57,10;111,10", then the first 't' of 'transports' is the 57th character in the string "i have to go..." and the length of the word "transports" is 10 etc, etc.
So i call the function highlightTerms in de.iqcomputing.lucene.LuceneTools (Maik Schreiber's class, lightly modified) :
ArrayList aop = new ArrayList();
LuceneTools.highlightTerms(sText, m_LastQuery, azer, aop);
A schematised excerpt of his function :
public String highlightTerms(
String p_sText, //"i have to go..."
Query p_Query, //"tra* AND pro*"
Analyzer p_Azer,
ArrayList p_aop) //"array of pos"
{
...
TokenStream stream = null;
HashSet terms = new HashSet();
org.apache.lucene.analysis.Token token;
int startOffset, endOffset;
getTerms(p_Query, terms, false);
StringReader sr = new StringReader(p_sText);
stream = analyzer.tokenStream(sr);
while ((token = stream.next()) != null) {
startOffset = token.startOffset();
endOffset = token.endOffset();
if (terms.contains(token.termText())) {
//add startOffset to p_aop
//but it is always 0
}
}
...
}
The call to "getTerms" collects all the terms figuring in the Query (this is why many Lucene classes must be modified in order to access "private" informations within the classes), i.e. after the call to getTerms we have in the HashSet "terms" the two terms of the last Query : "transports" and "protesting". Then a TokenStream is created from p_sText etc etc. The problem is, startOffset is always 0 (zero) and, therefore, the "array of positions" will be completely wrong (contains only zeros). You say, you have never encountered a similar problem, do you see why it deconnes here and why it works correctly in your context? I would be grateful to anybody for any help, because we are here in the big Schlamassel whitout these highlighting feature...
Arpad KATONA
--
a.katona@ever-team.com
PS: sorry for the lamentable english, i hope it is comprehensible anyway...
-----Message d'origine-----
De : Shoba Ramachandran [mailto:shoba_duruvan@yahoo.com]
Envoyé : lundi 12 mai 2003 20:59
À : Lucene Users List
Objet : Re: Highlighting, startOffset, endOffset
I'm using this termhighlighter and never got any
problem. Could you be elaborate on how you are using
it.
-Shoba
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org