7 Answers
7

During runtime, look up the term frequency vectors for both documents using IndexReader.getTermFreqVector(), and look up document frequency data for each term using IndexReader.docFreq(). That will give you all the components necessary to calculate the cosine similarity between the two docs.

An easier way might be to submit doc A as a query (adding all words to the query as OR terms, boosting each by term frequency) and look for doc B in the result set.

Yes ok for the first, i use the termfreqvector to get what i want, but i wanted to check how much faster would it be the to get similarity from lucene. For the second part of your answer, i checked in the javadoc that there is not an obvious way to get similarity score. Ok, i can look for doc B in the result set but the only i can get is its position in the TopDocs, not the exact similarity score between these two document vectors that i want.
–
maikyDec 9 '09 at 15:40

@tiendv how did you get Sujit Pal's documents? He does not provide a link to their contents on his web page? He just lists their titles? If you just used the document titles you will get a big difference because those document titles are very different.
–
Mark ButlerJul 26 '13 at 0:11

yes i see , I have been check this . Sujit Pal's result not true
–
tiendvJul 26 '13 at 1:48

Thanks for your feedback but as I understand it you do not need to calculate TF-IDF to calculate cosine similarity. You could calculate a similarity metric using TF-IDF if you want, but that was not the aim of the code above. Specifically I am using the algorithm above to test how well some automatic extraction code works against some human generated answers on a per document basis. TF-IDF would not help in that case, which is why I did not use it.
–
Mark ButlerJun 8 '13 at 2:13

Also I am happy to work with you optimising your code and I can see a few basic things you could do but it would be better if you posted it under a new question as this one did not mention TF-IDF? You could always cite this question?
–
Mark ButlerJun 8 '13 at 4:49