lucene-java-user mailing list archives

On Jun 29, 2009, at 3:10 PM, Amir Hossein Jadidinejad wrote:
> Hi,
> It's my first experiment with Lucene. Please help me.
> I'm going to index a set of documents and create a feature vector
> for each of them. This vector contains all terms belong to the
> document that weight using TFIDF.
> After that I want to compute the cosine similarity between all
> documents and produce a doc-doc similarity matrix. My document set
> is large and it's important to have a scalable implementation.
See Mahout (http://lucene.apache.org/mahout). In the utils module, is
a class called LuceneIterable that the o.a.mahout.utils.vectors.Driver
program can use to convert a Lucene index into a Mahout Vector
representation, which can then be used to create a d-d similarity
matrix. It uses Hadoop, so you can go as big as you want.
See http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
-Grant
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org