Announcement (2017-05-07): www.ruby-forum.com is now read-only since I
unfortunately do not have the time to support and maintain the forum any
more. Please see rubyonrails.org/community and ruby-lang.org/en/community
for other Rails- und Ruby-related community platforms.

I want to compare two documents in the index (i.e. retrieve the cosine
similarity/score between two documents term-vector's). Is this possible
using the standard Ferret functionality?
Thanks in advance,
Jeroen B.

On 5/27/06, Jeroen B. <removed_email_address@domain.invalid> wrote:
> I want to compare two documents in the index (i.e. retrieve the cosine> similarity/score between two documents term-vector's). Is this possible> using the standard Ferret functionality?
Hi Jeroen,
No problem. Make sure you store term-vectors when you add the field.
That is;
doc.add_field(:field, "yada yada yada",
Field::Store::NO, # or YES
Field::Index::TOKENIZED, # or UNTOKENIZED
Field::TermVector::YES) # or anything else but NO
Then you can retrieve the term vector from an index reader like so;
term_vector = index_reader.get_term_vector(doc_num, :field)
terms = term_vector.terms # array of terms in :field in document
freqs = term_vector.freqs # array of corresponding frequencies
Hope that helps. Is that enough to get you going?
Cheers,
Dave

On 5/27/06, Jeroen B. <removed_email_address@domain.invalid> wrote:
> I Index about 23000 weblogs with their weblog id as the document id and> the content by termvector. Now I want to compare two weblogs. So what> you suggest is that I retrieve the term-vectors for both documents and> calculate the dotproduct of the two vectors myself; or is there a nice> Ferret-way to do this?
Until now I haven't really used the TermVectors so this probably isn't
the best way to do it but here goes (this is very rough);
def cosine_similarity(index_reader, doc1, doc2)
tv1 = index_reader.get_term_vector(doc1, :data)
terms1 = tv1.terms
freqs1 = tv1.freqs
matrix = {}
terms1.size.times {|i| matrix[terms1[i]] = [freqs1[i], 0]}
tv2 = index_reader.get_term_vector(doc2, :data)
terms2 = tv2.terms
freqs2 = tv2.freqs
terms2.size.times {|i| (matrix[terms2[i]] ||= [0])[1] = freqs2[i]}
dot_product = matrix.values.inject(0) {|dp, (a,b)| dp += a*b}
lengths_product = Math.sqrt(freqs1.inject(0) {|sp, f| sp += f*f} *
freqs2.inject(0) {|sp, f| sp += f*f})
dot_product / lengths_product
end
I'd be interested to hear how you go with this. If performance is poor
I can add something like this to the C code.
Hope this helps,
Dave

David B. wrote:
> Until now I haven't really used the TermVectors so this probably isn't> the best way to do it but here goes (this is very rough);
I'm going to try this out now. I'll also try extracting all term vectors
from doc1 and using them as a query on doc2 (using a BooleanQuery). They
use this kind of method in "Lucene in Action" (somewhere around page 190
if I recall correctly).
Thanks for your quick responses; I'll let you know how things work out.
Cheers,
Jeroen B.

On 5/28/06, Jeroen B. <removed_email_address@domain.invalid> wrote:
> David B. wrote:> > Until now I haven't really used the TermVectors so this probably isn't> > the best way to do it but here goes (this is very rough);>> I'm going to try this out now. I'll also try extracting all term vectors> from doc1 and using them as a query on doc2 (using a BooleanQuery). They> use this kind of method in "Lucene in Action" (somewhere around page 190> if I recall correctly).
If it's a "More Like This" query that you are trying to write, I
recommend you look at the Lucene code here;
http://svn.apache.org/viewvc/lucene/java/branches/...
It's part of Lucene 2.0 now. I'll be adding MoreLikeThis Queries in
the near future.
Cheers,
Dave

Yes it is a more like this query, but: I only want the relevance score
for document B given document A as the query (so weblog:B AND
all_terms_from_A)
I'll look into it; thesis is due in 4 weeks so I've got loads of time :D
Cheers,
Jeroen B.

On Sun, May 28, 2006 at 07:36:25AM +0900, David B. wrote:
> If it's a "More Like This" query that you are trying to write, I> recommend you look at the Lucene code here;>> http://svn.apache.org/viewvc/lucene/java/branches/...
or you check out the port of this that lives in acts_as_ferret :-)
http://projects.jkraemer.net/acts_as_ferret/browse...
from Line 525 till around 720.
> It's part of Lucene 2.0 now. I'll be adding MoreLikeThis Queries in> the near future.
Dave, that's a nice idea. Should I try to prepare a patch for this based
on what I did in acts_as_ferret ? Would be ruby-only, though. But as the
whole more like this thing more or less is about building a
BooleanQuery,
I think speed is no issue here.
Jens
--
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer
removed_email_address@domain.invalid
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

On 5/29/06, Jens K. <removed_email_address@domain.invalid> wrote:
><snip>>> On Sun, May 28, 2006 at 07:36:25AM +0900, David B. wrote:> > It's part of Lucene 2.0 now. I'll be adding MoreLikeThis Queries in> > the near future.>> Dave, that's a nice idea. Should I try to prepare a patch for this based> on what I did in acts_as_ferret ? Would be ruby-only, though. But as the> whole more like this thing more or less is about building a BooleanQuery,> I think speed is no issue here.
Hi Jens,
That'd be great but not just yet. I may be making a few adjustments to
the API in the coming week. I'll be sure to discuss possible changes
with you guys when the time comes.
Gotta run. Cheers,
Dave