How do I use this result to get the most similar document against the search query? Basically I am trying to re-create a search bar for Wikipedia. Based on a search query I want to return the most relevant articles from Wikipedia. In this scenario, there are 6 articles (rows) and the search query contains 3 words (columns).

Do I add up all the results in the columns or add up all the rows? Is the greater value the most relevant or is the lowest value the most relevant?

1 Answer
1

Are you familiar with cosine similarity? For each article (vector A) compute its similarity to the query (vector B). Then rank in descending order and choose the top result. If you're willing to refactor, the gensim library is excellent.

If you're using tf-idf as your weighting scheme, you'd still want to just normalize your query. Your matrix contains three terms, all of which are represented in the query; thus the raw frequency vector of the query is (1,1,1). sqrt((1^2)+(1^2)+(1^2)) = 1.73, and 1/1.73 = 0.57. So your query vector is (0.57,0.57,0.57). Now you can treat the query as another document. The cosine similarity of this query vector and some other document vector is its dot product. For the first article: ((.57*.85)+(.57*0)+(.57*.52)) = 0.2964. Repeat this for all articles and the highest score wins.
–
verbsintransitAug 8 '12 at 19:51

So I do not have to train a classifier of some sort?
–
tabchasAug 8 '12 at 19:53

If I understand your tutorial link correctly, not at this point. I recommend reviewing section 6.2 onwards in link to first understand tf-idf, etc., and then applying it to machine learning topics. I'm not sure if you're learning both information retrieval and machine learning at once.
–
verbsintransitAug 8 '12 at 20:31

1

No code of mine off hand. But seriously, check out that gensim library. Look at the tutorials and the source code; you'll probably find what you're looking for.
–
verbsintransitAug 8 '12 at 20:44