A collection of sloppy snippets for scientific computing and data visualization in Python.

Friday, May 20, 2011

Latent Semantic Analysis with Term-Document matrix

This example is inspired by the second paragraph of the paper Matrices, vector spaces, and information retrieval. It shows a vector space representation of information used to represent documents in a collection and the query algorithm to find relevant documents. This example implement the model and the query matching algorithm using the linear algebra module provided by numpy. The program is tested on the sample data in Figure 2 of the paper.

5 comments:

In the default difflib there is a SequenceMatcher object that can be used to do some comparison like this. I used it to index websites at the company I work for and look for changes greater than a certain ratio to indicate that the website had been broken but you could use it for smaller chunks of text to search for relevant documents.