2
Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist2 What is Latent Semantic Indexing? LSI uses a kind of vector model The classical IR vector model groups documents with many terms in common But –Documents could have a very similar content, using different vocabularies –The terms used in the document may not be the most representative LSI uses the distribution of all terms in all documents when comparing two documents!

3
Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist3 A traditional vector model for IR The starting point is a term-document-matrix, both for the traditional vector model and LSI We can calculate similarities between terms or documents using the cosine We can also (trivially) find relevant terms for a document Problems: –The term “trees” seems relevant to the m-documents, but is not present in m4 –cos(c1,c5)=0 just as cos(c1,m3)=0

5
Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist5 How does LSI work? The idea is to try to use latent information like: –word 1 and word 2 are often found together, so maybe doc 1 (containing word 1 ) and doc 2 (containing word 2 ) are related? –doc 3 and doc 4 have many words in common so maybe the words they don’t have in common are related?

6
Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist6 How does LSI work? cont’d In the classical vector model, a document vector (from our toy example) is 12-dimensional and the term vectors are 9-dimensional What we want to do is to project these vector into a vector space with lower dimensionality One way is to use Singular Value Decomposition (SVD) We decompose the original matrix into three new matrices

8
Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist8 Using the SVD The matrices make it easy to project term and document vectors into a m-dimensional space (m ≤ min (terms, docs)) using ordinary linear algebra We can select m easily just by using as many rows/columns of T 0, S 0, D 0 as we want To get an idea, let’s use m=2 and recalculate a new (approximated) X – it will still be a t x d matrix

10
Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist10 What does the SVD give? Susan Dumais 1995: “The SVD program takes the ltc transformed term-document matrix as input, and calculates the best "reduced-dimension" approximation to this matrix.” Michael W Berry 1992: “This important result indicates that A k is the best k-rank approximation (in at least squares sense) to the matrix A. Leif 2003: What Berry says is that SVD gives the best projection from n to k dimensions, that is the projection that keep distances in the best possible way.

11
Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist11 Algorithms for dimensional reduction Singular Value Decomposition (SVD) –This is a mathematically complicated (based on eigen- values) way to find an optimal vector space in a specific number of dimensions –Computationally heavy - maybe 20 hours for a one million documents newspaper corpus –Uses often the entire document as context Random Indexing (RI) –Select some dimensions randomly –Not as heavy to calculate, but more unclear (for me) why it works –Uses a small context, typically 1+1 – 5+5 words Neural nets, Hyperspace Analogue to Language, etc.

12
Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist12 Some applications Automatic generation of a domain specific thesaurus Keyword extraction from documents Find sets of similar documents in a collection Find documents related to a given document or a set of terms

13
Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist13 Problems and questions How can we interpret the similarities as different kinds of relations? How can we include document structure and phrases in the model? Terms are not really terms, but just words Ambiguous terms pollute the vector space How could we find the optimal number of dimensions for the vector space?

16
Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist16 A small experiment I want the model to know the difference between Bengt and Bengt 1.Make a frequency list for all n-tuples up to n=5 with a frequency>1 2.Keep all words in the bags, but add the tuples, with space replaced by -, as words 3.Run the LSI again Now bengt-johansson is a word, and bengt- johansson is NOT Bengt + Johansson Number of terms grows a lot!

18
Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist18 The new vector space model It is clear that it is now possible to find terms closely related to Bengt Johansson – the handball coach But is the model better for single words and for document comparison as well? What do you think? More “words” than before – hopefully it improves the result just as more data does At least no reason for a worse result... Or?

25
Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist25 Hmm, adding n-grams was maybe too simple... 1.If the bad result is due to overtraining, it could help to remove the words I build phrases from… 2.Another way to try is to use a dependency parser to find more meaningful phrases, not just n-grams A new test following 1 above:

28
Friday 30. May 2003NoDaLiDa 2003: Leif Grönqvist28 What I still have to do something about Find a better LSI/SVD package than the one I have (old C-code from 1990), or maybe writing it myself... Get the phrases into the model in some way When these things are done I could: Try to interpret various relations from similarities in a vector space mode Try to solve the “number of optimal dimensions”-problem Explore what the length of the vectors mean

Om projektet

Kontakta oss

To ensure the functioning of the site, we use cookies. We share information about your activities on the site with our partners and Google partners: social networks and companies engaged in advertising and web analytics. For more information, see the Privacy Policy and Google Privacy &amp Terms.
Your consent to our cookies if you continue to use this website.