Saturday, September 27, 2008

Last week, I wrote about building term document matrices based on Dr. Manu Konchady's Text Mining Application Programming book. This week, I continue working on computing some similarity metrics from the same set of data. Most of the material is covered in Chapter 8 of the book.

D1 Human machine interface for computer applications
D2 A survey of user opinion of computer system response time
D3 The EPS user interface management system
D4 System and human system engineering testing of EPS
D5 The generation of random, binary and ordered trees
D6 The intersection graph of paths in trees
D7 Graph minors: A survey

The document collection is first converted to a raw term frequency matrix. This is further normalized using either TF/IDF or LSI. The matrix thus normalized is the input to our similarity computations.

A document is represented by a column of data in our (normalized) term document matrix. To compute the similarity between two documents, we perform computations between two columns of the matrix. Conceptually, each document as a point in an n-dimensional term space, where each term corresponds to a dimension. So in our example, we have 23 terms, therefore our term space is a 23-dimensional space. In case this is hard to visualize (it was for me), it helps to think of documents with 2 terms (and hence a two dimensional term space), and extrapolate from there.

Jaccard Similarity

Jaccard Similarity is a simple and very intuitive measure of document to document similarity. It is defined as follows:

The AbstractSimilarity class shown below contains the code to compare all document pairs from our collection. This has a complexity of O(n2), which we can reduce to about half if we code in the knowledge that similarity(A,B) == similarity(B,A) and similarity(A,A) == 1.0, but which is still too high for large number of documents. I haven't added these knowledge here, though, since this is just toy data and the performance is tolerable.

For the JaccardSimilarity class, we define the abstract method computeSimilarity() as shown below (based on our formula above). I have cheated here slightly - in Jama, norm1() is defined as the maximum column sum, but since our document matrices are column-matrices (ie 1 column), norm1() returns the same value as the sum of all elements in the document matrix. It does make the code less readable, in the sense that the reader/maintainer would have to go through an extra mental hoop, but it makes the code concise.

The result of our Jaccard similarity computation is shown below. The JUnit test to return this is SimilarityTest.java which is shown later in this post. The two matrices are derived from our raw term document matrix normalized using TF/IDF and LSI respectively. As you can see, the similarity matrix created off the LSI normalized vector shows more inter-document similarity than the one created off the TF/IDF normalized vector.

Here, θ is the angle between the gradients of the documents A and B in the n-dimensional term space. A and B are very similar as θ approaches 0° (cos(0) = 1), and very dissimilar as θ approaches 90° (cos(90) = 0).

The code below computes the cosine similarity based on the formula above. Here, normF() is the Frobenius norm, the square root of the sum of all elements, and norm1() is the maximum column sum, which works for me because we are computing it off a one-dimensional matrix.

The results of this computation for a term document vector normalized with TF/IDF and LSI respectively are shown below. As before, the inter-document similarity is higher for the LSI normalized vector.

Searching - Similarity against a Query Vector

Using the similarity vectors above, it is now possible to build a Searcher. The objective is to find the most similar documents that match our query. For this, we will need to build a query vector consisting of the terms in the query, and then compute the similarity of the documents against the query vector. The computation is similar to what we have already done above if we treat the query vector as a specialized instance of a document vector. We use Cosine Similarity for our computations here.

The results of searching with the query "human computer interface" against a similarity matrix built off a TF/IDF and an LSI normalization are shown below. Notice that we get a few more results from a LSI normalized vector. I have manually copied the titles over to make the result more readable. I suppose I could make the code do it, and I probably would have, if the code was going to be used by anybody other than myself.

Conclusion

I hope this post was helpful. I had looked at Term Vectors in Lucene in the past, but the idea of an n-dimensional term space never quite sank in until I started working through the stuff I describe above. I know that the data I am using is toy data, and that the computations described in the blog are computationally too complex and impractical in real life, but I believe that knowing this stuff helps in building models that would be useful and practical.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

Post is very useful for me working on sentimental analysis.but i am not able to understand how to create indexing_sample_data.txt file from sample data given.Help me to find this text file from data set.Thanks

Thanks for reply. it really worked.but i am getting java.lang.NoClassDefFoundErrorin the "this.jdbcTemplate = new JdbcTemplate(dataSource);" in phrase and abbreviation recognizer. I have included all header files. May be this error can because of incorrect path of datasource. If that so then please specify way of path of datasource. help me to remove this error.Thanks

Hi Amit, NoClassDef means that the class could not be found, since this occurs in this line, it could either be JdbcTemplate or DataSource. The first is possible if you don't have Spring jars in your classpath, and the second should not happen, since DataSource is part of the javax.sql package built into the JDK (unless you are using an ancient version). In any case, your error message should tell you more - I suspect that you are missing spring jars.

Hi sujit, thanks for reply!!!I got that error was because of datasource. I tried to make connection to mysql through jdbc but i got exception"com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception:

Hi Amit, this is probably because either (a) you need to expose a socket for JDBC to talk to MySql and/or (b) your username/password is incorrect or (c) some other reason I can't predict remotely. You may want to RTFM about Java/JDBC/MySQL connectivity, and if you think that is not necessary, try posting the problem on a MySQL forum - this looks like this is related to your setup and not with the code.

Hello sujit,Your suggestions are really working.Now i am not able to create database using textmine project.I have downloaded textmine project. Can you please tell me process to create "tmdb" database using textmine project. it will be very helpful.Thanks for help......

Hi Amit, good to know. As you can see, tmdb is simply a MySQL database loaded with data from the textmine project. You will find the data in the utils/data subdirectory of the textmine project. I had to massage it a bit, not sure if the code for that exists, if not, it was probably just a bunch of sed commands.

Hi Tehseen, calculating similarity between sentences is the same as calculating between documents, at least using the vector approach I describe here. Basically create a term/sentence matrix instead of a term/document matrix.

Hi Sujit, am using LsiIndexer.java for my project. but while executing am getting error such as " can not access Jama.Matrix; bad class file:\java.Matrix ; please remove or make sure it appears in the correct subdirectory of the class path ". Please help me on this. Thanks in advance.

Hi Ambika, you should download the entire project (contains updated LsiIndexer.java) from the SVN repository at jtmt.sf.net. I switched from Jama to commons-math-2.0 once they put the LSI Indexing stuff in there.

I see your LSI similarity values. I saw the documents too. It is saying 1.0 while, I can see documents are not even same in meanings. For example doc2 and doc 6 are not similar. Same for other records. How come your program calculation are saying this? If I missed anything, please let me know

Hi Sujit Fan, not sure why LSI says its similar but they are not actually. I believe it investigated it once but can't be sure. If you dump out the components of the decomposition, it would probably give you a better idea - you would see the terms that are the principal components - perhaps these words co-occur in the documents, even though they don't contribute much meaning? Then they may be good candidates for stopwords.

Hi Sujit,Thanks for your quick reply. You wrote that JAMA is not good for sparse matrix. But I think SVD is used to handle sparse matrix. Isn't it? If I am wrong can you please explain what problem we can face if we use a big sparse matrix in jama? speed? accuracy? etc. Also can you please tell me if there is any way to make JAMA work where rows are less than columns??? If NO! then can you point out any other LSI java package that I can use? I am looking for a small package like jama, not a BIG ONE :)

Hi, AFAIK, Jama has no support for sparse matrices - internally matrices are treated as a double[][] - when I say sparse matrix support, I mean that the data structure will only store the non-zero elements of the matrix. Obviously for large sparse matrices such as those for TD matrices, this can result in significant memory savings. Also AFAIK, SVD does not care about whether the matrix is sparse or dense. Also Jama is no longer under active development, so I would recommend commons-math (which is a bit larger than Jama) - they have recently been looking at SVD improvements lately. Actually, I would suggest that you get on the apache commons list and ask your questions there, you will probably get better answers there than from me.

Hi Sujit,Thank you so much for such a brilliant post. Am using your LSI and cosine similarity concept in my project. I downloaded your project, but While executing SimilarityTest.java , am getting error “ Exception in thread “main” java.lang.NoSuchMethoderror: main “ . I couldn’t find “main” method in your code. Please help me to solve this.One more doubt is, I dont quite understand it..why do we use a database here.vectorGenerator.setDataSource(new DriverManagerDataSource( "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/tmdb", "root", "orange"));Is any other settings I should do for this like jdbcodbcconnection,,??Can I use oracle10g for this?is it compulsory to use database??Sorry, if my questions are too lame..Please help me on this. I will be grateful to you.

To answer your first question, I am guessing that you may be running the SimilarityTest as a Java application (from your IDE perhaps?), in which case it looks for the main() method - you will need to run this as a JUnit application - you can do this from your IDE or using Ant or Maven from the command line - see the documentation for whatever tool you are using.

For the second question, the database is used for recognizing abbreviations and multi-word phrases. The first is a list of known abbreviations, which you can also put into a text file. The second is a bit more complex, since phrases are stored as collocated words in the database. In the jtmt project there are schema and data files that contain the table def and data for the table. Using a database seemed to be the most obvious way to do this, but of course it may be possible (although probably not very easy) to store them in some other form.

Hi helpNeeded, the post is just some Java implementations of various similarity measures I found in the TMAP book and elsewhere. When you say "how it all works", are you looking for the underlying mathematical theory behind these? If so, you should probably look elsewhere (do a search). If you are asking about the code itself, it should be fairly simple if you know Java.

Hi Sujit..How can i identify a document as effcient search query.During my testing i found that if search vector is good it will give the accurate results.But if the query is worse then we can't expect the right output...Thank u...

Hi Srijith, isn't this what you would expect on any search engine? When you say "bad" search vector, perhaps it contains too few terms or contains terms that are very common in the corpus, right? So I guess there would be two ways to go about "improving" your search vector - either insist on a minimum number of terms or post process the search vector to remove common terms? Or maybe I am not understanding what you are looking for?

Thank u sujitI have created LSI of 10 near duplicate documents and passing each document as a search vector.But even if all the documents are near duplicates,for some search query LSI give very less score,some docs are identified as near duplicates and others not...

I put 0.8 as a threshold to filter the near duplicates,but some near duplicates give a score less than 0.8 and some exact different document give a score > 0.8 ..

So i can't conclude....This is the scenario..I hope u could now able to figure out my issue..Thank u....

Hi Srijith, my understanding of LSI is that you achieve "semantic" indexing by removing the noisy dimensions out, leaving only the dimensions which truly represent the document. So documents that otherwise wouldn't look similar because of the presence of noisy dimensions now do. I am guessing that you did the same thing on the search vector side, ie removing the noisy dimensions from the search vector. Could the documents that don't seem to match have some outlier terms after cleanup that are causing the results to be skewed?

Thanks Ajaz. About the schema in the textmine project, I just downloaded the textmine-0.21 tarball (from textmine.sf.net and click on Download) and exploded it locally, I see a whole bunch of .dat files under utils/data, and a tables.sql with the schema in utils. I used that to do the initial population, then added a few tables of my own, whose schemas are in the jtmt project.

Hi Pal,I wind a collection of .sql files in jtmt project. what is the order in which those should be executed for creation for data base and populating fro dat files? please provide the steps. Also for windows wordnet 2.1 is only available , will it suffice??

Hi Sujit,I am trying to get jtmt working on my pc(windows). but facing some trouble with sql data base.What all scripts to be run or tables to be created? I found few '.sql' files at jtmt\src\main\sql . all these files need to be used? I am interested in getting similarity metrics alone working.

@Anonymous: The .sql files can be applied in any order after the textmine data is loaded (see my reply to Ajaz's comment above for details about that). If multiple tables are to be created in a certain order (because of fk references, etc), they are contained in a single .sql file so they are created in the correct order.

To access wordnet, I use jwi, which are advertised to work with Wordnet versions 1.6-3.0, so I guess you should be fine with Wordnet 2.1.

@lmthyaz: For .sql order, please see my response to anonymous above.

For restricting to tables for similarity, unfortunately I don't have a good answer for you. I suggest only loading up the textmine tables first, then running the JUnit test case for similarity, and see if it complains about missing tables, then add them in.

Hi Sujit,I am using Idf Indexer and cosine similarity of jtmt. when trying to find the simliraity betweenn two documents( corpus has only 2 files) i am getting a similarity as 'NaN'. what does it mean? where to make change if i want to modify?

Hi Imthyaz, hard to say without a bit more information, offhand I would say its probably an underflow or overflow (the resulting double is too small or too large to fit in the space provided), if you trace through the code and add some bounds checking, something like this:

Not sure about phrases, but I recently came across a Perl module called Wordnet::Similarity, which provides implementations of various functions for using Wordnet to find similarities between words. You may want to check it out.

thanks for providing the code for tf-idf.I am new in Java programming. I tried running the first code stopwordrecognizer but it gave me errors like unable to find class: Irecognizerunable to find class: Tokenfrom where do i get these files?

It probably won't be such a great idea to do this. The similarity matrix would be useful to find "related documents" from the one selected, but can't tell why its similar. Clustering clusters docs by features, so documents in a cluster are similar along that feature.

Hi David, thought a bit more about this, and I take my comment about it being a bad idea back... I think you could just use a standard algo such as Knn and replace the distance metric with your doc-doc similarity number. The k (size of each cluster) would be determined by the number of docs in your corpus / number of clusters you want.

Hi Rachit, its been a while since I pulled the data from Dr Konchady's textmine project, but basically I used the instruction in the project page. The TMAP book has some pointers as well, but I think the installation instructions on the project page should provide enough detail.

Hi SujitFirstly thanks for the great post that really helps a lot for us. I have some few queries regarding similarity in documents. 1. in this post you have used database to detect abbreviations and multi-word phrases. Can we exclude this part? I mean simple create indexing of the words removing stop words and functional words. in all case (lsi, if/idf), termDocMatrix needs vectorGenerator.getMatrix() as parameter. what will be appropriate parameter for termDocMatrix if we do not wish to use database? Simply, want to use each file preprocessing (remove stop words and stemming) and indexing then create termDocMatrix.2. If we are going to use the database, I get confuse which my sql file is to run. I found four .sql file at jtmt\src\main\sql viz. tmdb_load, tmdb_my_colloc, tmdb_postagger, tmdb_schema 3. But for the data, I could not find utils/data in the project.4. Is it possible to run the JUnit as Java application adding main function?Regars,Niraj

Hi Niraj, sorry about the delay in responding. Yes, I think you can safely remove the database based abbreviation/collocation detection stuff while computing document similarity. I believe the sequence in you would run the SQL files is: tmdb_schema, tmdb_load, tmdb_my_colloc and tmdb_postagger. The data comes from the TextMine (textmine.sf.net) project. And yes, you can take the contents of the function annotated by @Test and put it into the main() function then you can run the JUnit code as a Java application.

Hi Sara, if you are using cosine similarity then you cannot. In case of document vectors, they should be the same size, as each element of the vector will correspond to one word/term in the vocabulary (which is a superset of all terms found in the corpus). If they do differ in size for some reason, then you can pad the shorter one with zeros (in case of text you are saying that the extra terms occur zero times in the shorter document vector).

I did the same thing and add made all the zeros live. But, when I use Gaussian mixtures for clustering , it dose not work because I have many zeros in my matrix and all of my parameters will be NAN.Thanks.

I am guessing you are trying to find similarities between "average" document across genres, yes? If so, very cool idea. The NaNs are most likely because of underflow problems. One way to work around that is to apply a Laplace correction (add 1 to the denominator and v to the denominator where v is the vocabulary size) to the TD vectors, although if you do that it will change the data from sparse to dense (although you can just bake this behavior into a subclass of Sparse Matrix to keep the memory properties the same). But it may be good to find out actually where the NaN is occurring before taking such drastic action.

Actually, I am trying to cluster the documents using guessing mixtures. The problem is that: since the sparse matrix has many zeros, the variance will be zero and the program dose not work. It happens when I am using Dirichlet mixtures too. I was wondering if you could explain more how I can change the data to dense matrix? What is denominator ?Many thanks for your time.Sara

Well, the idea is to add 1 to the numerator of every element in the TD matrix and v to its denominator where v is the vocabulary size. So assume a term t has occurred in document d n times, and N is the total number of words in the corpus, then the frequency of t is n/N, with the Laplace correction it becomes (n+1)/(N+v). This now changes your sparse matrix into a dense one.

Thank you for your post. On the cosine similarity part, you used norm1 ( double dotProduct = sourceDoc.arrayTimes(targetDoc).norm1() ) to compute the dotProduct. From what I referred norm1 gives always positive results which makes the cosine similarity result always between 0 and 1 (while cosine similarity can also be between -1 and 0). Am I missing something or its a bug?

Thanks for your reply. Yes, you are right, in the example you gave it is always positive. I tried to use your cosine similarity method on vectors taken from the V-matrix generated after SVD and it had negative values. Now, I have written mine. Anyway thank you for your post and prompt reply.

Hi Sujit,I am trying to understand the similarity matrix output you have printed. You have five docs say d1,d2,..d5. How is the similarity represented here? Lets say we find similarity between d1 and d2, how do we represent in matrix form?

Hi Rob, d1 and d2 point to row and column indexes 0 and 1 respectively. So similarity(d1,d2) would be at either [0,1] or [1,0] of the matrix (both values are equal). Also notice the diagonal is all 1, indicating that any document is perfectly similar to itself.

Hi Sujit,I am a newbie to vector space modelling. I recently had to create a tool which could take in a query and calculate the give a list of files using tf-idf from a source code. Can you suggest me how to go about it?

Hi, if you already have a (nxm) TD matrix of your documents (call it D), you could use the same vectorizer to create a (nx1) vector for your query (call it Q), then you can generate a (nx1) similarity vector S = D * Q. Depending on the size of your document set, you can do the computation with dense or sparse matrices in memory, or you could use Hadoop if the matrices don't fit into memory.

About me

I am a programmer interested in Semantic Search, Ontology, Natural Language Processing and Machine Learning. My programming languages of choice are Java, Scala, and Python. I love solving problems and exploring different possibilities with open source tools and frameworks.