Uncle Lance's Ultra Whiz Bang

Thursday, September 6, 2012

Two problems

I found a couple of problems with this approach: very long sentences and running time & space.

Very Long Sentences

Other test data I looked at were technical papers. These often have very long sentences. The problem is not controlling for sentence length, but that there are too many theme words for the sentence to express one theme well. One could try breaking these up into sentence shards: a 50-word sentence would become three 15-word shards. If we use parts-of-speech trimming, we can break the shards within runs of filler words.

Optimizations

It is possible to create an optimized version of LSA that only ranks sentences or terms via Random Indexing. RI deliberately throws away the identity of one side of the sentence/term matrix. It can rank sentences or terms, but not both. This algorithm runs much faster than the full sentence/term version, and is the next step in creating an interactive document summarizer or a high-quality tag cloud generator.

Random Projection

Random Projection is a wonderfully non-intuitive algorithm. If you multiply a matrix with another matrix filled with random numbers, the output matrix will have the same cosine distance ratios across all pairs of vectors. In a document-term matrix, all of the documents will still have the same mutual 'relevance' (measured by cosine distance which magically matches tf-idf relevance ranking).

If the random matrix is the (transposed) size of the original matrix, the row and column vectors will have the above distance property. If the random matrix has more or fewer dimensions on one side, the resulting matrix will retain the identity of vectors from the constant side, but will lose the identity of the vectors on the varying side. If you have a sentence-term matrix of 50 sentences x 1000 terms, and you multiply it by a random matrix of 100 x 50, you get a matrix of 50 sentences x 100 "sketch" vectors, where every sketch is a low-quality summarization of the term vectors. The 50 sentences will still have the same cosine ratios; we have merely thrown away the identity of the term vectors. Since the running time of SVD is very non-linear, we now have a much faster dataset that will give us the orthogonal decomposition of the sentences (but the terms are forgotten). We can invert this and calculate the SVD for terms without sentences.

Random Indexing

The above description posits creating the entire sentence/term matrix, then doing the complete multiplication. In fact, you can create the resulting sentence/sketch matrix directly, one term at a time. This will considerably cut memory usage and running time. I include this explanation because there are few online sources describing Random Indexing, and the following forgets to explain RI's roots in Random Projection.http://www.sics.se/~mange/random_indexing.htmlhttp://www.sics.se/~mange/papers/RI_intro.pdf

The Test Harness

LSA toolkit and Solr DocumentSummarizer class

The LSA toolkit is available under https://github.com/LanceNorskog/lsa/tree/master/research. The Solr code using the LSA library is not yet published. It is a hacked-up terror intertwined with my port of OpenNLP to Solr. I plan to create a generic version of the Solr Summarizer that directly uses a Solr text type rather than its own custom implementation of OpenNLP POS filtering. The OpenNLP port for Lucene/Solr is available as LUCENE-2899.

The Summarizer optionally uses OpenNLP to do sentence parsing and parts-of-speech analysis. It uses the OpenNLP parts-of-speech tool to filter for nouns and verbs, dropping all other words. Previous experiments used both raw sentences and sentences limited to nouns & verbs, and pos-stripped sentences worked 10-15% better in every algorithm combination. This set of benchmarks did not bother to try the full sentences.

Reuters Corpus

The Reuters data and scripts for this analysis project are under https://github.com/LanceNorskog/lsa/tree/master/reuters. ...../data/raw is the Reuters article corpus preprocessed: the articles are reformatted into one sentence per line and are limited to 10+ sentences. The toolkit includes a script to run against the Solr Document Summarizer and save the XML output for each article, and a script to apply XPath expressions to create a CSV line for each article into one CSV file per algorithm. The per-algorithm keys include both the regularization algorithms and whether parts-of-speech filtering was applied.

Analysis

The analysis phase used KNime to preprocess the CSV data. KNime rules created more columns which were calculated from the generated columns, and then to create pivot table which summarized the data per algorithm. This data was saved into a new CSV file. KNime's charting facilities are very limited, so I used an Excel script to generate the charts. Excel 2010 failed on my Mac, and I had to make the charts in LibreOffice instead, but then copy them into a DOC file in MS Word (and not LibreOffice!) to get just plain jpegs from the charts.

Further Reading

The KNime data analysis toolkit is my favorite tool for exploring numbers. It is a visual programming UI (based on the Eclipse platform) which allows you to hook up statistics jobs, file I/O, Java code scriptlets and interactive graphs. Highly recommended for amateurs and the occasional user who cannot remember all of R.

The Measurements

In this Post we review the individual measures. These charts show each measure applied to all the algorithms, and also the basis statistics summary of the full dataset (not the per-algorithm aggregates). The measures are MRR, Rating, Sentence Length, and Non-zero. MRR is a common method for evaluating search results and other kinds of recommendations. The other three are fabricated for this analysis.

Mean Reciprocal Rank

MRR is a common measure for search results. It attempts to model the unforgiving way in which users react to mistakes in the order of search results. It measures the position of a preferred search result in a result list. If the "right" answer is third in the list, the MRR is 1/3.

This statistic is the mean of three MRRs, one for each result based on how far it is from where it should be. If the second sentence is #2, that is a 1. If it is #1 or #3, that is 1/2. If the third is #3, that is a 1. If it is #2, that is 1/2 and if it is #1, that is 1/3. The measures go down to the 5th sentence.

Rating (0-5)

Rating is a heavily mutated form of "Precision@3". It tries to model how the user reacts in a UI that shows snippets of the top three recommended sentences. 0 means no interest. 1 means at least one sentence placed (the Non-zero measure). 2-5 measure how well the first three recommendations match the first and second spots. In detail:

5: first and second results are lede and subordinate
4: first result is lede
3: second result is lede
2: first or second are within first three sentences
1: third result is within first three sentences
0: anything else

MRR and Rating (green and yellow) correlate very consistently in the graphic on Post #3. Rating tracks with MRR, but is more extreme. Note the wider standard deviation. This indicates that the Rating formula is a good statistic for modeling unforgiving users.

Sentence Length

This measures the mean sentence length of the top two recommended sentences. The sentence length is the number of nouns and verbs in the sentence, not the number of words. This indicates how well the algorithm compensates for the dominance of longer sentences.

Non-Zero

Every result that recommended the first, second or third sentence for one of the three top spots, by percentage.

The mean sentence length in the corpus is (I think) 22 sentences. A mean of 60 for 3 out of 22 is much better than random recommendations.

Precision v.s. Recall

In Information Retrieval jargon, precision is the accuracy of a ranking algorithm, and recall is the ability to find results. The three success measurements are precision measures. The "dartboard" measure is a recall measure.

From reading the actual sentences and recommendations, binary+normal and augNorm+normal had pretty good precision. These two also achieved the best recall at around 65%. This level would not be useful in a document summarization UI. I would pair this with a looser strategy to find related sentences by searching with the theme words.

Previous Example

In the example in Post #2, the three top-rated sentences were 4, 3, and 6. Since only one of three made the cut, the rating algorithm gave this a three. Note that the example was not processed with Parts-of-Speech removal and used the binary algorithm, and still hit the dartboard. This article is the first in the dataset, and was chosen effectively at random.

Overall Analysis

Measures are tuned for Interactive UI

This implementation is targeted for interactive use in search engines. A search UI usually has the first few results shown in ranked order, with the option to go to the next few results. This UI is intended to show the first three ranked sentences at the top of an entry with the theme words highlighted. Users are not forgiving of mistakes in these situations. The first result is much more important than the second, and so forth. People rarely click through to the second page.

The measures of effectiveness are formulated with this in mind. We used three:

A variant of Mean Reciprocal Rank (MRR).

"Rating" is a measure we created to model the user's behavior in a summarization UI. Our MMR variant and Rating are defined in the next Post.

Non-zero counts whether the algorithm placed any recommendations in the top three. "Did we even hit the dartboard?"

A separate problem is that sentences with more words can dominate the Sentence Length measures the length of the highest rated sentences. In this chart "Inverse Length" measures how well the algorithm countered the effects of sentence length.

Overall Comparison Chart

Key to algorithm names: "binary_normal" means that "binary" was used to create each cell, while "normal" multiplied each term vector with the mean normalized term vector. If there is no second key, the global value was 1. See post #1 for the full list of algorithms.

This display is a normalized version of the mean results for all 24 algorithm pairs, with four different measures. In all four, higher is better. "Inverse Length" means "how well it suppresses the length effect", "Rating" is the rating algorithm described above, "MRR" is our variant implementation of Mean Reciprocal Rank, and >0 is the number of results where any of the first three were in the first three sentences. None of these are absolutes, and the scales do not translate between measures. They simply show a relative ranking for each algorithm pair in the four measures: compare green to green, etc. The next post gives the detailed measurements in real units.

Grand Revelation

The grand revelation is: always normalize the term vector! All 5 local algorithms worked best with normal as the global algorithm. The binary function truncates the term counts to 1. Binary plus normalizing the term vector was by far the best in all three success measurements, and was middling in counteracting sentence length. AugNorm + normal was the highest achiever which compensates well for sentence length. TF + normal was the best overall for sentence length, but was only average for the three effectiveness measures.

The Experiment

This analysis evaluates many variants of the LSA algorithm against some measurements appropriate for an imaginary document summarization UI. This UI displays the most important two sentences with the important theme words highlighted. The measurements try to match the expectations of the user.

Supervised v.s. Unsupervised Learning

Machine Learning algorithms are classified as supervised, unsupervised, and semi-supervised.
A supervised algorithm creates a model (usually statistical) from training data, then applies test data against the model. An unsupervised algorithm is applied directly against test data without a pre-created model. A semi-supervised algorithm uses tagged and untagged data; this works surprisingly well in a lot of contexts.

The Data

We are going to use unsupervised learning. The test data is a corpus of newspaper articles called Reuters-21758, which was collected and published by the Reuters news agency to assist text analysis research. This dataset is not really tagged, but is appropriate for this experiment. Newspaper articles are written in a particular style which is essentially pre-summarized. In a well-written newspaper article, the first sentence (the lede) is the most important sentence, and the second sentence is complementary and usually shares few words with the first. The rest of the article is usually structured in order from abstraction to detail, called the Inverted Pyramid form. And, newspaper articles are the right length to summarize effectively.

Example Data

We limited the tests to sentences which were from 15-75 sentences long. The entire corpus is 21 thousand articles. This limits the test space to just under 2000. Here is a sample newspaper article:

26-FEB-1987 15:26:26.78

DEAN FOODS <DF> SEES STRONG 4TH QTR EARNINGS

Dean Foods Co expects earnings for the fourth quarter ending May 30 to exceed those of the same year-ago period, Chairman Kenneth Douglas told analysts. In the fiscal 1986 fourth quarter the food processor reported earnings of 40 cts a share. Douglas also said the year's sales should exceed 1.4 billion dlrs, up from 1.27 billion dlrs the prior year. He repeated an earlier projection that third-quarter earnings "will probably be off slightly" from last year's 40 cts a share, falling in the range of 34 cts to 36 cts a share. Douglas said it was too early to project whether the anticipated fourth quarter performance would be "enough for us to exceed the prior year's overall earnings" of 1.53 dlrs a share. In 1988, Douglas said Dean should experience "a 20 pct improvement in our bottom line from effects of the tax reform act alone."

President Howard Dean said in fiscal 1988 the company will derive benefits of various dairy and frozen vegetable acquisitions from Ryan Milk to the Larsen Co. Dean also said the company will benefit from its acquisition in late December of Elgin Blenders Inc, West Chicago. He said the company is a major shareholder of E.B.I. Foods Ltd, a United Kingdom blender, and has licensing arrangements in Australia, Canada, Brazil and Japan. "It provides an entry to McDonalds Corp <MCD> we've been after for years," Douglas told analysts. Reuter &#3;

As you can see, the text matches the concept of the inverted pyramid. The first two sentences are complementary, have no repeated words, and few words in common. Repeated concepts are described with different words: "Dean Foods Co" in the first sentence is echoed as "the food processor" in the second, while "expects earnings" is matched by "recorded earnings". This style seems real-world enough to be good test data for this algorithm suite. We did not comb the data for poorly structured articles or garbled text.

The Code

There are two bits of code involved: Singular Value Decomposition (explained previously) and Matrix "Regularization", or "Conditioning". The latter refers to applying non-linear algorithms to the document-term matrix which make the data somewhat more amenable to analysis. Several algorithms have been investigated for this purpose.

The raw term counts data supplied by a document/term matrix may not always be the right way to approach the data. Matrix Regularization algorithms are functions which use the entire dataset to affect each cell in the matrix. The contents of a document/term matrix after regularization are referred to as "term weights".

There are two classes of algorithm for creating term weights. Local algorithms alter each cell, while global algorithms create a global vector of values per term that are applied to all documents with that term, and likewise a global vector of values per document which is applied to all terms in that document. Local algorithms include term frequency (tf) which uses the raw matrix, binary which replaces each term count greater than one with a one, and some others which find a unique value for a cell based on the document and term vectors which cross at that cell.

Global algorithms for term vectors include normalizing the term vector and finding the inverse document frequency (idf) of the term. The LSA implementation includes implementations of these local and global algorithms, and any pair can be used together. Thus, tf-idf is achieved by combining the local tf algorithm and the global idf algorithm. The literature recommends a few of these combinations (tf-idf and log-entropy) as the most effective, but this test found a few other combinations which were superior. For document vectors, cosine normalization and a new "pivoted cosine" normalization are recommended for counteracting the dominance of longer sentences. These are not yet implemented. Existing combinations of local and term vector algorithms do a good job of suppressing sentence length problems. We will see later on that:

one of the term vector algorithms is by far the best at everything,

one of the local algorithms is the overall winner, and

one of the others does a fantastic job at sentence length problems but is an otherwise average performer.

The Result

The above article gave the following result for the 'binary' algorithm:

Because the analyzer does not remove extraneous words, they form the most common theme words. But notice that "Douglas said" is a common phrase and these two words are in the top 5. Theme words are not common words, they are words which are used together. Theme sentences are effectively those with the most and strongest theme words.

The Summarizer tool can return the most important sentences, and can highlight the most important theme words. This can be used to show highlighted snippets in a UI.

Document Summarization with LSA

This is a several-part series on document summarization using Latent Semantic Analysis (LSA). I wrote a document summarizer and did an exhaustive measurement pass using it to summarize newspaper articles from the first Reuters corpus. The code is structured as a web service in Solr, using Lucene for text analysis and the OpenNLP package for tuning the algorithm with Parts-of-Speech analysis.

Introduction

Document summarization is about finding the "themes" in a document: the important words and sentences that contain the core concepts. There are many algorithms for document summarization. This algorithm uses Latent Semantic Analysis, which uses linear algebra to analyze how words and sentences are used in common. LSA is based on the "bag of words" concept of "term vectors", or a list of all words and how often it are used in each document. LSA uses Singular Value Decomposition (SVD) to tease out which words are used the most with other words, and which sentences use the most theme words together. Document Summarization with LSA uses SVD to give us main and secondary sentences which have the strongest collections of theme words and yet a minimal number of theme words in common.

Every document has words which express the themes of the document. These words are frequent, but they are also used together in sentences. We want to find the most important sentences and words; the main sentences should be shown as the summary, and the most important words are good search words for this document.

Orthogonal Sentences

A key idea is that the most important and second most important sentences in a document are independent: they tend to share few words. The most important sentence in a document expresses the main theme of the document, and the second most important uses other theme words to elaborate on the main sentence. When we express the sentences and words in a bag-of-words matrix, SVD can analyze the sentences and words in relation to each other, by how the terms are used together in sentences. It creates a sorted list of documents which are as orthogonal as possible: which means that their collective theme words are as different as possible.

Note that since this technique treats documents and terms symmetrically, it also creates a sorted list of terms by how important they are- this makes for a good tag cloud.

Example

Here is an example of the first two sentences from a financial newswire article.

Dean Foods Co expects earnings for the fourth quarter ending May 30 to exceed those of the same year-ago period, Chairman Kenneth Douglas told analysts. In the fiscal 1986 fourth quarter the food processor reported earnings of 40 cts a share.

The first sentence is long, and expresses the theme of the article. The second sentence elaborates on the first sentence, and does not share any real words except earnings. Note that in order to avoid repeating the company name, and to give more information, the second sentence elaborates on the first sentence and refers to Dean Foods as the food processor.

Further Reading

This is an important paper on using SVD to summarize documents. It appears to be the original proposal for this technique: