A Numerically Stable Lanczos Text Summarizaton Algorithm

Aim

The goal of this deliverable is to create a numerically stable version of the Lanczos text summarization method, integrate it into the Yioop search engine and report its ROUGE results.

Overview

I originally proposed working with the text summarization algorithm sold to Yahoo written by Nick D’Aloisio. Unfortunately, the algorithm is not publically available. Stuck looking for a non-traditional text summarization algorithm, we stumbled upon the Lanczos algorithm.

The idea to attempt to use the Lanczos algorithm came from the work of one of Dr. Pollett’s previous students Youn Kim. The Lanczos method is “a technique that can be used to solve certain large, sparse, symmetric eigenproblems Ax = λx.”[GOLUB1996] The Lanczos algorithm solves these Eigen problems by splitting the initial frequency matrix into three matrices. “An orthogonal matrix U, a diagonal matrix S, and the transpose of an orthogonal matrix V” [Kim2010]. However, there is a down side to the Lanczos algorithm. Roundoff errors calculated during the run of the algorithm cause it to be numerically unstable. “The central problem is a loss of orthogonality among the Lanczos vectors that the iteration produces” [GOLUB1996]. In other words, the Lanczos cannot break the matrices down quick enough to keep the numbers stable.

Work Performed

The first thing I did was download Java code from Youn Kim’s Master’s Thesis webpage and execute it. The Java code was able to summarize his sample text perfectly. All it took for me to make it numerically unstable was to add a few lines to his sample text. Once I had a handle on what output results I was to expect from Youn’s Java code, I began converting it to PHP for use in the Yioop search engine. After I converted all of the Java code to PHP and overcame the subtle differences between Java and PHP, I was ready to make the algorithm numerically stable and integrate it into the Yioop search engine.

While doing my research on ways to make the Lanczos method stable, I came across a paper written by Ani Nenkova and Kathleen McKeown called Automatic Summarization. In their paper they cover a section on Latent Semantic Analysis (LSA). Within the LSA section, an approach by Yihong Gong and Xin Liu is described but their approach fails to address the Single Value Decompression (SVD) problem thus rendering the Lanczos algorithm numerically unstable still.

In addition to the work done by Gong and Liu, work by Ben Hachey, Gabriel Murray and David Reitter is also mentioned in the Nenkova and McKeown paper. Unluckily, the Hachey, Murray and Reitter method only applies to summarizing multiple documents into one. They attempt to use term frequencies from multiple documents to create the initial matrix which may allow the calculations downstream in the Lanczos algorithm to maintain stability. If you are not familiar with what we are doing here, we are doing single document summarization so their idea does not apply to our problem. Plus the issue is really targeted towards stabilizing the SVD generation.

With two strikes under my belt, I thought I would give it one last try. I found a book called Matrix Computations written by Gene H. Golub and Charles F. Van Loan. While Golub and Van Loan propose a few practical ideas to make the Lanczos procedure viable, each of them have shortcomings. For example, they propose an idea called Block Lanczos [GOLUB1996]. It tries to create mutually orthogonal matrices provided none of them are rank-deficient. If any are rank-deficient then the matrices produced will lose orthogonality and become numerically unstable too.

Results

In conclusion, the three attempts to provide a numerically stable version of the Lanczos method were unsuccessful. The idea to use matrix manipulation to generate summaries is very non-conventional, which is what we were looking for, but it is not a viable approach. Thus I will not be able to deliver a summarizer to integrate into Yioop or produced its ROUGE results.