Topics

topic model

We present an algorithm for Query—Chain Summarization based on a new LDA topic model variant.

Page 1, “Abstract”

We introduce a new algorithm to address the task of Query-Chain Focused Summarization, based on a new LDA topic model variant, and present an evaluation which demonstrates it improves on these baselines.

Page 2, “Introduction”

As evidenced since (Daume and Marcu, 2006), Bayesian techniques have proven effective at this task: we construct a latent topic model on the basis of the document set and the query.

Page 3, “Previous Work”

This topic model effectively serves as a query expansion mechanism, which helps assess the relevance of individual sentences to the original que-

Page 3, “Previous Work”

In recent years, three major techniques have emerged to perform multi-document summarization: graph-based methods such as LexRank (Erkan and Radev, 2004) for multi document summarization and Biased-LexRank (Otterbacher et al., 2008) for query focused summarization, language model methods such as KLSum (Haghighi and Vanderwende, 2009) and variants of KLSum based on topic models such as BayesSum (Daume and Marcu, 2006) and TopicSum (Haghighi and Vanderwende, 2009).

Page 3, “Previous Work”

TopicSum uses an LDA-like topic model (Blei et al.

Page 3, “Previous Work”

We developed a novel Topic Model to identify words that are associated to the current query and not shared with the previous queries.

Page 7, “Algorithms”

Figure 3 Plate Model for Our Topic Model

Page 7, “Algorithms”

We implemented inference over this topic model using Gibbs Sampling (we distribute the code of the sampler together with our dataset).

Page 7, “Algorithms”

After the topic model is applied to the current query, we apply KLSum only on words that are assigned to the new content topic.

Page 7, “Algorithms”

When running this topic model on our dataset, we observe: Dc mean size was 978 words and 375 unique words.

unigram

Appears in 6 sentences as: Unigram (1) unigram (6)

In Query-Chain Focused Summarization

KLSum adopts a language model approach to compute relevance: the documents in the input set are modeled as a distribution over words (the original algorithm uses a unigram distribution over the bag of words in documents D).

Page 3, “Previous Work”

KLSum is a sentence extraction algorithm: it searches for a subset of the sentences in D with a unigram distribution as similar as possible to that of the overall collection D, but with a limited length.

Page 3, “Previous Work”

After the words are classified, the algorithm uses a KLSum variant to find the summary that best matches the unigram distribution of topic specific words.

Page 3, “Previous Work”

When constructing a summary, we update the unigram distribution of the constructed summary so that it includes a smoothed distribution of the previous summaries in order to eliminate redundancy between the successive steps in the chain.

Page 6, “Algorithms”

For example, when we summarize the documents that were retrieved as a result to the first query, we calculate the unigram distribution in the same manner as we did in Focused KLSum; but for the second query, we calculate the unigram distribution as if all the sentences we selected for the previous summary were selected for the current query too, with a damping factor.

Page 6, “Algorithms”

In this variant, the Unigram Distribution estimate of word X is computed as: