Multilevel Measures of Document Similarity

Abstract

Many applications such as document summarization, passage retrieval and question answering require a detailed analysis of semantic relations between terms within and across documents and sentences. Often one has a number of sentences or paragraphs and has to choose the candidate with the highest level of relevance for the topic or question. An additional requirement may be that the information content of the next candidate is different from the sentences that are already chosen. Many approaches to information retrieval and document classification model the semantic similarity between documents using the relations between semantic classes of words. They include representing dimensions of the document vectors with distributional term clusters (?) and expanding the document and query vectors with synonyms and related terms as discussed in (?). Latent Semantic Analysis (LSA) (?) is one of the best known dimensionality reduction algorithms. It represents documents as vectors in the space of latent semantic concepts. Latent Dirichlet Allocation (LDA) (?) uses the latent semantic concepts as bottleneck variables in computing the term distributions for documents. The new representation captures overall semantic similarity between documents but is less sensitive to differences on the sentence level. Moreover, the methods include all vocabulary terms in their computations which limits their applicability. Semantic similarity on the word level is targeted for word sense disambiguiation (WSD), e.g. Schütze (?), verb classification XXX(cite D. Lin). The research has shown that different measures of similarity may be required for different groups of terms such as nouns and verbs. It also reasonalbe to use different notions of similarity for content bearing general vocabulary words and named entities. Methods of WSD are usually use co-occurrence statistics. Verb similarity measures is based on syntactic similarity. In this project, we propose to use a combination of similarity measures between terms to model document similarity. We divide the vocabulary into general vocabulary terms and named entities and compute a separate similarity score for each of the group of terms. The overall similarity score is a function of these two scores. In addition, we use statistical cooccurrence as well as syntactic similarity to compute the similarity between the general vocabulary terms.