NLP

A Better Indexing Method for Search Engine Using Reference Resolution

The traditional indexing is word-based. The document is treated as a collection of words, and indexed based on word frequencies. This project will not only index but also taking the relationship between words into account. This is done by reference resolution. Since the implementation of indexing is not part of this course, this project will focus on reference resolution.

The approach used in the design presented here is based on the view that those nominal expressions that refer to the same real or imaginary entity can be considered as belonging to a co-referent set: a set of expressions that all refer to the same things. The task of reference resolution is to determine which co-referent set a given anaphor belongs to.

Pronouns are a syntactically defined set, consisting of the words I, you, he, she, it, we, they plus their accusative and genitive forms, as in them and their respectively; the program is also designed to handle so-called lexical anaphora, by which we mean reflexive forms such as itself and themselves and reciprocal forms like each other.

Not all pronouns refer to things. We will use the term referential pronoun for those that do, and the term pleonastic pronoun for those that do not, as in the sentence “It is raining.”

For each nominal expression that is assumed to be referential, we will construct a discourse referent. This is a data structure that is used to maintain various pieces of information about that nominal expression.

First, we need to identify all the referential nominal expressions in the input text and construct a discourse referent for each. Second, for each discourse referent which corresponds to a pronoun, we need to identify which co-referent set it belongs to.

For the second part, this program primarily uses Kennedy and Boguraev (1996) algorithm, which bases on Lappin and Leass (1994), to resolve a pronoun to co-referent set.

3.3.1Parse:

Any reference resolution algorithm required parsed text. Ke
nnedy and Boguraev (1996) use LingSoft parser. I choose Connexor parser since it provides Functional Dependency Grammar (FDG). This part is also known as FDG in this program. I use web interface to access Connexor parse; therefore, Internet access is required.

Execute this only by clicking on “FDG” button.

3.3.2Preprocess text:

Build the internal tree representation from parsed text.

Execute this only by clicking on “Pre” button.

3.3.3Build discourse referents

From that tree structure, nominal expressions will be identified.

Execute this only by clicking on “DR” button.

3.3.4Determine coreference relationships

From those nominal expressions, coreference sets will be built.

Execute this only by clicking on “Cof” button.

3.3.5Output results

Not only the final output but also the output from each part is displayed in a window for better debugging.