“‘Root’, ‘stem’ and ‘base’ are all terms used in the literature to designate the part of the word that remains when all affixes have been removed. … A root is a form which is no further analysable, either in terms of derivational or inflectional morphology. It is that part of a word form that remains when all inflectional and derivational affixes hav been removed. A root is the basic part always present in a lexeme. In the form untouchables, for example, the root is touch … A stem is of concern only when dealing with inflectional morphology. It may be —but need not to be— complex, either in that it contains derivational affixes (as does govern·ment) or in that it contains more than one root (as does red·skin). Inflectional, but not derivational affixes are added to it; it is the part of the word form which remains when all inflectional affixes have been removed. In the form untouchables, the stem is untouchable, although in the form touched the stem is touch; in the form wheelchairs the stem is wheelchair, even though the stem contains two roots” (p. 20).

«For IR purposes, it doesn’t usually matter whether the stems generated are genuine words or not –thus, «computation» might be stemmed to «comput»– provided that (a) different words with the same ‘base meaning’ are conflated to the same form, and (b) words with distinct meanings are kept separate. An algorithm which attempts to convert a word to its linguistically correct root («compute» in this case) is sometimes called a lemmatiser.»

«Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma.»

Métodos estadísticos

«The rule-based methods used for the POS tagging problem began to be replaced by stochastic models in the early 1990s. The major drawback of the oldest rule-based systems was the need to manually compile the rules, a process that requires linguistic background. Moreover, these systems are not robust in the sense that they must be partially or completely redesigned when a change in the domain or in the language occurs. Later on a new paradigm, statistical natural language processing, has emerged and offered solutions to these problems. As the field became more mature, researchers began to abandon the classical strategies and developed new statistical models.
Several people today argue that statistical POS tagging is superior to rule-based POS tagging. The main factor that enables us to use statistical methods is the availability of a rich repertoire of data sources: lexicons (may include frequency data and other statistical data), large corpora (preferably annotated), bilingual parallel corpora, and so on. By using such resources, we can learn the usage patterns of the tag sequences and make use of this information to tag new sentences» (p. 240).

«TokenizationThe process of segmenting running text into words and sentences.
Electronic text is a linear sequence of symbols (characters or words or phrases). Naturally, before any real text processing is to be done, text needs to be segmented into linguistic units such as words, punctuation, numbers, alpha-numerics, etc. This process is called tokenization.
In English, words are often separated from each other by blanks (white space), but not all white space is equal. Both «Los Angeles» and «rock ’n’ roll» are individual thoughts despite the fact that they contain multiple words and spaces. We may also need to separate single words like «I’m» into separate words «I» and «am».
Tokenization is a kind of pre-processing in a sense; an identification of basic units to be processed.»

Análisis sintáctico superficial

Shallow parsing.Partial parsing.Chunking.

Fragmentos (chunks)

«I begin with an intuition: when I read a sentence, I read it a chunk at a time. For example, the previous sentence breaks up something like this:
(1) [I begin] [with an intuition]: [when I read] [a sentence], [I read it] [a chunk] [at a time]
These chunks correspond in some way to prosodic patterns. It appears, for instance, that the strongest stresses in the sentence fall one to a chunk, and pauses are most likely to fall between chunks. Chunks also represent a grammatical watershed of sorts. The typical chunk consists of a single content word surrounded by a constellation of function words, matching a fixed template.»

Métodos estadísticos

«The application of statistical methods to parsing started in the 1980s, drawing on work in the area of corpus linguistics, inspired by the success of statistical speech recognition, and motivated by some of the perceived weaknesses of parsing systems rooted in the generative linguistics tradition and based solely on hand-built grammars and disambiguation heuristics. In statistical parsing, these grammars and heuristics are wholly or partially replaced by statistical models induced from corpus data. By capturing distributional tendencies in the data, these models can rank competing analyses for a sentence, which facilitates disambiguation, and can therefore afford to impose fewer constraints on the language accepted which increases robustness. Moreover, since models can be induced automatically from data, it is relatively easy to port systems to new languages and domains, as long as representative data sets are available.
Against this, however, it must be said that most of the models currently used in statistical parsing require data in the form of syntactically annotated sentences —a treebank— which can turn out to be quite a severe bottleneck in itself, in some ways even more severe than the old knowledge acquisition bottleneck associated with large-scale grammar development. Since the range of languages and domains for which treebanks are available is still limited, the investigation of methods for learning from unlabeled data, particularly when adapting a system to a new domain, is therefore an important problem on the current research agenda. Nevertheless, practically all high-precision parsing systems currently available are dependent on learning from treebank data, although often in combination with hand-built grammars or other independent resources» (pp. 263-4).

«The approach assumes that a particular set of lexical items is in use during the course of a given subtopic discussion and, when the subtopic changes, a significant proportion of the vocabulary changes too. The method assumes three broad categories of lexical items to be found within a text:
(1) words that occur frequently throughout the text, which are often indicative of its main topic(s);
(2) words that are less frequent but more uniform in distribution, which do not provide much information about the divisions between discussions;
(3) groups of words that are ‘clumped’ together with high density in some parts of the text and low density in other parts. These groups of words are indicative of subtopic structure.
The problem of subtopic segmentation is thus the problem of determining where these clusters of words in the third category begin and end» (p. 603).

Teoría del centrado (Centering Theory)

«The main idea of centering theory (Grosz et al., 1983; 1995) is that certain entities mentioned in an utterance are more central than others and this imposes constraints on the use of referring expressions and in particular on the use of pronouns. It is argued that the coherence of a discourse depends on the extent to which the choice of the referring expressions conforms to the centering properties» (pp. 607-8).

Resolución de la anáfora

«The process of determining the antecedent of an anaphor is called anaphora resolution. In anaphora resolution the system has to determine the antecedent of the anaphor. For identity-of-reference nominal anaphora, any preceding NP which is coreferential with the anaphor is considered as the correct antecedent . . .
The process of automatic resolution of anaphors consists of the following main stages: (1) identification of anaphors, (2) location of the candidates for antecedents, and (3) selection of the antecedent from the set of candidates on the basis of anaphora resolution factors» (p. 614).