Probabilistic methods are providing new explanatory approaches to fundamental cognitive science questions of how humans structure, process and acquire language. This review examines probabilistic models defined over traditional symbolic structures. Language comprehension and production involve probabilistic inference in such models; and acquisition involves choosing the best model, given innate constraints and linguistic and other input. Probabilistic models can account for the learning and processing of language, while maintaining the sophistication of symbolic models. A recent burgeoning of theoretical developments and online (...) corpus creation has enabled large models to be tested, revealing probabilistic constraints in processing, undermining acquisition arguments based on a perceived poverty of the stimulus, and suggesting fruitful links with probabilistic theories of categorization and ambiguity resolution in perception. (shrink)

We demonstrate that an unlexicalized PCFG can parse much more accurately than previously shown, by making use of simple, linguistically motivated state splits, which break down false independence assumptions latent in a vanilla treebank grammar. Indeed, its performance of 86.36% (LP/LR F1) is better than that of early lexicalized PCFG models, and surprisingly close to the current state-of-theart. This result has potential uses beyond establishing a strong lower bound on the maximum possible accuracy of unlexicalized models: an unlexicalized PCFG is (...) much more compact, easier to replicate, and easier to interpret than more complex lexical models, and the parsing algorithms are simpler, more widely understood, of lower asymptotic complexity, and easier to optimize. (shrink)

How can the development of ideas in a scientiﬁc ﬁeld be studied over time? We apply unsupervised topic modeling to the ACL Anthology to analyze historical trends in the ﬁeld of Computational Linguistics from 1978 to 2006. We induce topic clusters using Latent Dirichlet Allocation, and examine the strength of each topic over time. Our methods ﬁnd trends in the ﬁeld including the rise of probabilistic methods starting in 1988, a steady increase in applications, and a sharp decline of research (...) in semantics and understanding between 1978 and 2001, possibly rising again after 2001. We also introduce a model of the diversity of ideas, topic entropy, using it to show that COLING is a more diverse conference than ACL, but that both conferences as well as EMNLP are becoming broader over time. Finally, we apply Jensen-Shannon divergence of topic distributions to show that all three conferences are converging in the topics they cover. (shrink)

A signiﬁcant portion of the world’s text is tagged by readers on social bookmarking websites. Credit attribution is an inherent problem in these corpora because most pages have multiple tags, but the tags do not always apply with equal speciﬁcity across the whole document. Solving the credit attribution problem requires associating each word in a document with the most appropriate tags and vice versa. This paper introduces Labeled LDA, a topic model that constrains Latent Dirichlet Allocation by deﬁning a one-to-one (...) correspondence between LDA’s latent topics and user tags. This allows Labeled LDA to directly learn word-tag correspondences. We demonstrate Labeled LDA’s improved expressiveness over traditional LDA with visualizations of a corpus of tagged web pages from del.icio.us. Labeled LDA outperforms SVMs by more than 3 to 1 when extracting tag-speciﬁc document snippets. As a multi-label text classiﬁer, our model is competitive with a discriminative baseline on a variety of datasets. (shrink)

This paper describes a system for extracting typed dependency parses of English sentences from phrase structure parses. In order to capture inherent relations occurring in corpus texts that can be critical in real-world applications, many NP relations are included in the set of grammatical relations used. We provide a comparison of our system with Minipar and the Link parser. The typed dependency extraction facility described here is integrated in the Stanford Parser, available for download.

“Everyone knows that language is variable.” This is the bald sentence with which Sapir (1921:147) begins his chapter on language as an historical product. He goes on to emphasize how two speakers’ usage is bound to differ “in choice of words, in sentence structure, in the relative frequency with which particular forms or combinations of words are used”. I should add that much sociolinguistic and historical linguistic research has shown that the same speaker’s usage is also variable (Labov 1966, Kroch (...) 2001:722). However, the tradition of most syntacticians has been to ignore this thing that everyone knows. (shrink)

This paper presents a novel approach to the unsupervised learning of syntactic analyses of natural language text. Most previous work has focused on maximizing likelihood according to generative PCFG models. In contrast, we employ a simpler probabilistic model over trees based directly on constituent identity and linear context, and use an EM-like iterative procedure to induce structure. This method produces much higher quality analyses, giving the best published results on the ATIS dataset.

This paper presents the ﬁrst use of a computational model of natural logic—a system of logical inference which operates over natural language—for textual inference. Most current approaches to the PAS- CAL RTE textual inference task achieve robustness by sacriﬁcing semantic precision; while broadly effective, they are easily confounded by ubiquitous inferences involving monotonicity. At the other extreme, systems which rely on ﬁrst-order logic and theorem proving are precise, but excessively brittle. This work aims at a middle way. Our system ﬁnds (...) a low-cost edit sequence which transforms the premise into the hypothesis; learns to classify entailment relations across atomic edits; and composes atomic entailments into a top-level entailment judgment. We provide the ﬁrst reported results for any system on the FraCaS test suite. We also evaluate on RTE3 data, and show that hybridizing an existing RTE system with our natural logic system yields signiﬁcant performance gains. (shrink)

first-order HMM, the current tag t0 is predicted based on the previous tag t−1 (and the current word).1 The back- We present a new part-of-speech tagger that ward interaction between t0 and the next tag t+1 shows demonstrates the following ideas: (i) explicit up implicitly later, when t+1 is generated in turn. While unidirectional models are therefore able to capture both use of both preceding and following tag con-.

I wish to present a codi cation of syntactic approaches to dealing with ergative languages and argue for the correctness of one particular approach, which I will call the Inverse Grammatical Relations hypothesis.1 I presume familiarity with the term `ergativity', but, brie y, many languages have ergative case marking, such as Burushaski in (1), in contrast to the accusative case marking of Latin in (2). More generally, if we follow Dixon (1979) and use A to mark the agent-like argument of (...) a transitive verb, O to mark the patient-like argument of a transitive verb, and S to mark the single argument of an intransitive verb, then we can call ergative any subsystem of a language that groups S and O in contrast to A, as shown in (3). (shrink)

Most current statistical natural language processing models use only local features so as to permit dynamic programming in inference, but this makes them unable to fully account for the long distance structure that is prevalent in language use. We show how to solve this dilemma with Gibbs sam- pling, a simple Monte Carlo method used to perform approximate inference in factored probabilistic models. By using simulated annealing in place of Viterbi decoding in sequence models such as HMMs, CMMs, and CRFs, (...) it is possible to incorporate non-local structure while preserving tractable inference. We use this technique to augment an existing CRF-based information extraction system with long-distance dependency models, enforcing label consistency and extraction template consistency constraints. This technique results in an error reduction of up to 9% over state-of-the-art systems on two established information extraction tasks. (shrink)

This paper proposes a new architecture for textual inference in which ﬁnding a good alignment is separated from evaluating entailment. Current approaches to semantic inference in question answering and textual entailment have approximated the entailment problem as that of computing the best alignment of the hypothesis to the text, using a locally decomposable matching score. While this formulation is adequate for representing local (word-level) phenomena such as synonymy, it is incapable of representing global interactions, such as that between verb negation (...) and the addition/removal of qualiﬁers, which are often critical for determining entailment. We propose a pipelined approach where alignment is followed by a classiﬁcation step, in which we extract features representing high-level characteristics of the entailment problem, and give the resulting feature vector to a statistical classiﬁer trained on development data. (shrink)

We describe an approach to textual inference that improves alignments at both the typed dependency level and at a deeper semantic level. We present a machine learning approach to alignment scoring, a stochastic search procedure, and a new tool that ﬁnds deeper semantic alignments, allowing rapid development of semantic features over the aligned graphs. Further, we describe a complementary semantic component based on natural logic, which shows an added gain of 3.13% accuracy on the RTE3 test set.

We present a generative distributional model for the unsupervised induction of natural language syntax which explicitly models constituent yields and contexts. Parameter search with EM produces higher quality analyses than previously exhibited by unsupervised systems, giving the best published unsupervised parsing results on the ATIS corpus. Experiments on Penn treebank sentences of comparable length show an even higher F1 of 71% on nontrivial brackets. We compare distributionally induced and actual part-of-speech tags as input data, and examine extensions to the basic (...) model. We discuss errors made by the system, compare the system to previous models, and discuss upper bounds, lower bounds, and stability for this task. (shrink)

This paper presents empirical studies and closely corresponding theoretical models of the performance of a chart parser exhaustively parsing the Penn Treebank with the Treebank’s own CFG grammar. We show how performance is dramatically affected by rule representation and tree transformations, but little by top-down vs. bottom-up strategies. We discuss grammatical saturation, including analysis of the strongly connected components of the phrasal nonterminals in the Treebank, and model how, as sentence length increases, the effective grammar rule size increases as regions (...) of the grammar are unlocked, yielding super-cubic observed time behavior in some conﬁgurations. (shrink)

We present a system for deciding whether a given sentence can be inferred from text. Each sentence is represented as a directed graph (extracted from a dependency parser) in which the nodes represent words or phrases, and the links represent syntactic and semantic relationships. We develop a learned graph matching model to approximate entailment by the amount of the sentence’s semantic content which is contained in the text. We present results on the Recognizing Textual Entailment dataset (Dagan et al., 2005), (...) and show that our approach outperforms Bag- Of-Words and TF-IDF models. (shrink)

Discriminative feature-based methods are widely used in natural language processing, but sentence parsing is still dominated by generative methods. While prior feature-based dynamic programming parsers have restricted training and evaluation to artiﬁcially short sentences, we present the ﬁrst general, featurerich discriminative parser, based on a conditional random ﬁeld model, which has been successfully scaled to the full WSJ parsing data. Our efﬁciency is primarily due to the use of stochastic optimization techniques, as well as parallelization and chart preﬁltering. On WSJ15, (...) we attain a state-of-the-art F-score of 90.9%, a 14% relative reduction in error over previous models, while being two orders of magnitude faster. On sentences of length 40, our system achieves an F-score of 89.0%, a 36% relative reduction in error over a generative baseline. (shrink)

We propose an approach to natural language inference based on a model of natural logic, which identiﬁes valid inferences by their lexical and syntactic features, without full semantic interpretation. We greatly extend past work in natural logic, which has focused solely on semantic containment and monotonicity, to incorporate both semantic exclusion and implicativity. Our system decomposes an inference problem into a sequence of atomic edits linking premise to hypothesis; predicts a lexical entailment relation for each edit using a statistical classiﬁer; (...) propagates these relations upward through a syntax tree according to semantic properties of intermediate nodes; and composes the resulting entailment relations across the edit sequence. We evaluate our system on the FraCaS test suite, and achieve a 27% reduction in error from previous work. We also show that hybridizing an existing RTE system with our natural logic system yields signiﬁcant gains on the RTE3 test suite. (shrink)

The alignment problem—establishing links between corresponding phrases in two related sentences—is as important in natural language inference (NLI) as it is in machine translation (MT). But the tools and techniques of MT alignment do not readily transfer to NLI, where one cannot assume semantic equivalence, and for which large volumes of bitext are lacking. We present a new NLI aligner, the MANLI system, designed to address these challenges. It uses a phrase-based alignment representation, exploits external lexical resources, and capitalizes on (...) a new set of supervised training data. We compare the performance of MANLI to existing NLI and MT aligners on an NLI alignment task over the well-known Recognizing Textual Entailment data. We show that MANLI signiﬁcantly outperforms existing aligners, achieving gains of 6.2% in F1 over a representative NLI aligner and 10.5% over GIZA++. (shrink)

Grammatical theory has long wrestled with the fact that causative constructions exhibit properties of both single words and complex phrases. However, as Paul Kiparsky has observed, the distribution of such properties of causatives is not arbitrary: ‘construal’ phenomena such as honoriﬁcation, anaphor and pronominal binding, and quantiﬁer ‘ﬂoating’ typically behave as they would if causatives were syntactically complex, embedding constructions; whereas case marking, agreement and word order phenomena all point to the analysis of causatives as single lexical items.1 Although an (...) analysis of causatives in terms of complex syntactic structures has frequently been adopted in an attempt to simplify the mapping to semantic structure, we believe that motivating syntactic structure based on perceived semantics is questionable because in general a syntax/semantics homomorphism cannot be maintained without vitiating syntactic theory (Miller 1991). Instead, we sketch a strictly lexical theory of Japanese causatives that deals with the evidence oﬀered for a complex phrasal analysis. Such an analysis makes the phonology, morphology and syntax parallel, while a mismatch occurs with the semantics. The conclusions we will reach are given in (1). (shrink)

This paper separates conditional parameter estima- tion, which consistently raises test set accuracy on statistical NLP tasks, from conditional model struc- tures, such as the conditional Markov model used for maximum-entropy tagging, which tend to lower accuracy. Error analysis on part-of-speech tagging shows that the actual tagging errors made by the conditionally structured model derive not only from label bias, but also from other ways in which the independence assumptions of the conditional model structure are unsuited to linguistic sequences. The (...) paper presents new word-sense disambiguation and POS tagging experiments, and integrates apparently conﬂicting reports from other recent work. (shrink)

This paper presents a new method for producing a dictionary of subcategorization frames from unlabelled text corpora. It is shown that statistical ﬁltering of the results of a ﬁnite state parser running on the output of a stochastic tagger produces high quality results, despite the error rates of the tagger and the parser. Further, it is argued that this method can be used to learn all subcategorization frames, whereas previous methods are not extensible to a general solution to the problem.

We present a Chinese word segmentation system submitted to the closed track of Sighan bakeoff 2005. Our segmenter was built using a conditional random field sequence model that provides a framework to use a large number of linguistic features such as character identity, morphological and character reduplication features. Because our morphological features were extracted from the training corpora automatically, our system was not biased toward any particular variety of Mandarin. Thus, our system does not overfit the variety of Mandarin most (...) familiar to the system's designers. Our final system achieved a F-score of.. (shrink)

I examine what would be necessary to move part-of-speech tagging performance from its current level of about 97.3% token accuracy (56% sentence accuracy) to close to 100% accuracy. I suggest that it must still be possible to greatly increase tagging performance and examine some useful improvements that have recently been made to the Stanford Part-of-Speech Tagger. However, an error analysis of some of the remaining errors suggests that there is limited further mileage to be had either from better machine learning (...) or better features in a discriminative sequence classiﬁer. The prospects for further gains from semisupervised learning also seem quite limited. Rather, I suggest and begin to demonstrate that the largest opportunity for further progress comes from improving the taxonomic basis of the linguistic resources from which taggers are trained. That is, from improved descriptive linguistics. However, I conclude by suggesting that there are also limits to this process. The status of some words may not be able to be adequately captured by assigning them to one of a small number of categories. While conventions can be used in such cases to improve tagging consistency, they lack a strong linguistic basis. (shrink)

adjacent phrases, but they typically lack the ability to perform the kind of long-distance reorderings possible with syntax-based systems. In this paper, we present a novel hierarchical phrase reordering model aimed at improving non-local reorderings, which seamlessly integrates with a standard phrase-based system with little loss of computational efﬁciency. We show that this model can successfully handle the key examples often used to motivate syntax-based systems, such as the rotation of a prepositional phrase around a noun phrase. We contrast our (...) model with reordering models commonly used in phrase-based systems, and show that our approach provides statistically signiﬁcant BLEU point gains for two language pairs: Chinese-English (+0.53 on MT05 and +0.71 on MT08) and Arabic-English (+0.55 on MT05). (shrink)

Automatically clustering web pages into semantic groups promises improved search and browsing on the web. In this paper, we demonstrate how user-generated tags from largescale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages. This paper explores the use of tags in 1) K-means clustering in an extended vector space model that includes tags as well as page text and 2) a novel (...) generative clustering algorithm based on latent Dirichlet allocation that jointly models text and tags. We evaluate the models by comparing their output to an established web directory. We ﬁnd that the naive inclusion of tagging data improves cluster quality versus page text alone, but a more principled inclusion can substantially improve the quality of all models with a statistically signiﬁ- cant absolute F-score increase of 4%. The generative model outperforms K-means with another 8% F-score increase. (shrink)

Many named entities contain other named entities inside them. Despite this fact, the ﬁeld of named entity recognition has almost entirely ignored nested named entity recognition, but due to technological, rather than ideological reasons. In this paper, we present a new technique for recognizing nested named entities, by using a discriminative constituency parser. To train the model, we transform each sentence into a tree, with constituents for each named entity (and no other syntactic structure). We present results on both newspaper (...) and biomedical corpora which contain nested named entities. In three out of four sets of experiments, our model outperforms a standard semi-CRF on the more traditional top-level entities. At the same time, we improve the overall F-score by up to 30% over the ﬂat model, which is unable to recover any nested entities. (shrink)

Unsupervised grammar induction systems commonly judge potential constituents on the basis of their effects on the likelihood of the data. Linguistic justiﬁcations of constituency, on the other hand, rely on notions such as substitutability and varying external contexts. We describe two systems for distributional grammar induction which operate on such principles, using part-of-speech tags as the contextual features. The advantages and disadvantages of these systems are examined, including precision/recall trade-offs, error analysis, and extensibility.

While Ç ´Ò¿ µ methods for parsing probabilistic context-free grammars (PCFGs) are well known, a tabular parsing framework for arbitrary PCFGs which allows for botton-up, topdown, and other parsing strategies, has not yet been provided. This paper presents such an algorithm, and shows its correctness and advantages over prior work. The paper ﬁnishes by bringing out the connections between the algorithm and work on hypergraphs, which permits us to extend the presented Viterbi (best parse) algorithm to an inside (total probability) (...) algorithm. (shrink)

In Pollard and Sag (1987) and Pollard and Sag (1994:Ch. 1–8), the subcategorized arguments of a head are stored on a single ordered list, the subcat list. However, Borsley (1989) argues that there are various defi- ciencies in this approach, and suggests that the unified list should be split into separate lists for subjects, complements, and specifiers. This proposal has been widely adopted in what is colloquially known as HPSG3 (Pollard and Sag (1994:Ch. 9) and other recent work in HPSG). (...) Such a move provides in HPSG an analog of the external/internal argument distinction generally adopted in GB, solves certain technical problems such as allowing prepositions to take complements rather than things identical in subcat list position to subjects, and allows recognition of the special features of subjects which have been noted in the LFG literature, where keyword grammatical relations are used. In HPSG3, it is these valence features subj, comps and spr whose values are ‘cancelled off’ (in a Categorial Grammar-like manner) as a head projects a phrase. A lexical head combines with its complements and subject or specifier (if any) according to the lexically inherited specification, as in (1). (shrink)

This paper presents our work on textual inference and situates it within the context of the larger goals of machine reading. The textual inference task is to determine if the meaning of one text can be inferred from the meaning of another and from background knowledge. Our system generates semantic graphs as a representation of the meaning of a text. This paper presents new results for aligning pairs of semantic graphs, and proposes the application of natural logic to derive inference (...) decisions from those aligned pairs. We consider this work as ﬁrst steps toward a system able to demonstrate broad-coverage text understanding and learning abilities. (shrink)

This paper examines the Stanford typed dependencies representation, which was designed to provide a straightforward description of grammatical relations for any user who could beneﬁt from automatic text understanding. For such purposes, we argue that dependency schemes must follow a simple design and provide semantically contentful information, as well as offer an automatic procedure to extract the relations. We consider the underlying design principles of the Stanford scheme from this perspective, and compare it to the GR and PARC representations. Finally, (...) we address the question of the suitability of the Stanford scheme for parser evaluation. (shrink)

While symbolic parsers can be viewed as deduction systems, this view is less natural for probabilistic parsers. We present a view of parsing as directed hypergraph analysis which naturally covers both symbolic and probabilistic parsing. We illustrate the approach by showing how a dynamic extension of Dijkstra’s algorithm can be used to construct a probabilistic chart parser with an Ç´Ò¿µ time bound for arbitrary PCFGs, while preserving as much of the flexibility of symbolic chart parsers as allowed by the inherent (...) ordering of probabilistic dependencies. (shrink)

While symbolic parsers can be viewed as deduction systems, this view is less natural for probabilistic parsers. We present a view of parsing as directed hypergraph analysis which naturally covers both symbolic and probabilistic parsing. We illustrate the approach by showing how a dynamic extension of Dijkstra’s algorithm can be used to construct a probabilistic chart parser with an Ç´Ò¿µ time bound for arbitrary PCFGs, while preserving as much of the ﬂexibility of symbolic chart parsers as allowed by the inherent (...) ordering of probabilistic dependencies. (shrink)

The same categorical phenomena which are attributed to hard grammatical constraints in some languages continue to show up as statistical preferences in other languages, motivating a grammatical model that can account for soft constraints. The effects of a hierarchy of person (1st, 2nd 3rd) on grammar are categorical in some languages, most famously in languages withError: Illegal entry in bfrange block in ToUnicode CMap inverse systems, but also in languages with person restrictions on passivization. In Lummi, for example, the person (...) of the subject argument cannot be lower than the person of a nonsubject argument. If this would happen in the active, passivization is obligatory; if it would happen in the passive, the active is obligatory (Jelinek and Demers 1983). These facts follow from the theory of harmonic alignment in OT: constraints favoring the harmonic association of prominent person (1st, 2nd) with prominent syntactic function (subject) are hypothesized to be present as subhierarchies of the grammars of all languages, but to vary in their effects across languages depending on their interactions with other constraints (Aissen 1999). There is a statistical reﬂection of these hierarchies in English. The same disharmonic person/argument associations which are avoided categorically in languages like Lummi by making passives either impossible or obligatory, are avoided in the SWITCHBOARD corpus of spoken English by either depressing or elevating the frequency of passives relative to actives. The English data can be grammatically analyzed within the stochastic OT framework (Boersma 1998, Boersma and Hayes 2001) in a way which provides a principled and unifying explanation for their relation to the crosslinguistic categorical person effects studied by Aissen (1999). (shrink)

We propose a model of natural language inference which identifies valid inferences by their lexical and syntactic features, without full semantic interpretation. We extend past work in natural logic, which has focused on semantic containment and monotonicity, by incorporating both semantic exclusion and implicativity. Our model decomposes an inference problem into a sequence of atomic edits linking premise to hypothesis; predicts a lexical semantic relation for each edit; propagates these relations upward through a semantic composition tree according to properties of (...) intermediate nodes; and joins the resulting semantic relations across the edit sequence. A computational implementation of the model achieves 70% accuracy and 89% precision on the FraCaS test suite. Moreover, including this model as a component in an existing system yields significant performance gains on the Recognizing Textual Entailment challenge. (shrink)

fundamental rule” in an order-independent manner, such that the same basic algorithm supports top-down and Most PCFG parsing work has used the bottom-up bottom-up parsing, and the parser deals correctly with CKY algorithm (Kasami, 1965; Younger, 1967) with the difficult cases of left-recursive rules, empty elements, Chomsky Normal Form Grammars (Baker, 1979; Jeand unary rules, in a natural way.

This paper examines feature selection for log linear models over rich constraint-based grammar (HPSG) representations by building decision trees over features in corresponding probabilistic context free grammars (PCFGs). We show that single decision trees do not make optimal use of the available information; constructed ensembles of decision trees based on different feature subspaces show signiﬁ- cant performance gains (14% parse selection error reduction). We compare the performance of the learned PCFG grammars and log linear models over the same features.