SYNTHER A NEW M-GRAM POS TAGGER

Transcription

1 SYNTHER A NEW M-GRAM POS TAGGER David Sündermann and Hermann Ney RWTH Aachen University of Technology, Computer Science Department Ahornstr. 55, Aachen, Germany ABSTRACT In this paper, the Part-Of-Speech (POS) tagger synther based on m-gram statistics is described. After explaining its basic architecture, three smoothing approaches and the strategy for handling unknown words is exposed. Subsequently, synther s performance is evaluated in comparison with four state-of-the-art POS taggers. All of them are trained and tested on three corpora of different languages and domains. In the course of this evaluation, synther resulted in the lowest error rates or at least below average error rates. Finally, it is shown that the linear interpolation smoothing strategy with coverage-dependent weights features better properties than the two other approaches. Keywords: synther, (m-gram) POS tagger, linear interpolation smoothing with coverage-dependent weights, POS tagger evaluation 1. INTRODUCTION POS taggers are used in many natural language processing tasks, e.g. in speech recognition, speech synthesis, or statistical machine translation. Their most common aim is to assign a unique POS tag to each token of the input text string. To the best of our knowledge, statistical approaches [1], [3], [8] in most cases yield better outcomes to POS tagging than finite-state, rule-based, or memory-based approaches [2], [4]. Although the maximum entropy framework [8] seem to be the most acknowledged statistical tagging technique, it has been shown that a simple trigram approach often results in better performance [1]. As taking more context into account should improve tagging results, the usage of higher m-gram orders in conjunction with an effective smoothing method is desirable. Thus, in this paper a new approach for defining the weights of the linear interpolation smoothing technique is presented and compared with two conventional smoothing methods. In POS tagging applications, one further viewpoint is especially considered: the handling of words which have not been seen during the training, so called outof-vocabulary (OOV) words. In Section 4, the approach utilized within synther is described. Finally, synther s performance is evaluated in comparison to four state-of-the-art taggers on three corpora of different languages and domains. 2. BASIC ARCHITECTURE OF A POS TAGGER The aim of the POS taggers (v. the schematic diagram in Figure 1) discussed in this paper is the assignment of unambiguous POS tags to the words of an input text. Given the word (or more general token) sequence w N 1 := w 1 w n w N on the positions n = 1,..., N, we search for the most likely tag sequence ĝ N 1 := arg max Pr(g N g1 N 1 wn 1 ). Rewriting this formula by means of BAYES law yields ĝ N 1 = arg max Pr(g N g1 N 1 ) Pr(wN 1 gn 1 ).

3 3.3 Linear Interpolation with Weights Depending on Training Data Coverage This technique understands that a high training data coverage for the order µ signifies that we are allowed to take more context into account and to rate that context higher. The training data coverage c µ is the ratio between the number of different µ-grams occurred while training and that of all possible µ-grams. Hence, we have 0 c µ 1. On the other hand, a low coverage (high sparseness) is indicator that the µ-gram weight should be curtailed. According to the above considerations we want the interpolation weights to be positioned on a continuous function λ µ (c µ ) fulfilling the following conditions (ĉ denotes the optimal coverage): λ µ (0) = 0, λ µ (ĉ) = max c µ λ µ (c µ ), 0 < λ µ (c µ ) < λ µ (ĉ) for c µ 0 c µ ĉ. One simple realization of these constraints is a set of λ µ which is computed by normalizing the values λ µ defined as follows. The normalization has to be executed because the sum of the interpolation weights must be unity. λ µ = ĉ for c µ ĉ c µ c µ otherwise ĉ ĉ should be estimated with the help of a development corpus and can be expected in the neighborhood of one percent. 4. OOV HANDLING In Eq. (2), the word-tag probability is defined as a product of conditional probabilities p(w g) which can be derived from p(g w) by means of BAYES law: p(w g) = p(w) p(g w) p(g) Furthermore, we understand that p(g) is known and p(w) constitute a factor which is equal for each possible tag sequence g n 1 and can be ignored searching for. the most probable sequence. Therefore, in the following, we only discuss the estimation of the conditional probability p(g w). In case of a word seen in training, we estimate p(g w) using relative frequencies, otherwise, we have a more detailed look at the actual word consisting of the ASCII characters l 1,..., l I. Especially the final characters serve as a good means to estimate wordtag probabilities of OOVs in Western European languages. When we want to take into account relative frequencies of character strings seen in training, we have to deal with training data sparseness. Again, this leads us to the usage of smoothing strategies, cf. Section 3. synther uses the linear interpolation technique, v. Eq. (4), wherein the weights are defined as proposed in [9]. These considerations yield the general definition of the searched probability p(g w). p(g w) = I i=1 N(g, w) N(w) λ i N(g, l i,..., l I ) N(l i,..., l I ) with λ i = 1 ( σ σ 1 + σ for N(w) > 0 ) i otherwise Here, σ is the standard deviation of the estimated tag occurrence probabilities. 5. CORPORA In the following, the corpora used for evaluation of synther are compendiously presented. Punctuation Marks (PMs) are all tokens which do not contain letters or digits. Singletons are all tokens respectively POS tags which occur only once in the training data. The m-gram perplexity is a degree of the diversity of tokens expected at each position: N PP m = n=m n m+1 ) 1 N m+1 The trigram perplexity PP 3 displayed in Table 1 to 3 was computed using the linear interpolation smoothing approach explained in Section 3.3..

4 When we restrict the search space by exclusively taking those tags into account which have been observed in connection with the particular token, the tagging procedure can only make errors in case of either ambiguities (also OOVs) or if a token has only been seen with tags differing from that of the reference sequence. The maximum error rate ER max is that error rate which we get if we always choose a wrong tag in ambiguous cases. When we randomly determine the tags according to a uniform distribution over all tags observed together with a particular word, we expect the random error rate ER rand. These two error rates serve as benchmarks to assess the properties of the corpus. E.g., we note that the error rates of the POS taggers presented below are about ten percent of ER rand. 5.1 Penn Treebank: Wall Street Journal Corpus This corpus contains about one million English words of 1989 Wall Street Journal (WSJ) material with human-annotated POS tags. It was developed at the University of Pennsylvania [10], v. Table 1. Train Table 1: WSJ Corpus Statistics Text POS Sentences Words + PMs Punctuation Marks Vocabulary Words Vocabulary PMs 25 9 Singletons Sentences Words + PMs OOVs (2.6%) (0%) Test Punctuation Marks (13.4%) (12.7%) PP ER max 55.7% ER rand 36.7% 5.2 Münster Tagging Project Corpus This German POS tagging corpus was compiled at the University of Münster within the Münster Tagging Project (MTP). It contains articles of the newspapers Die Zeit and Frankfurter Allgemeine Zeitung [5], v. Table 2. Train Table 2: MTP Corpus Statistics Text POS Sentences Words + PMs Punctuation Marks Vocabulary Words Vocabulary PMs 27 5 Singletons Sentences Words + PMs OOVs (9.2%) (0.0%) Test Punctuation Marks (13.1%) (13.1%) PP ER max 66.7% ER rand 49.8% 5.3 GENIA Corpus The data content of the GENIA corpus is chosen from the domain of molecular biology. It is edited in American English and has been made available by the University of Tokyo [7], v. Table EXPERIMENTS 6.1 Evaluating synther in Comparison to Four Other POS Taggers To perform an evaluation under objective conditions and to obtain comparable outcomes, synther has been trained and tested together with four freely available POS taggers: BRILL s tagger based on automatically learned rules [2] RATNAPARKHI s maximum entropy tagger [8] TnT a trigram tagger by THORSTEN BRANTS [1] TreeTagger a tagger based on decision trees provided by HELMUT SCHMID [11]

5 Train Table 3: GENIA Corpus Statistics Text POS Sentences Words + PMs Punctuation Marks Vocabulary Words Vocabulary PMs 25 7 Singletons Sentences 689 Words + PMs OOVs (3.8%) (0%) Test Punctuation Marks (11.2%) (10.9%) PP ER max 36.6% ER rand 22.2% Table 4 shows the results of this comparison: the total error rate, that for the OOVs, and, furthermore, the outcomes exclusively for known words (OOV). The latter is to separate the effect of OOV handling from that of the remaining statistics. All tests presented in this paper except for those in Section 6.2 were executed setting synther s m-gram order to m = 5. In particular, the outcomes of Table 4 show us: In several cases, the m-gram statistics used by synther result in the lowest error rates in comparison to the other taggers tested in the course of this evaluation. Both BRILL s and RATNAPARKHI s POS tagger which were developed at the Department of Computer and Information Science of the University of Pennsylvania produce their best results on their in-house corpus (WSJ). TnT as well as synther always produce aboveaverage outcomes. Except for the OOVs, the latter s statistics decreases the error rates by up to 6 percent relative by virtue of higher m-gram order (m = 5 in lieu of 3). Table 4: POS Tagger Evaluation: Error Rates Corpus Tagger ER[%] all OOV OOV BRILL RATNAPARKHI WSJ synther TnT TreeTagger BRILL RATNAPARKHI MTP synther TnT TreeTagger BRILL RATNAPARKHI GENIA synther TnT TreeTagger Comparison of Smoothing Techniques In the introduction of this paper, we have conjectured that increasing the order of m-gram statistics should improve the tagging performance. The following test will show that this assumption is only correct if it is supported by the smoothing strategy. In Figure 2, the performance of the three smoothing approaches presented in Section 3 is displayed versus the maximum m-gram order m. These experiments are based on the WSJ corpus described in Table 1. We note that the coverage dependent smoothing approach is the best out of these three strategies, at least for orders m > 2 and for the WSJ corpus. As well, this statement was confirmed on the MTP and GENIA corpus. 6.3 Influence of the Optimal Coverage Parameter ĉ on the Smoothing Accuracy Finally, we want to demonstrate how the accuracy of the coverage-dependent smoothing approach (cf. Section 3.3) is influenced by the optimal coverage parameter ĉ. By means of the WSJ corpus in Figure 3, we demonstrate that there is a local and also absolute minimum of the error rate curve. This minimum is located in a broad area of low gradients (ĉ = ) thus

6 In this paper, we have presented the m-gram POS tagger synther explaining in detail its smoothing approaches and the strategy for handling unknown words. Subsequently, the new POS tagger has been evaluated on three corpora of different languages and domains and compared with four state-of-the-art taggers. We have shown that synther results in belowaverage or even the lowest error rates using a new linear interpolation smoothing technique with coveragedependent weights. 8. REFERENCES [1] T. Brants TnT A Statistical Part-of- Speech Tagger. In Proc. of the ANLP 00. [2] E. Brill A Simple Rule-Based Part of Speech Tagger. In Proc. of the ANLP 92. [3] D. Cutting, J. Kupiec, J. Pedersen, and P. Sibun A Practical Part-of-Speech Tagger. In Proc. of the ANLP 92. Figure 2: Comparison of Smoothing Strategies determining any value within this area suffice to obtain error rates around 3.4%. [4] W. Daelemans, J. Zavrel, P. Berck, and S. Gillis A Memory-Based Part-of-Speech Tagger Generator. In Proc. of the Workshop on Very Large Corpora. [5] J. Kinscher and P. Steiner Münster Tagging Projekt (MTP). Handout for the 4th Northern German Linguistic Colloquium. [6] H. Ney and U. Essen Estimating Small Probabilities by Leaving-One-Out. In Proc. of EUROSPEECH 93. [7] T. Ohta, Y. Tateisi, H. Mima, and J. Tsujii GENIA Corpus: an Annotated Research Abstract Corpus in Molecular Biology Domain. In Proc. of the HLT 02. Figure 3: Dependence of the Tagging Performance on the Optimal Coverage ĉ 7. CONCLUSION [8] A. Ratnaparkhi A Maximum Entropy Model for Part-of-Speech Tagging. In Proc. of the EMNLP 96. [9] C. Samuelsson Handling Sparse Data by Successive Abstraction. In Proc. of the COL- ING 96. [10] B. Santorini Part-of-Speech Tagging Guidelines for the Penn Treebank Project. Technical Report MS-CIS-90-47, University of Pennsylvania. [11] H. Schmid Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proc. of the NeMLaP 94.

Chapter 5 Language Modeling 5.1 Introduction A language model is simply a model of what strings (of words) are more or less likely to be generated by a speaker of English (or some other language). More

Chapter 13 Statistical Parsg Given a corpus of trees, it is easy to extract a CFG and estimate its parameters. Every tree can be thought of as a CFG derivation, and we just perform relative frequency estimation

Probability Theory To start out the course, we need to know something about statistics and probability Introduction to Probability Theory L645 Advanced NLP Autumn 2009 This is only an introduction; for

Learning Features from Co-occurrences: A Theoretical Analysis Yanpeng Li IBM T. J. Watson Research Center Yorktown Heights, New York 10598 liyanpeng.lyp@gmail.com Abstract Representing a word by its co-occurrences

Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more

Low-Dimensional Discriminative Reranking Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park Discriminative Reranking Useful for many NLP tasks Enables us to use arbitrary features

On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System Fabrice Lefèvre, Claude Montacié and Marie-José Caraty Laboratoire d'informatique de Paris VI 4, place Jussieu 755 PARIS

with with National Centre for Language Technology School of Computing Dublin City University Parallel treebanks A parallel treebank comprises: sentence pairs parsed word-aligned tree-aligned (Volk & Samuelsson,

Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx

Bayesian Classifiers and Probability Estimation Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington 1 Data Space Suppose that we have a classification problem The

Artificial Neural Networks Examination, June 2005 Instructions There are SIXTY questions. (The pass mark is 30 out of 60). For each question, please select a maximum of ONE of the given answers (either

Decoding Complexity in Word-Replacement Translation Models Kevin Knight University of Southern California Statistical machine translation is a relatively new approach to the longstanding problem of translating

Personal Project: Shift-Reduce Dependency Parsing 1 Problem Statement The goal of this project is to implement a shift-reduce dependency parser. This entails two subgoals: Inference: We must have a shift-reduce

Feedforward Neural Networks Michael Collins 1 Introduction In the previous notes, we introduced an important class of models, log-linear models. In this note, we describe feedforward neural networks, which

Improving the Multi-Stack Decoding Algorithm in a Segment-based Speech Recognizer Gábor Gosztolya, András Kocsor Research Group on Artificial Intelligence of the Hungarian Academy of Sciences and University

INST 737 April 1, 2013 Midterm Name: }{{} by writing my name I swear by the honor code Read all of the following information before starting the exam: For free response questions, show all work, clearly

Efficient Tree-Based Topic Modeling Yuening Hu Department of Computer Science University of Maryland, College Park ynhu@cs.umd.edu Abstract Topic modeling with a tree-based prior has been used for a variety

Artificial Intelligence Programming Probability Chris Brooks Department of Computer Science University of San Francisco Department of Computer Science University of San Francisco p.1/?? 13-0: Uncertainty

Propositional Logic Propositions A proposition is a declaration of fact that is either true or false, but not both. Examples and non-examples: One plus two equals four (proposition) Mozart is the greatest

A Risk Minimization Framework for Information Retrieval ChengXiang Zhai a John Lafferty b a Department of Computer Science University of Illinois at Urbana-Champaign b School of Computer Science Carnegie

Gatsby Computational Neuroscience Unit 17 Queen Square, London University College London WC1N 3AR, United Kingdom http://www.gatsby.ucl.ac.uk +44 20 7679 1176 Funded in part by the Gatsby Charitable Foundation.