Transcription

1 L545 Dept. of Linguistics, Indiana University Spring / 24

2 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between, we ll handle words in context Today: n-gram language modeling Next time: POS tagging Both of these topics are covered in more detail in L645 2 / 24

3 : An n-gram is a stretch of text n words long Approximation of language: n-grams tells us something about language, but doesn t capture structure Efficient: finding and using every, e.g., two-word collocation in a text is quick and easy to do can help in a variety of NLP applications: Word prediction = n-grams can be used to aid in predicting the next word of an utterance, based on the previous n 1 words Context-sensitive spelling correction Machine Translation post-editing... 3 / 24

4 Corpus-based NLP Corpus (pl. corpora) = a computer-readable collection of text and/or speech, often with annotations Use corpora to gather probabilities & other information about language use A corpus used to gather prior information = training data Testing data = the data one uses to test the accuracy of a method A word may refer to: type = distinct word (e.g., like) token = distinct occurrence of a word (e.g., the type like might have 20,000 tokens in a corpus) 4 / 24

5 Let s assume we want to predict the next word, based on the previous context of I dreamed I saw the knights in What we want to find is the likelihood of w 8 being the next word, given that we ve seen w 1,..., w 7 This is P(w 8 w 1,..., w 7 ) But, to start with, we examine P(w 1,..., w 8 ) In general, for w n, we are concerned with: (1) P(w 1,..., w n ) = P(w 1 )P(w 2 w 1 )...P(w n w 1,..., w n 1 ) These probabilities are impractical to calculate, as they hardly ever occur in a corpus, if at all Too much data to store, if we could calculate them. 5 / 24

6 Unigrams Approximate these probabilities to n-grams, for a given n Unigrams (n = 1): (2) P(w n w 1,..., w n 1 ) P(w n ) Easy to calculate, but lack contextual information (3) The quick brown fox jumped We would like to say that over has a higher probability in this context than lazy does 6 / 24

8 Markov models A bigram model is also called a first-order Markov model first-order because it has one element of memory (one token in the past) Markov models are essentially weighted FSAs i.e., the arcs between states have probabilities The states in the FSA are words More on Markov models when we hit POS tagging... 8 / 24

9 Bigram example What is the probability of seeing the sentence The quick brown fox jumped over the lazy dog? (7) P(The quick brown fox jumped over the lazy dog) = P(The START)P(quick The)P(brown quick)...p(dog lazy) Probabilities are generally small, so log probabilities are usually used Q: Does this favor shorter sentences? A: Yes, but it also depends upon P(END lastword) 9 / 24

12 Know your corpus We mentioned earlier about splitting training & testing data It s important to remember what your training data is when applying your technology to new data If you train your trigram model on Shakespeare, then you have learned the probabilities in Shakespeare, not the probabilities of English overall What corpus you use depends on your purpose 12 / 24

13 : Assume: a bigram model has been trained on a good corpus (i.e., learned MLE bigram probabilities) It won t have seen every possible bigram: lickety split is a possible English bigram, but it may not be in the corpus Problem = data sparsity zero probability bigrams that are actual possible bigrams in the language techniques account for this Adjust probabilities to account for unseen data Make zero probabilities non-zero 13 / 24

14 Add-One One way to smooth is to add a count of one to every bigram: In order to still be a probability, all probabilities need to sum to one Thus: add number of word types to the denominator We added one to every type of bigram, so we need to account for all our numerator additions (10) P (w n w n 1 ) = C(w n 1,w n )+1 C(w n 1 )+V V = total number of word types in the lexicon 14 / 24

15 example So, if treasure trove never occurred in the data, but treasure occurred twice, we have: (11) P (trove treasure) = V The probability won t be very high, but it will be better than 0 If the surrounding probabilities are high, treasure trove could be the best pick If the probability were zero, there would be no chance of appearing 15 / 24

16 Discounting An alternate way of viewing smoothing is as discounting Lowering non-zero counts to get the probability mass we need for the zero count items The discounting factor can be defined as the ratio of the smoothed count to the MLE count Jurafsky and Martin show that add-one smoothing can discount probabilities by a factor of 10! Too much of the probability mass is now in the zeros We will examine one way of handling this; more in L / 24

17 Witten-Bell Discounting Idea: Use the counts of words you have seen once to estimate those you have never seen Instead of simply adding one to every n-gram, compute the probability of w i 1, w i by seeing how likely w i 1 is at starting any bigram. Words that begin lots of bigrams lead to higher unseen bigram probabilities Non-zero bigrams are discounted in essentially the same manner as zero count bigrams Jurafsky and Martin show that they are only discounted by about a factor of one 17 / 24

18 Witten-Bell Discounting formula (12) zero count bigrams: p (w i w i 1 ) = T(w i 1 ) Z(w i 1 )(N(w i 1 )+T(w i 1 )) T(w i 1 ) = number of bigram types starting with w i 1 determines how high the value will be (numerator) N(w i 1 ) = no. of bigram tokens starting with w i 1 N(w i 1 ) + T(w i 1 ) gives total number of events to divide by Z(w i 1 ) = number of bigram tokens starting with w i 1 and having zero count this distributes the probability mass between all zero count bigrams starting with w i 1 18 / 24

19 Kneser-Ney (asolute discounting) Witten-Bell Discounting is based on using relative discounting factors Kneser-Ney simplifies this by using absolute discounting factors Instead of multiplying by a ratio, simply subtract some discounting factor 19 / 24

20 Class-based Intuition: we may not have seen a word before, but we may have seen a word like it Never observed Shanghai, but have seen other cities Can use a type of hard clustering, where each word is only assigned to one class (IBM clustering) (13) P(w i w i 1 ) P(c i c i 1 ) P(w i c i ) POS tagging equations will look fairly similar to this / 24

21 models: Basic idea Assume a trigram model for predicting language, where we haven t seen a particular trigram before Maybe we ve seen the bigram or the unigram models allow one to try the most informative n-gram first and then back off to lower n-grams 21 / 24

22 equations Roughly speaking, this is how a backoff model works: If this trigram has a non-zero count, use that: (14) ˆP(wi w i 2 w i 1 ) = P(w i w i 2 w i 1 ) Else, if the bigram count is non-zero, use that: (15) ˆP(wi w i 2 w i 1 ) = α 1 P(w i w i 1 ) In all other cases, use the unigram information: (16) ˆP(wi w i 2 w i 1 ) = α 2 P(w i ) 22 / 24

23 models: example Assume: never seen the trigram maples want more before If we have seen want more, we use that bigram to calculate a probability estimate (P(more want)) But we re now assigning probability to P(more maples, want) which was zero before We won t have a true probability model anymore This is why α 1 was used in the previous equations, to assign less re-weight to the probability. In general, backoff models are combined with discounting models 23 / 24

24 A note on information theory Some very useful notions for n-gram work can be found in information theory. Basic ideas: entropy = a measure of how much information is needed to encode something perplexity = a measure of the amount of surprise of an outcome mutual information = the amount of information one item has about another item (e.g., collocations have high mutual information) Take L645 to find out more / 24

Chapter 5 Language Modeling 5.1 Introduction A language model is simply a model of what strings (of words) are more or less likely to be generated by a speaker of English (or some other language). More

Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx

Learning Features from Co-occurrences: A Theoretical Analysis Yanpeng Li IBM T. J. Watson Research Center Yorktown Heights, New York 10598 liyanpeng.lyp@gmail.com Abstract Representing a word by its co-occurrences

Section 0: Arrow Diagrams on the Integers Most of the material we have discussed so far concerns the idea and representations of functions. A function is a relationship between a set of inputs (the leave

1 The Starting Point: Basic Concepts and Terminology Let us begin our stu of differential equations with a few basic questions questions that any beginner should ask: What are differential equations? What

Chapter 13 Statistical Parsg Given a corpus of trees, it is easy to extract a CFG and estimate its parameters. Every tree can be thought of as a CFG derivation, and we just perform relative frequency estimation

Roberto s Notes on Linear Algebra Chapter 5: Determinants Section 3 Cofactors and Laplace s expansion theorem What you need to know already: What a determinant is. How to use Gauss-Jordan elimination to

Lecture 4: Training a Classifier Roger Grosse 1 Introduction Now that we ve defined what binary classification is, let s actually train a classifier. We ll approach this problem in much the same way as

Statistical Sequence Recognition and Training: An Introduction to HMMs EECS 225D Nikki Mirghafori nikki@icsi.berkeley.edu March 7, 2005 Credit: many of the HMM slides have been borrowed and adapted, with

MA 1500 Lesson 5 Section.6 I The Domain of a Function Remember that the domain is the set of x s in a function, or the set of first things. For many functions, such as f ( x, x could be replaced with any

Probability Theory To start out the course, we need to know something about statistics and probability Introduction to Probability Theory L645 Advanced NLP Autumn 2009 This is only an introduction; for

: Overview L545 Dept. of Linguistics, Indiana University Spring 2013 Input: a string Output: a (single) parse tree A useful step in the process of obtaining meaning We can view the problem as searching

NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Prediction Problems Given x, A book review Oh, man I love this book! This book is

Sampling distributions and estimation. 1) A brief review of distributions: We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation,

Lecture 3 Nondeterministic finite automata This lecture is focused on the nondeterministic finite automata (NFA) model and its relationship to the DFA model. Nondeterminism is an important concept in the

Ratios, Proportions, Unit Conversions, and the Factor-Label Method Math 0, Littlefield I don t know why, but presentations about ratios and proportions are often confused and fragmented. The one in your

Low-Dimensional Discriminative Reranking Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park Discriminative Reranking Useful for many NLP tasks Enables us to use arbitrary features

PRE-READING A. WARM-UP QUESTIONS 1. Have you heard of the Nazca Lines? If yes, what are they? If no, what do you think they are, based on the picture? 2. Where are the Nazca Lines located? If you don t

5. Simplifying Rational Expressions Now that we have mastered the process of factoring, in this chapter, we will have to use a great deal of the factoring concepts that we just learned. We begin with the

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers Let s assume that every instance is an n-dimensional vector of real numbers x R n, and there are only two possible

Lecture 6 Force and Motion Identifying Forces Free-body Diagram Newton s Second Law We are now moving on from the study of motion to studying what causes motion. Forces are what cause motion. Forces are

MA103 STATEMENTS, PROOF, LOGIC Abstract Mathematics is about making precise mathematical statements and establishing, by proof or disproof, whether these statements are true or false. We start by looking

Artificial Intelligence Programming Probability Chris Brooks Department of Computer Science University of San Francisco Department of Computer Science University of San Francisco p.1/?? 13-0: Uncertainty

02 Death Chances of Survival: You might make it Survival Strategies: Fractions; Equivalence by: Termination PIZZA PERIL The Challenge It s your first day of work at Catwalk magazine, a dream come true.

!119 9 UNITS OF MEASUREMENT There are many measurements that we must take into consideration when shooting long range: distance to the target, size of the target, elevation compensation, windage compensation,

Introduction This review was originally written for my Calculus I class, but it should be accessible to anyone needing a review in some basic algebra and trig topics. The review contains the occasional

The Plane of Complex Numbers In this chapter we ll introduce the complex numbers as a plane of numbers. Each complex number will be identified by a number on a real axis and a number on an imaginary axis.

Name School Date Lab 12.1 Calorimetry Purpose To investigate the flow of heat between two bodies due to a difference in their temperatures. To investigate the flow of heat involved in changes in phase.

Telescopes... Light Buckets Now that we have an understanding of what light is and why it s important to astronomy, what tools are required to study light from distant objects? The telescope is the tool

Student Outcomes Students learn if-then moves using the addition and multiplication properties of inequality to solve inequalities and graph the solution sets on the number line. Classwork Exercise 1 (5

Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more

Info 2950, Lecture 24 2 May 2017 An additional "bonus" was added to problem 4 on Fri afternoon 28 Apr, and some additional typos plus slight rewording to resolve questions about problem 2 at end of afternoon.

ECOSYSTEM MODELS Spatial Non-Cellular Automata Tony Starfield Recorded: March, 2011 In the previous segment, we looked at cellular automata models, but we ended with the idea that not all spatial models

The Turing machine model of computation For most of the remainder of the course we will study the Turing machine model of computation, named after Alan Turing (1912 1954) who proposed the model in 1936.

Lecture 1 Introduction to Algorithms 1.1 Overview The purpose of this lecture is to give a brief overview of the topic of Algorithms and the kind of thinking it involves: why we focus on the subjects that

MITOCW free_body_diagrams This is a bungee jumper at the bottom of his trajectory. This is a pack of dogs pulling a sled. And this is a golf ball about to be struck. All of these scenarios can be represented

86 Polynomial Functions 3.4 Complex Zeros and the Fundamental Theorem of Algebra In Section 3.3, we were focused on finding the real zeros of a polynomial function. In this section, we expand our horizons

Physics 302 - Motion Math (Read objectives on screen.) Welcome back. When we ended the last program, your teacher gave you some motion graphs to interpret. For each section, you were to describe the motion

CHINESE REMAINDER THEOREM MATH CIRCLE AT WASHINGTON UNIVERSITY IN ST. LOUIS, APRIL 19, 2009 Baili MIN In a third-centry A. D. Chinese book Sun Tzu s Calculation Classic one problem is recorded which can

September 13, 2008 A Language of / for mathematics..., I interpret that mathematics is a language in a particular way, namely as a metaphor. David Pimm, Speaking Mathematically Alternatively Scientists,

SELECTIVE APPROACH TO SOLVING SYLLOGISM While solving Syllogism questions, we encounter many weird Statements: Some Cows are Ugly, Some Lions are Vegetarian, Some Cats are not Dogs, Some Girls are Boys,

Quick Introduction to Nonnegative Matrix Factorization Norm Matloff University of California at Davis 1 The Goal Given an u v matrix A with nonnegative elements, we wish to find nonnegative, rank-k matrices

Mini Lecture. Introduction to Algebra: Variables and Mathematical Models. Evaluate algebraic expressions.. Translate English phrases into algebraic expressions.. Determine whether a number is a solution

NYS COMMON CE MATHEMATICS CURRICULUM Lesson 24 6 4 Student Outcomes Students identify values for the variables in equations and inequalities that result in true number sentences. Students identify values