Lecture 4 Outline

Part-of-Speech Tagging Using Hidden Markov Models

Corpus-Based Methods

Natural language is very complex
- we don't know how to model it fully,
so we build simplified
models which provide some approximation to natural language
How can we measure 'how good' these models are?
- we build a corpus,
annotate it by hand with
respect to the phenomenon we are interested in,
and then compare it with the
predictions of our model
- for example, how well the model predicts
part-of-speech or syntactic structure
To build a good corpus
- we must define a task people can do reliably (choose a
suitable POS set, for example)
- we must provide good documentation for the task
- we must measure human performance (through dual
annotatiion and inter-annotator agreement)
How to train the model?
- need a goodness metric
- train by hand, by adjusting rules and analyzing
errors
- train automatically
* develop new rules
* build probabilistic model
(generally very hard to do by hand)

Statistical Part-of-Speech Tagging (J&M sec 5.5)

Looking at words in isolation:

Given a word w, what tag t should we assign
to
it?
We want to assign the tag which maximizes the number of correct
assignments.
The probability of getting the assignment correct is P ( t | w
)
So we want to assign the tag = argmax(t) P ( t | w )
We can estimate P ( t | w ) as
(number of times w is tagged as t in corpus) /
(number of times w appears in corpus)
(this is the 'maximum likelihood estimator', J&M p. 88)

Hidden Markov Model (HMM) (J&M 6.2)

Suppose we associate a part-of-speech with each state in a
Markov Model.
We then associate an 'emission probability' P ( w | t )
of emitting a particular word when in a particular
state.
This is a hidden Markov Model ...
the sequence of words generated do not uniquely
determine the sequence of states.

Training an HMM

Training an HMM is simple if we have a completely labeled
corpus:
we have marked the POS of each word.
We can then directly estimate both P ( ti | ti-1
) and P ( wi | ti ) from corpus counts
using the Maximum Likelihood Estimator.

Using an HMM ('decoding')

The argmax(T) given above corresponds to finding the most
likely path through the model, given W.
The Viterbi algorithm provides a fast (linear in number of
tokens) algorithm for this task. (J&M sec. 5.5.3 and 6.4)
It consists of a forward pass which computes probabilities,
and a backward pass which traces the most likely path.
In the forward pass,
it builds a probability matrix viterbi [number
of
POS states+2, number of tokens + 2], and
a back pointer matrix of the same size.viterbi [ s, t ] = max (over all paths to [s,t]) of
the probability of reaching state s at token t