First, make sure you can access the course materials. The
components are:

code1.zip : the Java source code provided for this coursedata1.zip : the data sets used in this assignment

The authentication restrictions are due to licensing terms. The username and
password should have been mailed to the account you listed with the Berkeley
registrar. If for any reason you did not get it, please let me know.

Unzip the source files to your local working directory. Some
of the classes and packages wonít be relevant until later assignments, but feel
free to poke around. Make sure you can compile the entirety of the course code
without errors (if you get warnings about unchecked casts, ignore them - thatís
a Java 1.5 issue, if you cannot get the code to compile, email, stop by office
hours, or post to the newsgroup). If you are at the source root (i.e. your
current directory contains only the directory 'edu'), you can compile the
provided code with

javac -d classes */*/*/*.java
*/*/*/*/*.java

You can then run a simple test file by typing

java -cp classes edu.berkeley.nlp.Test

You should get a confirmation message back. You may wish to use an IDE
such as Eclipse (I recommend it). If so, it is expected that you be able
to set it up yourself.

Next, unzip the data into a directory of your choice.
For this assignment, the first Java file to inspect is:

If everythingís working, youíll get some output about the performance of a
baseline language model being tested. The code is reading in some newswire and
building a basic unigram language model that Iíve provided. This is phenomenally
bad language model, as you can see from the strings it generates - youíll
improve on it.

Description

In this assignment, you will construct several language
models and test them with the provided harness.

Take a look at the main method of LanguageModelTester.java, and its output.

Training: Several data objects are loaded by the
harness. First, it loads about 1M words of WSJ text (from the Penn treebank,
which we'll use again later). These sentences have been "speechified", for
example translating "$" to "dollars", and tokenized for you. The WSJ data
is split into training data (80%), validation (held-out) data (10%), and test
data (10%). In addition to the WSJ text, the harness loads a set of speech
recognition problems (from the HUB data set). Each HUB problem consists of a set
of candidate transcriptions of a given spoken sentence. For this
assignment, the candidate list always includes the correct transcription and
never includes words unseen in the WSJ training data. Each candidate
transcription is accompanied by a pre-computed acoustic score, which represents
the degree to which an acoustic model matched that transcription. These
lists are stored in SpeechNBestList objects. Once all the WSJ data and HUB
lists are loaded, a language model is built from the WSJ training sentences (the
validation sentences are ignored entirely by the provided baseline language
model, but may be used by your implementations for tuning). Then, several tests
are run using the resulting language model.

Evaluation: Each language model is tested in two ways. First, the
harness calculates the perplexity of the WSJ test sentences. In the WSJ test
data, there will be unknown words. Your language models should treat all
entirely unseen words as if they were a single UNK token. This means that, for
example, a good unigram model will actually assign a larger probability to each
unknown word than to a known but rare word - this is because the aggregate
probability of the UNK event is large, even though each specific unknown word
itself may be rare. To emphasize, your model's WSJ perplexity score will
not strictly speaking be the perplexity of the extact test sentences, but the
UNKed test sentences (a lower number).

Second, the harness will calculate the perplexity of the
correct HUB transcriptions. This number will, in general, be worse than
the WSJ perplexity, since these sentences are drawn from a different source.
Language models predict less well on distributions which do not match their
training data. The HUB sentences, however, will not contain any unseen
words.

Third, the harness will compute a word error rate (WER) on the HUB recognition
task. The code takes the candidate transcriptions, scores each one with the
language model, and combines those scores with the pre-computed acoustic scores.
The best-scoring candidates are compared against the correct answers, and WER is
computed. The testing code also provides information on the range of WER scores
which are possible: note that the candidates are only so bad to begin with (the
lists are pre-pruned n-best lists). You should inspect the errors the
system is making on the speech re-ranking task, by running the harness with the
ď-verboseĒ flag.

Finally, the harness will generating sentences by randomly sampling you language
models. The provided language modelís outputs arenít even vaguely like
well-formed English, though yours will hopefully be a little better. Note
that improved fluency of generation does not mean improved modeling of unseen
sentences.

Experiments: You will implement several language
models, though you can choose which specific ones to try out. Along the way you
must build the following:

Note that if you build, for example, a Kneser-Ney trigram model with all
hyperparameters tuned automatically on the held-out data, you're technically
done, though it will be more instructional to build up models of increasing
complexity.
While you are building your language models, it may be that lower perplexity,
especially on the HUB sentences, will translate into a better WER, but donít be
surprised if it doesnít. The actual performance of your systems does not
directly impact your grade on this assignment, though I will announce students
who do particularly interesting or effective things.

What will impact your grade is the degree to which you can present what you did
clearly and make sense of whatís going on in your experiments using thoughtful
error analysis. When you do see improvements in WER, where are they coming from,
specifically?
Try to localize the improvements as much as possible. Some example questions you
might consider: Do the errors that are corrected
by a given change to the language model make any sense? Are there changes to the models which
substantially improve perplexity without improving WER? Do certain models
generate better text? Why? Similarly, you should do some data analysis on the
speech errors that you cannot correct. Are there cases where the language model
isnít selecting a candidate which seems clearly superior to a human reader? What
would you have to do to your language model to fix these cases? For these kinds
of questions, itís actually more important to sift through the data and find
some good ideas than to implement those ideas. The bottom line is that your
write-up should include concrete examples of errors or error-fixes, along with
commentary.

Write-ups: For this assignment, you should turn in a write-up of the work
youíve done, but not the code (it is sometimes useful to mention code choices or
even snippets in write-ups, and this is fine). The write-up should specify what
models you implemented and what significant choices you made. It should
include tables or graphs of the perplexities, accuracies, etc., of your systems.
It should also include some error analysis - enough to convince me that you
looked at the specific behavior of your systems and thought about what itís
doing wrong and how youíd fix it. There is no set length for write-ups, but a
ballpark length might be 3-4 pages, including your evaluation results, a graph
or two, and some interesting examples. Iím more interested in knowing what
observations you made about the models or data than having a reiteration of the
formal definitions of the various models.

Random Advice: In edu.berkeley.nlp.util there are
some classes that might be of use - particularly the Counter and CounterMap
classes. These make dealing with word to count and history to word to count
maps much easier.