To remove punctuation from the text files, we used the following Unix
sed script: sed.strip. It
specifies a number of global substitutions in terms of very simple
regular expressions. If sed is not available, it would be
very easy to write the same thing in Perl, or one could
just do the substitutions in a text editor.

But here are the resulting 'clean' text files that we used:
training data (a concatenation of various
novels), and test data (cleaned up
Persuasion).

The Good-Turing estimates for Austen in Table 6.8 were calculated
using Gale and
Sampson's (1995) Simple Good Turing technique using Sampson's C program
SGT.c, available from
his website. The frequency of
frequency data
that was used as input is available in this
file. (To do exercise 6.6, what you might want to do is use
a language modelling toolkit to generate raw n-grams, a Perl program to
do counts over those n-grams, and then to feed those into
SGT.c for Good-Turing estimation.)

This file gives examples of some of the commands we used in
calculations in the chapter, using standard Unix commands, and programs
from the CMU-Cambridge Statistical Language Modeling toolkit:
recipes.txt.