Automated Audio Segmentation Using Forced Alignment (Draft)

Step 1 - Create Pronunciation Dictionnary

First you need to make sure that
all the words in the eText of the audio book are contained in the
VoxForge Lexicon. The Lexicon file contains the pronounciations
used for Acoustic Model creation, and if you try to train an Acoustic Model with a word that
is not in the Lexicon file, the training process will end
abnormally.

This section will guide you throught the process to
creating a list of all words in the eText, and then compare it against
the lexicon file, and create a log of all the missing words. The
missing words will then need to be added to the VoxForge Lexicon (with
pronounciations).

In this example, we used the Librivox text for the History of
England, from the Accession of James the Second, by Thomas Babington
Macaulay (eText.txt & Librivox Audio File).

Cleanup eText file

You might need to convert the text file to your OS format (see this link for an explanation of the format differences and click here to obtain the flip utility for MAC or Windows/MS-DOS environments).
On Linux use the dos2unix command to convert the file from MS-DOS format
(or Mac format) to Unix format:

$dos2unix eText.txt

Then run it through a spell checker to correct any spelling mistakes.

You will also need change your eText so that numbers and dates are
spelled out, as follows:

numbers - if you find any numbers in your eText, write them
out in the eText itself. For example, the number '10' should be
converted to 'ten' in the eText.

dates - all dates should also be written out in long form.
For example, the date '1928' should be converted to 'Nineteen Twenty
Eight' in the eText.

You can do this by hand or wait until you run the HDMan command below, it will
log any words in the eText file that are missing from the Lexicon file
- which only includes spelled out numbers and dates.

Note: you may need to listen to the actual audio to confirm that
the reader has read the number or date as you would think.
Sometimes they might say 'one, oh' rather than 'ten' for the number
'10', or they might say 'one, nine, two, eight' rather than 'Nineteen
Twenty Eight' for the date '1928'. Always transcribe the number
or date the same way as the reader says it.

Review the HDMan log file (dlog) to determine which words (if any) in the eText file that are missing from the VoxForge lexicon.

If you don't have any missing words go to Step 3.

If
you have missing words, review the list and look for any that might
correspond to numbers or dates in numeric format. If you find any
of these, then change them in your eText file (i.e. so that they are
written using words, not numbers), and re-run the HDMan command as
shown above. If you still have missing words showing in the dlog
file, then go to Step 2.

Step 2 - Add Missing Words to Your Copy of the VoxForge Lexicon

Manually

To add a missing word (as displayed in your HDMan log - dlog)
to the VoxForge Lexicon, you need to look
at the pronunciation of similar words in the dictionary, and create a
new pronunciation entry for your word based on these similar
words.

For example, if you want to add the word "winward", you would look up words that are similar, such as:

WINWOOD [WINWOOD] w ih n w uh d

In this case, this gives us the pronunciation for the "win"
in the word "winward". Next, we look for words that
contain "ward" in the dictionary, such as:

WOODWARD [WOODWARD] w uh d w er d

WARD [WARD] w ow r d

Notice that although the words "woodward" and "ward"
contain the same sequence of letters (ward), they are pronounced
differently - they have different phoneme sequences. Next you
need to make a judgment call based on your knowledge of your
English dialect (you might also want to listen to the actual audio
passage that contains the word, but this could take too much time for
each and every word you are unsure of... ). For me, the way I
pronounce the word part "ward"
in "winward" is closer to the sounds I make in "woodward"
that in the word "ward". Therefore, the final
pronunciation dictionary entry I would use would look like this:

WINWARD [WINWARD] w ih n w er d

You then need to add this word to your version of the VoxForge
Lexicon in
*Alphabetical* sequence. You need to repeat these steps
for all the "missing words" words in your eText. It's a little
tedious when you perform this process for the first time, but as you
get familiar with the words and phonemes, it goes much quicker.

Manually with help from Festival

Start Festival

$ festival

From the Festival command line, there are a series of "lex" commands
that can help you determine the phonemes contained in a word that is
not included in the VoxForge dictionnary, and as an added bonus, you
can actually listen to how Festival pronounces the word to get a better
feel for the phonemes.

First, find out which lexicons (i.e. pronunciation dictionnaries and
rules) are included in your distribution of Festival using the "lex.list" command as follows:

festival> (lex.list)

("english_poslex" "cmu")

Since VoxForge is based on the cmu dictionnary, we can use Festival
to determine the phonemes of an unknown word, using Festival's
dictionnary an pronunciation rules (see here for Festival's phone list).

Festival (rel 1.95) usullay uses the "cmu" lexicon by default.
To make sure that you are using this dictionnary, use the following
command:

festival> (lex.select "cmu")

Next, to determine the pronunciation of a word use the "lex.lookup" command as follows:

festival> (lex.lookup "internet")

("internet" nil (((ih n t) 1) ((er n) 0) ((eh t) 1)))

Festival will list the phonemes included in the word, but also
includes numbers (these indicate "lexical stress" for a phoneme).
Ignore the parathesis and numbers, and you have Festival's view of the
phonemes that make up the word you entered. Therefore, for the
word "Internet", Festival says its phonemes are: "ih n t er n eh t".

Semi-automated approach using Festival (manual corrections required)

Create a new file called MissingWords, and Copy the missing words listed in the dlog log file from the HDMan run

This will create a good first draft of the pronunciations for the
missing words - in a file called MissingWords_out. You still need to confirm these pronunciations to
make sure they are OK. You can do this by looking at similar
groups of letters in the missing words, and look up the pronunications
for these groups in other known words - if they match, then use what
Festival recommends. If they don't match, you need to make a
judgement call based on your knowledge of English.

After you've added all the missing words to your copy of the VoxForge dictionnary

In the current example, once all the missing words have been added, your VoxForge Lexicon should look like this: VoxForgeDict.

Once you finish adding all your words, re-run the HDMan command:

$ HDMan -A -D -T 1 -m -w wlist -i -l dlog dict VoxForgeDict

And review the HDMan log output (i.e. dlog) again to make sure that you did not miss
any other words.

Note: One common error is to put the new entries in the Lexicon
file in the wrong sort order. You might have to experiment with
word placement (especially with words containing non-alphanumeric
characters) to get it so that HDMan will run correctly.

The HDMan command with create a dictionnary
file called: dict. Your dict file is essentially all the words in
your wlist file with added pronounciation information.

Step 3 - Temporarily DownSample Your Wav File

Forced Alignment using HVite only seems to work with audio recorded at a 16kHz sampling rate, at 16bits per
sample. So we will create a temporary version of your audio at 16kHz/16bits, run HVite, and then use the time alignments
generated by the Forced Alignment process to segment your original
audio.

You need to use the SoX sound editing utility to downsample your audio. The SoX command syntax is as follows:

$ sox original.wav -c 1 -r 16000 -w downsampled.wav

-c 1 - converts stereo audio to mono

-r 16000 - is the target rate of 16kHz

-w is the target bits per sample (word length = 16 bits)

In this example, we will use the audio from jimmowatt's "History of England" submission:

Note: you need to make sure that the text contained in the eText is the same as what is recorded.

Step 5 - Forced Alignment

Next run the HVite (HTK tool) to use the VoxForge Acoustic Model to line up the written words in the words.mlf file with the spoken words in the corresponding Audio book file, and to get time alignments.

First you need to create an HVite configuration file called "wav_config" containing the following:

This creates a file called aligned.out containing all the words from your words.mlf file, with time alignments. The output from the HVite command is here.

Note: different acoustic models may produce slightly different
forced alignment results (i.e. the better the Acoustic Model, the more
accurate the forced aligments).

Step 6 - Validate Audio with Audacity

1. Run the htklabels2audacity.pl script to convert the HTK time stamps into a format readable by Audacity, as follows

$ perl ./htklabels2audacity.pl aligned.out audacityLabelTrack.txt

This
creates a label file called audacityLabelTrack.txt
that can be opened in Audacity so you can compare it with the original
audio book and confirm that the Forced Alignment times look
OK.

2. Open your original speech audio file (historyofengland01ch04_01_macaulay.wav) as an audio track in Audacity. Next open the audacityLabelTrack.txt
label file as a label track in Audacity. To perform a quick
confirmation
that the alignments look OK, you will need to zoom in to certain
sections using Audacity's 'Zoom to Selection' feature, listen to the
audio, and make sure the spoken audio matches the label.

Step 7 - Run the Segmentation Script

First, create a directory called 'wav'.

Prepare Your Original Audio File for Processing

Next, you need to rename your audio file with a 3 or 4 character
name. Because
this is the name that will be used for the segmented wav files (e.g.
hoe.wav will be segmented to: hoe0001.wav, hoe002.wav, hoe003
.wav...). In addition, audio file may require some further
changes because HTK only works with audio files recorded at a
maximum
of 16 bits per sample, and the segmenting script assumes that the audio
was
recorded in mono format.

If
your audio is a mono recording, at 16 bits per sample, then you only
need to rename your file to a short filename (keep the wav suffix):

$ mv original.wav target.wav

If your audio was recorded at a 32 bit float sample rate,
you can use SoX to convert it to 16-bits (using the '-w' 16-bit word
parameter), or if it was recorded in stereo (i.e. using 2 channels),
you can also use SoX to convert it to mono (using the '-c 1' single
channel parameter). In our current example, you would use
SoX as follows:

$sox historyofengland01ch04_01_macaulay.wav -w - c 1 hoe.wav

Run Audio Segmentation Script

Next you need to create an HCopy configuration file called
"copy_config" containing the following (the htksegment.pl script below
uses HCopy to segment the audio):

Next, run the htksegment.pl
script to perform the actual segmentation of the audio into many
smaller files and create a corresponding prompt file. It uses the
following parameters:

$perl htksegment.pl [wav filename] [sample rate]

So for our current example, you would run it as follows:

$perl htksegment.pl hoe.wav 44100

Step 8 - Submit your segmented files to VoxForge

Create Readme file

Create a READMEfile
that describes your submission. Right-click this link and save the file to your upload folder. Modify
the entries where appropriate:

Each line in the readme has a question with some possible answers
within brackets. Please replace the suggestions between the brackets
with your answer .

Take your best guess as to the original author's dialect (follow this link for help on this). If you are not sure, just put in Librivox.

Create License file

Next, create a LICENSE file for your submission. Right-click this link and
save the file to your upload folder. Change the year
to the current year, and the 'name of author' to your name (or to the
'Free Software Foundation' - if you wish to assign your copyright to
the FSF).

Although the audio book you segmented is likely in the public
domain, you have copyright over the way the audio was segmented (because you have rearranged this audio in a unique way) and
therefore you can license the segmented audio under GPL.

Tar your files.

Please create a single compressed tar file containing the following files:

your segmented wav files;

the corresponding prompts file;

your updated eText file (remove any references to Project Gutenberg - see this FAQ for an explanation why);

any changes/updates you might have made to the VoxForge Lexicon; and

your README and LICENSE files.

Name your tar file as follows "[voxforge
username]-[year][month][day].tgz" . For example, if you stored all these files in the
/home/myusername/segment folder, you would execute
the following command to create your gzipped tar file:

$cd /home/myusername
$tar -zcvf kmaclean-20070125.tgz segment

Connect to the VoxForge FTP site

Connect to the site using your favourite FTP client (see link
below).

If you are using Firefox 1.5 or greater, you can use
FireFTP, a cross-platform FTP client. For Linux you can use Nautilus (Gnome), for Windows you
can use FileZilla or WinSCP, and Cyberduck can be used on a Mac.

(Note: You need to be registered on the VoxForge site for the link to display and to get the current password. )

Copy your TarFile

Copy your compressed tarfile to the VoxForge FTP site.

Submission Notification

Please add a note stating that you submitted some audio to the
VoxForge FTP site and/or to ask questions about the FTP submission
process. You can do this by clicking the 'Add' link below (note: it is only visible if you are logged in).