Any people are familiar with commercially available software
which allows users to dictate directly into their computers. This type of system
relies on a particular user training their computer to the idiosyncrasies of
their voice, speaking words carefully, usually with limited background noise
and a closely held microphone.

The goal of current research is to remove such limitations
and produce a system which will recognise speech above background noise, allow
for variability in the speaker's accent and handle irregularities caused by
degraded speech heard over imperfect communication channels (such as crackly
radios and phone lines). In fact, the goal is to produce a system that is as
good as a human in decoding speech!

Researchers in speech recognition technology have an annual
opportunity to compare the performance of their systems against those from other
groups around the world by taking part in a standardised test run by the US
National Institute of Standards and Technology and the US Defence Advanced Research
Projects Agency. The current evaluation task is transcription of several hours
of audio taken from television and radio news broadcasts. In the 1997 evaluation,
the HTK system developed at CUED was ranked first overall. It had a lower word-error
rate, by a statistically significant margin, than its competitors, which included
major companies such as IBM, Dragon and Philips, as well as universities such
as Carnegie Mellon. The error rate achieved by the CUED HTK system was around
16%, which is remarkably good for this type of task.

How they work

The speech-recognition systems currently being developed
use a statistical modelling approach to estimate the most probable word sequence
from the input speech. 'There are three sources of information that are used
to model the speech signal', explains Phil Woodland. 'The first describes the
way the audio signal varies when different sounds are produced. These acoustic
models allow us to compute the probability that unknown audio corresponds to
a particular sound, or phone. Secondly, we use a pronunciation dictionary that
lists the phone sequences that make up allowable words and, finally, we use
probability of sequences of words estimated by examining statistics gathered
on a large-text corpus. Given some input speech, we search for the most likely
sequence of words based on these models. Of course, the key issue is how to
construct the models!'

The HTK system, like most modern speech recognisers, uses
simple statistical networks, known as Hidden Markov Models (HMMs) to assign
probabilities to sequences of spectra each corresponding to about 10ms of speech.
The HMM parameters are trained on large quantities of transcribed audio data
and as more data is input into the system, the HMMs can be further tuned. To
provide high accuracy, the system uses separate HMMs for a sound in different
contexts. An important feature of this approach is that it is capable of automatic
learning: given some transcribed audio data, a pronunciation dictionary and
a text corpus, it is possible to build a speech-recognition system based on
the same technology for many different languages.