Antwerp, Belgium
August 27-31, 2007

Web-Based Language Modelling for Automatic Lecture Transcription

Cosmin Munteanu, Gerald Penn, Ron Baecker

University of Toronto, Canada

Universities have long relied on written text to share knowledge. As
more lectures are made available on-line, these must be accompanied
by textual transcripts in order to provide the same access to information
as textbooks. While Automatic Speech Recognition (ASR) is a cost-effective
method to deliver transcriptions, its accuracy for lectures is not
yet satisfactory. One approach for improving lecture ASR is to build
smaller, topic-dependent Language Models (LMs) and combine them (through
LM interpolation or hypothesis space combination) with general-purpose,
large-vocabulary LMs. In this paper, we propose a simple solution for
lecture ASR with similar or better Word Error Rate reductions (as well
as topic-specific keyword identification accuracies) than combination-based
approaches. Our method eliminates the need for two types of LMs by
exploiting the lecture slides to collect a web corpus appropriate for
modelling both the conversational and the topic-specific styles of
lectures.