2018 SPEECH LUMINARY: DANIEL POVEY

We present the thinkers and innovators who are creating new tools and approaches for speech technology—and fostering the next generation of talent. In this installment, we talk to Daniel Povey from the Center for Language and Speech processing at Johns Hopkins University.

First, his thesis work debuted new pragmatic innovations for discriminative training of speech recognition models, and these approaches became broadly popular. Second, Povey’s name likely carries the most cachet as the main developer of Kaldi, a popular open-source speech recognition toolkit.

“Kaldi is unique because it comes with extensive sample scripts for commonly available datasets and has a license that allows for commercial as well as research use. It also contains state-of-the-art algorithms for ASR,” says Povey, who also previously helped create HTK, the most popular ASR toolkit prior to Kaldi.

While Povey conducts research on many elements of speech technology—including acoustic and language modeling, decoding, and weighted finite-state transducers—nowadays he primarily focuses on deep neural networks. “Within the last few years, the field has moved almost entirely over to neural networks for acoustic models. So I was forced to adapt, even though I had a lot invested in Gaussian mixture model-based technology,” he says.

Povey cut his speech tech teeth as a researcher, first at the University of Cambridge. Then he spent nearly 10 years working for two key industry research labs: the IBM T.J. Watson Research Center and later Microsoft Research. He became an associate research scientist for Johns Hopkins in 2012 and gravitated to a professor role three years later.

He’s a man of letters, too, having written or co-authored many papers published in scientific journals and academic publications—one of which won an award from the International Speech Communication Association for best paper. Currently, he serves as associate editor for IEEE Signal Processing Letters.

“I’m currently working on detecting lines of text in images containing text. This is strongly needed when preprocessing data for OCR or handwriting recognition,” Povey says. “I’m also very focused on practical innovations that can improve speech recognition today. I believe concentrating on things that we know work now—as opposed to things that are sexy or which might work in the future—is one of the things that has made Kaldi so popular,” he says.

When he’s not in front of a chalkboard or keyboard, Povey is probably plucking piano keys or guitar strings as a nonprofessional musician. But make no mistake: He’s perfectly content remaining a speech tech rock star. “It’s a great time to be working in this field, because speech recognition is starting to be useful in more and more application areas that didn’t exist only five years ago. These opportunities are creating a huge demand for ASR talent,” says Povey, noting that his graduating students are getting compensation offers well over twice what he received when he graduated in 2002. “There’s also a big demand right now for speech recognition researchers, as the technology continues to make its way into mainstream products like smart speakers.”