Speech Recognition in 1920s: Radio Rex – The first speech recognition machine?

17Feb

AS MAN RUSHES to build his replacements, he notices an interim requirement for man-machine communication. In the meantime at least, computers must be able to, but cannot, understand the writing and talking of men. We are protected from technological unemployment so long as we are buffered by punched cards, magnetic tapes, and on-line or off-line printers. But the day will come![1]

I was surprised to see an interesting reference at the end of the Automatic Speech Recognition chapter of Jurafsky and Martin’s Speech and Language Processing book:

The first machine that recognized speech was probably a commercial toy named “Radio Rex” which was sold in the 1920’s. Rex was a celluloid dog that moved (by means of a spring) when the spring was released by 500 Hz acoustic energy. Since 500 Hz is roughly the first formant of the vowel [eh] in “Rex”, the dog seemed to come when he was called. (David, Jr. and Selfridge, 1962)

Radio Rex from 1920s - The first speech recognition machine

As soon as I read about Radio Rex and 500 Hz I wanted to do a quick analysis of my voice while saying “Rex” and compare it to a female voice saying the same word because in a related web forum someone shared the following reaction:

That’s clever, but the crudest possible voice recognition. I study sound spectrograms all the time in my phonetics course, and it is true that the first formant in the vowel [e] is at about 500 Hz, but only in the adult male voice, so Rex would not respond to women or children unless they used a different vowel, like [i] or [I], or even [u] or [U]. They would have to call him “Reeks” or “Riks” or “Rooks” or “Ruks” in order to get the first formant low enough. I bet you have to say it really loud, too.

But what was the best way to do such an analysis? Which program would give me detailed information about different speech signals? Would it run on my Ubuntu GNU/Linux system? Enter Praat:

Praat (also the Dutch word for “talk”) is a free scientific software program for the analysis of speech in phonetics. It has been designed and continuously developed by Paul Boersma and David Weenink of the University of Amsterdam. It can run on a wide range of operating systems, including various Unix versions, Mac and Microsoft Windows (95, 98, NT4, ME, 2000, XP, Vista). The program also supports speech synthesis, including articulatory synthesis.

That is pretty much everything I needed! So I quickly recorded my voice as well as a female voice:

An analysis of the male voice by Praat - The speech signal of the word "Rex"

When I selected the region roughly corresponded the vowel [eh] and asked Praat for F1 (the first formant) it reported a value around 530 Hz.

Then I did a similar analysis for the female speech signal:

An analysis of the female voice by Praat - The speech signal of the word "Rex"

And Praat reported that the first formant for the [eh] part of the female “Rex” speech was around 740 Hz.

Praat in action

So my curiosity was satisfied, at least for now. I was able to check directly whether the reaction I shared above applied to my voice and the female voice I recorded. Having a Radio Rex toy would be nice, I could shout ‘his name’ and then play the female voice to see how the simple mechanism of the toy dog reacted. Theoretically I would expect it not spring forward when I played the female voice.

Of course the toy dog mechanism described above is nowhere near automatic speech recognition that we are used to see nowadays, it does not recognize specifically anyone’s voice or converts speech to text that can be processed, but rather simply reacts to a frequency interval regardless of the person who can produce it. Nevertheless it is exciting for me to have learned that such an interesting toy existed in 1920s and surprised lots of kids. I should also note that the first scientific description of the toy seems to occur in an article from 1962 and I read about it in a modern language processing book in 20111. If this does not feel like a mini time-travel then I guess nothing does.

In the early days of computers we all learned a congeries of theorems by Turing and von Neumann which told us (or so we thought) that a computer could do anything we told it. We would merely (!) have to specify sufficiently accurately just what it was we wanted the machine to do. And it is true that some highly variable input signals can be categorized by elaborate, exhaustive programs, but it is just not feasible thus to program recognition of printing, speech, handwriting, radar and sonar signals, and objects in photographs (clouds in satellite weather pictures, for example).1

Radio Rex - Magic Revealed! (in a sense 😉

I’m also excited to have used the Praat software for such a practical purpose. I cannot help myself but think how nice it would be to have used this beautiful and very sophisticated program back then, when I had my phonology course during my cognitive science education (about 8 years ago). That would be real fun.

PPS: Keen readers will realize that the article I quoted from more than once is written by the authors from Bell Labs which was a part of AT&T which produced the Natural Voices text-to-speech system as well as lots of other speech technologies. Later, many researchers from the famous Bell Labs moved to Google which currently provides the tiny text-to-speech service mentioned above.