Speech Recognition Jukebox Using Atmega32

Introduction
For the Final Project in ECE 476: Designing with Microcontrollers, Robbins and Saha developed a Speech Recognition Jukebox, comprised of a speech recognition system that activated a simple music player. The speech recognition system was capable of recognizing fourcommands and could cycle through a simple play list of three songs. The jukebox could turn itself on, begin play, move between tracks, and stop play all through user voice commands.
In order to implement this design, Robbins and Saha needed to combine several different hardware and software elements. A small microphone was purchased and used to convert the human voice signal into a voltage signal. This alternating voltage signal was amplified by 1,000 times using three LM358 operational amplifiers. Hardware frequency filters were used to limit the frequency input and software frequency filters were used to parse the signal into different frequency regions.
The values of the signal in these different frequency regions helped to determine each individual words unique digital fingerprint. The fingerprints of important words, such as commands for the music-playing element of the design, were stored into the program. Each time a word was spoken, the fingerprint of this sample word was compared to the stored fingerprints to determine which command, if any, was spoken.
Recognized commands for the system are:

ON

Turn the music player on, play current song

END

Pause the music player

SOON

Play the next song

PREV

Play the previous song

Table 1: Voice Commands Recognized by the System
Given the correct combination of commands, a simple music tune would be played on the speaker of the television. A more in-depth analysis of the workings of both the software and hardware sections of the design can be found below.High Level Software Design
Speech recognition systems have been implemented in a variety of different applications, most notably automated caller systems and security systems. These systems have progressed considerably in recent years and have the capability of performing numerous tasks from simple user vocal commands. For the ECE 476: Designing with Microcontrollers Final Project, Robbins and Sahas ambition was to combine speech recognition technology with music playback. Robbins and Saha were inspired by the work of previous years groups, whose work is cited in Appendix 5, which demonstrated that such a project was realizable within the timing and hardware constraints of the ECE 476 Final Project parameters.Capturing the Human Voice
The human hearing system is capable of capturing noise over a very wide frequency spectrum, from 20 Hz on the low frequency end to upwards of 20,000 Hz on the high frequency end. The human voice, however, does not have this kind of range. Typical frequencies for the human voice are on the order of 100 Hz to 2,000 Hz. Robbins and Saha would have hardware electrical filters that would pass only the frequencies between approximately 150 Hz and 1,500 Hz and several digital Butterworth filters that would work to parse this frequency spectrum into smaller regions. Both of these types of filters are discussed in more depth below.
But how often should one sample a signal that is oscillating at these frequencies? According to Nyquist Theory, the sampling rate should be twice as fast as the highest frequency of the signal, to ensure that there are at least 2 samples taken per signal period. Thus, the sampling rate of the program would have to be no less than 4,000 samples per second.
Also, the human voice moves a sound wave, which compresses and decompresses the air as it moves. As will be discussed below in the Hardware Design section, a microphone was utilized to convert this compression wave into an electrical signal that could be filtered, amplified, and analyzed.Butterworth Digital Filters
The frequency spectrum of the human voice needed to be divided into several sub-intervals to allow analysis of the specific frequency spectrum of the word being spoken. Robbins and Saha divided the frequency spectrum into seven (7) intervals using six 4-pole Butterworth band-pass filters and one 2-pole Butterworth high-pass filter. The table below illustrates the scope of each filter:

Table 2: Frequency Range of Digital Filters
The Butterworth filter attempts to be linear and pass the input as close to unity as possible in the pass band. In the program design, the Butterworth filters manipulated the A/D converter output into the frequency domain. The code for both the high-pass Butterworth filter and the band-pass Butterworth filter were written by Bruce Land and can be found on the ECE 476 course website.Control Section
The output of the digital filters would help to formulate a digital fingerprint that was unique for each word. Five samples were taken from each digital filter, thus yielding 35 total samples that would comprise the digital fingerprint of each word. The fingerprints of the dictionary words, ON, END, PREV, SOON, were stored in the software program. Whenever the user input a command to the system, this samples digital fingerprint would be calculated and then compared to each of the dictionary words.
To compare the dictionary words with the sample, the program calculated the correlation of the two vectors. The pair with the highest absolute value correlation was chosen as a match. When an input command word was recognized as a dictionary word, the control section would set a series of flags that would update the state machine. This state machine would change state on these flags being set and each state corresponded to a separate song being played.Audio Playback
Robbins and Saha chose three songs to be played by the jukebox – a Sonatina written by W.A. Mozart, Ode to Joy written by Ludwig van Beethoven, and the Star Spangled Banner. These songs were chosen because of their simple melody and easy recognition. Using the audio production code provided in Lab 4: Digital Oscilloscope, shown below, these songs notes were converted into a format that could be played on the television speaker.

Table 3: Conversion Table for Musical Notes (Bold C corresponds to middle C)Logical Structure
The logical structure of the program is quite simple. The user will speak the desired command into the microphone. The microphone will convert this audio signal into an electrical signal, which will then be filtered and amplified before being sent to the A to D converter. The program A to D samples the input, and the output of the A to D converter is run through seven digital filters. The control section uses the outputs of the seven digital filters to obtain a working fingerprint of the spoken command and compares this fingerprint with those stored fingerprints to decipher which command, if any, has been spoken. Upon recognizing a user command, a state machine within the control section will change state. Each state of this state machine corresponds to a separate song being activated. Thus, upon changing state, a different song signal will be sent to the television audio connection, enable music playback.Hardware/Software Tradeoffs
To be able to execute all the commands in the program, there need to be enough clock cycles. The Mega32 clock runs at 16 MHz (16 million clock cycles per second). As the code requires that the A to D converter be sampled at a rate of 4 kHz, all the code for the program must be able to execute in 4,000 clock cycles (16 million / 4 kHz). Thus, the hardware must be able to work in real time and not further limit the capabilities of the program. As the hardware is mostly comprised of resistors and capacitors, and the LM358 is a relatively fast op-amp, there are no concerns with regard to hardware affecting the software.
The only constraint remains that all the computations performed by the program be able to fit it 4,000 clock cycles. The seven digital filters will consume the majority of the clock cycles. Each 4-pole band-pass Butterworth filter takes up 228 clock cycles and the 2-pole high-pass Butterworth filter takes up 148 cycles. Thus, all the filters together will consume 1,516 cycles. This yields almost 2,500 clock cycles for the remainder of the code, which is more than enough space.