Speech To Text (STT)

The STT is pretty simple as it consists of three steps: activation, acquisition, and translation. Activation can be accomplished via a “key press” but I much rather use voice activation. Assuming you live in a normally quiet atmosphere, it is perfectly practical (and easy) to calculate the root mean square noise (RMS) and activate upon a given threshold. You can set the threshold by acquiring a distribution and looking at standard deviations, or you can just choose a number. Either way you can look at typical RMS values for your given mic/environment using the following:

I’ve set my threshold to 1050 (an arbitrary value, you should find your own). Now then the first major subroutine of the AI can be set – the listening function. This will essentially run infinitely and its nice to allow this to run as a thread (it may be needed later). This is the basic code for the activation:

The try/except block is to catch errors, especially useful for the debug stage.

The aquisition and translation stages are done in another subroutine, getUsersVoice. This is a pretty simple code – it will first beep to notify that aquisition has begun. Then it will use arecord to record the audio for a given amount of time. It will beep when finished. Then it will send the text to the Google Speech API. For this last step I use a separate bash file, parseVoiceText.sh just because there are so many quotations. Here is the code:

As you probably noticed, I didn’t tell you about processInput(). That’s going to be the main function to handle events. I am currently fleshing that out and will post back when I have some more on that.