Speech recognition and Speech synthesis using neural networks

Today I want to show you how Joker can be used for speech recognition and speech synthesis using neural networks and Joker Empathy module.

Joker Empathy module

I have brewed two docker containers for super simple usage. Just one command required to run neural network and obtain the results. This tutorial should work on any Linux and OSx . No GPU required, only CPU.

This funny video shows voice interaction with Joker:

Speech recognition (speech-to-text)

This service based on Kaldi ASR project. Kaldi’s ‘chain’ models (type of DNN-HMM model) used. Actual trained model released by api.ai team. Model contains 127847 words. Compare this number with Oxford English Dictionary which contains 171,476 words or average English-speaking adult knows between 20,000 and 30,000 words. And need to say that this model shows 11.2% word error rate (WER). This is very good results ! “Old” speech recognition methods (GMM-HMM) can show only 21+% WER.

To run test just issue following command in console:

docker run -it aospan/stt

builtin file will be processed and output should contain following text:
/opt/in/in.wav HELLO THIS IS SPEECH TO TEXT RECOGNITION FOR JOKER PROJECT

that is what actually system recognized from audio file. Here is a audio file:

default phrase ‘Hello, my name is Joker. Today is a great day because it’s my birthday’ was used. To supply your own phrase run following command:

docker run -it -v `pwd`/out:/opt/out aospan/tts "your phrase here"

Conclusions

Now we can build very user-friendly systems with natural voice control like Amazon Alexa or Google Home. But Joker does’t need online connectivity, all speech processing done locally. This improves privacy and security – no audio data shared with third party. And we can do voice control when no internet connection configured (for example, for fresh installations).