on computer science and media topics

Using the power of google cloud API: A dockerized node app counting words in prasentations.

For the Dev4Cloud lecture at HdM Stuttgart, we created a simple Go/NodeJS/React App, which helps people to keep track of often used words during presentations. In a presentation setting, most people tend to use too many fill words and to train against this, we want to introduce our presentation counter to you.

The presentation counter consists of 3 parts, the React frontend, the GO backend and the NodeJS speech server for the communication with Google Cloud platform. To make it short, the frontend captures the microphone audio, sends it to the speech server, and the speech server gets the audio transcript from Google. Then the transcript is send back to the Go Backend which saves the relevant words in an Alpine db and updates the frontend.

Frontend

Static compiled react frontend, contain code to communicate via Websocket with the Go and NodeJS server. As well as capturing the microphone audio. Capturing audio needs a bit of boilerplate code:

The micophoneProcess inside the handleSuccess function receives a stream of an audio buffer with a given buffersize. The micophoneProcess does two important things, first it converts the stream from 48000 Hz to 16000 Hz and then uses WebSockets to send it to the NodeJS server close to real time.

Our frontend. Mobile first witch a super simple UI. Tap the mic to start the session and add words on the bottom.

Speech Server

This server is a lightweight NodeJS app. For using the Google Cloud API you need to have your own user-key and this should not be shared with the frontend, so we created a layer in between them. This server holds in key with the dotenv-library. Google Cloud API needs the key to be in the process environment variables, and adding the key on operating system level would be a big pain, so we used this great library. After an audio stream started, the server uses the key to authorize with Google Cloud API, and creates uses speechClient.streamingRecognize(request) to open a stream to the cloud. Inside the request parameter is the configuration and encoding information. Our configuration looks like this:

Please note the last item: interimResults: true This tells the Google Cloud API to send unfinished results. Why would we want unfinished results, could you ask yourself? The answer is simple: we do not want to wait until the sentence is finished. As soon as a sentence is finished, Google can calculate the probability more accurate, since the context is closed. That means, when ever a Google detects a finished sentence, we can a more accurate prediction. But we would have to wait until the sentence is finished. Because of that we are using the less accurate results to get faster results, near real time again and might have to correct the displayed results if they change.

Google Cloud API gives a transcript of all the spoken words it can recognize in the audio stream. So the NodeJS Server sends this transcript, divided in a final and a interim part to the Go Backend where the words are going to be counted – and as soon as the interim part gets final recounted. In that way we have fast results, which is important for a word counter app – nobody would like to wait until the sentence is finished for the counts to update.

Google Cloud API

For this project we used the Google Speech API. To be able to use this, you first have to a Google API key, but this is quite straightforward and described here. Next you have to download a library from Google, which mirrors the API functions in your code. With NodeJS the installation is npm install google-cloud/speech --save. Now you can import it and initialise it.

After those steps you are ready to go! Just follow the steps described in the Speech server section.

Another thing to add, with your Google accounts comes 300$ of free credit for the Google Cloud Platform. At first we wondered if this will be enough to develop the project, until we found out that you also have 60 free minutes to analyse with the Speech API. Even after those 60 Minutes it is only 15 ct/min. What we want to say, just play with it, its easy and the possibility to create a nice prototype is really awesome. Thanks Google!

Go-Backend

At the beginning, this project was fork of a project we did in the same semester. In the base project people can count words of a person doing a presentation by clicking on a button. For this former project I decided to try language in the backend which was new to me: GO. This language is more low-level than Java or JavaScript, but more high-level than C or C++. So if you come from a Java/JavaScript like we did, your mind will struggle probably from time to time. There some kind Pointer like in C but fortunately no pointer arithmetic. The main advantage of this language shall be, that it can be almost as fast as C++ but takes way less time to compile. Additionally it has some pure functional constructs in it.

So long story short: What we wanted to do is to extend this server that it uses the google speech-to-text to get back text form of someone’s presentation in real time. During the development process we found out that the requirements didn’t really meet with the previous ones. The server had to update the frontend in real time like it did before but now we wanted the server also to correct the frontends data. Why correcting? From the google API you get back a text in real time if you do stream processing. More or less word for word the sentence you get back is extended. But this sentence isn’t fix, as well as the words in it as long as the sentence hasn’t finished. This is because google always tries to return a reasonable sentence.

These facts and we still needed to count the words on every update made us to almost rewrite the whole web server code.

Docker

Both of us Simon and me Marius we still don’t know what we think about this tool. The general idea of having container deencapsulation purpose is great but working with docker often isn’t.

There is no real debug tool, at least we haven’t found one. Then it caches that much that you have to rebuild your container almost every time from scratch disabling the cache. This can cause a very long build-step depending on what you have to install during the startup process.

But now to the advantages and what we did with it. Both of our servers, the node and the GO one are running its own alpine container. Docker-Compose is managing the they are in a docker network with a DNS, so the can communicate with their respective names.

This make it possible to deploy the application anywhere a docker daemon is available. What is quite nice. The whole software infrastructure comes with docker and the definitions in the Dockerfiles.

Summary

Working with cloud service is easier than we thought. Using the technology offered by google was very straight forward, but the docker and the GO thing wasn’t.

This was our first project using the cloud. As a result we just took a simple functionality it offers and created something it. Regarding the effort from an engineering point of view, I might have been easier to use more out of the box functionalities. For example you could have deployed the server code directly to a go or node service a google. There you can create containers by simply drag and drop modules to it.

Finally we can say, that it is worth to look deeper into the cloud topic. Not only because you needn’t to buy hardware, but also because it can save you a lot time to develop an application.