A speech-to-text prototype in Jitsi Meet!

Aug 1, 2017

A month has passed and a lot has happened. I now have a working prototype of a
transcription service in Jitsi Meet!

Quick reminder

My GSoC project has 2 main goals: providing a live, as close to real-time as possible transcription
of every participant and delivering a final, complete transcript at the end of a conference. The plan was
to modify Jigasi to accomplish this.

The prototype

A picture (or 60 pictures per second in this case ;) ) shows more than 1000 words, right? So I’ve made a
demo, which you can watch below.

You can see that Jigasi can be dialed into a conference with a special uri jitsi_meet_transcribe. This will
later be hidden behind an UI. After it is dailed in Jigasi will notify that it has started transcribing, and will post results
to the chat as soon as it has finished transcribing a particular piece of audio from a participant. As it
will currently spam chat with these transcripts, this will also be moved to another UI element. Jigasi does
also internally store a final transcript, but we still need to come up with a way to deliver these to the
end user.

What does the demo conveniently not show?

There are a few hiccups which need to be ironed out. On a few occasions Jigasi will fail to start up correctly,
or it might join a room without starting to transcribe. You can also see in the demo that the accuracy is not
perfect (dog vs talk), and that I talk slower than you would normally talk in a meeting, which also improves
the accuracy. I also did not use more difficult words, like jargon, which it might fail to recognize. All these
issues are not really under my control, but depend on what the speech-to-text backend can provide. There
might of course be some optimisation possible, like sending the audio in a different file type, or providing the service
with a vocabulary of words which might’ve been said. There is currently no good way to stop the transcriber
without kicking it or using a command in the browser console. There is also a major limitation set by the
backend we use,which I will talk about in the next paragraph.

The speech-to-text API behind Jigasi

We decided on using the Google Cloud speech-to-text API as our backend for the transcription. Google provides
a REST and a gRCP API as well as a client library. Although the client library is currently in beta, we
decided on using it as it was very easy to import into Jigasi and supports StreamingRecognize

Google’s speech-to-text API provides 3 different services:

Recognize: This is a synchronous API in which you provide a single audio file and your application
will block until you receive the transcription. Currently the audio can not be longer than 1 minute. Supported
by REST, gRCP and the client library.

LongRunningRecognize: This is an asynchronous API in which you provide a single audio file and your
application will retrieve (interim) results via a provided interface. Currently the audio can not be longer
than 80 minutes. Supported by REST, gRCP and the client library

StreamingRecognize: This is an asynchronous API in which you open a “session” and provide a
continuous stream of audio files, and will receive results as soon as they are ready via a provided interface.
Currently you can keep sending audio for 1 minute. Only supported by gRCP and the client library.

For this project the StreamingRecognize service seems like the best choice, as Jitsi Meet sends audio packets in
20 ms intervals, which we can then forward to the API (after locally buffering for ~100/500 ms to save bandwith, best
frame time to be determined :) ). This solves the issue of determining when someone is actively speaking and when someone
finished speaking, which would’ve been necessary when using the (LongRunning)Recognize. It is also better than the recognize
API’s as StreamingRecognize can almost immediately provide a result when someone finishes speaking, whereas for
the other two you can only send the audio for processing when someone has finished their sentence, which can lead to
heavy delays the longer the audio is.

Dealing with the 1 minute restriction

The only problem with StreamingRecognize is that Google decided to allow a stream of up to a single minute, after
which it will unmercifully throw an exception at you(r application). This is very unfortunate for our use-case as users
often talk for a longer amount of time, especially in settings like business meetings or lectures. It would be a lot
easier if we could just open a session for every participant at the start of a conference and keep it open for the
entire duration, but alas, we have to work around it. My quick and dirty solution was to open an initial session
when someone starts and keep it open for 55 seconds. However, after 50 seconds we open an additional one which will
will overlap for 5 seconds with the “old” session as to catch words if the user is in the middle of a sentence. Then the
new session becomes the old one, which will stay open for 55 seconds in total, and the cycle continues.
This kinda works, but might provide some double transcripts. It also sucks when you have to stop a session when
someone is actively speaking, which will heavily impact the accuracy. There might be a solution if you
listen for final_result=true in your results, which is a flag set by Google. The problem with this approach is that if
someone is actively speaking
Google might not detect end-of-sentence and thus not send final_result=true before your 1 minute is up. This will be a
major task to solve in the upcoming weeks.

Using StreamingRecognition with the Java Client library on an AudioStream

To end this post, I want to quickly explain how to use StreamingRecognition with the Java client library, as it is
currently missing from their documentation.

This is a code sample using the streaming API.

importcom.google.api.gax.grpc.ApiStreamObserver;importcom.google.api.gax.grpc.StreamingCallable;importcom.google.cloud.speech.spi.v1.SpeechClient;importcom.google.cloud.speech.v1.*;importcom.google.common.util.concurrent.SettableFuture;importcom.google.protobuf.ByteString;importjava.io.IOException;importjava.util.List;publicclassStreamingSession{/**
* This listens for responses from Google
*/privateResponseApiStreamingObserver<StreamingRecognizeResponse>responseObserver;/**
* This sends requests to Google
*/privateApiStreamObserver<StreamingRecognizeRequest>requestObserver;/**
* The client managing all connections
*/privateSpeechClientspeechClient;/**
* Create a session and send config
*
* @throws IOException when failing to connect, can be due to missing credentials
*/publicStreamingSession()throwsIOException{// Instantiates a client with GOOGLE_APPLICATION_CREDENTIALSspeechClient=SpeechClient.create();// Configure request with raw PCM audioRecognitionConfigrecConfig=RecognitionConfig.newBuilder().setEncoding(RecognitionConfig.AudioEncoding.LINEAR16).setLanguageCode("en-US").setSampleRateHertz(16000).build();StreamingRecognitionConfigconfig=StreamingRecognitionConfig.newBuilder().setConfig(recConfig).build();responseObserver=newResponseApiStreamingObserver<StreamingRecognizeResponse>();StreamingCallable<StreamingRecognizeRequest,StreamingRecognizeResponse>callable=speechClient.streamingRecognizeCallable();requestObserver=callable.bidiStreamingCall(responseObserver);// The first request must **only** contain the audio configuration:requestObserver.onNext(StreamingRecognizeRequest.newBuilder().setStreamingConfig(config).build());}/**
* Give a frame of continuous audio to Google
*
* @param audio the audio as an array of bytes
*/voidgiveNextAudioFrame(byte[]audio){// Subsequent requests must **only** contain the audio data.requestObserver.onNext(StreamingRecognizeRequest.newBuilder().setAudioContent(ByteString.copyFrom(audio)).build());}/**
* Close the session and print results to standard output
*
* @throws IOException when failing to close the session
*/voidendSession()throwsException{// Mark transmission as completed after sending the data.requestObserver.onCompleted();List<StreamingRecognizeResponse>responses=responseObserver.future().get();for(StreamingRecognizeResponseresponse:responses){for(StreamingRecognitionResultresult:response.getResultsList()){for(SpeechRecognitionAlternativealternative:result.getAlternativesList()){System.out.println(alternative.getTranscript());}}}speechClient.close();}/**
* This class receives the text results once they come in with the #onNext message
*/classResponseApiStreamingObserver<T>implementsApiStreamObserver<T>{privatefinalSettableFuture<List<T>>future=SettableFuture.create();privatefinalList<T>messages=newjava.util.ArrayList<T>();@OverridepublicvoidonNext(Tmessage){messages.add(message);}@OverridepublicvoidonError(Throwablet){future.setException(t);}@OverridepublicvoidonCompleted(){future.set(messages);}// Returns the SettableFuture object to get received messages / exceptions.publicSettableFuture<List<T>>future(){returnfuture;}}}

The import bits are the ResponseApiStreamingObserver<StreamingRecognizeResponse> responseObserver and
ApiStreamObserver<StreamingRecognizeRequest> requestObserver objects. As their names suggest, they
offer the ability to send audio (as a StreamingRecognizeRequest object) and receive transcripts (as
a StreamingRecognizeResponse object). The giveNextAudioFrame() method can be used to give a chunk
of your audio stream, and when the stream is empty you can call endSession(), which will stop the session
and print the final results.