Post Categories

Post Archives

Blog Stats

Hitchhiking the HoloToolkit-Unity, Leg 4 – Text To Speech

NB: The usual blog disclaimer for this site applies to posts around HoloLens. I am not on the HoloLens team. I have no details on HoloLens other than what is on the public web and so what I post here is just from my own experience experimenting with pieces that are publicly available and you should always check out the official developer site for the product documentation.

Voice is really important as both an input and output mechanism on the HoloLens and there’s a great section in the documentation over here on working with it;

And you can see the code for the TextToSpeechManager over here on GitHub.

There’s one caveat though in the code above and that’s that the model employed by the Text to Speech manager code here is one of ‘fire and forget’ and so my code above which is trying to have 3 distinct pieces of speech spoken doesn’t really work and what I usually hear when I run that code is a spoken output of;

“Three”

because the third call overruns the second call which overruns the first call.

I wanted to see if I could tweak that a little and so I wrote some exploratory code based on what I saw in a slightly earlier check-in of the Text to Speech manager and my main change was to try and alter the implementation such that calls to SpeakText or SpeakSsml were effectively queued such that a second call would execute after the first one had completed playing.

I initially thought that this would be pretty easy but it turned out that the underlying Unity AudioSource object that underpins this code doesn’t really seem to have a great way of letting you know when the audio has stopped playing. It seems that the options are to either;

Call the Play() method or the PlayScheduled() method with some kind of delay so as to delay a particular piece of speech until the pieces of speech that have gone before it have already finished playing.

There may be other/better mechanisms but that’s all I found by doing a search around the web.

The existing TextToSpeechManager code already does work (on lines 239 onwards of function PlaySpeech) to ensure that it moves most of the work of generating speech from text via the UWP’s SpeechSynthesizer APIs into a separate task but (as the comments in the code around line 297 of function PlaySpeech say) the actual playback of the audio has to happen on the main Unity thread although I don’t believe that the call to Play is a blocking call that would halt that Unity thread.

I wanted to leave as much of this code in place as possible while adding in my extra pieces that queued up speech rather than always trying to play it even if the AudioSource was already busy which is what the existing code seems to do and it felt like to do that I would have to implement some kind of queuing mechanism which took into account;

Needing to be able to deal with the idea that the production of the speech itself is done asynchronously.

Needing to be able to poll the isPlaying flag on some frequency once the speech playback had started in order to determine completion by polling.

Update? Coroutines? Tasks?

At this point, I could see a few different ways in which I might be able to implement this functionality with Unity and I figured that I could maybe;

Do some work from a call to Update() to poll to see if any current speech had finished playing.

Use Unity’s Coroutines in order to see if I could poll the speech status from there.

Try and wrap up something that used a TaskCompletionSource and which presented the polling as something that could be awaited in C#.

The last one is perhaps the most elegant but in the end I went with using the InvokeRepeating method to schedule some work to be checked ‘every so often’. There are probably better ways of doing this but it’s all part of learning

In order to get something going, I took the existing code and did a few things.

1 – Refactoring into a UnityAudioHelper

I took some of the code from the existing TextToSpeechManager and refactored it into this ‘audio helper’ class below. Largely, this is just the original code move into its own static class;

2 – Add a ‘Queue Worker’

I added in my own abstract base class which tries to encapsulate the idea of a queue of work to be processed where items are taken off the queue and worked upon in sequence. This is quite a generic problem to solve and you can get into aspects of multi-threading and so on which I’ve avoided here and this little class isn’t as generic as it could be because I’m bending it to my specific requirements here in that;

A queue is polled periodically rather than (e.g.) signalling some kind of synchronization object when work is available. This isn’t how I’d perhaps usually write this sort of class but I have a specific requirement to poll the AudioSource plus Unity’s model is quite amenable to polling.

An item of work is de-queued.

The item of work is executed and it is assumed that the item will take steps to avoid blocking.

The completion of the item of work is determined by polling some method to check a ‘completed’ status.

3 – Derive a Text to Speech Manager

I derive a new variant of the original TextToSpeechManager from my new IntervalWorkQueue as in the code snippet below and this class is making calls out to the refactored UnityAudioHelper which I listed earlier. The main ‘features’ here are that the Start() method calls base.Start() in order to get the interval work queue up and running and that the DoWorkItem and WorkIsInProgress methods and properties have been overridden to call into the original code whereas the original PlaySpeech method has been reworked to simply call base.AddWorkItem to add an entry onto a queue.