An easy way to implement context dependence would be to simply deactivate commands when they are not applicable. However, by dynamically creating speech models tailored to the current situation, the recognition rate can be improved considerably.

But creating context dependent speech models leads to a problem: Building models is very time consuming. As the context usually changes very often, the switch between speech models has to be fast.

To compensate, Adam developed a simple caching solution for Simond.
While it worked okay for most use cases, it was a bit buggy and the design had some issues. Because of that, it would have been very hard to switch the model compilation backend (e.g. exchange the HTK with SPHINX).

So during the recent refactoring I also rewrote the context adaption and caching system in Simond.

"So isn't this, like, really easy?"

The premise seems quite straight forward: Whenever the situation changes, try to find the new situation in a cache: If found, use the old model, if not build a new one and add it to the cache.

However, it's not quite as simple: Input files may change very often. However, there are a lot of changes where it's absolutely predictable that the resulting model won't change. Architecturally speaking, this depends on the model creation backend (in this case the HTK) so an independent caching system can't really identify those situations.

The input files may even change during the model creation process.
An example: Someone with a user generated model has two transcriptions for a single word but only training samples for one of them. Because the training data is transcribed on a word level this can only be identified during the model creation. If a (tri)phone of the alternate (unused) transcription is now undefined (untrained), it needs to be removed from the training corpus. Associated grammar structures might now be invalid, etc. Again, this would mean that the caching system has to be integrated with the model creation backend.

But moving the model caching system to the backend isn't a nice solution either as that would mean that each backend would need to implement it's own cache.

"Oh..."

So to enable sensible caching with multiple backends I ended up with an a little bit more complicated, two layered approach:

Model input files would be assigned a unique fingerprint. Source files with the same finger print are guaranteed to produce the same speech model. The finger print is calculated by the model creation backend. This way the calculation can take just those parts of the input files into account that will have an effect on the produced speech model.In practice this for example means that changing command triggers or adding a grammar sentence with no associated words will produce the same finger print and therefore not trigger the costly re-creation of all associated models.

The current context is be represented through "situations". The cache contains an association between situations and the finger print they will provoke. Multiple situations might share the same finger print (the same speech model). Once a cached model has no situations assigned to it's activation, it will be removed from the cache.

The resulting workflow looks something like this:

To ensure maximum responsiveness, Simond will try to update cached models when the associated input files change. So if you have three situations for your model and add some training data, all three models will be re-evaluated in a separate thread.

The model creation itself uses a thread pool
to take advantage of multi-core systems and actually scales very well.

Still, the model creation process can take minutes if you have a lot of training data - even on a decent CPU.

"But what about entirely new situations?"

Creating and maintaining a model cache for all possible situations wouldn't be feasible as the cache would of course grow exponentially with the number of conditions to consider.

To avoid having to wait for the creation of a model for the new situation, the context system was designed to create and maintain the most permissive model available as a fallback.

Let's consider an example: Suppose you have a setup with three scenarios - Firefox, Window management, Amarok - and you configure Simon to activate the Firefox and Amarok scenarios only when the respective applications are running.
The created fallback model would have all three scenarios activated.
Suppose you open and close Firefox quite frequently so those two situations are covered with an up-to-date model. You are currently in the situation that both Firefox and Amarok are closed. Again, there's a model for that. Then you open Amarok for the first time: The correct model would have a disabled Firefox scenario and an activated Amarok scenario.
As the requested model is not available, Simond will now start to compile it. In the mean time, Simond will switch to the fallback model: The one with all scenarios (Firefox, Amarok and the Window Management scenario) activated.

When picking a model to build, the fallback model is given higher priority to ensure that it's (almost) always available.

By the way: Simond sends the compiled speech model back to Simon during synchronization. This is done both to shorten the time it takes the recognition to start in a multi-server environment (think of mobile clients) and to ensure the last successfully compiled model is available in case that the current input files can not be compiled and the client connected to a "fresh" server. Of course only the most fallback model is synchronized to keep the network overhead low.

"But what about ambiguous commands?"

There might be setups where commands have different meanings depending on the context. For example "Find" might have issue "Ctrl+F" in LibreOffice but open Google when issued while browsing the web.

To avoid situations of undefined behavior while the targeted model is compiling, deactivated scenarios are not only removed from the speech model on the Server side but their commands are also disabled on the client side.

That means the only drawback of the more permissive model is a lower recognition rate for the time it takes Simond to create the new model - ambiguous commands will still be handled correctly.

As soon as the more targeted model is finished building, the recognition will switch automatically.

"Isn't this post getting too long?"

Yes, definitely.

So to sum up: Simon 0.4 will feature a sophisticated model caching and context adaption mechanism.

Dienstag, 1. Mai 2012

One of the simultaneously most important and challenging tasks for me has always been to keep Simon usable for the "average" user.

Yes, reading the manual is sometimes required but I still feel comfortable to say that users don't need to have in depth knowledge about speech recognition to build their own speech models with Simon - and that's something we've always been proud of.

However, the initial learning curve is undoubtedly a bit steep. So let's look at the interface that so often left new users baffled.

Analyzing Simons Interface

After the initial first run wizard (that sadly many new users seem to
skip entirely) the following was the first screen that's shown to new users.

While
very pretty (thanks to the Oxygen team), it only provided links to resources where users can find further
help. The interface afforded absolutely no interaction pattern and left users stranded.

Simon 0.3.75: Main Screen

After a bit of looking around, the user would probably notice the "Wordlist", "Grammar", etc. tabs containing the components of the currently loaded scenario.
However, even if the user loaded scenarios in the first run wizard, all those tabs will be completely empty. That's because the user is looking at the "Standard" scenario - an empty default scenario. To change this, users are supposed to use the unlabeled drop down in the toolbar.

The reason for this weird interaction pattern was mainly because scenarios are a recent addition to Simon: They were only introduced in Simon 0.3 and while there was a huge amount of internal refactoring associated with that, the UI always felt a bit "tacked on".

So during the last month I was re-evaluating parts of Simons interface to make it more intuitive for new users.

First of all, I identified some principles I wanted to convey to the user and then designed the new interface around them:

Scenarios are opaque. Users can of course edit them if they want but the average user will probably never touch their components. In any case there is a strict hierarchy that must be maintained at all times: Scenario A (containing Components A), Scenario B (containing Components B), etc.

Base models are the easiest way to get started. If setting up Simon to use a static base model requires users to search for an archive on a wiki, download, extract it and to point Simon to individual files called cryptic names like "hmmdefs" or "tiedlist" then the interface has clearly failed. It must be easy and intuitive for users to create, share and use base models.

Around half of all recognition problems are Microphone related. For the voice activity detection (the part of Simon that separates "Speech" from "Silence") to work, the volume must be set correctly. Especially with ALSA forgetting volume levels this is often a source of problems of which the only symptom was that the recognition simply didn't work.

Obviously, the interface needed a major revamp. So over the last
month I have been working on and off on some tweaks for what will
become Simon 0.4.

The Result

The screenshot below shows the new Simon main screen.

Simon 0.3.80: Welcome Screen

But let's look at the changes individually.

Scenarios

There is now a prominent list of your currently used scenarios in the main screen.

The tabs showing the components of the scenario are gone and have been replaced with a little "Open <scenario name>" button.

Clicking it opens the scenario for editing. While in "edit mode", the overview is hidden. The "Back to overview"-bar drops down smoothly animated to draw the users attention.

Simon 0.3.80: Wordlist

Training

Next to the scenario list, Simons main screen now also shows a list of all available training-texts of the loaded scenarios. Clicking "Start Training" will start the standard trainings wizard without opening the "edit mode" of the scenario.

Selecting a trainings-text on the right also selects the scenario it belongs to on the left. This is done both as a visualization of which scenario will benefit the most from the training and as a matter of convenience: If the user wants to remove or add another related trainings-text (which would mean he'd need to "open" the scenario), the correct scenario is already selected.

Speech models

Speech models are now packaged into .sbm files ("Simon Base Model"). The package contains all the required model files as well as some meta data (name, model type and build date).

The welcome page shows information about the active model and, if available, the used base model.

Simon 0.3.80: Base Model Settings

The base model settings page provides a way to create the new sbm files from HTK model files ("Create from model files"). The currently used active model can be exported as sbm container to share or archive created models.

Additionally, I've already put in a request to add a new category to kde-files.org and am planning to enable speech model sharing through GHNS.

This package abstraction was also a big step towards supporting other backends next to HTK / Julius but I'll elaborate on that in a different blog post.

Recognition

Last but not least, the Simon main screen now permanently displays the current microphone volume.

The volume calibration widget has been improved to integrate the voice activity parameters and will now no longer require the user to tell it that the volume has been adjusted.

Simon 0.3.80: No applicable command for recognition result

The last recognized command is also displayed. If the command didn't trigger any action, Simon will now display a small note next to the recognized sentence to help scenario developers to track down problems.

Final Words

I am not a Usability expert by any means. Having spent so much time with the interface, I wouldn't have noticed a lot of the issues had it not been for the valuable feedback from the community. I especially want to thank Frederik Gladhorn and Bjoern Balzaks for their input.

The interface is of course still far from perfect. However, I'm quite happy about how the recent refactoring has turned out and am looking forward to more improvements in the future.