Click them all!

Research: Human Speech Recognition

One of the main requirements in the software development world nowadays is the increase of interactivity and personalisation levels in developed applications. Examples from the leading market players, such as Siri, Google Now, Samsung S Voice and others are setting a new standards for user experience and human computer interaction. This route can find a wide array of uses among the new developments in mobile, automotive field, new smart house technologies, Internet of Things (IoT) and so on.
The main technology for such experience is a high quality human speech recognition.

In the course of our work, we have already met with a need to integrate fully functional voice control for a mobile application, based on already existing solutions in this field.
While analyzing and integrating this functionality we identified following main use cases, which must be fully covered with voice control:Case 1. Recognition of standard words or short phrases to be used as an array of basic commands for an application, for example Yes, No, Cancel, Repeat, Return, Next and so on.Case 2. Recognition of certain words and phrases with unconventional spelling. Such as personal names, names of different apps, addresses and so on.Case 3. Recognition of live speech (phrases, sentences) to create notes, letters, messages or to conduct searches.

Google Recognizer

At first we tried to use Google Voice Recognition engine (Google Recognizer) to cover all of the above mentioned Use cases.
Google Recognizer usage is possible on any Android device without the need to integrate additional libraries or solutions.

Google Recognizer is a very well documented solution and also one of the best in terms of recognizing live speech thanks to built-in technologies for semantic selection of words for phrases and sentences according to their logical meaning and combinations. It can be said that currently there is no competition in the number of localisations to support voice recognition in different languages.

However, while integrating and using Google Recognizer we faced several problems:

Google Recognizer has a low reliability while dealing with short words, such as Yes, No, Ok, often returning empty results, mainly because it was adapted to recognize comparatively long logical phrases and sentences. (Use case 1).

Google Recognizer has an offline mode, but it uses a very limited dictionary, which covers only simple widely used words and doesn’t recognize live speech (Use case 3). While the quality and reliability of Google Recognizer offline mode is not good enough, which makes applications, developed with Google Recognizer in mind critically dependent on user having a stable internet connection.

Google Recognizer doesn’t offer a possibility to form your own array of expected words, such as basic commands, which causes Google Recognizer to return any phonetically similar words instead of expected ones (Use case 1, 2).

Proper names, especially rare or of foreign origin for current localization, are hardly recognized at all (Use case 2). For example, trouble may be caused by names of Asian or Hispanic origin in US, or French names in Canada.

In online mode, in case of weak internet connection, Google Recognizer can’t operate parallel to offline mode, which affects resulting recognition quality drastically. This restriction may prove critical when working with mobile devices.

Even with a stable internet connection periodical failures of the Google Recognizer service led to our application losing nearly all of its functionality.

Google Recognizer can’t be run on any mobile platform other than Android OS.

We tried to adapt Google Recognizer for our needs, for example to satisfy Use cases 1 and 2 we implemented further processing of Google Recognizer results using different phonetic methods, such as DoubleMetaphone with Levenshtein method of symbol shift and so on. Thanks to such methods, we managed to ensure high reliability for situations where recognizer returned phonetically similar words as a result, for example “Cold, cool, all, coll” instead of expected “Call”.
An additional workaround for Use case 1 was found in an ability to easily expand user vocabulary, for example if a user wanted to use “Ok, sure, of course, positive” as synonyms of “Yes” to confirm an action, he could add this words using Google Recognizer and use them as alternative commands to confirm his actions in an application.
However, aforementioned limitations of Google Recognizer forced us to look for other solutions to use in our application.

We looked at the most well-known solutions from PocketSphinx and Nuance Mobile Toolkit (NMT):

Nuance Mobile Toolkit solution is also widely known as Dragon Mobile SDK. Nuance offers a whole set of solutions which can be used in different circumstances and fields. We focused on studying several offline (Embedded, Vocon) and online (Cloud, Hybrid) solutions.

Embedded

The strengths of this recognizer lie in a fully functional offline mode with its own massive vocabulary and quite good recognition of live speech in an office environment (relatively silent, ambient noise is caused only by a human monotone).
The weaknesses of this recognizer turned out to be slow request processing speed (initialization, processing and return of results), unstable due to some native problems we were unable to access library and strong dependence between recognition quality and outside noise, so using this solution among street or car noise was proven to be ineffective.
An additional minus is that recognizer’s usage of library and vocabularies warrants for additional 100-200 Mb (depending on vocabulary size) in application install package.

Vocon recognizer offline

Main characteristic of this recognizer is the ability to set your own grammar and vocabulary, where you can place all fixed words, such as commands, list of names and so on. On each step of every usage scenario for your application Vocon can dynamically load appropriate grammar if it has been set beforehand. This way the quality of basic word recognition becomes very high, so we have no need to additionally process recognizer’s results, since the recognizer reliability is on an acceptable level.

The most important task for any developer implementing this solution is to carefully prescribe every possible scenario and step the application user takes in Vocon grammar, which will guarantee high reliability and quality of work.
Additional strengths of Vocon are fully offline mode of work, high speed, low dependence between recognition quality and outside noise, which allows to use the applications in the street as well as in public transport or in the office.

Due to particularities of this decision it can’t be implemented to recognize live speech, because it requires user to define all expected grammar beforehand.
Inability to create user vocabularies can be viewed as another weakness of Vocon. This restriction is caused by the rule that all grammar before being used in an application must be compiled on developer’s side using Apache Ant and can’t be recompiled on the mobile device itself.
A relative weakness of Vocon is a lack of any detailed documentation about its integration in mobile applications. We had to solve many basic questions just by trial and error.

Cloud

Cloud recognizer developed by NМТ is similar to Google online recognizer, so speech is processed on the server side. Its main purpose is to recognize live speech, which can’t be predetermined. Recognition quality, while used with a strong internet connection, is quite good, sometimes even superior to Google Recognizer. Cloud, unlike Google Recognizer, is much better at picking up pauses between words and end of speech point, so word cutting at the end of the phrase is much rarer (common problem in Google Recognizer) and the quality of processing long texts, consisting of several phrases and sentences, is higher.
While using Cloud recognizer for Cases 1 and 2, you can easily use the same phonetic processing methods that worked with Google Recognizer.
While comparing Cloud to Google Recognizer it must be noted that, because Google search engines are the most used ones in the world, Google Recognizer will be much more reliable when processing names of places, establishments, points of interest and so on, while Cloud recognizer mostly covers commonly used lexicon.
This difference can be decisive if the voice recognition technology is used in some sort of navigation application.

Vocon Hybrid

The main idea of Hybrid recognizer was to combine strengths of predefined grammar, as in offline Vocon, and online part when working with unidentified grammar, as in Cloud. This decision could allow user to use commands, names and free text in live speech mode, so basically user would be able to use scenarios with shortcuts, for example “Write an SMS to John saying “Hello! How are you?””, without the need to follow this scenario step by step.
Sadly, we were unable to implement Hybrid recognizer in our application to fully evaluate the possibilities of such a combination due to online functionality not working. Offline part works exactly as aforementioned Vocon.

PocketSphinx

PocketSphinx Speech Recognition Toolkit is an Open-source solution for mobile devices developed by CMUSphinx, well known among Unix developers.
The main drawbacks of PocketSphinx for us were quality and reliability of text recognition. In this field this solution loses to both Google Recognizer and NMT recognizer. Among other weaknesses are a rather source library big size and inability to define your own grammar.
A strength of this decision is an ability to operate in a constant listening mode (wake up mode), so you can implement a functional which allows to call your application anywhere and in any phone state with a voice command, as in “OK Google” and “Hi Siri”.

Summarizing our studies, the most appropriate for us solution turned out to be a combination of offline Vocon and online services. Vocon is used for basic commands, in-app navigation, determining contact names and installed applications, predetermines points of interest list, which allows the app to retain its main functionality even without internet connection. Google and Cloud are used in scenarios which require live speech recognition, for example while narrating notes, messages, mail, locations and so on.
On early stages PocketSphinx was being used to wake application with a voice command, but was later swapped to Vocon, which, as it turned out, is capable of providing this functionality as well without the need to sustain a redundant library in the app, which also takes additional pretty big space in the app installation package.

Overall it must be noted that a serious work is being done by IT industry to enhance voice recognition technologies and while current solutions can be slow, limited, difficult to integrate and don’t showcase all possibilities such functionality can be used for, not far off is the time when we will be able to fulfill even the boldest of ideas, which not so long ago lived only in fantasies of a sci-fi author.

Voice recognizers comparison table:

Capability / Recognizer

Google Recognizer

NMT Vocon Recognizer

NMT Embedded Dragon Recognizer (EDR)

NMT Cloud

NMT Hybrid

PocketSphinx

Offline mode

+
*

+***

+**

–

–

+**

Command words

+(with handling by phonetic methods)
**

+***

+(with handling by phonetic methods)
*

+(with handling by phonetic methods)
*

+***

+(with handling by phonetic methods)
*

Contact name

+
(with handling by phonetic methods)
**

+
***

have not tried

have not tried

+
***

have not tried

Live free speech

+
***

–

+
**

+
***

have not tried

+
*

Noise suppression

+
***

+
***

–

+
***

+
***

have not tested

Wake up mode

–

+
***

–

–

–

+
**

Installation package size

–

minimal

+

–

minimal

+

Note: the number of stars means level of quality of this option in corresponding solution.

As you can see the language on this blog is not pure English, but with Ukrainian accent. :) we decided that it is not effective to hire native proofreaders every time we update the site. It is the same language here that we use in communication with customers. So you can decide if there is language barrier problem or no.