Thanks for the Mozilla DeepSpeech project! Great open source contribution.

I’m getting long strings of words with no spaces. Example:

split the cape handler out from the sir hanler and make or on new hanlerswas not moneliticamandthenduconthingswerencaerulizationotenopuing paws on that until i its a product signalmyeanwhatwassomebiokarsthatyouwerefocusedtomtheturmeting projection so at

My file’s audio is a WAV file at 16 kHz with a bit depth of 16 mono, with the audio codec being PCM S16 LE. I’m using the default Python client to test things. The audio was recorded cleanly via a Mac OS X laptops microphone.

I’ve seen this with other audio samples I’ve tried. I looked through the Mozilla DeepSpeech’s github issues and didn’t see others reporting this. Is this a known issue? Are there any known workarounds (different audio setup, etc.)?

It might also just be the result of training VS real world usage. We know that non-native american speakers have less good results (myself included) because of the training dataset. Hopefully when training includes broader accents it will be better. If you can record clear audio clips of 5 - 10 secs (sometimes, microphones produces strange stuff also) and make sure you try with and without the language model

I’d like to add some custom words to the language model to see if that helps the garbled words issue, but I can’t regenerate it since the real vocab.txt is not available. Is there any way I can privately get it to aid debugging and testing of this issue?

The text used to train the language model was/is a combination of texts from the Fisher, Switchboard, and other corpora. As Fisher + Switchboard are licensed to only be used within Mozilla, unfortunately, I can’t provide the text used to train the language model to you.

Once everything is installed you can then use the deepspeech binary to do speech-to-text on short, approximately 5 second, audio files (currently only WAVE files with 16-bit, 16 kHz, mono are supported in the Python client)

Also in that folder are several text files that show the output with the standard language model being used, showing the garbled words together (chunks_with_language_model.txt):

Running inference for chunk 1
so were trying again a maybeialstart this time
Running inference for chunk 2
omiokaarforfthelastquarterwastoget
Running inference for chunk 3
to car to state deloedmarchinstrumnalha
Running inference for chunk 4
a tonproductcaseregaugesomd produce sidnelfromthat
Running inference for chunk 5
i am a to do that you know
Running inference for chunk 6
we finish the kepehandlerrwend finished backfileprocessing
Running inference for chunk 7
and is he teckdatthatwewould need to do to split the cape
Running inference for chunk 8
out from sir handler and i are on new
Running inference for chunk 9
he is not monolithic am andthanducotingswrat
Running inference for chunk 10
relizationutenpling paws on that until it its a product signal

Then, I’ve provided similar output with the language model turned off (chunks_without_language_model.txt):

Running inference for chunk 1
so we're tryng again ah maybe alstart this time
Running inference for chunk 2
omiokaar forf the last quarter was to get
Running inference for chunk 3
oto car to state deloed march in strumn alha
Running inference for chunk 4
um ton product caser egauges somd produc sidnel from that
Running inference for chunk 5
am ah to do that ou nowith
Running inference for chunk 6
we finishd the kepe handlerr wend finished backfile processinga
Running inference for chunk 7
on es eteckdat that we would need to do to split the kae ha
Running inference for chunk 8
rout frome sir hanler and ik ar on newh
Running inference for chunk 9
ch las not monoliic am andthan ducotings wrat
Running inference for chunk 10
relization u en pling a pas on that until it its a product signal

I’ve included both these files in the shared Dropbox folder link above.

Here’s what the correct transcript should be, manually done (chunks_correct_manual_transcription.txt):

So, we're trying again, maybe I'll start this time.
So my OKR for the last quarter was to get AutoOCR to a state that we could
launch an external alpha, and product could sort of gauge some product signal
from that. To do that we finished the CAPE handler, we finished backfill
processing, we have some tech debt that we would need to do to split the CAPE
handler out from the search handler and make our own new handler so its not
monolithic, and do some things around CAPE utilization. We are kind of putting
a pause on that until we get some product signal.

This shows the language model is the source of this problem; I’ve seen anecdotal reports from this message base and blog posts that this is a wide spread problem. Perhaps when the language model hits an unknown n-gram, it ends up combining all of them together rather than retaining the space between them.