Tag: machine learning

As I was listening to the December 21 episode of the CPPCast, together with TWiML&AI my two most favourite podcasts, I couldn’t help but be a little bewildered by the number of times the guest used the word “like” during their interview.

During another CPPCast episode which I recently listened to, the hosts coincidentally discussed the idea of making available transcriptions of the casts.

These two occurrences, namely the abundance of the “like” disfluency and the mention of transcription, connected in the back of my mind, and produced the idea of finding out how one could go about to use a publically available speech API to transcribe the podcast, and count the number of utterances of the word “like”.

Due to the golden age of information we find ourselves in, this was not that hard at all.

Selecting the API

After a short investigation of Microsoft’s offerings seemed to indicate that I would not be able to transcribe just under an hour of speech, I turned to Google.

The Google Cloud Speech API has specific support for the asynchronous transcription of speech recordings of up to 3 hours.

Setting up the project and service account

Make sure that you can access the Google Cloud Dashboard with your google account. I created a new project for this experiment called cppcast-speech-to-text.

Within that project, select APIs & Services dashboard from the menu on the left, and then enable the Speech API for that project by selecting the Enable APIs and Services link at the top.

Next, go to IAM & Admin and Service Accounts via the main menu, and create a service account for this project.

Remember to select the download JSON private key checkbox.

Transcode and upload the audio

For the Speech API, you will have to transcode the MP3 to FLAC, and you will have to upload the file to a Google Cloud Storage bucket.

I transcoded the MP3 to a 16kHz mono FLAC (preferred by the API) as follows:

Too many likes?

I wrote the following Python to tally up the total number of words, and the total number of “like” utterances.

import json
withopen('/Users/cpbotha/Downloads/cppcast-131-text.json')as f:
# results: a list of dicts, each with 'alternatives', which is a list of transcriptsres = json.load(f)['response']['results']num_like = 0
num_words = 0
for r in res:
alts = r['alternatives']# ensure that we only have one alternative per resultassertlen(alts) == 1
# break into lowercase wordst = alts[0]['transcript'].strip().lower().split()# tally up total number of wordsnum_words += len(t)# count the like utterancesnum_like += sum(1 for w in t if w == 'like')

In this 56 minute long episode of CPPCast, 7411 words were detected, 214 of which were the word “like”.

This is not quite as many as I imagined, but still comes down to 3.82 likes per minute, which is enough to be quite noticeable.

Conclusions

We should try to use “like” and other speech disfluencies far less often. Inserting a small pause makes more sense: The speaker and the listeners get a little break to process the ongoing speech, and the speech comes across as more measured.

All in all, it took me about 2 hours from idea to transcribed text. I find it wonderful that machine learning for speech-to-text has become so democratised.

After my transcription job was complete, I saw that it was possible to supply phrase hints to the API. I could have uploaded a list of words we expect to occur during this podcast, such as “CPPCast” and “C++”, and this would have been used by the API to further improve its transcription.