Collaborating Authors

Although lately, speech recognition technology has improved considerably, it is yet no match to the human transcriptionist in achieving accuracy. Speech recognition software that are commercially available show an average error rate of about 12% while transcribing phone conversation. Read on to learn more. Automated transcription is a process where an audio and video file is converted into a written format using voice & speech recognition technology. Like most AI streams, artificial intelligence for transcription works in the same way, training specific software with high-quality datasets or examples.

At Deepgram, an end-to-end deep learning speech recognition system is used to create a completely different solution, which makes collecting speech data faster, more accurate and reliable, and truly meets the needs of enterprise companies. Deepgram's innovation is to use artificial intelligence to process text and graphics, so that they form mixed custom models, and then fully train these models to enable them to use files from telephone and podcasts to recorded meetings and videos. The innovative method of Deepgram voice storage can help customers search for words according to their pronunciation, even if they are misspelled, Deepgram can find them. Deepgram CEO Stephenson said that Deepgram's model automatically picks up the noise profile of the microphone, as well as background noise, audio coding, transmission protocol, accent, price point (ie energy), emotion, conversation theme, speech rate, product name and language. In addition, he claims that they can improve speech recognition accuracy by 30% compared to industry benchmarks, increase transcription speed by 200 times, and process thousands of simultaneous audio streams.

Real-time transcription provides deaf and hard of hearing people visual access to spoken content, such as classroom instruction, and other live events. Currently, the only reliable source of real-time transcriptions are expensive, highly-trained experts who are able to keep up with speaking rates. Automatic speech recognition is cheaper but produces too many errors in realistic settings. We introduce a new approach in which partial captions from multiple non-experts are combined to produce a high-quality transcription in real-time. We demonstrate the potential of this approach with data collected from 20 non-expert captionists.

Somewhere between 2009 and 2010 a new and exciting technology broke into the forefront of Artificial Intelligence research. Within a few months, a combination of advanced computing power and huge amounts of data set the stage for a new era in AI. Sophisticated algorithms, invented back in the 1950s, previously considered no more than an academic thought experiment, were transmuted into the cutting edge of the industry. These algorithms -- Deep Neural Networks -- broke boundaries, smashed records, and obtained novel achievements in the field of Artificial Intelligence, that had been all but lying dormant for decades. One of the areas where these achievements were most prominent was Automatic Speech Recognition (ASR), i.e., the task of automatically transcribing voice recordings into written words.

In one of my previous blog posts, I touched on the topic of AI-powered transcription services on the market. There, I introduced the idea that, with this pace of multimedia production, traditional, human-powered transcription services is not the solution. In the past 2 years, we've produced 90% of all the data our civilization has. At this pace, and a 9:1 ratio of transcribing multimedia files, human-powered transcription is simply impossible to keep up. Just like hiring an army of workers to dig a perfectly straight ditch of a 1000 miles is not the best option, we need to start thinking of how machines can help.