Text Talks

How artificial intelligence and machine learning are unlocking content

By C Bryan Jones

The pace of content creation has never been faster. Each day, the world generates 2.5 quintillion bytes of data, and more than 90 percent of all data in existence has been produced since 2016. As transmission speeds accelerate and bandwidth increases, more and more of this content is
shifting from text to audio and video.

In the process, it is being hidden from search engines, locked away in multimedia formats that cannot be catalogued. A gold mine of information is waiting to be tapped if only the spoken word could be easily converted to text. Doing so by hand is time consuming and, often, prohibitively expensive. But that is changing with a new generation of technology driven by artificial intelligence (AI) and machine learning.

WRITE IT DOWNInformation is what powers the world. It is also what powers Google, the go-to directory for 86 percent of internet users. The ubiquitous catalog created by Mountain View, CA-based tech giant Google LLC works by scanning websites and cataloging text. That inspiring video you created about your services? Invisible. That enlightening podcast interview with the company CEO? Silent.

“People and businesses are creating huge amounts of voice-based content in the form of interviews, conferences, webinars, tutorials, podcasts, customer calls, and meetings,” said Ashutosh Trivedi, co-founder of Spext, which has offices in California and India. “This content is living in monolithic media formats, such as MP3 audio files and MPEG video files, which are meant for long-form entertainment.”

While this content may be a powerful way to introduce your company, bring in new customers, and boost sales, none of that can happen until someone finds it.

Andre Bastie, chief executive officer and co-founder of Dublin, Ireland-based Happy Scribe, Ltd. explained: “By default, search engines do not retrieve audio content. For this reason, if you publish an audio or video file on the web, it will never appear in search results. This is a big issue for podcasters and media companies that publish a lot of audio and video content.”

However, to say that the video is invisible and the podcast is silent is only partially true. Google will catalog the page that contains the link or embedded player, but it can only see the description you placed alongside. It can’t read inside the files, so you had better beef up those keywords. What you really need is a transcript so that the wealth of information contained within is revealed to the automated sweeps that catalog our 2.5-quintillion-byte daily output.

TIME FACTORCreating that transcript isn’t simple though—at least it hasn’t been. Time and money add up quickly.

The time required for a human to transcribe voice to text is generally four minutes of work for each minute of audio. Costs range from 75 cents to $1.50 per minute. Suppose you have a one-hour podcast—a format more businesses are using as part of their marketing mix. To have a transcript created would, on average, take four hours and cost $30–60. Now suppose you produce one podcast per week. Or two. Or three. Costs rise quickly.

But what if the same task could be done in four minutes at a cost of just $6? That is what AI and machine learning are now making possible, and companies such as Spext and Happy Scribe are pushing the revolution forward.

Also in the mix is Temi, a service of San Francisco, CA-based internet startup Rev. “The time and productivity benefits of machine transcription are difficult to ignore, especially as the technology continues to improve,” said Mary Kenny, product manager at Temi. “I think that, one day, we may be in a world where audio and text are thought of not as distinct and separate media formats, but as the same content, interchangeable from one to the other.”

Bastie echoed the time-savings, touching on a way in which The ACCJ Journal has been making use of the technology for content development since last autumn: “Having a text version of your audio documents can save you hours of work. This is especially true for researchers who will have to search through dozens of interviews.”

INSTANT ACCESSOne such person is Jeremy Caplan, director of education at the Tow-Knight Center for Entrepreneurial Journalism at the City University of New York’s Newmark Graduate School of Journalism. Caplan spoke to The ACCJ Journal about his experience using the nascent technology.

“Machine transcription makes it easy and affordable to create quick, accurate transcripts of interviews and meetings. I can look back over interview notes and discussions and efficiently share key highlights without having to spend hours poring over recordings or listening to things over and over,” he said. “That means less time spent on menial tasks and more time for focusing on challenging questions and important decisions.

“Many professionals accustomed to dictating or keeping track of conversations in medicine, law, and business will find value in having almost-instant transcripts, because they can quickly share the material in scannable text form when time is of the essence. Journalists can use quick transcripts for breaking-news situations and to quickly and efficiently share raw material with readers in this era when transparency in journalism is crucial in helping reestablish trust for our profession.”

Trivedi pointed out an often-overlooked benefit: customer relations. “Transcribing customer meetings can also help you focus on building the relationship. You can record—with permission—and transcribe your customer meetings so that you can be fully present in the moment. That way, you don’t need to worry about taking notes, or wonder whether you’ll remember the details of the meeting later.”

Additional uses he cited play into the marketing mix. “Transcripts can make media content reusable and generate more forms of the same content. An audio or video interview can be broken down into short-form content such as tweets, or easily converted to a blog post.”

AUTO OR MANUAL?Although the accuracy of machine transcription is impressive and improving all the time, it may not be the right choice for every situation. Here is advice from Temi on selecting the best course:

Choose automated transcription if you:

Need a transcript immediately

Are limited on budget

Have clear audio with one or two speakers

Only need a rough draft

Have time to edit the results yourself

Want to search for keywords

Are looking for a few specific quotes

Choose human transcription if you:

Need the most accurate results

Have a flexible budget

Don’t want to spend time on editing

Intend to publish the content

Recorded more than two speakers

Have audio with heavy accent

DEEP LEARNINGWhat has made this revolution in voice to text possible? Happy Scribe’s Bastie cites recent advances in machine learning. “It comes down to new algorithms using neural networks, and a lot of data—deep learning.” These advances, coupled with increasing computing power, were also cited by Temi’s Kenny as having “enabled massive improvements in speech recog­ni­tion accuracy.”

At the core is AI, but its role is more complex than one might think. “Intelligence has many facets,” said Trivedi, talking about how Spext is able to achieve close to 97-percent accuracy. “Understanding human speech and language is essential for intelligent machines to interact with humans in their own language.”

AI is also at the heart of Temi’s service, explained Kenny. “The primary components underlying our speech engine are automated speech recognition [ASR] and natural language processing [NLP]. ASR is the processing of speech to text while NLP is the processing of the text to understand its meaning.”

ACCURACY MATTERSThere’s a good chance that you have tried speech-to-text transcription before, either on your smartphone or using one of the speech-recognition software packages that has been around since the 1990s. You may have been disappointed with the results. While such programs could be trained for good accuracy with your own voice, today’s AI-powered tools are different. They are able to handle new voices, accents, and dialects—mixed within the recording—and deliver accuracy of greater than 95 percent.

In many fields, extreme accuracy is critical and can be a matter of life and death. Spext cites “close to 97 percent” accuracy for its machine-generated transcripts of English with accents. Trivedi explained how they achieve this: “Deep learning models, infrastructure to support deep-learning training, and huge amounts of data have made this possible. [Our engine] will also reach high accuracy in verticals where the language has specifics—in medical transcription, for example.” The company also recently launched a vertical product for those working in law to help manage court recordings and depositions.

This doesn’t mean that machine transcription is set to replace the human ear. “It still has a chance of error, so it cannot be used in very critical cases,” Trivedi said. “Human transcription or review is required in medical and legal domains, where accuracy must be 100 percent, so at Spext we believe in a hybrid model that uses the speed and scale of machines and an application for humans to review the result and make it useful.”

Spext can produce a transcript of a two-hour meeting in just five minutes, and other services show similar speed. That’s a far cry from the potential eight hours required for a human to do the same work.

Sharing his experience with this approach, Caplan explained: “Automated transcripts are very accurate for dictation. But when you record conversations with multiple speakers, the transcripts tend to be riddled with little errors. That’s why they’re best for ‘gist’ transcripts that give you the sense of the discussion, rather than for flawless word-for-word records. My name, Jeremy Caplan, might on occasion be rendered as ‘Germany’s captain,’ and ‘a bucket of flowers’ might turn into a phrase like ‘buck it for flour.’ Still, with a bit of quick editing, you can clean up an auto-generated transcript in much less time than it would take to transcribe a recording manually.”

TOMORROW’S TECHSo where do we go from here? Bastie predicts that accuracy will greatly improve in the coming years, while Trivedi believes machine transcription or speech-to-text is going to become a commodity and will be available for all at no cost.

“If you look at content forms—text, audio, and video—each has its own advantages,” Trivedi said. “Text is easy to create, edit, share, search, remove, and host, whereas audio and video are not. In the future, we will be using text as an interface to create, share, and disseminate voice-based media.”

This can already be seen in Spext’s underlying technology, which provides a new kind of media file that fuses text and speech to make content easy to edit, share, and transform. “With machine transcription, you can know the exact time­stamp of each word with millisecond accuracy,” Trivedi explained. “We have built our own AI model to accurately find the timestamp of each word in an audio or video file. This allows us to build innovative applications such as our fast clip creator and voice editor.”

That voice editor is the world’s first to use a transcript as the editing interface for media production. If you delete text, the corresponding audio is also removed. “This supercharges creating, publishing, and repurposing media,” he added. “Marketing teams don’t have to work with complex audio or video editing software. You can create clips from a long video by just copying and pasting text.”

Lastly, as AI’s mastery of English improves, diversity in languages will follow. At present, Temi’s automated transcrip­tion service supports only English, but Rev’s human freelancers can work in more than 35 languages—including Arabic, Chinese, Dutch, French, Japanese, and Spanish—for sub­titling and translation. Spext is working with more than 100 languages for automated transcription and Happy Scribe offers 119.

We may live in a multimedia world with sights and sounds swirling around us everywhere we go, but at the end of the day it is the written word that still rules. Thanks to AI, machine learning, and Big Data, tapping into all the senses—and making sure your content is found—is getting a whole lot easier.