Alexa, Please Save the World

Speech recog­ni­tion is go­ing to change the way we com­pute and how we think

KIDS TODAY will grow up think­ing a key­board is some an­te­dilu­vian tool like an aba­cus or but­ter churn, which they might en­counter only be­cause it’s nailed to a wall of a TGI Fri­days.

Voice is tak­ing over as the way we in­ter­act with tech­nol­ogy and in­put words. Ac­tu­ally, it was sup­posed to have taken over a long time ago. Back in 1998, I wrote a col­umn for USA Today say­ing that “speech-recog­ni­tion tech­nol­ogy looks ready to change the world,” though I also noted that when I tried to say “two turnta­bles and a mi­cro­phone” into the lat­est and great­est speech-recog­ni­tion soft­ware, it thought I said some­thing like “two torn la­bels and an ice cream cone.” Turns out that was about 20 years too soon.

But the tech­nol­ogy works now. Mi­crosoft, Google, Ama­zon, IBM, China’s Baidu and a hand­ful of star­tups have been driv­ing hard to build arti

cial in­tel­li­gence soft­ware that can un­der­stand nu­anced speech and re­ply co­her­ently. Late last year, Mi­crosoft said its speech-recog­ni­tion tech­nol­ogy had caught up to hu­man un­der­stand­ing. Its “word er­ror rate” got down to 5.9 per­cent, about the same as peo­ple who had tran­scribed the same con­ver­sa­tion—and much bet­ter than the word er­ror rate in any con­ver­sa­tion be­tween a par­ent and his or her teenage son.

Google’s speech-recog­ni­tion tech­nol­ogy is learn­ing hu­man lan­guages at a rapid clip. In Au­gust, it added 30 new ones, in­clud­ing Azer­bai­jani and Ja­vanese, bring­ing the to­tal to 119. IBM’S Wat­son tech­nol­ogy has be­come well known for in­ter­act­ing with hu­mans—you’ve prob­a­bly seen the com­mer­cial show­ing Wat­son talk­ing with Bob Dylan. OK, it’s an ad. But even implying that a machine can comprehend what Dylan is say­ing is groundbreaking.

Com­pa­nies are lin­ing up to get ready for a ood of speech-driven com­merce. The main rea­son Ama­zon wants to get Alexa into your home is so

you’ll get used to shop­ping by just speak­ing to the thing. In Au­gust, Google and Wal­mart an­nounced a part­ner­ship that will al­low users of the Google Home gad­get to use speech to buy di­rectly from the world’s big­gest re­tailer. “We are try­ing to help cus­tomers shop in ways that they may have never imag­ined,” said Marc Lore, CEO of Wal­mart ecom­merce U.S. (Lore joined Wal­mart when it bought the on­line re­tailer he founded, Jet.com.) All around re­tail, chat­bot shop­ping through apps from the likes of Wechat, Kik and Hip­munk is the new hot thing. Most shop­ping bots today are text-based but are mov­ing to­ward speech. Ac­cord­ing to­com­score, half of all searches will be voice searches by 2020—and search is most con­sumers’ rst step to­ward buy­ing.

Ever since Ap­ple in­tro­duced Siri in 2011, we’ve come to ex­pect our phones and apps to comprehend spo­ken queries, which is an un­der­ap­pre­ci­ated, mon­u­men­tal achieve­ment after so many decades of try­ing. It’s like the turn­ing point in the 1910s, when peo­ple started to ex­pect that air­planes would ac­tu­ally y.

IBM demon­strated the rst voice-recog­ni­tion machine, called Shoe­box, at the 1962 World’s Fair in Seat­tle. The de­vice could un­der­stand all of 16 words—the num­bers zero to nine and in­struc­tions like “plus” and “mi­nus.” To let you know it un­der­stood you, Shoe­box would do sim­ple math and print the re­sult.

In the 1970s, the U.S. mil­i­tary’s re­search arm, the De­fense Ad­vanced Re­search Pro­jects Agency, or DARPA, funded a mas­sive speech-recog­ni­tion pro­gram that got the to­tal of words un­der­stood by a machine up to about 1,000—still far from prac­ti­cal yet roughly equiv­a­lent to our cur­rent pres­i­dent’s vo­cab­u­lary. In the 1980s, James Baker, a pro­fes­sor at Carnegie Mel­lon Univer­sity, co-founded Dragon Sys­tems, based on his speech-recog­ni­tion re­search. In 1990, Dragon’s

rst con­sumer dic­ta­tion-tak­ing prod­uct cost $9,000 and mostly just frus­trated users. In 1998, when I stopped in at IBM Re­search to check on progress in the eld, speech recog­ni­tion was still not yet good enough for ev­ery­day use.

Why has the tech­nol­ogy sud­denly got­ten so good? The on­slaught since 2007 of mo­bile de­vices and cloud com­put­ing has al­lowed mas­sive data cen­ters op­er­ated by gi­ants such as Google and Ama­zon to learn lan­guage from hun­dreds of bil­lions of con­ver­sa­tions around the world. Ev­ery time you ask some­thing of an Alexa or a Wat­son, the sys­tem learns a lit­tle more about how peo­ple say stu . Be­cause the soft­ware can learn, no one has to punch in data about ev­ery slang word or ac­cent. The soft­ware will keep im­prov­ing, and soon it will un­der­stand our speech bet­ter than the typ­i­cal hu­man does.

And that could rad­i­cally change the world. Shop­ping may be an early ap­pli­ca­tion, but the tech­nol­ogy can even al­ter the way we think. A cou­ple of gen­er­a­tions learned to think with a key­board and mouse—a tac­tile ex­pe­ri­ence. “The cre­ative process is changed,” a Dragon ex­ec­u­tive named Joel Gould told me back in 1998, an­tic­i­pat­ing changes. “You’ll have to learn to think with your mouth.” In a way, it’s tak­ing us back to the way our brains were meant to work— the way peo­ple thought and cre­ated for thou­sands of years be­fore pens and type­writ­ers and word pro­ces­sors. Homer didn’t need to type to con­jure up The Iliad.

In a speech-pro­cess­ing world, il­lit­er­acy no longer has to be a bar­rier to a de­cent life. Google is ag­gres­sively adding lan­guages from

EVEN IMPLYING THAT A MACHINE CAN COMPREHEND WHAT BOB DYLAN IS SAY­ING IS GROUNDBREAKING.

de­vel­op­ing na­tions be­cause it sees a path to con­sumers it could never be­fore touch: the 781 mil­lion adults who can’t read or write. By just speak­ing into a cheap phone, this swath of the pop­u­la­tion could do ba­sic things like sign up for so­cial ser­vices, get a bank ac­count or at least watch cat videos.

The tech­nol­ogy will a ect things in odd, small ways too. One ex­am­ple: At a con­fer­ence not long ago, I lis­tened to the head of Ama­zon Mu­sic, Steve Boom, talk about the im­pact Alexa will have on the in­dus­try. New bands are start­ing to re­al­ize they must have a name peo­ple can pro­nounce, un­like MGMT or Chvrches. When I walked over to my Alexa and asked it to play “Chu-ver-ches,” it gave up and played “Pulling Mus­cles From the Shell” by Squeeze.

In fact, as good as the tech­nol­ogy is today, it still has a lot to learn about con­text. I asked Alexa, “What is ‘two turnta­bles and a mi­cro­phone’?” In­stead of re­ply­ing with any­thing about Beck, she just said, “Hmm, I’m not sure.” But at least she didn’t point me to the near­est ice cream cone.

+
NO RAGE AGAINST THE MACHINE: Tech ex­ec­u­tives at the CES trade show in Las Ve­gas in 2017. Many com­pa­nies are build­ing arti cial in­tel­li­gence soft­ware that can un­der­stand nu­anced speech and re­ply co­her­ently.