Can an Algorithm Write a Better News Story Than a Human Reporter?

Illustration: Mark Allen Miller

Had Narrative Science — a company that trains computers to write news stories—created this piece, it probably would not mention that the company’s Chicago headquarters lie only a long baseball toss from the Tribune newspaper building. Nor would it dwell on the fact that this potentially job-killing technology was incubated in part at Northwestern’s Medill School of Journalism, Media, Integrated Marketing Communications. Those ironies are obvious to a human. But not to a computer.

For now consider this: Every 30 seconds or so, the algorithmic bull pen of Narrative Science, a 30-person company occupying a large room on the fringes of the Chicago Loop, extrudes a story whose very byline is a question of philosophical inquiry. The computer-written product could be a pennant-waving second-half update of a Big Ten basketball contest, a sober preview of a corporate earnings statement, or a blithe summary of the presidential horse race drawn from Twitter posts. The articles run on the websites of respected publishers like Forbes, as well as other Internet media powers (many of which are keeping their identities private). Niche news services hire Narrative Science to write updates for their subscribers, be they sports fans, small-cap investors, or fast-food franchise owners.

And the articles don’t read like robots wrote them:

Friona fell 10-8 to Boys Ranch in five innings on Monday at Friona despite racking up seven hits and eight runs. Friona was led by a flawless day at the dish by Hunter Sundre, who went 2-2 against Boys Ranch pitching. Sundre singled in the third inning and tripled in the fourth inning … Friona piled up the steals, swiping eight bags in all …

OK, it’s not Roger Angell. But the grandparents of a Little Leaguer would find this game summary—available on the web even before the two teams finished shaking hands—as welcome as anything on the sports pages. Narrative Science’s algorithms built the article using pitch-by-pitch game data that parents entered into an iPhone app called GameChanger. Last year the software produced nearly 400,000 accounts of Little League games. This year that number is expected to top 1.5 million.

Narrative Science’s CTO and cofounder, Kristian Hammond, works in a small office just a few feet away from the buzz of coders and engineers. To Hammond, these stories are only the first step toward what will eventually become a news universe dominated by computer-generated stories. How dominant? Last year at a small conference of journalists and technologists, I asked Hammond to predict what percentage of news would be written by computers in 15 years. At first he tried to duck the question, but with some prodding he sighed and gave in: “More than 90 percent.”

That’s when I decided to write this article, hoping to finish it before being scooped by a MacBook Air.

Hammond assures me I have nothing to worry about. This robonews tsunami, he insists, will not wash away the remaining human reporters who still collect paychecks. Instead the universe of newswriting will expand dramatically, as computers mine vast troves of data to produce ultracheap, totally readable accounts of events, trends, and developments that no journalist is currently covering.

That’s not to say that computer-generated stories will remain in the margins, limited to producing more and more Little League write-ups and formulaic earnings previews. Hammond was recently asked for his reaction to a prediction that a computer would win a Pulitzer Prize within 20 years. He disagreed. It would happen, he said, in five.

Hammond was raised in Utah, where his archaeologist dad taught at a state university. He grew up thinking he’d become a lawyer. But in the late 1980s, as an undergraduate at Yale, he fell under the sway of Roger Schank, a renowned artificial intelligence researcher and chair of the computer science department. After earning a doctorate in computer science, Hammond was hired by the University of Chicago to lead a new AI lab. While there, in the mid-1990s, he created a system that tracked users’ reading and writing and then recommended relevant documents. Hammond built a small company around that technology, which he later sold. By that time, he had moved to Northwestern University, becoming codirector of its Intelligent Information Laboratory. In 2009, Hammond and his colleague Larry Birnbaum taught a class at Medill that included both programmers and prospective journalists. They encouraged their students to create a system that could transform data into prose stories. One of the students in the class was a stringer for the Tribune who covered high school sports; he and two other journalism students were paired with a computer science student. Their prototype software, Stats Monkey, collected box scores and play-by-play data to spit out credible accounts of college baseball games.

At the end of the semester, the class participated in a demo day, where students presented their projects to a roomful of executives from the likes of ESPN, Hearst, and the Tribune. The Stats Monkey presentation was particularly impressive. “They put a box score and play-by-play into the program, and in something close to 12 seconds it drew examples from 40 years of Major League history, wrote a game account, located the best picture, and wrote a caption,” recalls the Medill dean, John Lavine.

Stuart Frankel, a former DoubleClick executive who left the online advertising network after Google purchased it in 2008, was among the guests that day. “When these guys did the presentation, the air in the room changed,” he said. “But it was still just a piece of software that wrote stories about baseball games—very limited.” Frankel followed up with Hammond and Birnbaum. Could this system create any kind of story, using any kind of data? Could it create stories good enough that people would pay to read them? The answers were positive enough to convince him that “there was a really big, exciting potential business here,” he says. The trio founded Narrative Science with Frankel as CEO in 2010.

The startup’s first customer was a TV network for the Big Ten college sports conference. The company’s algorithm would write stories on thousands of Big Ten sporting events in near-real time; its accounts of football games updated after every quarter. Narrative Science also got assigned the women’s softball beat, where it became the country’s most prolific chronicler of that sport.

But not long after the contract began, a slight problem emerged: The stories tended to focus on the victors. When a Big Ten team got whipped by an out-of-conference rival, the resulting write-ups could be downright humiliating. Conference officials asked Narrative Science to find a way for the stories to praise the performances of the Big Ten players even when they lost. A human journalist might have blanched at the request, but Narrative Science’s engineers saw no problem in tweaking the software’s parameters—hacking it to make it write more like a hack. Likewise, when the company began covering Little League games, it quickly understood that parents didn’t want to read about their kids’ errors. So the algorithmic accounts of those matchups ignore dropped fly balls and focus on the heroics.

I asked Kristian Hammond what percentage of news would be written by computers in 15 years. “More than 90 percent.”

Narrative Science’s writing engine requires several steps. First, it must amass high-quality data. That’s why finance and sports are such natural subjects: Both involve the fluctuations of numbers—earnings per share, stock swings, ERAs, RBI. And stats geeks are always creating new data that can enrich a story. Baseball fans, for instance, have created models that calculate the odds of a team’s victory in every situation as the game progresses. So if something happens during one at-bat that suddenly changes the odds of victory from say, 40 percent to 60 percent, the algorithm can be programmed to highlight that pivotal play as the most dramatic moment of the game thus far.
Then the algorithms must fit that data into some broader understanding of the subject matter. (For instance, they must know that the team with the highest number of “runs” is declared the winner of a baseball game.) So Narrative Science’s engineers program a set of rules that govern each subject, be it corporate earnings or a sporting event. But how to turn that analysis into prose? The company has hired a team of “meta-writers,” trained journalists who have built a set of templates. They work with the engineers to coach the computers to identify various “angles” from the data. Who won the game? Was it a come-from-behind victory or a blowout? Did one player have a fantastic day at the plate? The algorithm considers context and information from other databases as well: Did a losing streak end?

Then comes the structure. Most news stories, particularly about subjects like sports or finance, hew to a pretty predictable formula, and so it’s a relatively simple matter for the meta-writers to create a framework for the articles. To construct sentences, the algorithms use vocabulary compiled by the meta-writers. (For baseball, the meta-writers seem to have relied heavily on famed early-20th-century sports columnist Ring Lardner. People are always whacking home runs, swiping bags, tallying runs, and stepping up to the dish.) The company calls its finished product “the narrative.”

Occasionally the algorithms will produce a misstep, like a story stating that a pinch hitter—who usually bats only once per game—went two for six. But such errors are rare. Numbers don’t get misquoted. Even when databases provide faulty information, Hammond says, Narrative Science’s algorithms are trained to catch the error. “If a company has a 600 percent rise in profits from quarter to quarter, it’ll say, ‘Something is wrong here,'” Hammond says. “People ask for examples of wonderful, humorous gaffes, and we don’t have any.”

Forbes Media chief products officer Lewis Dvorkin says he’s impressed but not surprised that, in almost every case, his cyber-stringers nail the essence of the company they’re reporting on. Major screwups are not unheard-of with flesh-and-blood scribes, but Dvorkin hasn’t heard any complaints about the automated reports. “Not a one,” he says. (The pieces on Forbes.com include an explanation that “Narrative Science, through its proprietary artificial intelligence platform, transforms data into stories and insights.”)

The Narrative Science team also lets clients customize the tone of the stories. “You can get anything, from something that sounds like a breathless financial reporter screaming from a trading floor to a dry sell-side researcher pedantically walking you through it,” says Jonathan Morris, COO of a financial analysis firm called Data Explorers, which set up a securities newswire using Narrative Science technology. (Morris ordered up the tone of a well-educated, straightforward financial newswire journalist.) Other clients favor bloggy snarkiness. “It’s no more difficult to write an irreverent story than it is to write a straightforward, AP-style story,” says Larry Adams, Narrative Science’s VP of product. “We could cover the stock market in the style of Mike Royko.”

Once Narrative Science had mastered the art of telling sports and finance stories, the company realized that it could produce much more than journalism. Indeed, anyone who needed to translate and explain large sets of data could benefit from its services. Requests poured in from people who were buried in spreadsheets and charts. It turned out that those people would pay to convert all that confusing information into a couple of readable paragraphs that hit the key points.

Narrative Science, it so happened, was well placed to accommodate such demands. When the company was just getting started, meta-writers had to painstakingly educate the system every time it tackled a new subject. But before long they developed a platform that made it easier for the algorithm to learn about new domains. For instance, one of the meta-writers decided to build a story-writing machine that would produce articles about the best restaurants in a given city. Using a database of restaurant reviews, she was able to quickly teach the software how to identify the relevant components (high survey grades, good service, delicious food, a quote from a happy customer) and feed in some relevant phrases. In the space of a few hours she had a bot that could churn out an endless supply of chirpy little articles like “The Best Italian Restaurants in Atlanta” or “Great Sushi in Milwaukee.”

(Narrative Science’s main rival in automated story creation, a North Carolina company founded as Stat Sheet, has broadened its mission in similar fashion. The company can’t compete with Narrative Science’s Medill pedigree and so has assumed the role of a feisty tabloid in a two-paper town. It too got its start in sports, writing accounts of Major League and big-college games as well as creating a trash-talk generator called StatSmack. After realizing that turning data into stories presented an opportunity far larger than sports, the company changed its name to Automated Insights. “I used to put limitations on what we do, assuming our stories would be specific to data-rich industries,” founder Robbie Allen says. “Now I think ultimately the sky is the limit.”)

Users can customize the tone of any story—from breathless financial reporter to dry analyst.

And the subject matter keeps getting more diverse. Narrative Science was hired by a fast-food company to write a monthly report for its franchise operators that analyzes sales figures, compares them to regional peers, and suggests particular menu items to push. What’s more, the low cost of transforming data into stories makes it practical to write even for an audience of one. Narrative Science is looking into producing personalized 401(k) financial reports and synopses of World of Warcraft sessions—players could get a recap after a big raid that would read as if an embedded journalist had accompanied their guild. “The Internet generates more numbers than anything that we’ve ever seen. And this is a company that turns numbers into words,” says former DoubleClick CEO David Rosenblatt, who sits on Narrative Science’s board. “Narrative Science needs to exist. The journalism might be only the sizzle—the steak might be management reports.”

For now, though, journalism remains at the company’s core. And like any cub reporter, Narrative Science has dreams of glory—to identify and break big stories. To do that, it will have to invest in sophisticated machine-learning and data-mining technologies. It will also have to get deeper into the business of understanding natural language, which would allow it to access information and events that can’t be expressed in a spreadsheet. It already does a little of that. “In the financial world, we’re reading headlines,” Hammond says. “We can identify if some company’s stock gets upgraded or downgraded, somebody gets fired or hired, somebody’s thinking of a merger, and we know the relationship between those events and a stock price.” Hammond would like to see his company’s college sports stories include nonstatistical information like player injuries or legal problems.

But even if Narrative Science never does learn to produce Pulitzer-level scoops with the icy linguistic precision of Joan Didion, it will still capitalize on the fact that more and more of our lives and our world is being converted into data. For example, over the past few years, Major League Baseball has spent millions of dollars to install an elaborate system of hi-res cameras and powerful sensors to measure nearly every event that’s occurring on its fields: the velocities and trajectories of pitches, tracked to fractions of inches. Where the fielders stand at any given moment. How far the shortstop moves to dive for a ground ball. Sometimes the real story of the game may lie within that data. Maybe the manager failed to detect that a pitcher was showing signs of exhaustion several batters before an opponent’s game-winning hit. Maybe a shortstop’s extended reach prevented six hits. This is stuff that even an experienced beat writer might miss. But not an algorithm.

Hammond believes that as Narrative Science grows, its stories will go higher up the journalism food chain—from commodity news to explanatory journalism and, ultimately, detailed long-form articles. Maybe at some point, humans and algorithms will collaborate, with each partner playing to its strength. Computers, with their flawless memories and ability to access data, might act as legmen to human writers. Or vice versa, human reporters might interview subjects and pick up stray details—and then send them to a computer that writes it all up. As the computers get more accomplished and have access to more and more data, their limitations as storytellers will fall away. It might take a while, but eventually even a story like this one could be produced without, well, me. “Humans are unbelievably rich and complex, but they are machines,” Hammond says. “In 20 years, there will be no area in which Narrative Science doesn’t write stories.”

For now, however, Hammond tries to reassure journalists that he’s not trying to kick them when they’re down. He tells a story about a party he attended with his wife, who’s the marketing director at Chicago’s fabled Second City improv club. He found himself in conversation with a well-known local theater critic, who asked about Hammond’s business. As Hammond explained what he did, the critic became agitated. Times are tough enough in journalism, he said, and now you’re going to replace writers with robots?

“I just looked at him,” Hammond recalls, “and asked him: Have you ever seen a reporter at a Little League game? That’s the most important thing about us. Nobody has lost a single job because of us.”