Programming The News: The Future Of Reporting Is Algorithms

from the I-for-one-welcome-our-new-fedora-clad-robotic-overlords dept

This may seem like the sort of statement usually delivered by an overblown narrator as rockets and lasers go zooming* by, but here goes: In the world of journalism, the future is now! Granted, it's the kind of future that often makes waves in the present and raises at least as many questions as it answers, but if you wanted a bright, problem-free future, you'd have to travel back to the divergence point somewhere between Philip K. Dick and The Jetsons... and then eliminate the dystopians.

*Yes, I realize lasers don't make noise or "zoom" by, but that hasn't prevented George Lucas from becoming insanely rich, has it?

Journalist Ken Schwencke has occasionally awakened in the morning to find his byline atop a news story he didn’t write.

No, it’s not that his employer, The Los Angeles Times, is accidentally putting his name atop other writers’ articles. Instead, it’s a reflection that Schwencke, digital editor at the respected U.S. newspaper, wrote an algorithm — that then wrote the story for him.

Instead of personally composing the pieces, Schwencke developed a set of step-by-step instructions that can take a stream of data — this particular algorithm works with earthquake statistics, since he lives in California — compile the data into a pre-determined structure, then format it for publication.

His fingers never have to touch a keyboard; he doesn’t have to look at a computer screen. He can be sleeping soundly when the story writes itself.

This isn't exactly new news. (Then again, neither is the morning paper, but that's a discussion for another time...) Algorithmic story generation has been around for a few years now, with Narrative Science leading the field. A couple of years ago, Narrative Science was the story, rather than just the automated recap. George Washington University's website had covered a GWU baseball game with a longish recap that only got around to mentioning the opposing pitcher's perfect game in the seventh (out of eight) paragraph. Speculators wondered if a bot was behind this "ignoring the forest for the trees" recap. Narrative Science's techies were highly offended and responded by producing two algorithmically-generated recaps -- one from the home team POV and a more neutral piece.

The first concern with robo-journalism is often expressed by the journalists themselves: are we getting pushed out?

This robonews tsunami, he insists, will not wash away the remaining human reporters who still collect paychecks. Instead the universe of newswriting will expand dramatically, as computers mine vast troves of data to produce ultracheap, totally readable accounts of events, trends, and developments that no journalist is currently covering.

This is somewhat echoed by L.A. Times reporter Schwencke, who sees the algorithmic output as a boon for busy journalists.

Schwencke says the use of algorithms on routine news tasks frees up professional reporters to make phone calls, do actual interviews, or dig through sophisticated reports and complex data, instead of compiling basic information such as dates, times and locations.

“It lightens the load for everybody involved,” he said.

Schwenke's "bot" is rather simple, functioning best with a limited dataset and a minimum of formatting. Narrative Science's output is a bit more complex, allowing customers to adjust the "slant" of the generated stories. Not only that, but the software can cop an attitude, if requested.

The Narrative Science team also lets clients customize the tone of the stories. “You can get anything, from something that sounds like a breathless financial reporter screaming from a trading floor to a dry sell-side researcher pedantically walking you through it,” says Jonathan Morris, COO of a financial analysis firm called Data Explorers, which set up a securities newswire using Narrative Science technology. (Morris ordered up the tone of a well-educated, straightforward financial newswire journalist.) Other clients favor bloggy snarkiness. “It’s no more difficult to write an irreverent story than it is to write a straightforward, AP-style story,” says Larry Adams, Narrative Science’s VP of product. “We could cover the stock market in the style of Mike Royko.”

This leads to the ethical quandary presented by the use of bots. Is robo-generated journalism really journalism, and is the use of algorithms a betrayal of readers' trust, especially when a familiar name is on the byline? If factual errors are discovered, does the blame lie with the software, or with the journalist who agreed to let the article "write itself?"

The answer here isn't simple (and the question likely isn't even fully formed yet), but the key is transparency.

“People are already reading automated data reports that come to them, and they don’t think anything of it,” said Ben Welsh, a colleague of Schwencke’s at the Times.

Welsh says that responsibility for accuracy falls where it always has: with publications, and with individual journalists.

“The key thing is just to be honest and transparent with your readers, like always,” he said. “I think that whether you write the code that writes the news or you write it yourself, the rules are still the same.”

“You need to respect your reader. You need to be transparent with them, you need to be as truthful as you can… all the fundamentals of journalism just remain the same.”

Questions involving intellectual property are also raised, although they aren't discussed in these articles. Who holds the copyright on the generated articles? In Schwencke's case, these rights are likely retained by the L.A. Times. In the case of Narrative Science, it's probably defined by contractual terms with the end user. Once the contract is up, the generated articles' copyright reverts to the end user.

Schwencke's homebrewed algorithm is a different IP animal. If he switches papers, does he retain the right to the "bot?" Or is that algorithm, developed while employed with the L.A. Times, considered a "work for hire," and thus, the paper's property? Arguably, his algorithm is an extension of him, covering his area of expertise and designed to emulate his reporting. What if Schwencke generates a similar piece of software for his new employer? Would he be permitted to do this, or would this be prevented by additions to "non-compete" clauses? Is it patentable?

The more ubiquitous "robo-journalism" becomes, the more issues like these will arise. Hopefully, IP turf wars will remain at a minimum, allowing for the expansion of this promising addition to the journalist's toolset. With bots handling basic reporting, journalists should be freed up to pursue the sort of journalism you can't expect an algorithm to handle -- longform, investigative, etc. This is good news for readers, even if they may find themselves a little unnerved (at first) by the journalistic uncanny valley.

Reader Comments

graph theory

I think I just had a wonderful idea.

This journalist has a machine that sifts through lots of raw data and writes a story -- using a style tuned to his specifications -- which I then read. I can keep reading his stuff or look elsewhere, but if that feedback reaches him at all, it must reach through several layers.

What if I had a machine like that, which sifted through the same data and wrote stories suited to my taste. I could make minute adjustments whenever I pleased, or read articles by multiple "journalists" on the same subject, give my scores and let them fight it out and evolve. The human journalists could still do the research, but I'd be subscribing to the pool of their findings, not to the condensate articles. And I could gin up as much of this bespoke news as I wanted, on any topic I wanted...

What if there were a complementary engine that could read a story (with a certain date), extract the constituent facts, and deduce the settings the robo-journalist should have in order to produce something similar given the facts known at the time. Then one could reverse the editorial bias settings and read an opposing view. Or study a large body of articles to make a robo-journalist mimic of any human journalist, living or dead. (Maybe not a very convincing one, but the technology can only improve.) Ah, to read Hitchens again...

Such robo-journalists could have blogs, singly or in groups, and evolve to...

Big deal

This blog already has taken the next step. There are several bots that scan the articles here then generate random outrage based on keywords. Mind you the results aren't perfect yet and can be fairly hilarious at times due a lack of reading comprehension.

Its hard to think of anything less copyrightable than news. Gossip, grape-vine talk, heard it from a friend, over the fence backyard chat and other cultural ways to share information particularly news plus our opinions about it. All of this is foundation culture in the way we like to talk about the weather.

Its nice to have automation to deal with crunching data from diverse sources combining that into readable reports but this also allows the user to do the same or more analysis given the same data. In this case the data would be more important than the article itself.

Many times have I wanted to create own stock market value vs. points in economic history that I thought were pivotal complete with interest rate co analysis and other real time based reports/values plotted along side and the ability to update it as new data arrives. How about prison population vs. GDP economic drain on society or maybe crop production vs. global avg temp. you get the idea.

Not many of our personal analysis would be perfect but at least we were looking/checking on our own. It might be nice not having to believe some pencil-neck pinhead hired by a special interest group to grind an axe.

In the US data in not patentable or copyrightable but a published printout can be even though the facts in it cannot. So. We have the start of the data wars.

To keep one computer from taking/using the data from another the first algorithm would not give the data but some multiplier of two data values for some new industry value not rated or known to the general public. It gives new meaning to the term data processing. Such wold prevent the average person from making an original analysis.

Its already a common technique when the market has learned to measure the value of a product by standard industry values. Then through special interest industry groups doing market research find some new way to measure the product that makes smaller units seem larger thus willing to pay more. (Diagonal TV measurement instead of X and Y values, VA multiplier for UPS computer power supplies instead of watts etc.)

As if what is heard on popular media news sites is not suspect enough how would knowing that it was manipulated by robots be any better? At least with news anchors we could blame a person but how can one fault an algorithm? Have started to mistrust any article/research/experiment/causation/idea that does not provide original data sources.

The next step is consumer side customization.
I wake up to a morning newspaper customized to MY tastes, data taken from various sources on the Internet and the stories chosen for me based on my interests, edited for me based on my reading habits, written based on my bias.

Re:

I bought both my children and grandchildren dictionaries. The reason for this is that when I was a child, I learned quite a bit from words I found on the way to words I was actually trying to find. If you can customize everything you can see you won't find things you don't know. The sad thing is you won't even know you missed them.

Most of what you see and hear on network news these days is advertising and editorial jocularity. The talking heads might as well be replaced with maxheadroom. Real journalism still exists, but one needs to look for it. I doubt a robot will be performing the task anytime soon, unless possibly, it is Marvin.

Bad Newspaper Writing.

People have come up with programs which can take tabular information (stock market results, baseball scores, etc.) and plug the results into a prose template. It's not really intelligent, of course, but it has merely exposed the banality of much newspaper writing.

Cer tain news stories, notably those in sports, are highly formulaic. I had a look at some sports stories in the local newspaper, which were written by humans. What turns out is that sportswriters aren't even very good at articulating what actually happened in a baseball or football game, so they pad it out with statistics, standard cliches, and summary reports of previous games. For example: "Smith, whose batting average is such and such, struck out, He had also struck out in yesterday's game, and in Wednesday's game." The sportswriter is padding out minimal information about a game which he may not even have attended. Obviously, a computer can paste in bits and pieces from old articles, that kind of thing. As I said, this is really bad sports-writing. It does not tell you something interesting, for example, that Smith has superb reflexes but he swings at a lot of balls which are not over the plate, and if he only held back, the umpire would call them foul. Perhaps Smith is a simple sort of man, easily baited in general, and perhaps Jones, the pitcher, is a clever, mocking sort of fellow, like the boxer Muhammad Ali ("If he gives me any jive, I'll take him in five!"), who is very good at needling simple men into indiscretions, the way a matador plays a bull. Perhaps Jones is carefully placing the ball just far enough beyond the plate that Smith will swing at it and miss, and become still more enraged at missing...

The kind of paper a baseball umpire might maintain for his own use might look a lot like a spreadsheet, a pre-printed sheet with room to fill in information as it develops. He can tick a box to indicate a strike, a foul ball, a home run, etc., and work up a kind of shorthand similar to that used to indicate chess moves. "1 KP-K4, KP-K4; 2 QP-Q4, PxP; 3 QxP, QN-QB3; 4 Q-K3, N-B3" is quite expressive in its way. One could have an analogous system for baseball. There might be a stylized diagram of the playing field, on which movement of the ball and the runners could be indicated by drawing arrows. A tablet computer, with the right software, could save time and labor in recording this diagrammatic information, and automatically generate an Acrobat file for publication. In the case of football, there is a conventional language of diagrams, which is used to show players how they are supposed to move when a given play is called. It can equally well be used post-hoc, to describe how they actually did move.

Of course, the umpire could add idiosyncratic sidenotes, where appropriate, eg. "Pitcher threw ball, which struck batter in ear. Batter picked himself up, gathered up his bat, ran to mound, hit pitcher over head with bat. Outfielders ran towards mound, as did players from the bench, the latter carrying additional bats. Faction fight ensued. Opposing fans joined in, carrying automobile tire irons. Game canceled. Police called."

If you are looking at _good_ sports-writing, say Ernest Hemingway's _Death in the Afternoon_, James Michener's _Sports in America_, or Donald Hall's _Fathers and Sons Playing Catch_, that is not something a computer can easily replicate. But they talk about much bigger subjects than merely who won the game.

Of course there's a simple answer

This leads to the ethical quandary presented by the use of bots. Is robo-generated journalism really journalism, and is the use of algorithms a betrayal of readers' trust, especially when a familiar name is on the byline? If factual errors are discovered, does the blame lie with the software, or with the journalist who agreed to let the article "write itself?"

The answer here isn't simple (and the question likely isn't even fully formed yet), but the key is transparency.

The answer is very simple: If the reporter put his name on it, he is accountable for it. If his algorithm screws something up and he lets it go to print without even checking it first, then of course the blame lies with him. Why wouldn't it?

Present Data As Data Without Trying to Verbalize It Unnecessarily.

Weather is not local-- it involves huge masses of air, moving over hundreds or thousands of miles. The NOAA electronically collects data from all over the world, and from space, dumps it all into the computer, sets it up as Partial Differential Equations, and generates worldwide predictions. Only the government has the money to do this kind of thing. By inputing a latitude and longitude, you can get a forecast for any particular location. A newspaper's attempts at creating its own daily weather reports can only be inferior.

The highest grade of weather reporting is aviation weather reporting. There are systems to more or less instantly download NOAA weather reports to airplanes in flight, with the information being automatically posted to moving-map displays. The pilot zooms out his map so that he can see two or three hundred miles ahead, and finds that bunch of red marks have just appeared a couple of hundred miles away. So he gets on the radio, and talks to ground control about an alternate route.

Similarly, the United States Geological Survey does the same thing for earthquake reports. They collect data from vast numbers of seismographs which are plugged into computer networks, and whose results are automatically processed by computers, and posted on a website.

"Welsh says that responsibility for accuracy falls where it always has: with publications, and with individual journalists."

True enough. They are responsible for the behavior of their bots.

But contrast this with statements by people who don't want the responsibility that comes when their torrent software delivers music to the IP police, because they say they didn't know it would do that.