Where facts become data

Last month, I talked about some problems with facts. These problems have nothing to do with whether the universe is real (it is) or whether something is true because you want it to be (it's not). Rather, the problems have to do with the way we think about facts. And that confusion is getting worse-or, arguably, better-as facts and data get confused.

A fact used to be a way the world is. That's why your English teacher was eager to correct you when you used the phrase "true facts." All facts are by their nature true. Or, even more pedantically, a fact is real, and only our statements about facts can be true or false.

But even when I was a lad, it wasn't quite that simple. Back then, a fact was more clearly a way the world was, but the only facts we ever came across were ones expressed by someone. So, we'd say that the almanac is full of facts, when more properly we should have said that it was full of true statements about facts. So, inevitably, a fact seemed to be both a way the world was and a statement about the world. Human awareness was already woven through facts.

A twisty path

This has come to matter more as data has become more central in the ecology of knowledge.

The word "data" has as twisty an evolution as the words "fact" and "information," although "data" and "information" were both hijacked more explicitly than was the word "fact." "Fact" near its beginning in English meant "evil act," and somewhat naturally moved back toward its Latin root as that which has been done. "Data" and "information" were given new technical meanings at the dawn of the Computer Age. Data moved from being the given—the start that cannot be ungiven—to the "stuff" that computers operated on. In that way, data could not be much less like facts. Data is from the start a representation of the world, not the world itself. Data by itself makes absolutely no claim to truth, which is why "garbage in, garbage out" is a coherent idea. The phrase "true data" will not get you corrected the way "true fact" will.

In the network era

In the Age of Computers, data had a negative tinge in the culture overall. It was seen—quite accurately—as a reduction of a complex world to overly simple representations. In the Age of Networks, though, data has taken on a new positive connotation, I think because of the confluence of three factors: There's so much of it, it's openly accessible and it's getting linked up. Put them together and data becomes not so much a slimmed-down representation of a complex phenomenon (like a human resources record about a human being) but a way to discern patterns in the clouds. Everyone understands that some of the data in these clouds is going to turn out to be inaccurate, but with so much of it openly available, and with the ability to link up data sets, the inaccuracies turn into the equivalent of rounding errors.

Now facts want to join in the fun. For example, the site Factual.com aggregates facts and makes them available via an API so that they can be mashed together and mashed up with other sources. It was founded in 2007 by Gil Elbaz, who sold his first company—it developed AdSense—to Google. The company is particularly strong on restaurants and places, although it is expanding its coverage, and has ambitious plans to be a factual center for the Web.

Facts and data merge

Factual.com is pretty clearly about facts. Its databases generally contain the sort of information that you'd definitely want to correct if it's wrong. For example, its table of 6,661 beers of the world lists the brewers, the style of beer, the country and region, and the sorts of information included on a nutrition label. It's factual information, but it's not swarming with data about how often beers are mentioned in literature or the number of bottles bought per region per month. But Factual.com's facts begin to look more like data because they are accessible via APIs so that developers can hook them together with other facts and with lots of data. The information at Factual.com is designed to be used by computers, just as data is. And once a fact has been pulled from a database and clicked together with a bunch of data from a cloud—"Why, did you know, sir, that Astra Urtyp is a German-style Pilsner brewed in Hamburg, a city with more than 60 museums and where a head of lettuce costs approximately $1.15?"—it's getting hard to tell the facts from the data.

But factual purists ought not despair. This blurring between facts and data is happening for a good reason: We're building a smarter world.