I'm the data editor at Forbes, where I explore questions that interest me by writing and programming. I graduated from the University of Chicago in 2006 with degrees in mathematics and economics. Follow me on Twitter: @JonBruner, or on my personal website.

Will Data Monopolies Paralyze the Internet?

Google has built an advanced database of street geometries by sending cars like these around the world. The data they collect would be a crucial component in driverless cars, and the database itself may be forbiddingly expensive to replicate from scratch. But the streets that Google has mapped remain available for some innovative entrepreneur to map in a different way. (Image credit: Getty Images via @daylife)

Tim O’Reilly spoke with me last week about Internet companies acquiring massive proprietary data sets. “We’re kind of heading toward data as a source of monopoly power in some cases,” he told me, comparing data to “Intel Inside” as a barrier to market entry. He qualified a moment later, “There’s so much innovation still ahead that I’m pretty confident there’s room for more.” Tim’s venture capital partner, Bryce Roberts, followed up with a thoughtful post that foresees the end of Web 2.0 once all of that free-living user-generated data that we celebrated a decade ago–blogs, shared photos, message boards–moves behind password protection on social networks.

Data monopolies are a real possibility, but I think their rise will be tempered by ways of collecting data that barely existed just a couple of years ago. The world’s data isn’t something that can be held captive by a single operator, in the way that the world’s scandium mines could conceivably be bought by a nefarious baron who would laugh menacingly while twirling his moustache (“Fools!” he would say, his craggy face lit from below by a flickering fireplace).

Take as an example Google Maps: Google has built an extremely accurate roadway database by dispatching a fleet of telemetry cars around the world to collect data from high-definition cameras, laser rangefinders and GPS receivers. That data will be central to the company’s efforts to develop commercial autonomous cars. If those cars become widespread, Google will enjoy an enormous commercial advantage on the quality of its roadway data, which by then would be extremely expensive for anyone else to reproduce from scratch by the same method.

But regardless of how Google shares–or doesn’t share–its roadway data, the roads it has mapped will still exist outside of Google, ready to be mapped again by some inventive entrepreneur. In fact, Google’s method for collecting its roadway data–buying and outfitting cars and hiring humans to drive them up and down every street in the world–already looks a little old-fashioned. You might imagine that the next great roadway database could be compiled by stitching together geotagged images found on photo-sharing sites to create 3D models, or by aggregating location data from mobile phones or in-car telematics to find road lane centerlines and infer speed limits.

Social networks are somewhat different, of course: privacy-minded users have handed over vast troves of valuable data to Facebook and then locked it behind passwords, where it’s less accessible to neutral crawlers. Once you’ve handed over your biographical details and consumer preferences to Facebook, then Facebook can make money off of them, sure, but those biographical details and consumer preferences still exist outside of Facebook. They appear in all sorts of other things that you do and are waiting to be captured by another company that maybe doesn’t exist yet. Unless Facebook finds a way to copyright your birth date, the enormous value of its database will also serve as an enormous incentive for new companies to look for the same data elsewhere.

And, for what it’s worth, Facebook has found it has to be reasonably free with its users’ data in order to become the foundational platform for the entire social Internet. The site’s API allows outside applications to operate in much the same way that users operate–posting status updates, seeing “likes”–once users permit them to do so.

You don’t even need user approval to get lots of valuable information via screen scraping. I’ve written scripts that impersonate, by submitting the right cookies, a Facebook member I created named Testingoutsome Features–a twenty-year-old retired Penn Central Railroad fireman in Altoona who is a fan of the birther movement. Facebook didn’t react to the easily-noticeable incongruities of a man whose career ended 15 years before he was born–or to the fact that Testingoutsome was visiting as many as ten member profiles per second over periods of several days. I was able to extract location attributes for a majority of the 175,000 or so people who have written on Sarah Palin’s Facebook wall–data that’s not available through the API without user authorization. It’s true that Facebook could abruptly shut off access to become an impenetrable fortress of mineable chirps, but it seems that, in practice, something in the market has given Facebook reason to let some of its data leak out.

Some very promising data hasn’t been collected on a large scale yet and might be less susceptible to monopolization than things like status updates. Lots of people I spoke with at the Where conference last week were excited about new ways to approach ambient data. Companies like Alohar Mobile, which Tim mentioned in my interview, promise to collect the little specks of data that we’re constantly releasing–our movements, via smart phone sensors; our thoughts, via Twitter feeds–and turn them into substantial data sets from which useful conclusions can be inferred. The result can be more valuable than what you might call deliberate data because ambient data can be collected consistently and without relying on humans to supply data on a regular basis by, say, checking in at favorite restaurants. It also offers great context–another crucial theme from my conversations in California–because constant measurements make it easier to understand changes in behavior.

Bigger companies have obvious advantages in getting into ambient data (Facebook’s installed app base would give it an immense advantage if it decided to start collecting more data from phone sensors). But that sort of data is more difficult, in some ways, to confine within a monopoly.

We should watch carefully for the emergence of data monopolies, but I’m optimistic. Lots of very innovative people are working on new ways to harvest data, and any kind of monopoly will only make their work more lucrative.

Post Your Comment

Post Your Reply

Forbes writers have the ability to call out member comments they find particularly interesting. Called-out comments are highlighted across the Forbes network. You'll be notified if your comment is called out.