“Big data” is when the size of the data itself becomes part of the problem

As the world generates incredible volumes of new data, the amount available for analysis (turning data into actionable knowledge) increases exponentially. This acceleration in data volume results in more and more problems you might want to solve being “big data” problems.

The difficulty for many developers is that sources of clean, easy to use “big data” are not always apparent. Enter data markets.

Data markets and their many flavors

There’s a fundamental question we need to answer right up front: What is a data market?

In researching this article, I was fortunate enough to spend some time discussing data markets with Grant Nestor of Factual (more on his company later). When I asked Grant to define “data market”, here’s what he said:

A data market is a destination where data is exchanged for other data, money, or things of value.

I like Grant’s definition for two reasons. First, it emphasizes that one can “go” to a data market, i.e. there’s a web presence where you can research and find available sources of data. Second, it points out that we don’t just buy data from markets, but rather that we might also exchange other data or things of value for it. Both of these will be important later as we talk about what features various markets provide.

Data catalogs – markets which pull together links to various datasets; the data may be hosted in the market’s own storage or it may be linked to elsewhere; catalog markets are meant to make it easier to locate datasets of interest, but the data itself may not always as “fresh” as the data provided by the next class of markets

Free public data sources – often mandated by governments and NGOs, these provide access to useful data but the data may be poorly structured or “dirty”, making it more difficult for a developer to use

Graphics oriented services – services meant more for analysts than developers, heavy on built-in visualization tools and spreadsheet support but often lacking in programmatic (API) access

Which of these types of data markets will meet the modern web developer’s needs? I can’t speak to every possible scenario, but I do know what I’m looking for for my own web API oriented development.

Features I want in a data market

For this data market series I’m interested in data markets that will let us explore their offerings cheaply and efficiently. Here then is what I am looking for from an ideal data market:

The ability to try before I buy with some sort of free developer offering

A general purpose market with a variety of data available (many vertical- and domain-specific markets are also available if you ever need them)

A variety of methods to access the data, ideally including web browser options, charting tools for displaying data trends after I find them in the browser, data dumps whereby I can download the desired data to operate on locally, and most importantly web query language and/or web API options that let me hack on the data living in the market’s servers

RESTful API which returns JSON output (XML is my second choice); a YQL binding would be very nice to have as an option, too

Bindings for a general purpose language, ideally Python or Java; the more languages are supported via client libraries, the better

Note that I am focusing on general purpose data markets which provide a free (as in beer) public API to access their data. Sometimes these markets are referred to as providing “Data as a Service” (DaaS) or “cloud data“. If a market doesn’t wrap its data via an API, in my opinion it’s making things too difficult for the developer. (Mediocre government data dump sites, I’m looking at you.)

Every market we’ll discuss below contains a variety of data sources and at least a certain level of access available for free so that developers can get started quickly and inexpensively. I’ll be primarily discussing DaaS data catalogs (many of these also contain free public datasets) for the rest of this series of data market articles.

The major data market catalog players

While DataMarket.com is itself a more narrowly focused data catalog and thus not up for consideration for this series given my criteria above, it has provided an excellent overview of the data market competitive landscape on its blog. Click here to read “The Emerging Field of Data Markets” post. All four of the data markets I’ve chosen to discuss further below are outlined in that blog post.

Factual

Factual (@factual) provides a general purpose market with public APIs that developers can start using for free. Factual’s market enables developers to share and reuse data. You can participate in enriching and expanding the available data by both contributing new data and updating existing data. In Grant Nestor’s words:

Factual is actively seeking a “virtuous circle” which benefits everyone.

Factual will in fact cut businesses a deal if they agree to upload some of their own data, or enrichments to Factual data that they use, back into the Factual system.

Factual currently exposes many different datasets via their data market search. Their primary focus to date has been around empowering developers with basic information about businesses and geographic points of interest. In fact one of their better known customers so far has been Facebook, which has loaded in Factual point of interest (POI) data for various countries including the UK and Japan to be used by Facebook Places. Here’s an example of one of the datasets available in Factual, the US POI and Business Listings which currently contains more than 13 million places:

Although Factual does provide a lot of local and business geodata, they also provide a wide variety of other data from many domains including entertainment, education, government, health, and more. For more information on what’s available you can browse the available dataset topics.

Factual offers RESTful API access via their Server API on the Developer Tools page. They also provide CSV download, iPhone SDK, and HTML+CSS+JavaScript web access options, with an Android SDK coming at some point in the future. I’ll discuss these in more detail and show specific examples of using Factual data in the next article in this series.

Freebase

Freebase (@fbase) also provides a data catalog of structured, updatable data akin to Factual. Freebase data spans a wide field of endeavors similar to Factual, and it has many millions of records available for developer use.

One difference between Freebase and Factual lies in Freebase’s entity-based approach. Freebase imposes more structure on the underlying data by assigning unique IDs to identified entities. This video provides a good description of why this is done:

This additional structure makes certain operations simpler, while at the same time making user contributions more difficult. You may benefit as a data user, but have more work to do as a data provider. You have to judge for yourself which of these two approaches is preferable for your data market needs.