The Evolution of Linked Data Business Models

The #linkeddata hashtag is once again active on the topic of linked data business models and the linked data value proposition.Scott Brinker [2] seems to have kicked things off this time around with his post, 7 business models for linked data, which Talis’Leigh Dodds has responded to with Thoughts on Linked Data Business Models. Although I’m tempted to dive right in with comments on facets of these two great posts, I’d like first to focus on InfoChimps, a company with origins in the big data (esp. scientific dataset) community that is trying to make money by “incentivizing” the trafficking of dataset without overtly identifying themselves as a “linked data provider.”

InfoChimps is interesting to me because its infrastructure was born from the essential question of how to persist and publish datasets that users have added value to. In his podcast interview with Paul Miller of Cloud of Data [3], InfoChimps co-founder Flip Kromer says that original goal was to create a SourceForge-like service where users who modified datasets — corrected, extended, attributed, whatever — could easily share those datasets with others. InfoChimps soon went beyond this sharing model, enabling companies with valuable datasets like well-known polling and market research company Zogby International to easily upload and license their datasets.

My understanding of InfoChimps is that their focus is on making the sharing, augmenting and monetization of datasets easy. In fact, when Paul in his interview asked Flip to address topics such as linked data-based publishing, it seemed a bit like this was off-message and instead Flip focused on the simplicity and value they bring, giving users the ability to post and share their large “rectangular” datasets, i.e. in native Excel or CSV format. A key take-away from this exchange was that InfoChimps is not “leading” with technology, which I think is the right strategy (at least for now).

A few months ago in the previous life of this blog [1] I pondered the value of linked data and its providers in light of the economics of scale-free networks. My hypothesis was and is that, as with everything else that is networked, in the world of linked data the rich will get richer and value will be demonstrated by the extent to which a dataset (and a provider) links to datasets and is being linked to by other datasets. The more heavily-linked a dataset is, the more valuable it is, by definition. This means that a starting point for realizing the inherent value of a dataset is making it linkable, not merely shareable: applications and other datasets must be able to link to it, and it must leverage the linkability of other datasets.

Datasets that are difficult to use have limited value.InfoChimps has addressed the question of ease of use in a very practical way by encouraging its depositors to upload their datasets in standard “rectangular” formats such as Excel or CSV. Readers versed in linked data might see these as an ancient approach, but at a time when the “Web of Data” gospel is still just starting to spread this is actually quite smart: most data management systems (RBDMS, triple stores, graph databases) can both import and export CSV and Excel, even enormous datasets can be easily disseminated, and indeed many of the leading projects such as data.gov and data.gov.uk have applied linked data principles to expose data originally obtained in “ancient” formats including CSV. Furthermore, InfoChimps provides interfaces and mechanisms for the community to augment datasets hosted on their site, thus fostering a community-driven development of value.

The problem I think comes as we look forward to new modes of data consumption and application. The upload/license/download commercial data model, which dates back at least to the 1980s and probably much earlier, depends upon customer hosting of datasets and does not seem to cater to the many agile, dynamic approaches that the linked data community has been thinking about. But I imagine this isn’t far off; it seems more a question of how to make automated RDF mapping of widely varying CSV datasets reliable, and how to provide individualized, secure interfaces for customers that properly reflect their license agreements. In fact, at the very end of his post Leigh Dodds says the following:

…From a technical perspective I’m interested to see how well protocols like OAuth and FOAF+SSL can be deployed to mediate access to licensed Linked Data…

Me to! But for now, I think I’ll address that in a follow-up post…

Notes:

[1] Thanks to Blogger having blocked me, I’m now a happy WordPress convert!

Responses

Thanks for the kind words, you’ve given me much food for thought.

The way I think of our approach to linked data is that rather than building the linked data graph up from the atomic level of individual facts, we’re building a web of knowledge down from the molecular level of full datasets.

As you point out, this is technologically simpler (like you’d expect from a bunch of chimpanzees) and presents a smaller impedance mismatch with current technology. It also avoids several subtle problems. Provenance becomes straightforward: compare the decision to trust (and cite) data from the National Climate Data Center with that drawn from numerous Wikipedia contributors via DBpedia via infochimps, then abstract the latter to a live cloud of evolving data. Versioning and forking, efficient computing, and license/TOS compliance become significantly easier as well.

I also want to say that though ‘rectangular’ data is most flexible, contributors should feel free to upload data in whatever shape and format they use, rectangular or graph or almost-structured. We have datasets containing one point and datasets with network graphs in some odd adjacency-list format. Well, of course, a .gml is odd to me, but not to the folks that made the dataset; I’d prefer a .tsv to process with hadoop, and you might like an .rdf for your graph browser. The thing we all agree on is that we’d rather have the data in an odd format than not have it at all 🙂

The more heavily-linked a dataset is, the more valuable it is, by definition.

I beg to disagree. To give a simple example, you have a bunch of links sitting to the right side of the blog. If you tripled the number of links, would the list be more valuable, or less? If you had a 300 times the number of links would that make the list more valuable. I think not.

Unfortunately it’s very cumbersome in linked data to assign a link strength, so only strong links are valuable.

Thanks for your comment, Eric! I agree with you that fitness of the linked nodes is critical; indeed, this is why I’ve preferred thinking about Barabasi’s preferential attachment network model (which considers evolution due to node fitness) over more simplistic models.

Several weeks ago I explored some of this in a rather lengthy post, but it was lost when my blog was trashed by Blogger. I will “have a think” and try to articulate a more precise statement of this idea of “value = links x fitness” soon!

Taking platform independence and access as given i.e., inherent to HTTP based Linked Data. Its best to look at Linked Data value as a function of: Link Density (relatedness), Link Quality, and Linked Data consumer’s Context Lenses.