March 29, 2011

LDOW2011 Workshop

The Linked Open Data Workshop (LDOW20XX) has become an integral part of the yearly WWW conferences, and this year was no exception under the unsurprising name of LDOW2011. And, as always, it is was an enjoyable, pleasant event. The organizers (Chris Bizer, Tom Heath, Michael Hausenblas, and Tim Berners-Lee) made the choice of accepting slightly less papers to leave room for more discussions. That was a good choice; the workshop was really more of a workshop rather than just listening to presentations, there were nice discussions, lots of comments… and that was great.

It is very difficult to summarize a whole day, and I do not want to go and comment each individual paper. The papers (and, I believe, soon the presentation slides) are on the Web, of course, it is worth glancing at each of them. For me, and that is obviously very personal, maybe the most important takeaway is actually close to the blog I wrote yesterday on the empirical study of SPARQL queries. And this is the general fact that we are at the point when the size and complexity of linked open data cloud is such that we can begin to make meaningful measurements, experimental data analysis, empirical studies, etc, to understand how the data is really used out there, what is the shape and behavior of the beast, and how these affect the tools and specifications we develop.

The workshop started with an overview of Chris (I hope his slides will be on the Web at some point) doing exactly that. He looked at the evolution of the LOD cloud and tried to analyze its content. There were some nice cosy figures: the growth in 2010, in terms of the number of triples, was of 300%, with some spectacular application areas coming into the game, like a 955% growing of library related data, or the appearance of governmental data from nothing in 2009 to about 11B triples in November 2010. Although Danny Vrandecic made the remark at the end of the Workshop that we should stop measuring the LOD cloud in terms of pure number of triples (and I can agree with that), those numbers are nice nevertheless. Some figures were less satisfactory: links among datasets is relatively low (90 out of the 200 datasets have only around 1000 links to the outside, and the majority only interlink with only one other dataset; only around 9% of the datasets publish machine readable licenses (although 31% publish machine readable provenance data, which is a bit nicer). Some of the common vocabularies are commonly reused (31% use Dublin Core terms, for example), but way too many dataset publishers define their own vocabulary even if that is not strictly necessary, and only about 7% publish mapping relationships from their own vocabulary to others.

Beyond the numbers themselves, I believe the important point is that somebody does collect and publish these data regularly to understand where we should put some emphasis in future. For example (and this came up during the discussion) work should be done on simple (in my view, rule, i.e., RIF or N3 based) mappings among vocabularies, those should be published for others to use; that figure of 7% is really too low. Work on helping data providers to create additional links easily is another area of necessary improvement (and there were, in fact, several papers on that very topic during the day).

I do not know whether it was a coincidence or whether the organizers did it on purpose, but the day ended by a similar paper but on vocabularies. A group from DERI collected some specific datasets to see how a particular vocabulary (in this case the GoodRelations vocabulary) is being used on the Web of Data, what are the usage patterns, how it can be used for specific possible use cases, etc. The issue here is not the GoodRelations ontology as such (you can see the details of the results in the paper) but rather the methodology: we are at the point when we can measure what we got, and we can therefore come up with empirical data that will help us to concentrate on what is essential. I hope this approach will come up to the fore more and more in future. We need it.