Archive for June, 2011|Monthly archive page

Bing, Google, and Yahoo! have announced schema.org yesterday, a collaboration between the three search providers in the area of vocabularies for structured data. As the ‘schema guy’ at Yahoo!, I have been part of the very small core team that developed technical content for schema.org. It’s been an interesting process: if you doubt that achieving an agreement in the search domain is hard, consider that the last time such an agreement happened was apparently sitemaps.org in 2006.

However, over the years, the lack of agreement on schemas have become such a major pain point for publishers new to the Semantic Web, that eventually cooperation became the only sensible thing to do for the future of the Semantic Web project. Consider that until yesterday any publisher that wanted to provide structured data for Bing, Google, and Yahoo needed to navigate three sets of documentation, and worse, choose between three different schemas and multiple formats (microdata, RDFa, microformats) and markup their pages using all of them.

So how did we get here? In my personal view, one of the key problems of the Semantic Web design of the W3C has been that it considered only technical issues, and not the need for a social process that would lead to bootstrap the system with data and schemas. We are now doing better on the data front thanks to large community efforts such as Linked Data. With regard to schemas — we used to call them ontologies, until we found it scared people away — the expectation was that they would be developed in a distributed manner and machines would do the hard job of schema matching or somehow agreements would emerge. However, schema matching is a hard problem to automate. Agreements were slow to come due to a lack of space for schema development and discussions. We have tried a number of things in this respect, for example some of you might know that I’ve been one of the instigators of the VoCamp movement which peaked around 2009. The W3C itself accepted some RDF-based schemas as member submissions, but it didn’t see itself as the organization that should deal with schemas, and there has been no process for dealing with these submissions either. (As an example, we learned when we started with SearchMonkey that there have been actually two versions of VCard in RDF submitted by two different members of the W3C. This problem has since been resolved.) Other schemas just appeared on websites abandoned by their owners. Finding stable and mature schemas with sufficient adoption has eventually become a major pain point. In the search domain, the situation improved somewhat when search providers preselected some schemas for publishers to use, and started providing specific documentation, with examples and a way to validate webpages. However, as illustrated above, the efforts have been still too fragmented until yesterday.

Given the above history, I’m extremely glad that cooperation prevailed in the end and hopefully schema.org will become a central point for vocabularies for the Semantic Web for a long time to come. Note that it will almost certainly not be the only one. schema.org covers the core interests of search providers, i.e. the stuff that people search for the most (hence the somewhat awkward term ‘search vocabularies’). As the simple needs are the most common in search logs, this includes things like addresses of businesses, reviews and recipes. schema.org will hopefully evolve with extensions over time but it may never cover complex domains such as biotechnology, e-government or others where people have been using Semantic Web technology with success. Nor do I think that schema.org is ‘perfect’. Personally, I would have liked to see RDFa used as the syntax for the basic examples, because I consider it more mature, and a superior standard to microdata in many ways. You will notice that RDF(a) in particular would have offered a standard way to extend schema.org schemas and map them to other schemas on the Web. Currently, there is an example of using the schemas in RDFa, but the support for this version of the markup will depend on its adoption.

Please take a look at schema.org, and if you have comments please consider using the schema.org feedback mechanisms (we have a feedback form as well as a discussion group).