Schema.Org: The Fire’s Been Lit

Why has schema.org made the following strides since its debut in 2011?

In a sample of over 12 billion web pages, 21 percent, or 2.5 billion pages, use it to mark up HTML pages, to the tune of more than 15 billion entities and more than 65 billion triples;

In that same sample, this works out to six entities and 26 facts per page with schema.org;

Just about every major site in every major category, from news to e-commerce (with the exception of Amazon.com), uses it;

Its ontology counts some 800 properties and 600 classes.

A lot of it has to do with the focus its proponents have had since the beginning on making it very easy for webmasters and developers to adopt and leverage the collection of shared vocabularies for page markup. At this August’s 10th annual Semantic Technology & Business conference in San Jose, Google Fellow Ramanathan V. Guha, one of the founders of schema.org, shared the progress of the initiative to develop one vocabulary that would be understood by all search engines and how it got to where it is today.

That includes the fact that it’s no longer being criticized by conference attendees as an effort by Google, Microsoft Bing!, Yahoo and Yandex to take over the web, with Guha joking that “in past years, the analog of this room [in which he was speaking] at SemTech was extremely contentious, with people saying you are evil, trying to take over the web.” To that end, he pointed out that it is very important to understand that schema.org “is a vocabulary, it is not the vocabulary” – that it can’t itself cover every area, that webmasters can use it together with other vocabularies, that “we would like others to create [other] vocabularies,” and that it is working with others to incorporate their vocabularies and give them credit on schema.org.

When Guha thinks back on how far schema.org has gone, he told the audience that he could not have dreamt that it would have become part of the workflow for so many sites. One of schema.org’s core principals, simplicity, deserves much of the credit, he explained, discussing that schema.org aligns itself with making life easier for webmasters.

For instance, it has kept them from having to deal with N namespaces by specifying vocabulary credit on the schema.org site; in the design of microdata for content markup it conducted usability analysis tests to see which alternate syntaxes caused the least problem for webmasters; and it has no expectations that webmasters should understand Knowledge Representation or Semantic Web Query Languages.

Fitting into their workflows has been key – extending even to a point of departure from Linked Data principles around the use of URIs, he said. As an example, he showed a graph of data related to Chuck Norris – type: Actor; birthday: March 10th, 1940; birthplace: Ryan, Oklahoma. A given site like IMDB or Rotten Tomatoes would use a few tens of the thousands of terms like actor and birthdate, he said, “and we can basically say, everybody who says actor needs to use this term.” At the same time, it’s working on making another few tens of thousands of terms, like USA, feasible for use by these sites.

But there are 1 billion to 100 billion terms, like Chuck Norris and Ryan, Oklahoma, “and we can’t expect Rotten Tomatoes and IMDB to figure out a way in their workflow to coordinate the ID for these things,” he said. “If we tell them thou shalt use this URI or someone else gives them a URI to reference to Chuck Norris, it’s just not going to happen.”

That’s where the reference by description notion steps in. Instead of using the ID, sites can communicate that they have an entity that is of type actor, whose first name is Chuck and last is Norris, and whose birthdate is March 19, 1940 – information they already have – and schema.org can do the work of solving these entities. “It’s a huge amount of complexity on our part, but it makes their lives simple,” he said. A goal – and a hard technical problem to completely solve – is to stitch together a common graph of the various subgraphs around an entity for webmasters, such as are shown below, and potentially even make such common graphs public.

It’s a great example, he said, “of where we say that integrating this kind of work methodology in your workflow is close to impossible, but we have the technical resources to be able to do that.”

Structured Data Apps Of the Future

Having schema.org doing well and getting structured data adopted by webmasters is just the start, though. Guha also discussed that the future holds more leveraging of structured data through schema.org markup to go beyond the first generation of apps that were focused on rich presentation of search results and classical search. Google’s own work, he cited, includes the Knowledge Graph and how today, for instance, relatively static structured data about a marked-up entity, like a music group, can be combined with more dynamic information that is streaming in via pages using schema.org, like upcoming events related to that group.

“These kinds of things, the pipelines, process and workflows are actually quite staggering,” he said, as new information has to be grabbed as it comes on line, reification and cross identification of entities has to take place, and you have “to put it all goether and shove it out and that just happens at scale.”

He also referenced how user profiles and structured data feeds come together in Google Now, to alert users about something that Google knows will be of interest to them, and how external sites like Open Table deliver email confirming reservations leveraging schema.org markup that can be picked up by clients like Gmail, so that users can get an alert on their smart phone when it’s time to leave for dinner. “We did these things in the lab 30 years ago but doing this at scale is a completely different beast,” he said.

Another focus is Google enabling developers to create their own custom vertical search engines, to restrict search based on structured data used in pages all over the web in a web-scale way. Google does the heavy lifting, he said, in the way of crawling and indexing, while developers specify the schema.org restrictions to employ and can leverage APIs to build their own UI. In the past, he said, “it would take you an enormous amount of effort to build something from scratch, but with this capability, you can now do it in ten minutes,” he said.

It’s also his dream, he said, that in three to five years that we’ll be at the stage where users looking at any piece of government-funded scientific research will be able to safely assume that the data behind it will be available in machine-readable form that can easily be consumed. This will open the door to quickly and easily combining data from multiple studies to drive even better insights. Ideally, this would be the next stage of efforts like one begun by the National Institute of Health, which requires that the data behind any effort using federal funds be available.

We’re not there yet, but Guha said he believes that we’re still in the early stages of structured data on the web. “The most interesting things will start happening when we find those applications” that are strong fits for leveraging structured data on the web. “Our work,” he said, “is not done.”

About the author

Jennifer Zaino is a New York-based freelance writer specializing in business and technology journalism. She has been an executive editor at leading technology publications, including InformationWeek, where she spearheaded an award-winning news section, and Network Computing, where she helped develop online content strategies including review exclusives and analyst reports. Her freelance credentials include being a regular contributor of original content to The Semantic Web Blog; acting as a contributing writer to RFID Journal; and serving as executive editor at the Smart Architect Smart Enterprise Exchange group. Her work also has appeared in publications and on web sites including EdTech (K-12 and Higher Ed), Ingram Micro Channel Advisor, The CMO Site, and Federal Computer Week.