Where Schema.org Is At: A Chat With Google’s R.V. Guha

Interested in how schema.org has trended in the last couple of years since its birth? If you were at The International Semantic Web Conference event in Sydney a couple of weeks back, you may have caught Google Fellow Ramanathan V. Guha — the mind behind schema.org — present a keynote address about the initiative.

Of course, Australia’s a far way to go for a lot of people, so The Semantic Web Blog is happy to catch everyone up on Guha’s thoughts on the topic.

We caught up with him when he was back stateside:

The Semantic Web Blog: Tell us a little bit about the main focus of your keynote.

Guha: The basic discussion was a progress report on schema.org – its history and why it came about a couple of years ago. Other than a couple of panels at SemTech we’ve maintained a rather low profile and figured it might be a good time to talk more about it, and to a crowd that is different from the SemTech crowd.

The short version is that the goal, of course, is to make it easier for mainstream webmasters to add structured data markup to web pages, so that they wouldn’t have to track down many different vocabularies, or think about what Yahoo or Microsoft or Google understands. Before webmasters had to champion internally which vocabularies to use and how to mark up a site, but we have reduced that and also now it’s not an issue of which search engine to cater to.

It’s now a little over two years since launch and we are seeing adoption way beyond what we expected. The aggregate search engines see about 15 percent of the pages we crawl have schema.org markup. This is the first time we see markup approximately on the order of the scale of the web….Now over 5 million sites are using it. That’s helped by the mainstream platforms like Drupal and WordPress adopting it so that it becomes part of the regular workflow.

Those sites are mostly in English, but the entire schema.org has been translated into Chinese and also there is a fair amount of Russian adoption.

The Semantic Web Blog: How is that uptake changing the web?

Guha: It’s changing the web because a huge number of applications that were thought of as hard to do are now much easier. And it’s changing not just the web but also workflows.

Take making a reservation with OpenTable today. When you make one you get an email confirmation from them and that email has schema.org markup telling what the reservation is for, the address and so on, so that some app, like Google Now sitting on a smartphone, can pick that up and tell you that it’s time you left [to make your] reservation, or put it in your calendar or something. That is fascinating. A huge number of intelligent apps that were very hard to do without structured data now can become available on a routine basis.

Another thing that is happening is that a lot of people in the academic community are trying to figure out how to use data for academic work and want to contribute to that. At schema.org we do very little of the vocabulary by ourselves. We find people who are domain experts, the groups that care deeply about something. Many times they come to us. A lot is happening for accessibility-related work now, for example, and for product-related markup we worked with Martin Hepp and the Good Relations people.

And there is interest in collaborating on newer things. In particular it’s very interesting that there are lots of CSV files on the web, and academic researchers are interested in working on various issues in specifying the semantics of these files. That gets into more nuanced issues in database integration and things like that. At the ISWC conference there were at least a half dozen posters about this topic, and there is interest in working with us. Schema.org can become a path for their research to be adopted and used in the industry.

The Semantic Web Blog: What impact does the take-up of schema.org have on Google’s Hummingbird search engine algorithm?

Guha: [After noting Google’s general policy about not providing specific details around its search algorithm]: We can say that as time goes on and as more and more signals become available, one thing we do is figure out ways of incorporating these signals into the ranking algorithm, and signals from things like schema.org are of course useful.

But the other thing to remember is that it’s not just search engines now that are using schema.org. Pinterest uses schema.org for rich pins, for instance. And for NRD.gov, [which powers the Veterans Job Bank], most companies that have job openings now mark them up with schema.org to specify whether they are veteran-friendly; NRD does text plus structured data search at web scale [across job boards, social media networks and corporate sites] based on schema.org to search for [those jobs to post]. This kind of stuff is super-exciting.

Another interesting thing in Google now is if we detect you are in the market for buying some real estate, we take feeds from sites like Zillow and when there is a new house we think might match, or an open house, we show you a card. That’s not search at all, that’s being proactive based on your search history and understanding what to show you.

So it’s about structured data on the web, not just about search. And having that lets you learn things like a certain musician you like happens to be performing in your city. I had an experience like that. I like Joan Baez but she doesn’t give concerts anymore so it doesn’t make sense for me to search for that. But it happens she did give a charity concert at a little school a couple of miles from where I live. And based on my search history and Ticketmaster mark-up with schema.org, I found out about that. And I never would have thought to search for her performing. For something like that to happen you need high-quality structured data.

The Semantic Web Blog: Where do challenges still lie for schema.org?

Guha: We have to get to the next level, to represent time which is always a challenge in plain old RDF. And we are working with the W3C folks on trying to come up with ways to represent time.

Then, of course, when you are dealing with millions of sites it is probable that not everyone gets their markup absolutely right. It is a lot less challenging to do markup with schema.org – we actually did a bunch of useability studies with webmasters to see what they are more or less likely to make errors with. Still, a when you are dealing in the millions of sites you do see stray, random things. Some of it may be the result that now markup is part of someone’s job, so their approach or attitude about doing it aren’t the same as the early-adopter enthusiasts. So you need to provide the extra help for them to brave any hardships to make it work.

The Semantic Web Blog: Any thoughts on concerns webmasters may have that their markup isn’t leading to the rich snippets treatment they might have expected?

Guha: For starters schema.org is more than just Google – it’s Google, Yandex, Microsoft and others who use the markup. That said, Google’s main focus is to make sure its users are getting more value. We keep iterating through lots of different user interfaces and experiments until we can get something we are confident provides more user value. So often there is a lag in that between the time a webmaster puts up their vocabulary on their pages and it shows up in Google search. That’s what the issue is.

And there are other vocabularies where communities have come to us and asked, because it doesn’t make sense to be independent, to be in schema.org, and have the bigger organizations behind it. There’s a certain process we go through where it needs to be aligned with the rest of the vocabulary and show that major web sites in that area do indeed support it. The Veterans Job Search engine, for instance, came out of discussions with Aneesh Chopra at the Office of the CTO at The White House. They wanted it and we worked with them and a bunch of sites like Linked In, Indeed and Simply Hired that said yes [about using it], which led us to doing the markup. Now Google.com doesn’t do anything with it but that doesn’t mean it’s not useful.

The Semantic Web: So, how would you sum up how well schema.org has met your own expectations?

Guha: Everyone was a little suspicious when this started and no one was confident that webmasters will do it. So many attempts had been made to get webmasters to mark up their data. I would have been happy with 100,000 sites doing it but with more than 5 million, I think I am more than happy.

About the author

Jennifer Zaino is a New York-based freelance writer specializing in business and technology journalism. She has been an executive editor at leading technology publications, including InformationWeek, where she spearheaded an award-winning news section, and Network Computing, where she helped develop online content strategies including review exclusives and analyst reports. Her freelance credentials include being a regular contributor of original content to The Semantic Web Blog; acting as a contributing writer to RFID Journal; and serving as executive editor at the Smart Architect Smart Enterprise Exchange group. Her work also has appeared in publications and on web sites including EdTech (K-12 and Higher Ed), Ingram Micro Channel Advisor, The CMO Site, and Federal Computer Week.