I Am an Entity: Hacking the Knowledge&nbspGraph

This post was promoted from YouMoz. The author’s views are entirely his or her own (excluding an unlikely case of hypnosis) and may not reflect the views of Moz.

For a long time Google has algorithmically led users towards web pages based on search strings, yet over the past few years, we've seen many changes which are leading to a more data-driven model of semantic search.

In 2010 Google hit a milestone with its acquisition of Metaweb and its semantic database now known as Freebase. This database helps to make up the Knowledge Graph; an archive of over 570 million of the most searched-for people, places and things (entities), including around 18 billion cross-references. A truly impressive demonstration of what a semantic search engine with structured data can bring to the everyday user.

What has changed?

The surge of Knowledge Graph entries picked up by Dr Pete a few weeks ago indicates a huge change in the algorithm. Google has been attempting to establish a deep associative context around the entities to try and understand the query rather than just regurgitate what it believes is the closest result for some time, but this has been focused on a very tight dataset reserved for high profile people, places and things.

It seems that has changed.

Over the past few weeks, while looking into how the Knowledge Graph pulls data for certain sources, I have made a few general observations and have been tracking what, if any, impact certain practices have on the display of information panels.

If I'm being brutally honest, this experiment was to scratch a personal "itch." I was interested in the constructs of the Knowledge Graph over anything else, which is why I was so surprised that a few weeks ago I began to see this:

It seems that anyone now wishing to find out "Andrew Isidoro's Age" could now be greeted with not only my age but also my date of birth in an information panel. After a few well-planned boasts to my girlfriend about my new found fame (all of which were dismissed as "slightly sad and geeky"), I began to probe further and found that this was by no means the only piece of information that Google could supply users about me.

Many of you may now be a little scared about your own personal privacy, but I have a confession to make. Though I am by no means a celebrity, I do have a Freebase profile. The information that I have inputted into this is now available for all to see as a part of Google's search product.

I've already written about the implications of privacy so I'll gloss over the ethics for a moment and get right into the mechanics.

How are entities born?

Disclaimer: I'm a long-time user of and contributor to Freebase, I've written about its potential uses in search many times and the below represents my opinion based on externally-visible interactions with Freebase and other Google products.

After taking some time to study the subject, there seems to be a structure around how entities are initiated within the Knowledge Graph:

Affinity

As anyone who works with external data will tell you, one of the most challenging tasks is identifying the levels of trust within a data-set. Google is not different here; to be able to offer a definitive answer to a query, they must be confident of its reliability.

After a few experiments with Freebase data, it seems clear that Google are pretty damn sure the string "Andrew Isidoro" is me. There are a few potential reasons for this:

"Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness."

In summary, provenance is the 'who'. It's about finding the original author, editor and maintainer of data; and through that information Google can begin to make judgements about their data's credibility.

Google has been very smart with their structuring of Freebase user accounts. To login to your account you are asked to sign in via Google; which of course gives the search giant access to your personal details, and may offer a source of data provenance from a user's Google+ profile.

Freebase Topic pages also allow us to link a Freebase user profile through the "Users Who Say They Are This Person" property. This begins to add provenance to the inputted data and, depending on the source, could add further trust.

External structured data

Recently an area of tremendous growth in material for SEOs has been structured data. Understanding the schema.org vocabulary has become a big part of our roles within search but there is still much that isn't being experimented with.

Once Google crawls web pages with structured markup, it can easily extract and understand structured data based on the markup tags and add it to the Knowledge Graph.

No property has been more overlooked in the last few months than the sameAs relationship. Google has long used two-way verification to authenticate web properties, and even explicitly recommends using sameAs with Freebase within its documentation; so why wouldn't I try and link my personal webpage (complete with person and location markup) to my Freebase profile? I used a simple itemprop to exhibit the relationship on my personal blog:

Finally, my name is by no means common; according to howmanyofme.com there are just 2 people in the U.S. named Andrew Isidoro. What's more, I am the only person with my name in the Freebase database, which massively reduces the amount of noise when looking for an entity related to a query for my name.

Data sources

Over the past few months, I have written many times about the Knowledge Graph and have had conversations with some fantastic people around how Google decides which queries to show information panels for.

Google uses a number of data sources and it seems that each panel template requires a number of separate data sources to initiate. However, I believe that it is less an information retrieval exercise and more of a verification of data.

Take my age panel example; this information is in the Freebase database yet in order to have the necessary trust in the result, Google must verify it against a secondary source. In their patent for the Knowledge Graph, they constantly make reference to multiple sources of panel data:

"Content including at least one content item obtained from a first resource and at least one second content item obtained from a second resource different than the first resource"

These resources could include any entity provided to Google's crawlers as structured data, including code marked up with microformats, microdata or RDFa; all of which, when used to their full potential, are particularly good at making relationships between themselves and other resources.

The Knowledge Graph panels access several databases dynamically to identify content items, and it is important to understand that I have only been looking at initiating the Knowledge Graph for a person, not for any other type of panel template. As always, correlation ≠ causation; however it does seem that Freebase is a major player in a number of trusted sources that Google uses to form Knowledge Graph panels.

Search behaviour

As for influencing what might appear in a knowledge panel, there are a lot of different potential sources that information might come from that go beyond just what we might think of when we think of knowledge bases.

Bill Slawski has written on what may affect data within panels; most notably that Google query and click logs are likely being used to see what people are interested in when they perform searches related to an entity. Google search results might also be used to unveil aspects and attributes that might be related to an entity as well.

For example, search for "David Beckham", and scan through the titles and descriptions for the top 100 search results, and you may see certain terms and phrases appearing frequently. It's probably not a coincidence that his salary is shown within the Knowledge Graph panel when "David Beckham Net Worth" is the top auto suggest result for his name.

Why now?

Dr Pete wrote a fantastic post a few weeks ago on "The Day the Knowledge Graph Exploded" which highlights what I am beginning to believe was a major turning point in the way Google displays data within panels.

However, where Dr Pete's "gut feeling is that Google has bumped up the volume on the Knowledge Graph, letting KG entries appear more frequently," I believe that there was a change in the way they determine the quality of their data. A reduction in affinity threshold needed to display information.

For example, not only did we see an increase in the number of panels displayed but we began to see a few errors in the data:

This error can be traced back to a rogue Freebase entry added in December 2012 (almost a year ago) that sat unnoticed until this "update" put it into the public domain. This suggests that some sort of editorial control was relaxed to allow this information to show, and that Freebase can be used as a single source of data.

For person-based panels, my inclusion seems to show a new era of Knowledge Graph that Dr Pete reported a few weeks ago. We can see that new "things" are being discovered as strings then, using data, free text extraction and natural language processing tools, Google is able to aggregate, clean, normalize and structure information from Freebase and the search index, with the appropriate schema and relational graphs, to create entities.

Despite the brash headline, this post is a single experiment and should not be treated as gospel. Instead, let's use this as a chance to generate discussion around the changes to the Knowledge Graph, for us to start thinking about our own hypotheses and begin to test them. Please leave any thoughts or comments below.

Great post. So my question is how does one become an entity? What data sources does google draw from for knowledge graph? I'm aware of Wikipedia & Freebase, but are there others? Can anyone contribute to freebase?

Substantially all the the Open Data Sources are taken into account by Google for the Knowledge Graph.

Then, related to Entity Recognition, you must take into consideration also all the signals you can give of the real existence of life as Entity (and specifically as that Entity defined as Brand). This old post by our own Rand is still valuable in that sense.

The "theory" behind that - said very shortly - is that Knowledge Base plays an important role in how Google understand that a site is the representation of an Entity and not just a number of pages under a domain name (even though those pages, or web document in patent jargon, are entities too: search entities... but this could lead to some confusion being the word the same, but the concepts different).

Essentially you are moving away from just being words that make up a search string like "Andrew+Isidoro" and towards an understanding that those words relate to a real "thing" that can be conceptualised.

There are a number of Open Data sources that Google (and Satori) are pulling data from but they also seem to use proprietary data such as their search logs to help determine intent behind entity related queries. There is a very good paper written by Google (albeit a little old now) which highlight how entities *could* be being formed.

As I said above there are a number of data sources that go into making an Entity. Take Moz's Gillian Muessig as an example. Gillian has no Wikipedia page and only a very basic Freebase profile but she is still understood to be an "seomoz co-founder" as the RDF data that fuels her Freebase listing has pulled in 3rd party data from these Open Data sources.

Best practice would be to list yourself in these places (where applicable) using uniformed data and test with informational queries.

Andrew, would you suggest skipping authorship and other schema.org implementation over Freebase? I'm putting a lot out there to connect datasources where i contribute etc. but... in the end, what would make a difference?

Here in Europe, schema.org is being picked up slowly by search engines, but my guess would be to invest in a decent Freebase profile instead of those detailed implementations.

Any thoughts on this?
We're still struggling to explain the efforts to our clients; investing in data-structure & relationships between their authors & companies. Should they invest in freebase or other stuff to make them more 'open' on the web?

Does wiki and Freebase as you said allows us to create our personal profile ? As i can see in your blog that you have used a website ( a domain) of your name. Is it possible to list our-self and our products to Freebase ? If this possible I lots insight of having a freebase profile.

Sorry, I read this in the approval queue, and then completely forgot I hadn't read/shared it when it went live. Argh. Really found this fascinating - have a feeling I'll be referencing it for my SearchFest talk in February/

I think you'll like seeing what a fellow Italian SEO - Enrico Altavilla - did on the same line but that actually explains all the steps he followed for significantly modify the Knowledge Graph of a known person/entity.

I've just looked through Enrico's post and it's a great read (I wish I could understand the original as the translation is a bit rubbish). The Knowledge Graph is certainly an area that we as SEOs should be exploring more.

I think relevance will begin to have a part to play in the future. Those that know you are much more likely to be returned your data than that of a footballer that they have no affinity to. Will have to wait and see on that one though...

I know it's been a while, but I found this post very helpful. I'm currently searching for the correct syntax for adding the itemprop="sameAs" markup. Something doesn't seem quite right about the example Andrew posted here:

It seems strange to have nested hrefs like this, and/or it may be missing a quote or metathesized closing brackets, perhaps? Can anyone correct this for me so I can have a good example? There's virtually no documentation about what correct sameAs formatting is!

Specifically, I'm looking to do some Schema.org sameAs entity marking but not necessarily within an <a>. I'm trying to markup regular unlinked text.

Great work, Andrew! I think I speak for for the majority of the people who read this when I say I really appreciate the hard work you put into this post. It will be interesting to see what other data Google starts peppering into the knowledge graph. I.e. social, local, video, etc. Obviously seeing these verticals incroporated into the SERPs is nothing new, but I'm curious to see how Google will tie it all together now that Knowledge Graph is in full swing.

After the humming bird update the search engine results improved a lot due to the entity approach .As you also mentioned expamples of the entity based search in your post and I think during reading your post , every one will try to his check his own detail by typing name with such type questions.

Not quite. The knowledge graph is quite a complicated idea of pulling in data from multiple sources, understanding and then displaying them within relevant SERPs. Authorship is similar in that it scrapes data from the page but it seems to be handled in a different way. See Bill Slawski's post on this:

I only wonder if that's exactly what Google is trying to do -- turn us all into searchable entities? Maybe that's why they're instigating such a big push on Google+? I can already see this becoming an endless privacy/ security issue for people who want their private information secured from the SERPs.

It isn't really a case of privacy. All of the data that it understands about the entity "Andrew Isidoro" shown above is freely available on the web. It's just been found, understood, and displayed in a new format.

I do, however, think that as more people, places and things get added to the Knowledge Graph, we'll begin to see large scale personalisation of it; showing entities based on their affinity to your own entities information.

Andrew, thanks for this amazing post. I didn't know about sameAs property from Schema.org. Looked into code at your webpage and implemented it on mine. Looking forward to be recognized as an entity :-)

This a great piece will be sharing from a fellow Worcester Uni grad to another. I should send my dissertation to you I did in 2010 on The Semantic Web, I think you'd appreciate it. Really enjoyed reading this!!

Thanks for sharing this information Andrew. Google made a lot of changes in their algorithm that left people unaware of what's happening. Guys like you helps us become aware of the things happening on the web.

Great Post Andrew; You said "there are just 2 people in the U.S. named Andrew Isidoro."

Let's suppose there are more than 100 or 500 persons in USA. Than would it be possible for you to show knowledge Graph for queries like "Andrew Isidoro's age" etc. How would Google now for which "Andrew Isidoro's age" we are asking for until unless there is no popular "Andrew Isidoro's" in USA.

This is where things are a little underdeveloped at the moment.Currently you'll be shown the data of the user who is the most "authoritative" for that term. For example; a friend of mine, John Glover, is a pro cricketer at Glamorgan Cricket Club and has a fairly comprehensive Freebase profile. Yet when I, a close friend, search for him I get a result for John Glover the actor.

I think in the future we'll see much more dynamic data based on our social graph. Essentially taking our social connections (and data within) into account when constructing knowledge panels. For more info on how this might work I recommend Justin Briggs' post on Building the Implicit Social Graph.