Google’s Hummingbird Algorithm Ten Years Ago

Added 2013-11-10 – Google was granted a continuation version of this same patent (Search queries improved based on query semantic information) on November 5th, 2013, where the claims section has been completely re-written in some interesting ways. It describes using a substitute term for one of the original terms in the query, and using an inverse document frequency count to see how many times that substitute term appears in the result set for the modified version of the query and for the original version of the query. The timing of this update of the patent is interesting. The link below points to the old version of the patent, so if you want you can compare the claims sections.

Back in September, Google announced that they had started using an algorithm that rewrites queries submitted by searchers which they had given the code name “Hummingbird.” At the time, I was writing a blog post about a patent from Google that seemed like it might be very related to the update because the focus was upon re-writing long and complex queries, while paying more attention to all the words within those queries. I called the post, The Google Hummingbird Patent because the patent seemed to be such a good match.

Image from Dr. Bill May from the US Forest Service on a page about wildflowers

Google has been granted a number of patents about query re-writing, sometimes also referred to as query expansion or query broadening, which try to make it more likely that the search engines will return results closer to what a searcher is looking for, even if they might not necessarily use the best choice of keywords to find the information that will fill their needs. I had also recently written about some other patents describing how Google might re-write queries and it seemed like they were putting together a framework that involved looking at search interactions to better understand probabilities for ranking pages.

Yes. And no. Some of the parts are perfectly good, so there was no reason to toss them out. Other parts are constantly being replaced. In general, Hummingbird — Google says — is a new engine built on both existing and new parts, organized in a way to especially serve the search demands of today, rather than one created for the needs of ten years ago, with the technologies back then.

Knowing that Google had been working upon patents involving re-writing queries for a number of years, I took this statement as a challenge. Could I find a patent that looks like it describes how Hummingbird might work filed around a decade ago? I searched around, and there was one that was co-invented by Google’s Head of Search Quality, who was involved in making the recent Hummingbird announcement, Amit Singhal. While the technology described in the patent was very similar, it definitely is simpler, and doesn’t seem to focus as much on the need that mobile searches might have for responding to conversational spoken searches. Instead, it tells us:

A search query, entered by a user is typically only one query of many that express the information that the user desires. For example, someone looking to buy replacement parts for their car may pose the search query “car parts.” Alternatively, however, the search queries “car part,” “auto parts,” or “automobile spare parts” may be as effective or more effective in returning related documents. In general, a user query will have multiple possible alternative queries that could be helpful in returning documents that the user considers relevant.

Conventionally, additional search queries relating to an initial user query may be automatically formed by the search engine based on different forms of a search term (e.g. “part” or “parts”) or based on synonyms of a search term (e.g., “auto” instead of “car”). This allows the search engine to find documents that do not contain exact matches to the user’s search query but that are nonetheless relevant.

Interestingly, this older patent may have been filed back in 2003, but it wasn’t granted until 2011. The patent is:

A search query for a search engine may be improved by incorporating alternate terms into the search query that are semantically similar to terms of the search query, taking into account information derived from the search query. An initial set of alternate terms that may be semantically similar to the original terms in the search query is generated.

The initial set of alternate terms may be compared to information derived from the original search query. One example of such information is a set of documents retrieved in response to a search performed using the initial search query. One or more of the alternate terms may be added to the original search query based on their relationship to the information derived from the original search query.

This patent tells us that there are a few different ways of rewriting queries. Two methods that might be used would involve taking some words within the query and either using “stemming” as a way to transform some of the words within the original query, or looking up those words in a thesaurus. Stemming might involve looking at words with the same root (such as congress and congressional), and re-writing the query using those variations of the same word. Using a thesaurus might involve replacing a word such as “car” with a synonym such as “automobile.”

This older patent describes some potential problems with either of those approaches:

One serious problem with the stem-based and synonym-based techniques for finding additional search queries is that two words may have similar semantics in some contexts, but not in other contexts. For example, “automobile” has similar semantics to “car” in the query “Ford car”, but not in the query “railroad car.” As a result, these techniques often produce search queries that generate irrelevant results. For another example, if the query “jaguars” was stemmed to the word “jaguar,” the query semantics may have been changed from that of animal to that of a popular car.

The more recent patent I’ve called the Hummingbird Patent doesn’t really address stemming or a thesaurus, or other ways of identifying synonyms that Google has been exploring. Both patents though look at identifying co-occurring words within the search results each term used as a query, or within query log files, for candidate synonym terms to be used to re-write a query. The patent filed in 2003 also discusses attempting to understand the “query context” of the original query to capture a meaningful re-written query.

The newer patent does a better job of describing a process that might be used to rewrite queries, taking the context of the whole original query under consideration. It’s possible that this new patent was created after a lot of consideration of how this kind of query pre-processing might be done. We’re told in Synonym identification based on co-occurring terms that query context is an essential part of this process:

In general, one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of evaluating a candidate synonym for a particular query term included in a search query using non-adjacent contexts. Specifically, a candidate synonym can be evaluated in order to determine if the candidate synonym is a synonym, or substitute term, for the particular query term, based on additional terms included in the search query that are not adjacent to the particular query term. For example, when the search query includes numerous terms, the context for a particular query term included at the beginning of the search query may be defined by a query term located at the end of the search query. The use of context for the particular query term can improve the overall confidence that a candidate synonym is a synonym for the particular query term.

So that shift in technology between now and 10 years ago seems to be in better using a semantic analysis based upon finding co-occurring terms to build a way to better understand the context of a query.

Conclusion

There have been a lot of blog posts published on Hummingbird in the past couple of weeks that attempt to explain it better.

Some of these describe Hummingbird as something that makes better use of Google’s Knowledge Base and undertanding of entities in queries and pages that might be about those named entities. Considering that the announcement of Hummingbird also told us that it impacted 90% of all searches, that’s too big of an impact to only affect queries with specific people, places, or things in them. The example query we were given didn’t include a named entity either.

I’ve had some people ask me if semantic approaches such as the markup from schema.org played a role in how Hummingbird works, and told them that I didn’t think so. While Schema.org markup can help a search engine understand what a page it’s indexing is about, and might lead to rich snippets, it doesn’t focus upon re-writing queries and understanding their context better.

It’s important to keep in mind, when you’re writing about a topic, or doing keyword research, that the words that you’re choosing to use aren’t just strings of words, but rather embody certain concepts that may contain many different aspects.

Should you change how you’re doing link building after Hummingbird? PageRank is still part of how ranking at Google works, and attracting links to your pages is probably something that you should keep on trying to do. But, a big part of ranking under Hummingbird probably has something to do with understanding better how data collected about interactions between different search entities might influence rankings of search results. If people tend to click upon specific pages in response to certain queries, and spend a fair amount of time on those results, the pages they click upon and dwell upon might be boosted in search results under Hummingbird.

Related

Reader Interactions

Comments

Bill, I feel that the idea of rewriting queries is as scary as the disappearing organic landscape. But I also don’t think, for once, Google is totally at fault here either; I think as people move from desktop to handhelds, the queries are becoming shorter because it is faster – easier. For years they trended up in words and characters used, but mobile queries stopped it, simply by the nature of them. So with the death of the long tail query, so goeth the long tail results right? How come no one talks about how many words are actually used in queries anymore? (not provided)?
If they can semantically rebuild and restructure (and then obfuscate the details of) a query, what chance in hell do you have of doing anything to keep up with relevant answers? Isn’t that really the point?
Thanks for all the work. 🙂

This is an enlightening disclosure about the real Hummingbird update. Certainly an article like none other on the web lately.

It is exciting to think that you do not have to research for keyword strings but once your content ranks for a semantic search query, it will rank better for similar queries in the future thus helping you come up in search post the Hummingbird update.

Just one correction I wanted to mention was with a spelling error of Amit Singhal’s name at the very first time you mention it in the blog. 🙂

Interesting. I think that the patent will further allow Google give the searcher exactly what they want to find. It sounds like the adwords option that allows ads to show for synonyms for keywords searched.

Thanks. I don’t think that the patent from 2003, or the newer patent had any intent to drive more people to use paid search advertisements. If there were even a hint of that in either patent, I would have written about it. I didn’t. That’s not the stated purpose of Hummingbird anywhere.

Neither patent had any intent to harm the effectiveness of long tail queries either.

Bill, as a first time visitor to your blog, I’m impressed by the nature of your post. Obviously a lot of research went into your subject matter. As far as Google’s patent goes, how likely do you think it is that the original patent that was filed 10 years ago was a prediction of Google’s goal to manipulate search queries to drive more traffic to paid advertising?

To throw my two cents’ in about the future (or death) of long tail search queries, even when Google shared the specific search strings people used to find your website, the most common were, by far, 1-3 phrases. The idea behind any of the updates to search algorithms, besides keeping seo’s guessing to drive more companies to PPC, is to return results at the top of the page that do the best job getting people to interact with their website WITHOUT GOOGLE’S HELP.

I think an argument can be made against that in the beginning. But today, with social media, social sharing, directories, etc, those that do the best job of getting people to interact with them before they find them on Google, ultimately fair the best.

I’m not sure this Hummingbird update has actually hit Europe yet, but I’m looking forward to its release. Windows 8.1 went out today and part of that overhaul sees a major change in the Bing search engine, which is very much in keeping with Facebook’s Graph Search and the Google update. They all seem to be upping the anti, as it were, and fighting for dominance, but I can’t see Google being shifted anytime soon.

Query rewriting, just as Google’s Suggestions before it, tend to result in less variation in the search term being used and/or the results being found.
Fewer popular terms = more competitiveness for AdWords for those terms/results… or do we think Hummingbird will provide more results for which there’s no/fewer AdWords bidding?

As stated above by Alex, there wasn’t such a big change in European country search engines. Either the update hasn’t hit the local SE’s or the impact is unnoticeable. I think it has a lot to do with the fact that the Google Knowledge graph wasn’t focused on local spread languages.

Some very interesting points. Dare I say this is why personalised search was such a big thing. Effectively this is how context is gathered when returning related search terms.

After all, if my search history pointed out I was a car enthusiast, then it would know to return car realted pages for “jaguar” while Google knows that I’m a sucker for cats, it’s more likely to return pages on the big cat for similar searches.

I’ve not noticed much of a change in traffic from search of late, so this could mean that we’re doing SEO right, or that this update hasn’t had that big an impact. Ultimately we’ll need to simply keep watching this space.

This Hummingbird update surely has not hit Europe yet as I cannot see a difference in searches. Some friends of mine tested and compared google in the States with the European google here (in France and in the UK) and we find differences. Maybe the updates have hit google.com first?

Great sleuthing! I agree Bill. Hummingbird had such a mild launch, I do not see how it could be affecting long tail queries. All the regular (and well known) SEO conspiracy theorists did not mention Hummingbird and no one reported drastic changes. I had a chance to interview Martin Macdonald, here, and he said “… don’t see it changing results, in the same way that Caffeine didn’t change results. I do see information providers being hit big time though, people like weather sites, currency converters, sports statistics sites and so on.”

This is one of the best article I have read about the Google Hummingbird algorithm. I believe that we should not change how we perform keyword research or write about content after Hummingbird if we focus more upon the concepts we are targeting when researching keywords.