How Google May Expand Searches Using Synonyms for Words in Queries

When someone searches the Web, one of the challenges that they often face is using the right words in their search to find what that they are looking for.

Search engines usually rank pages based upon how prominently terms from a searcher’s query appear on those pages, and if a searcher doesn’t use the right words in their search, they may miss the pages and the information that they would like to find.

For example, someone looking for web hosting in the City of Ft. Wayne may type the query [Web hosting Fort Wayne] into a search engine, and not see many pages about hosting in that location because the City is usually referred to as “Ft. Wayne” rather than “Fort Wayne.” I find myself frequently challenged by this kind of problem when looking for information about Washington, D.C., or the District of Columbia, or DC.

A patent granted to Google this week explores how the search engine might expand the search terms that searchers use to include synonyms in searches, to make it easier for searchers to locate information on the Web. In the Ft. Wayne example, this could mean that Google would look for pages on the Web that were relevant for both [web hosting Fort Wayne] and [web hosting Ft. Wayne].

The Fort Wayne example is taken from the patent, and the authors of the patent provide another example of a search query that someone looking for music for a video they are making might use in a search – [free loops for flash movie]. Chances are that most people offering music that can be used for free for videos are going to use the word “music” rather than “loops.” They may also use the word “animation” rather than “movie.” When that searcher types [free loops for flash movie] into Google’s search box, the search engine might not return pages that provide free music for flash animations because those pages don’t use the words “loop” or “movie,” or the words “loop” and “movie” are used on some pages that aren’t very prominent and the pages don’t rank very well in Google for those terms.

We’re told by the inventors of the patent, that as the number of terms in a query increases, this problem becomes more serious:

Thus, documents that satisfy a user’s information need may use different words than the query terms chosen by the user to express the concept of interest. Since search engines typically rate documents based on how prominently the user’s query terms are in the documents, this means that a search engine may not return the most relevant documents in such situations (since the most relevant documents may not contain the user’s query terms prominently, or at all).

This problem becomes progressively more serious as the number of terms in a query increases. For queries longer than three or four words, there is a strong likelihood that one of the words is not the best phrase to describe the user’s information need.

Synonyms and Context

One of the simpler ways for a search engine to try to find synonyms for terms that people use in queries to expand those queries would be to come up with a thesaurus or database of synonyms, and lookup the words in a query to identify possible synonyms. But there are some limitations to that approach. The most significant is that the meaning of a term often relies upon the context of how it is used.

For example, “music” is not usually a good synonym for “loops,” but it is a good synonym in the context of the example query above. Further, this case is sufficiently special that “music” is not listed as a synonym for “loop” in standard thesauruses; many other examples of contextually dependent non-traditional synonyms can be easily identified.

And even when conventional synonyms can be identified for a term, it can be difficult to identify which particular synonyms to use in the particular context of the query.

The patent presents a process for finding synonyms for words that appear in search queries, evaluating the quality of those synonyms within the context of a particular query, and using those synonyms to expand queries and return relevant pages to searchers.

It starts by finding queries that are alike, and performing tests upon those query terms and phrases, while looking at information related to those queries.

For instance:

The number or percentage of times both terms appeared in search queries within a certain amount of time.

The number or percentage of times both terms appeared within a particular user search session.

How much alike the search results are that are returned for the original search query and for a search where a candidate synonym is substituted.

Once synonyms are found that might be good replacements within a query, the search engine might offer a modified query using the synonym as a search suggestion, or the revised query might be used to expand the scope of the search results presented to a searcher.

So, someone searching for [Web hosting Fort Wayne] might be shown a set of search results with a query suggestion at the top of the results with a link to results for [Web hosting Ft Wayne], or they might see a set of search results that includes pages that are good matches for both [Web hosting Fort Wayne] and [Web hosting Ft Wayne].

A method is applied to search terms for determining synonyms or other replacement terms used in an information retrieval system. User queries are first sorted by user identity and session. For each user query, a plurality of pseudo-queries is determined, each pseudo-query derived from a user query by replacing a phrase of the user query with a token.

For each phrase, at least one candidate synonym is determined. The candidate synonym is a term that was used within a user query in place of the phrase, and in the context of a pseudo-query. The strength or quality of candidate synonyms is evaluated. Validated synonyms may be either suggested to the user or automatically added to user search strings.

How the Process Works

Someone enters a query at the search engine, and a set of pages which are relevant for the query are retrieved and ranked based upon their perceived relevance and importance.

The search engine then looks at the query terms, and might attempt to identify possible synonyms for words or phrases within that query from a list that might have been created from analyzing the search engine’s query logs.

To create that list, all queries received over a certain period of time might be reviewed and potential, or candidate synonyms may then be identified.

For example, the original query might have been [free loops for flash movie], and there might be previous queries within the log such as [free music for flash movie] that may be worth reviewing.

Or, query fragments with wildcard tokens within them might be used, such as [free * for flash movie].

Information from the query logs about the queries with the candidate synonyms in them might then be analyzed.

For instance, how frequently has someone searching for [free loops for flash movie] within a short period of time then searched for [free music for flash movie] or [free loops for flash animation].

Other tests may also be performed as well, such as what is the probability that both queries might have a number of the top search results in common if someone searched for both. So, if in a search for [free loops for flash movie] and a search for [free loops for flash animation], there are a certain number of pages in the top 10 (or some other number) that are the same, then “movie” and “animation” are good synonyms within the context of that query.

Conclusion

The patent includes a number of examples of how synonyms might be selected for words that appear in queries, and is worth spending a good amount of time upon if you’re interested in how a search engine like Google might expand search results for searchers to include those synonyms.

When I search for [district of columbia museums], the top result after local results is a page that doesn’t include the word “Columbia.” If I look at the cached copy of the page at Google, I am told that “Columbia” does appear within anchor text in links to the page, which may be why it shows up as the top result for my query. But, there are plenty of pages that are also good matches for the words I used to search with.

Is Google deciding that there are other words or phrases on that page that are synonyms for “district of columbia” such as “D.C.”, and modifying my search results to include that page?

While not conclusive evidence by any means, it is interesting that in the top search result (past the local results) for my query, the acronym “D.C.” is bolded as if it were one of my query terms. Google usually highlights query terms when they appear in search results using bold text to show searchers that the pages they are returning are relevant for the query used in a search.

There’s no mention in this patent that Google might highlight or display synonyms in bold text in search results if they are used to expand search results for a query, and the highlighting process used by Google is a separate process, but it is interesting that the search engine bolded the synonym for District of Columbia.

What does this mean for you as a searcher or as a site owner if Google is using this process?

For searchers, it might mean that Google may add pages to your search results based upon words it perceives as synonyms to words you used in your query. Search for something while including the words “District of Columbia” in your search, and you may see also see pages that use “Washington, D.C.” or “D.C.” instead of “District of Columbia.”

For site owners, it could mean that if you target specific keyword phrases on your pages for searchers, that other sites that use synonyms for some of the words in your chosen keyword phrases may also show up in the same search results as your pages.

Added – January 19, 2010 – An Official Google Blog post was just published which describes a recent change at Google on how Google handles synonyms, as well as the use of bold in search results to highlight those synonyms. The description sounds very much like the process above, with the use of synonyms determined in context.

Google also published a patent filing which looks at synonyms in context, but also uses statistical language models to translate a query into another language and then back into the first language to attempt to find more than one phrase or term that may include synonyms within the same context. That approach and the one that I described above could be seen to be related in a number of ways. I describe it in the post: How a Search Engine Might Find Synonyms to Use to Expand Search Queries.

53 thoughts on “How Google May Expand Searches Using Synonyms for Words in Queries”

Bill, even with a college education and experience on SEO and keyword research, I still find this confusing. Maybe it’s the level of difficulty or maybe my short attention span. Does anyone else feel this way?

I did try to keep my explanation as simple as I could, but I can try to take another stab at it. The patent is fairly complex.

Say someone searches for [fort wayne web hosting].

Google looks up results for the search, and ranks them. But before presenting them to a searcher, it takes a few more steps.

First, it looks in a table created from recent queries (a day, a week, or some other amount of time in the past) to see if there are any potential synonyms for terms found there.

Google sees that a large number of people who searched for [fort wayne web hosting] also searched for [ft wayne web hosting] in followup searches (or in reverse order). Other people searching for [fort wayne *wildcard*] also searched for [ft wayne *wildcard*] during the same time period.

Google also sees a lot of searches, not necessarily from the same people during the same query sessions, for both [fort wayne web hosting] and [ft wayne web hosting].

Because of the large number of people who search for both during their individual query sessions, and the large number of searches that may not have happened during individual query sessions, Google may explore the idea that “Fort Wayne” and “Ft Wayne” may be synonyms, at least in the context of a search for web hosting.

Google may then perform some other tests, and the patent lists a few different ones they could use. One test would be to see if there are a number of shared pages amongst the top search results for each term. Let’s say that three of the top ten pages that show up in a search for [fort wayne web hosting] and [ft wayne web hosting] are the same. That makes it much more likely that “fort wayne” and “ft wayne” are synonyms.

So, if Google decides that they are synonyms, it might take a couple of steps. The first would be to show a query suggestion at the top of the search results. Someone searches for [fort wayne web hosting], and Google could then show the following:

Did you mean: Ft Wayne web hosting

In that suggestion, “Ft Wayne web hosting” would be a link to a search for that phrase.

Or Google could combine the search results for both [fort wayne web hosting] and [ft wayne web hosting], and rank those results, and present them to a searcher.

This patent is something that I see for many queries. You can try [miami fl] and you will see the “FL” highlight all “Florida” words. And it happens to all US States. You can try with TX, GA,… . I think this is really useful for users and to return accurate results.

Very helpful. Thanks! Understanding how Google works with these things is generally impossible for me, I know there is logic to it, I just can’t always see it. Appreciate your enlightenment. God bless!

Great post as ever.Yes on a personal note, I feel it is a good ploy that “G” has come up with this technique of ranking search queries and their synonyms. Having said that, een before a year as you have mentioned in the patent. I got ranked for a term GPS systems in less than a 1 year or so in google.co.uk even for a website that goes by the name satellitenavigation faster. Then i got placed for sat nav and then satellite navigation. Here i consider gps systems and sat nav to be potential synonyms to satellite navigation and just goes to show that it is easier to get placed for synonyms than the actual terms. Thoughts on this point Bill

You’re welcome. One of the things that we have to be careful about when saying that Google may have implemented something like the method in this patent, which tries to identify synonyms within the context of a query is that Google may be using something simpler for some searches that look like what this patent describes.

For example, it really wouldn’t be hard for Google to have a thesaurus for some very common synonyms that they could lookup and do replacements upon to expand queries. Including state names in such a simpler system wouldn’t be unreasonable at all.

I’ve been keeping an eye out for other queries that don’t involve geographic locations. For example, if I search for [automobile dealership jobs], I’m seeing some results that look like they come from a search for [auto dealership jobs] with “auto” highlighted. Are those showing up because Google is expanding my search by using the method described in this patent? Maybe. But, if it were should I also be seeing results for [car dealership jobs]? I’m not sure, but I would guess that I should be.

Thanks. Bing did purchase Powerset a while back, and it does look like they are using some of Powerset’s technology. Powerset had set up a demo on their site where they were showing how their semantic search worked with Wikipedia, and it appears that has found its way into Bing. It’s hard to tell what each of the search engines might be serving us next when it comes to how search results are selected and ranked, but it should be interesting to watch how each develops.

Thanks. It can be hard to get the big picture – so many different things go into how search engines work. Google has stated in the past that they look at more tha 200 ranking signals when determining which pages to show in search results. Something like what I described in this patent is potentially just one small part.

Thank you. My thoughts? I like to try to include reasonably related words and synonyms on pages and sites when it comes to creating the content for those pages if possible. While I think that what this patent describes could potentially be very helpful, I’d rather not rely upon their analysis of user-data from query log files as much as taking matters into my own hands if I can.

I do like your example, and I appreciate your sharing it with us.

If I were to create a site about coin collecting, I would try to make sure that I included as many words and phrases as possible that might be seen as synonyms or related terms on my pages, such as: numismatics, coin grading, minted legal tender, currency, coin collecting, replica coins, antique coins, mint marks, coinage, doubloons, and more. Some of those terms I select might be highly competitive, and may seem difficult to compete for, but that doesn’t mean that they shouldn’t be included. As your site grows, it may just start ranking for those very competitive terms.

This is what makes the first step a site / business owner have to understand is – its all about their customers’ natural language, what’s going through their head / mind when actually searching for information on a search engine. Many just forgotten search engine optimization is not just about the search engine but their end customers / users.

Of course, search engines have also been looking at anchor text, and text that may be associated with links to determine what web pages may be relevant for in searches. That’s true even before 2003. It’s possible that some of the traffic you have been seeing was related to text used in links that point to your pages rather than synonyms that the search engine might have associated with those pages. If Google is using something like this synonym approach, it might be hard to tell where they have because of anchor text, and because they’ve associated other words with pages through a contextual synonym method like this one.

I agree with you completely – understanding what words your audience might use to find your site is one of the most important things you can strive to learn when you put a site on the Web, and publish content on its pages.

I hope your holidays have been enjoyable so far, and that the new year is good to you.

I really like patent filings like this one that give us some insights into how a search engine might perform some of the things that it does. There may be some questions about how accurate something like this is on its own, but it is just one of many processes that the search engines may be following. If it helps make the search results even just a little bit better, then it’s a good thing for them to use.

It is possible that Google has used some kind of process involving synonyms or related phrases to identify duplicate content. The synonym identification process in this patent is one that focuses upon synonyms for query terms, and expanding those terms, based upon user-data found in the log files of the search engine, rather than looking a content found on Web pages. So, it might not be helpful for finding duplicate pages.

But, one of Google’s phrase based indexing patent filings does look at related phrases to attempt to identify duplicates. Not quite synonyms, but an interesting approach regardless. It can be found at:

Interesting article. I’m a first timer on your site and I must say I am eager to read more of what lies within that head of yours.

This Context Specific thesaurus that Google would need to implement in order to capture every conceivable possibility for the use of the word “loop” for example might be a tad overkill.

First off the possibilities of getting tangle into a web of assumptions about the contextual definition of the word for starters , and second , the speed of delivery once this concept comes full circle and the thesaurus becomes larger and larger.

Last December I wrote about another method that Google might use to learn about synonyms within the context of a sentence or phrase, which appears to expand upon the method mentioned in this patent filing. That post was How a Search Engine Might Find Synonyms to Use to Expand Search Queries, and it involved translating a phrase into another language, and then translating it back into the original language, to see if there were other alternative translations when it was translated back. For instance, if I translate “new car parts” into French, and then translate it back into English, I might end up with at least two different alternative English versions – “new car parts” and “new automobile parts.” Given the context, it would appear that “car” and “automobile” would be synonyms.

The use of language models, whether based solely upon English, or upon translations between languages, seems to be an area that Google is diving into head first. It might seem like overkill, but a statement from Google Vice President for Search Products, Marissa Mayer, last December 14th, told us:

“Imagine what it would be like if there was a tool built into the search engine which translated my search query into every language and then searched the entire world’s websites,” she says. “And then invoked the translation software a second and third time – to not only then present the results in your native language, but then translated those sites in full when you clicked through.”

This patent filing on expanded language search came out in January of last year. If Google’s ambitions are this large, then a context specific thesaurus might be a small accomplishment in comparison.

I have seen this in action quite a bit recently. Like your District of Columbia (DC) example, I’ve noticed that Google has done fantastically in this area. Of course Google knows that I’m from South Africa when I start searching it already has a pretty good idea as to how I might relate certain words. After all, chips here are what others might call crisps in other parts of the world.

Most noticeable examples I can think of at the moment are all in local search (supporting the idea that Google’s really looking to go local). Cities that your readers may know include Johannesburg, Cape Town or even Pietermaritzburg (start/finish of the Comrades Marathon). Searches for any of these cities with their abbreviated form provides similar results with the JHB, CT or even PMB highlighted as if they were indeed the search.

I wonder just how long it will take before Google’s assumption on what they know of you will enable it to return truly personalised search results for a search?

Thank you for sharing your observations – it’s interesting to see how Google is taking into account differences in language from different parts of the word in explanding queries and returning results.

You raise an point that is worth spending some time thinking about – when Google attempts to understand the intent behind a query, how much of that can be said to be personalization? Much of the process behind identifying synonyms within the context of a search relies upon looking at query logs from the search engine for a number of searchers. Personalization would also attempt to try to identify unique characteristics of the actual person doing the search itself.

This patent doesn’t tell us that Google might only look at query logs for searchers in South Africa to identify possible synonyms, but it would make sense that they would (more so when it comes to queries that include geographic terms). In a few of Google’s patent filings about personalization, they sometimes mention that they might create “profiles” for searchers, for web sites, and for query terms.

Each of those “profiles” could include information that could be used to personalize search results. A profile for a query term might include possible synonym expansions. A profile for a searcher might include information from past searching and browsing behavior, and groups of “interests” that a searcher might be seen to what to see in pages that they are shown. A profile for a web site might include more information about that site than just what keywords might appear upon a single page of the site.

So, the expansion of synonyms can be an important part of personalization.

We`ve seen this patent in germany too these days – after a big brand-update (similar to vince early 2009 in the u.s.) synonyms seemed to have incredible influence within the ranking algorithm. Nevertheless they pushed back now after some quite crazy serps on top-keywords with sites on top positions not even including the searched keywords.

Under the process described in this patent filing, synonyms are identified by an algorithm. It may incorporate a lot of user-based data about searches, query sessions, and search result selections, but it is an algorithmically based process.

That’s really interesting to hear. There’s always been the possibility of some top search results appearing that don’t contain keywords used in a query because of the influence of anchor text pointing to pages, but I expect that if many of the top results don’t, and include a large amount of synonyms instead, that might lead to a fair amount of confusion and concern.

The challenge now moves to optimizing for synonyms that will be served in the SERPs of more competitive terms. Also, a sudden shift in referring keywords, without corresponding optimization work, might be a clue that synonyms are in play.

It’s an interesting area to explore, and I think it highlights thinking about the use of synonyms with the content on pages.

As for LSI, there were a number of companies that were (and may still be) claiming to use “LSI” to help businesses identify keywords to use on their pages that may not have been using quite the correct term or concept, just as there were many who used to claim that there was an “ideal” keyword density for pages of different types. Those methods aren’t helpful, and have never been.

What I find interesting is that, although the synonym concept is great, why the patent process had to be brought into the picture.

Synonyms are not patentable.

If we are talking about a specific way to program a search engine to get there OK. But, this idea that everything must be patented is slowing man’s progress substantially.

For example, IBM this year was awarded over 4900 patents. It is almost as if every time an IBM engineer gets a cup of coffee, their company patents the exact number of steps to the pot.

Man, in his selfishness, and unwillingness to share information without compensation is virtually and literally making an attempt to close his brother off from anything that might help him lead a better life, unless people are paid for the privilege.

It’s a sad world to think of the way we use and abuse our God given intelligence. God gives intelligence to us free of charge, and we charge others for its fruits.

Of course, synonyms aren’t patentable, but the process described is, as a automated process that takes a very large amount of user data and attempts to identify synonyms within the context of the words used, and expands searches if there’s enough confidence that expansion is appropriate. If this process was just a matter of looking at the words in a query, and seeing if there were any matches in a dictionary of synonyms, I’d probably be questioning this, too.

I know that there’s a lot of controversy about software patents, but I wonder if we would see many of the innovations that we do if patents weren’t available. The patent process is intended to spur innovation by granting the ability to exclude others for a limited amount of time for things that are patented. There are other approaches, such as the GPL and open source software which provide an alternative way for people to develop ideas and share their efforts.

Regardless of how you or I might feel about the patent process, whether it encourages or stifles innovation, the focus of my post isn’t so much on whether or not that process is broken or inappropriate. The fact is at this point that the patent was granted. Going beyond that, what does this process mean to search, to what you or I might see when we perform a search at Google? What might it mean for people who own web sites, and would like those sites to be found at Google?

Wow, you have blown me out of the water. I’ve read that LSI was not a feasible way to index web pages — that it was too large an undertaking — but I assumed the kinks would be worked out, and that at least on a smaller scale, LSI was in place. I guess not. Hm!

Well, certainly applying synonyms to place name variations is the right place to start.

I don’t believe that LSI is suited well for the Web. Approaches like the ones described in the synonyms patent filings that take advantage of a large amount of user-based data, as well as the creation of statistical-based language models can be used to expand which queries are searched for, as well as helping to re-rank search results that different searchers may see. The underlying rankings may still be based upon signals involving relevance and importance, but there’s an ever increasing amount of data that the search engines are trying to use to decide what to show searchers. These synonym approaches are only a couple of them.

Google did announce that they would start using synonyms to expand query terms when and where it seemed appropriate.

I’m not sure that the point that you raise is one that might be associated with the synonym process, however. It is a very interesting topic, but there might be something else that the search engine is doing that causes that result.

Bill,
excellent article very useful. Thank you! Understanding how Google works with these things is usually impossible for me, I know there is a logic to it, I just can not see him ever. Thank you for your enlightenment. God bless you we put all the learning acquired in this post into practice in the Brazil seo here. thanks

Thanks. It was interesting to see a few blog posts from Google on this topic after they had made the change to start expanding queries by showing results for synonyms as well. I’m wondering if we will see even more posts like that from them. I hope so.

Thank you. I’m not quite sure that I understand the question you are asking. Ideally you want to try to identify some keyword phrases that you want to rank well for on specific pages, and treat every page as an opportunity to possibly rank for different terms or phrases. I do like to try to include some synonyms of those specific terms on the pages I’ve chosen for those terms as well.

So, if I write a page about antique cars, I might use the word “automobiles” on that page as well.