Google Patent Granted on Semantic Units (Meaningful Compounds)

When searchers type a query into a search engine, it isn’t uncommon for them to use more than one word. It also isn’t unusual for those words to be a semantically meaningful phrase rather than just a list of keywords.

Multiple search terms entered by a user are often more useful if considered by the search engine as a single compound unit. Assume that a user enters the search terms “baldur’s gate download.”

The user intends for this query to return web pages that are relevant to the user’s intention of downloading the computer game called “baldur’s gate.” Although “baldur’s gate” includes two words, the two words together form a single semantically meaningful unit.

If the search engine is able to recognize “baldur’s gate” as a single semantic unit, called a compound herein, the search engine is more likely to return the web pages desired by the user.

A Google patent, originally filed in the year 2000, was granted this week on a method that enables a search engine to understand when more than one word are used together in a single semantically meaningful manner.

A search engine for searching a corpus improves the relevancy of the results by classifying multiple terms in a search query as a single semantic unit. A semantic unit locator of the search engine generates a subset of documents that are generally relevant to the query based on the individual terms within the query. Combinations of search terms that define potential semantic units from the query are then evaluated against the subset of documents to determine which combinations of search terms should be classified as a semantic unit. The resultant semantic units are used to refine the results of the search.

Paying attention to these types of compounds might find use in reranking search results so that pages which contain a compound are considered more relevant than documents that contain the individual words but not the compound.

Use of a compound might also be helpful in finding query refinements – semantically meaningful alternatives. For the “baldur’s gate” example quote above, a semantically meaningful alternative may be “baldur’s gate reviews” (i.e., written reviews of the game).

Conventional approaches to compounds in queries

Compounds in queries could be identified based upon a matching of a list of previously identified compounds and upon statistics that describe the relative frequency of occurrence of the compounds.

The first approach involves extracting compounds from the web, and looking for word sequences that occur with a statistically significant frequency. The problem with this method is that it would likely generate a much larger list of compounds than people would ever use to search with in queries, and only a small fraction of identified compounds would ever be used.

The second approach involves extracting compounds from query logs. That may pose some problems in how people search. An example used in the patent:

A disadvantage associated with finding compounds in query logs using statistical techniques is that word sequences occurring in query logs may not correspond to compounds in the documents. This is because queries, especially on the web, tend to be abbreviated forms of natural language sequences. For example, the words “mp3” and “download” may occur together often in query logs but “mp3 download” may not occur as a compound in a document.

Another issue, and where the word “semantic” comes into play in this document, is that the meaning of the query is important:

A disadvantage of both corpus and query log based techniques, and indeed of any technique relying purely on previously detected compounds and on statistics to segment a query, is that they tend to ignore the meaning of the query. Such techniques may identify a compound that is not consistent with the meaning of the query, which can negatively impact applications that rely on the compound as being a semantic unit within the query.

For example, the queries “country western mp3” and “leaving the old country western migration” both have the words “country” and “western” next to each other. Only for the first query, however, is “country western” a representative compound. Segmenting such queries correctly requires some understanding of the meaning of the query. In the second query, the compound “western migration” is more appropriate, although it occurs less frequently in general.

Finding semantic units

How are compounds identified based on the overall context of a user query?

1) Individual search terms in the query are matched to an index of the Web, and substrings of the query are generated. For each of those generated substrings, a value is calculated that relates to the portion of the identified documents that contains the substring. Semantic units are selected from the generated substrings based on those calculated values.

2) The list of relevant documents for those searches is refined based on the selected semantic units.

3) Semantic units might be chosen from a predetermined number of the most relevant documents in the list returned by the ranking component.

4) “Relevance” in this context could be defined based on factors that could include the proximity between query words (pages in which the query words are close to each other are considered more relevant) and the order of the words in the returned document (e.g., documents in which the query words are in the same order as in the query phrase are considered more relevant).

In other words, a search is performed by first identifying documents related to the individual terms in the query. Compounds are then selected using a methodology based on the rate of occurrence of the compound within the identified documents. Results are reranked based upon the use of those compounds.

A very interesting post. It is generally very enlightening to read the posts in this blog.

In Natural Language Processing the term “multi word token” is frequently used to call what is called in this post a “Semantic Unit”. This problem is mostly being discussed in the context of the “Named Entity Recognition” task, though not limited to it.

An interesting approach to recognizing “Meaningful Compounds” is using lexical chains that has sprung a lot of publications and academic work, in particular in automatic summarization of text.

Thanks for sharing your comments about this–you do a great job of explaining it.

When the individual search terms in the query are matched to an index of the Web, though, I’m assuming that they’re taking out the stop words, right?

What’s also interesting, though, is that I think this all plays a more important role as the searchers themselves continue to use more keywords in their searches (i.e., searches are getting longer). It used to be that people would use one and two words..then three, and now we’re seeing them use more and more words.

Thanks. I’m not sure that stop words are being removed during this process. The examples in the patent include the use of the word “the” as one of the multi-word substrings being investigated to see if it is a meaningful compound.

And if we search Google for a string like “To be or not to be,” (without the quotes), there isn’t any mention of stop words being removed. I wonder if that is because the search engine is recognizing “to be” and “not to be” as semantically meaningful units.

There is a message about the use of the word “or” there, though:

Try uppercase “OR” to search for either of two terms. [details]

The role this may play in longer queries is interesting. If there are more than one semantically meaningful compounds in a query, is this approach helpful, or could it focus more on one compound in a significant manner that could push down results with the second compound?

There’s a lot more to search, and to SEO, than most people realize. It’s not something that can really be automated in anyway, and the topic and our knowledge of it evolves over time. The good thing is that you are looking for information, and learning, and that will help you significantly.

I like looking at patents and patent applications because they are primary sources of information – they come directly from the search engines themselves. I try not to look at them for the processes that they describe as much as for the assumptions being made in them, and the insights that they might provide into how the people building search engines approach the problems that they face.

Stick with it, try to learn something new everyday, and you’ll surprise yourself by how much you know in a few months, and in a few years.

[…] more important in SERPs, as specific keywords are being replaced by context and topical themes (more information about semantic units). Why are semantic relevance becoming more important? Simply, it makes sense for the benefit of […]