Assumptions behind Web Searches

This is the first in a series of posts on Google Custom Search Engines.

If you’re interested in how search works on the Web, you may want to spend some time exploring Google Custom Search. It enables you to create a site search for an individual site, or a customized search engine on specific topics that may focus upon a number of sites that you can select.

There’s another reason to start looking at Google Custom Search Engines, or CSEs. A recently published patent application from Google describes how the Search Engine may use information from CSEs to influence what we might see in Google’s Web search. This post is an introduction to the topic, and it covers how search engines attempt to identify the intent behind queries and web pages.

The patent application, Aggregating Context Data for Programmable Search Engines, includes a fairly well written statement (for a patent application) about one of the difficulties that search engines face when trying to come up with results to show searchers in response to queries. I thought it was worth sharing here, and it provides a nice introduction to a longer exploration of how Google CSEs might be used to improve web search.

That statement by the patent filing’s inventor, Ramanathan V. Guha, begins with a paragraph about the focus of information systems such as search engines:

The development of information retrieval systems has predominantly focused on improving the overall quality of the search results presented to the user. The quality of the results has typically been measured in terms of precision, recall, or other quantifiable measures of performance. Information retrieval systems, or “search engines” in the context of the Internet and World Wide Web, use a wide variety of techniques to improve the quality and usefulness of the search results.

These techniques address every possible aspect of search engine design, from the basic indexing algorithms and document representation, through query analysis and modification, to relevance ranking and result presentation, methodologies too numerous to fully catalog here.

In many of my previous posts about search engine patents, I’ve stressed that the most important thing that we often might learn from the patent filings isn’t necessarily an insight into particular processes that a search engine might use, though learning something about those can be interesting.

Instead, a patent filing often introduces us to some of the assumptions that a search engine might be making about the Web, searchers, and even search itself. This particular patent gives us a glimpse at a few of those assumptions, which it goes on to challenge later in the patent:

Regardless of the particular implementation technique, the fundamental architectural assumption for search engines has been that the search engine’s operational model is fixed and non-alterable by entities external to the system itself.

That is, the search engine operates essentially as a “black box” that receives a search query, processes the query using a preprogrammed search algorithm and relevance ranking model, and provides the search results. Even where the details of the search algorithm are publicly disclosed, the search engine itself still operates only according to this algorithm and nothing more.

But what if a new idea entered the mix on how search engines might learn about pages on the Web that challenged that assumption to a degree? What if some external influences, such as those brought in by Custom Search Engines could help solve another problem described in the “Background of Invention” section of the patent filing that I’m taking these quotes from?

That other problem involves how a search engine might understand the intent behind a query, and the identification of pages that might fit a searcher’s intent. As the patent tells us:

An inherent problem in the design of search engines is that the relevance of search results to a particular user depends on factors that are highly dependent on the user’s intent in conducting the searched (in other words, the reason they are conducting the search) as well as the user’s circumstances (in other words, the facts pertaining to the user’s information need).

Thus, given the same query by two different users, a given set of search results can be relevant to one user and irrelevant to another, entirely because of the different intent and information needs. Most attempts at solving the problem of inferring a user’s intent typically depend on relatively weak indicators, such as static user preferences, or predefined methods of query reformulation that are nothing more than educated guesses about what the user is interested in based on the query terms.

Approaches such as these cannot fully capture user intent because such intent is itself highly variable and dependent on numerous situational facts that cannot be extrapolated from typical query terms.

The patent filing provides a fairly simple example – you search for the name of a model of a camera, such as “Canon Digital Rebel.” Your intent might be to find information about the features the camera offers, to compare prices of the camera on a number of web sites, to actually buy one of the cameras, or to get support or help or customer service as the owner of a Canon Digital Rebel.

From your query, it isn’t really certain what your intent might be. What if the search engine could provide additional questions that might help pinpoint the intent behind your search?

The patent also pokes a couple of holes in other assumptions that search engines often make in this introductory section of the description of the invention described in the patent.

The first is that the search engine can look at your past history of queries and web pages visited to try to make an educated guess at the kinds of pages that you might be interested in seeing. This “predicting the future” by looking at the past may not account for a very accurate reflection of your overall interests, and the particular informational need behind your search for the name of a camera, or the situational facts behind your search.

Another approach that search engines sometimes take is making an assumption that the way that some queries are presented by searches can evidence an intent to perform a particular kind of search, based possibly upon your prior searches, or past searches by others. Again, those past searches may not be a very good indication of the intent behind your query.

For instance, since your query includes a product name, “Canon Digital Rebel,” a search engine might assume that you want shopping sites based upon other searches by other searchers leading them to choose pages where they can buy a camera. But what if you want more information about the features of the camera, or you have a problem with a Canon Digital Rebel that you already own? Showing you sites where you can buy the Canon camera may not be the best choice.

I’ll be expanding upon that in a future post. In the meantime, if you’d like to start exploring one possible way that Google might start challenging the assumptions pointed out in the start of this patent, you may want to spend some time with the FAQ pages for Google Custom Search.

Hi Bill,
I wonder if Google is ever disappointed in the data they gleam from CSE. I would think that with a topical e-commerce website the intent is inherited from the original search query and the sites algorithmically determined level of commercial intent. I grabbed the first 100 CSE queries from a few sites and overwhelmingly they are terms with modifiers or items that they couldn’t find in their organic search engine. Much of what they search for on our CSE does not exist or has become extinct. Yet since the site is topical, they appear to still use the CSE with modifiers or refinements in hopes of finding it.

Obviously I don’t have CSE data from but few sites but I would be interested to know if others see a lot of CSE queries for “phantom” terms.

[Going off topic] I see Google morphed the old Quick Scroll from a Chrome plug in to a On by default tool bar feature. Throwing a pop up on someone’s site without consent is walking on dangerous ground IMO. And for some queries it throws heat maps and funnels out the window, especially if you’re a content rich e-commerce site.

Thanks. To a degree, I think that’s what Google tries to do with its advanced search page, though it includes some things that you don’t mention, and doesn’t include some other that you do.

I think another assumption that most search engineers operate under is that they need to keep a search box as simple as possible, without all the checkboxes or radio buttons. We are seeing more specialized search links in the sidebar on Google that allow you to filter search results in a good number of different ways as well. Those may be somewhat influenced by the query term used, but many of them are the same or similar regardless of the query used.

But I think that some of the ways of potentially refining a query in a meaningful way may only be possible after an initial search is performed using a query. I think this is most true when the words within a query may potentially have a number of different meanings, or may potentially evidence more than one potential intent behind the search.

Even a simple search for “pizza” could be an attempt to find a local pizza place, a search for recipes, a search for different brands, and more.

I don’t mind an “interactive” type search, where intelligent choices of query refinements are presented after a search that can help improve the search results that I see for my query significantly, and in a manner that does a good job of matching the intent behind my search.

Using information gathered from multiple custom search engines to create a range of potential query refines, either by itself, or in combination with user data that shows what kinds of sites that someone clicks upon when they search for something like “pizza,” or tracks how someone might refine their query after being unsatisfied with the results shown for a broad search like “pizza” may lead us to improvements in search.

I’m not quite sure that this CSE approach is ready for prime time yet, but it does seem like it could potentially provide information that might not be accessable elsewhere.

If I were a search engineer, trying to find a way to use information gleaned from multiple CSEs, I would attempt to work out a way to test those CSEs, monitor them, and see how effectively they work, as well as attempt to determine if they were being used as a way of spamming the search engines.

One of the previously published patent applications about CSEs does actually describe some ways that Google might identify spam CSEs. Google’s patent on trust rank, and using annotations for pages like the labels in CSEs creates reputation scores for the people who create those annotations.

With any ranking system, it can really help if there are ways to weigh or score any ranking signals that might be used, and to determine their authenticity (or if they are attempts to manipulate rather than provide useful editorial signals on the meaning of a page or site)

Obviously I don’t have CSE data from but few sites but I would be interested to know if others see a lot of CSE queries for “phantom” terms.

I’d like to see that kind of information as well.

Thanks for raising the quick scroll issue – I’m going to be testing it out some.

I think you can do that to some degree, but there’s a limit. If you were going to change around Google’s Advanced Search page, what kinds of things would you add to it? It has some of those features that Mick mentions, but not all of them.

In an ideal world, you would also type in a query that makes it more likely that you’ll get a good match for what you intended to search for. Instead of just typing “pizza” into a search box, you would type in “pizza places in 90210” for example, to find a local pizza place, or “pizza recipes” to find out how to make pizza. Yet for either kind of intent, many people still just type “pizza” into a search box.

If you were looking for help with your “Canon Digital Rebel,” instead of just typing the brand and model names into a search box, you might also add a word like “support” to your query. But many people don’t.

Also, there are some topics that you may search for that you don’t know much about, and may not be able to come up with that perfect query that can help you find the results you want to see. Giving people some good choices for potential query refinements after their initial search could be very helpful.

Search engines likely all have multiple algorithms that can rank sites differently based upon things such as the query term used, the kind of intent that those queries might evidence, the locations and languages of searchers, and more. But ultimately, the goals of the search engines are the same – to try to present meaningful and useful results to searchers, so that searchers continue to use the search engines.

Understanding some of the assumptions behind how search engines work can help you make more intelligent choices when you have web pages that you hope will be found through those search engines.

Looking back on this, I still feel that Google hasn’t perfected its search results based on what type of content is searched for. This is something that I feel Google can really go after without having to think of the possible drawbacks; CSEs provide a way to get more accurate results, and there is no debating that. I still don’t see why Google hasn’t invested more of its time & money into it.

Google has been around for more than a decade now, and yet I suspect that you if ask anyone from Google about the maturity of search, chances are that they might tell you that it’s still in its infancy, and that there’s a lot more that can and will be done.

Google has put a fair amount of effort into CSEs, and into a lot of other areas as well. We don’t always hear about the progress that the search engine makes upon areas like custom search engines, but there have been a number of patents filed from Google involving CSEs, and there’s a good chance that they play some role in helping rank pages in Web search, and helping come up with query refinements that you see in Web search as well.

This is a very interesting post. Great insight into the future of search.

I have to agree with Bill that the interactive type search offering query refinements is a better option than check boxes and probably a simpler solution. Those check box selections/preferences could change from search to search and becoming a burden.

Mick, I do like the principal behind the thought which is to allow the user to be more interactive or control their search experience.

I just started reading your blog Bill. Very unique, unlike other typical SEO blogs. I love this angle and perspective for insight into the future of search. Thanks Cyrus for pointing me in Bill’s direction with your post “20 Future Facing SEO Blogs 2012”.

SEO by the Sea focuses upon SEO as the search engines tell us about it, from sources such as patents and white papers from the search engines. This information about SEO is tempered by years of experience from the author of the site, who has been doing SEO since the days when search engines started appearing on the Web.