How a Search Engine May Measure the Quality of Its Search Results

When you try to gauge how effective your website is, you may decide upon certain metrics to measure its impact. Those may differ based upon the objectives of your pages, but could include things like how many orders you receive for products you might offer, how many phone calls you receive inquiring about your services, how many people signup for newsletters or subscribe to your RSS or click upon ads on your pages. They could include whether people link to your pages, or tweet or +1 articles or blog posts that you’ve published. You may start looking at things like bounce rates on pages that have calls to action intended to have people click upon other links on that page. You could consider how long people tend to stay upon your pages. There are a range of things you could look at and measure (and take action upon) to determine how effective your site might be.

A search engine is no different in that the people who run it want to know how effective their site is. A patent granted to Yahoo today explores how the search engine might evaluate pages ranking in search results for different queries, and looks at a range of possible measurements that it might use. While this patent is from Yahoo, expect that Google and Bing are doing some similar things. And while Bing is providing search data for Yahoo, that doesn’t mean that Yahoo’s results might not be presented and formatted differently than Bing’s results, and include additional or different content as well. As a matter of fact, Yahoo recently updated its search results pages.

One of the problems or issues that you might run into when attempting to see how well your site works is determining how well the metrics you’ve chosen to measure that might work. A problem that plagues large sites is that they are so large that it can be difficult to determine which metrics work best. Yahoo’s approach uses a machine learning approach to determining the effectiveness of different “search success” metrics.

A system and method for development of search success metrics. A plurality of search engine result pages are collected and a target page success metric is determined for each page. A plurality of machine learned page success metrics are trained using a first subset of the search engine result pages and each result page’s respective target page success metric, wherein each of the machine learned page success metrics is trained to predict the target page success metric for each of the first subset of search engine result pages. A predicted target page success metric is predicted for each of a second subset of the search engine result pages using each of the machine learned page success metrics. The accuracy of each of the machine learned page success metrics in predicting the target page success metric associated with each of the second subset of search engine result pages is then evaluated.

One of the things I like to do when looking at a patent like this is see if I can learn a little more about the people behind the patent. A look at inventor Laurence Wai’s LinkedIn profile shows that he is now the senior manager in charge of analytics at Groupon. The LinkedIn profile describes some of the work he did while at Yahoo, and also a little about his involvement in transitioning Bing results to fit into Yahoo pages. He is also co-author of a paper titled Web Search Result Summarization: Title Selection Algorithms and User Satisfaction (pdf), which includes as authors a couple of other Yahoo researchers as well as a Microsoft search engineer. The paper introduces the topic of “search success,” which is the focus of this patent.

The patent presents a number of different approaches to measuring search success including “presentation, ranking, diversity, query reformulation, SRP enhancements, and advertising.”

The focus behind this patent is to take a measurement that might have been shown in the past to be highly reliable in measuring the effectiveness of search results or pages, but which might be either too costly or time consuming to measure upon an ongoing basis, and develop ways to predict how well a particular page might fulfill that metric. For example, if dwell time, or the amount of time someone spends upon a page is a useful measurement for determining how well that page meets the needs of a searcher, are there other metrics that a machine learning system can use to predict dwell time for a page?

The patent uses the phrase “search success” to measure the overall ability to measure how effective the search engine might be in displaying useful search results to searchers. It also refers to “page success metrics” for different types or families of measurements that might reliably be used to evaluate the success of search results. These different classes of “page success metrics” could be ranked based upon how reliable they might be perceived to be, and the patent presents a general rule about them:

It is also generally true that the higher a class is ranked, the greater the cost of obtaining the metric.

The hardest and most valuable metric, direct feedback from a searcher on the value of results, is considered within the realm of unobtainium by the author of the patent, who tells us that there is presently no known techniques in existence for “directly evaluating the user’s perceptions of search page results.”

Providing a searcher with the ability to report whether or not a set of search results were useful is close, and it’s regarded as helpful though it can be biased by limitations of self reporting.

Next on the list in a “heirarchy” of metrics are target page success metrics, such as click through rates on search results. For instance, the search engine might look through its query logs and see whether or not pages were clicked within search results for specific queries, and which pages were clicked.

These types of clicks might be tempered with editorial judgments, such as whether or not the search results were in response to navigational or non-navigational queries, and whether a particular page might have been placed at the tops of the search results because the query was perceived to be navigational. For example, if I type [espn] into a search box, chances are that I want to visit the ESPN website rather than search for information about ESPN. If people search for ESPN and tend to look at pages other than the ESPN website, it might cause the search engine to question the value of showing ESPN first in a set of search results.

In addition to clicks on results, another metric that might be used to evaluation search results is dwell time. Rather than looking at the amount of time spent upon a page, this dwell time would be a comparison of the time stamps associated with different actions on a search result page.

The patent also refers to the use of a Discounted Cumulative Gain approach to determining the quality of search results, where the search engine might look to see if more highly relevant results appear more highly within search results. An interesting paper jointly written by researchers from Yahoo and Google explored some of the problems with that approach in 2009, and I wrote about it in the post Evaluating the Relevancy of Search Results Based upon Position.

Conclusion

If you’ve spent some time thinking about how a search engine might evaluate the quality of its search results, you may want to spend some time with the patent to learn more about some of the approaches that they might be exploring.

Google’s Panda updates are similar in that they focus upon identifying a series of metrics that can help identify relevant and higher quality results that might gain more clicks in search results and more successful searches. Like the process described in this Yahoo patent, one of the issues that Google needs to contend with is determining how well the different metrics they’ve decided upon using in Panda might predict click throughs, dwell time, and other search success measures.

30 thoughts on “How a Search Engine May Measure the Quality of Its Search Results”

Could yahoo be making these changes because of the google panda update? are they now assessing the search metrics because they are considering making changes to their algorithm or am i well off the mark.

This concept of “dwell time” is pretty intriguing and is of particular concern when you consider that what appears in the context of your particular link on a SERP may not be exactly what you want, but rather what the search engine guesses is most relevant.

Its interesting to see this, I think it shows support that Yahoo are doing something which should have a similar effect to the Google Panda update. As always search engines are trying to improve their results, its just how they choose to measure relevancy that is the most intriguing. A lot of webmasters have trouble manually measuring their website success, search engines have to do it through an algorithm.
Going to hedge my bets on social media playing a much bigger part in search in the future, measured correctly it has to be one of the most truthful indicators of success

I had lunch with a former Google search engineer that worked for Google over 4 years ago. He mentioned to me that even then they were measuring click-through rates relative to verticals to determine the quality of SERPs. A metric like CTRs combined with the amount of time it takes a searcher to return to the SERP via the back button for example may be a great way to test SERPs with high search/traffic volumes.

Take this a step further with Google being able to identify brands/entities. If certain page showed up in a results set and a tracked user conducted brand searches combined with similar queries later, that could say a lot about the quality of the original search results.

Wait, so what exactly is “search success”? I went through your post understanding what the patent was looking for in ranking metrics (search success), but I still don’t entirely understand what Larry Wai defines that as. I’d also like to point out that the kind of research you do into these patents (i.e. looking into who Larry Wai) really sets your blog apart from others Bill. I haven’t been able to read & comment as much lately because my blog has started to grow faster than I expected, but I’ll try to be more involved on this blog in the future like I have been in the past.

The direct feedback is to me probably the most important metric along with the amount of time spent clicking on top search results. If the top search results aren’t getting the clicks, there is a problem with the quality of the rankings.

This patent was filed in 2009, and Google Panda didn’t happen until 2011, so it’s not a reaction to Panda. The value of a search engine is tied to the quality and breadth of the results that it returns in response to searches. So, it’s not unusual for Google, Yahoo, and Bing to each have their own approach to estimating the value of those results.

The fact that the search engines are all attempting to come up with a process like Panda in some ways, where they find machine measurable metrics that can predict things like click rate and dwell time makes a lot of sense. If user reactions to pages are reliable ways to gauge how much people like pages they find in search results, then if you can find things to measure that might indicate that people will like those pages, it makes sense to use them.

I guess that they figure that if a very large number of people will click on a link to a particular page in a set of search results for a certain query, and a high percentage of them seem to go to that page, leave it quickly, and look at other pages or perform another search, that the page isn’t as relevant for that query as it might seem to be.

I think social media will play a larger role in the future, but probably not based solely upon something like popularity. I think that might make it too easy for people to game search results.

Instead, I think different social media participants will have different author or reputation scores that can influence how much of an impact they might have. Instead of a direct vote = increased ranking, an annotation or label from those social participants about a page or site in question might influence how it might rank for a range of queries.

We know that the Direct Hit search engine was using click throughs as part of their search ranking algorithm because they were granted a patent on it. Askjeeves ended up buying Direct Hit and using their algorithm (along with other ranking signals), but I would suspect that all of the other search engines have been exploring the value of click throughs as a ranking signal since. I think it makes a lot of sense to use other signals, and to use clickthroughs as feedback to determine how well those other signals might predict the clicks.

Good point about the search engine tracking more than just individual searches, and looking at actual query sessions where clicks and subsequent searches might provide some interesting data. I believe that’s happening as well.

Search success is simply finding ways to measure how successful a search engine seems to be at delivering people to pages that might satisfy their queries.

Larry Wei defines a range of different ways to measure those types of success, from the ones that might provide the best measure of success but might be hard to gather information about to others that might not be as effective in providing the best measures but are easier to collect information about.

It’s good to see you, Jon. I hope I do get to see you around here more frequently.

It’s hard to tell the motivations behind direct clicks, even though they seem to be a good indication that people like the results they see. If you can come up with another set of metrics that can predict things like direct clicks and dwell time, and then compare them to those direct measures, then you not only have a good measure to use consistently in the future, but also a set of things (the direct clicks and dwell time) to compare them to.

Thanks. There are a few reasons why I like to do things like learn about the people behind the patents. It allows you to see what kinds of things they’ve worked on in the past, and where. It can give you the chance to see if there are other things they’ve written or created that are similar as well. I’ve found a number of times in the past that when someone has published a patent for something, they’ve also written a paper about it that doesn’t have all the legal language of the patent. I try to link to those when I can find them.

Social media does give the search engines a whole different set of data to look at and think about how to use in ranking pages. It does make sense for them to consider how they can use that data, and experiment with its use.

Small and medium sized businesses can also get involved in social media to market and promote their businesses, share ideas with others, and so on. I think the quality of those interactions are more important than the quantity, especially from sites like Twitter and Google Plus.

There’s certainly been some improvement after Panda when it comes down to the quality of Google SERPs. However damaging for many spammers out there, the new update is arguably one of the most critical we’ve seen. It’s still unclear to me, if Google can and will guess a website’s Bounce Rate as a criteria to downgrade or upgrade it when the webmaster has NOT installed Analytics.

Success metrics are always super important – the tricky part is making sure that everyone in the team is aiming in the right direction.

I would suppose considering Yahoo’s previous predicament that market share would be a priority followed by revenue – ignoring the other benefits of Yahoo (the last deck I saw they still claim to be the most visited portal on the web, they have yahoo mail, etc.) this could be brought about by increasing stickiness of users choosing to conduct their searches using Yahoo.

The error parameter mentioned in the search results is interesting – where the error margin is determined by the relationship of the number of page views, the final page view with the successful result – although the first two results are dictated as navigational

Excluding non-naviagational links they also seem to prioritise a feature on “discounted cumulative gain” where editorial searches are given a relevance ranking.

Considering that they’ve included parameters search sessions within the patent it seems like an important step in improving the results that they give with a view to improving the user experience of Yahoo – which could result in the growth in market share.

The only thing I perhaps would have expected to see is a forecast around the number of searches around a topic to define it’s result. So where a navigational search would result in a single click through from a single search for a single session – for something non navigational I would anticipate a longer tail of searches. (From the banking sector I believe we forecasted this at between 8 – 12 searches) which would include navigational (bank brands) generic (say car insurance) and then long tail niche terms (car insurance for a Vauxhall Tigra in Islington / Cash back car insurance comparison site).

If the topic or sentiment of the search could be interpreted then perhaps a prediction on the subsequent search could be determined.

I think I’ve seen better search results after the Panda updates, though it appears that there has been some collaterial damage along with those results where sites with quality content are being impacted negatively.

It’s really not certain that Google is using bounce rate as a ranking signal, and it appears more likely that they are looking at other metrics that might predict things like bounce rate and dwell time instead.

Whether or not a site has Google Analytics installed shouldn’t really be a factor. Google has so many other ways to measure things like that, including measuring the amount of time that someone selects a page in search results, and comparing that time to their next action on the search engine, as well as including Google toolbar browsing data, and possibly additional information from Google Chrome.

Interesting thoughts. A microsoft patent from not too long ago also added information about browsing search trails to the mix to see how people actually got to the “final” pages that they ended up upon as a result of a search. In some ways, that’s similar to the idea of following up to see which additional searches people conducted around a specific query or topic, to use additional user-information beyond an initial search.

Yahoo does seem to have some competing interests, in both trying to get visitors to spend as much time on their own pages as well as providing better results that send visitors away to other pages. The higher the quality of search results they provide, the more people would return to search on their site.

There are some issues with using a discounted cumulative gain approach, in that the “predictions” of how many clicks results should have at certain places within search results can be skewed by the presence of some very relevant results immediately above them. (Navigational results are a good example, as you’ve pointed out.)

I’m not sure that a metric like how long someone stays on a page will be a direct ranking factor, but it does seem like a good way to predict rankings, and that’s the point of this patent.

If a search engine can look at different features on a page, and determine whether those features might be likely to increase or decrease the amount of time someone might spend on a page, that may be what the search engines are looking at, and may use actual dwell time when available to see how well their algorithm is doing, as feedback.

That’s the reason why I said in the conclusion to this patent that this approach looks like what Google is trying to do with their Panda updates. They aren’t relying directly on user information, but rather how they can identify different features that might predict that user information.

For example, if a web page is mostly advertisements with little actual information that doesn’t provide much information about a specific topic that might be related to the query term or phrase that they page has been ranking well for, those particular features about that page might indicate that people won’t spend much time on that page, but might instead click upon an ad on the page or return to search results quickly. Being able to identify those features as negative ones, that would probably result in a short dwell time would indicate to the search engine that perhaps it should rank that page lower.

This ties up well with a change that I have seen in my own website. Like some of your other commenters, I have been reading a lot about social media and brand playing a bigger role in SERPS in 2011. I have just noticed an indicator of this with my own website.

My company name “Sussex SEO” is a locally searched term but my home page is set up around a more popular local term “web design Sussex”.
The company name should be harder to rank in the SERPS. Even though the page is set up around a different term, it comes in higher for the company name which has stiffer competition.
I’m sure that this has a lot to do with click through rate.
It makes a lot of sense that people typing “Sussex SEO” are largly people who are looking for me. Hence the high click through rate. People typing web design Sussex are looking for a web designer in Sussex. My percentage of clicks for this term is obviously much lower.

There’s a good probability that Google has associated the Sussex SEO phrase with your website, likely based upon click-throughs in search results as well as other factors, such as business profile links and other mentions helping to create that association on pages other than your own.

With the “web design sussex” phrase, there’s a good chance that Google probably doesn’t have as high of degree of confidence that when people perform that search, they are looking for your site, so an entity association (or matching query/document category assignment, which is another possibility) between the phrase and your site may not be boosting your page in search results.

“There’s certainly been some improvement after Panda when it comes down to the quality of Google SERPs. However damaging for many spammers out there, the new update is arguably one of the most critical we’ve seen.”

No doubt wholesale penalization would remove a lot of bad pages but at what cost to innocent sites? You can also see what strata of society is more likely to commit crimes and kill them all, just to be safe

Bill, one day you should talk about the paradox of being penalized *sitewide* by bad user engagement statistics when Google can screw them up for you. The more you are penalized the worst the terms you may rank for become, leading to unhappy users and more penalization for you. It becomes a spiral of death.

Google’s Panda doesn’t appear to directly include user-engagement statistics to rerank search results, but rather attempts to try to find features on sites through a machine learning approach that might predict things like click throughs and dwell time and other search success type metrics.

As I’ve written in a few different posts, it does appear that there is some collaterial damage, in that some sites are being impacted by Panda that do offer quality content and good user experiences, and I do wonder and worry about those sites being impacted. I am trying to share what information that I can find to try to help sites that are impacted to make changes that might help the avoid situations like that.

I’m still working through and looking at the instances where a site might be impacted on a sitewide level, and I think that does happen in a number of cases, but not all cases. I’d definitely recommend working upon other channels other than just a specific search engine to deliver visitors to your site, as well.

Great job researching the patent. The greatest metric would be direct user feedback.

“Providing a searcher with the ability to report whether or not a set of search results were useful is close, and it’s regarded as helpful though it can be biased by limitations of self reporting.”

I don’t see why having some sort of button to allow users to vote saying helpful or not helpful wouldn’t work. Of course there would be some bias but it wouldn’t have to be extremely weighted in results.

The best user feedback is probably a searcher explaining their thoughts and intentions as they are searching, whether that’s something they are doing directly to someone or something that they’ve recorded as they are performing searches. Why they choose certain query terms. What they might have expected to see in the search results. Why they might think some results might be inappropriate. What their thoughts are when they refine the queries they are using. Why they might give up on some searches altogether.

In my most recent post, How a Search Engine May Automate Web Spam Reports and Search Feedback, I’ve written about a Microsoft patent that appears to describe how they may be handling and using feedback from searchers. One of the points in the patent is that they might consider simple voting mechanisms. I think we’ve seen Google try something like a few times in the past, such as the inclusion of smiley faces and sad faces in the Google Toolbar that you could use to vote upon a site that you visit, the present day +1 button in search results (and on pages if those pages authors added them), and the present day ability to block some sites for specific queries.