April 24, 2009

How *not* to rate a search engine

While demoing Live Search at the Web 2.0 Expo, people continually asked the same questions: “What makes Live different?” or “Show me some features that will make me want to switch from my search engine” or the extremely confrontational “Why do you think you’re better than Google?”

My first instinct was to dive in and show people the coolest features in Live Search (e.g., demoing Virtual Earth with an Xbox controller) or to let them play around with their own queries.

However, given my experience working for severalstartupsearch engines, I’ve come to realize that it's extremely difficult to convince someone that you’re better than another engine with words, features, or few carefully chosen queries.

So, after awhile, I started my demos with a caveat about the nature of a search engine: I implored my audience to try out Live Search for a week so that, in the words of the immortal Lavar Burton of Reading Rainbow, “But, you don’t have to take my word for it.”

A search engine, by contrast, has an extremely simple interface: you type in some words and hope that the engine will cough up pointers to helpful Web sites or give you a direct answer. The inner workings of a search engine, i.e. how those results were produced, are completely opaque to the user. Hundreds of features are used to rank results so that the right Web sites and answers show up on a page when you type in some string of words. Those features don't surfaces as demonstrable chunks that can be easily summarized or understood.

Common mistakes when evaluating a search solution

Which brings me to the biggest mistake people make: judging a search engine by typing in a few queries and analyzing the results. There are many interrelated reasons that this methodology fails:

A few good/bad results don’t mean that all results will be good/bad – even if you try out five searches and all are good, how do you know if your sixth is also going to be good? That is, since you don’t know what is going on under the hood, you can’t make any predictions about the quality of future results.

It’s hard to select a representative cross section of queries – people usually try out a few navigational queries, a vanity query, and a few queries that are either damn-near impossible or extremely obscure. None of these sets represents an accurate cross-section of your monthly query log.

What you think is “good” may not be good for the majority of users – for navigational queries (e.g. “CNN.com”) the top result is clear. For more complicated queries, the top results are rarely obvious.

Queries are out of context – we had this problem at SideStep all of the time. During usability studies, users who were simply evaluating the look-and-feel of the product and scanning for cheap flights without any end-goal were never as good as users who were actually trying to buy a flight for a real trip. A search engine should help you complete tasks, not just give you a pretty page or have links that look useful.

People tend to focus on the first result – some queries just require one result. But many queries should be judged by the diversity of interesting results.

There are probably countless other mistakes that are made during solo evaluations of search. Therefore, search engines big and small realize that problems of ranking and relevance – the core of any search project – are solved only by lots and lots and lots of data from lots and lots and lots of people. To solve this data problem, we need to collect data from real users. For example, we run many thousands of queries past human judges and look at mountains of click data from the production site. After applying apply advanced statistical techniques to this data, we get the information we need to create algorithms that turn your few (mispelled) words and turn them into a useful page of results.

As one of my colleagues at Powerset always likes to remind me: this is rocket science.

TrackBack

Comments

Great post Mark. I agree the simple search box completely belies what goes on underneath. It’s a disservice to the users at large when reviewers generalize five queries experience to billions of queries. Keep up the good work.

Mark, nice post. But I do have to pick on Powerset here. I attended a Powerset presentation before the acquisition and asked about evaluation metrics--especially since my experience as a beta tester had been underwhelming. I didn't get a straight answer--and I haven't to this day. Were there ever any published experiments using TREC, user studies, or some other evaluation methodology? Or is all of the evaluation work still guarded as trade secret?

Spot on Mark. Many who review or rate a search engine's capabilities know little about how search works, they only know how results present. Even then, most compare it to what they already know (Google/Yahoo/Live/etc.) instead of removing their rose-colored glasses to look clearly at the advantages and disadvantages of something new. Obviously, the engine needs to perform for the user. But reviewers (often, media reviewers - not technical reviewers) judging engines based on just a couple searches shows the short-sightedness of the reviewer, and not necessarily that of the engine. Thanks for the post.

I do agree with you. When I want an example for a talk or an article, I end up doing a bunch of searches to find just the right one to illustrate the point I'm trying to make. And if I can't find a good one, I re-examine my ideas.

But how many queries and what kind are enough to rate a search engine properly? Enterprise search isn't web search, and rarely has much control over the algorithm.

I have my own ideas, based on search log analysis, but most sites don't have the kind of traffic you are talking about. It turns out that a week's worth of search logs, if there are only 20,000 searches, have a very small head and a very long tail. And a whole lot of search spam and URL queries.

How I wish I had your mountains of queries and click data and money for human judges!

Ha, ha, ha, good web site, the site of the construction of the really very good, let the first thing I love this website, and still he so brilliant, so rich connotation website, I will focus on such a web site, but also the good mood.