Search Analytics for Your Site: Conversations with Your Customers

Chapter 8: Practical Tips for Improving Search

I’ll show you some specific benefits that come directly from site search analytics, starting with improving your site’s search system.

Plugging Gaps in Your Search Engine’s Index

“It’s a dirty little secret that search engines don’t always index all the content they’re supposed to. The problem isn’t with the software itself. Rather, it’s due to the engine simply not knowing about content areas that exist on a given site.”

It’s a dirty little secret that search engines don’t always index all the content they’re supposed to. The problem isn’t with the software itself. Rather, it’s due to the engine simply not knowing about content areas that exist on a given site. And the engine may not know because it’s likely that you don’t. If your site is actually a large set of subsites—typical of many enterprise-scale Web environments—then it’s simply hard to know what content is out there.

One solution is to analyze your top queries with zero results. Of those, identify which aren’t retrieving results because there simply is no content to match searchers’ needs.

Then look at those queries as a group. First, are you surprised by what you find? Do you see any patterns, anything in common at all? Would you have expected there to be content to match those queries? If so, who would have created it, and in what unit would they likely work?

You’re mostly there, Holmes; now go talk to someone in that part of your organization. Find out if there is, indeed, content that needs to be crawled in order to match these null queries (and if not, to make a gentle recommendation that it should be created).

Making Query Entry Easier by Fixing “the Box”

“Most sites these days happily sport “the box,” a simple text-entry box (and an accompanying “search” button) that persists on every page in a fixed position.”

Most sites these days happily sport “the box,” a simple text-entry box (and an accompanying “search” button) that persists on every page in a fixed position (see Figure 1). It’s a life preserver of sorts—searchers know exactly where to look for it when they need to execute a search, and it works the same way wherever they find it.

Figure 1—Even a Web environment as large as IBM.com has a persistent simple search entry box.

If you have “the box” in place throughout your site, congratulations! But have you considered how wide it should be? It had better be wide enough to accommodate the majority of your queries. SSA will help you figure out how long your queries typically run, so you can plan your design of the text-entry box accordingly.

In this example, I analyzed AIGA’s top 500 unique queries for a specific month—these accounted for exactly 37% of all search activity (see Figure 2). I used Microsoft’s “LEN” function to count the number of characters in each query and then calculated the queries’ mean and median lengths (10.648 and 10, respectively).

After I sorted by query length, you can see that the maximum length among these 500 queries was 62 characters, but that is something of an outlier; the next longest was 36, then 28, and then it flattened out (apparently, Zipf is everywhere), as shown in Figure 3.

Figure 3—The short head of AIGA’s queries by length; notice the quick drop-off.

Based on this data, I might be safe using a search entry box with a width in the 15–20 character range. If horizontal real estate isn’t at a premium, a width of 30 characters would be even better.

I could take my analysis further and compare a sample from the long tail and see if query length differs greatly. Still, with this small sample, I’m safely addressing our most frequent queries and almost 40% of all search activity.

Accommodating Strange Query Syntax

“These days, Google is good enough that most searchers can be lazy, entering a term or two, and expecting something reasonably good in return.”

Once upon a time, prior to the advent of the Web, most online searching was done in either library catalogs or commercial databases that were hugely expensive to use (think hundreds of dollars by the hour) and usually horribly designed. Accordingly, searchers in those days were more than casual and definitely not lazy. They were quite motivated to learn all sorts of search tricks, like using Boolean operators (for example, OR, AND, and NOT), wildcards, ways to truncate terms, and lots of other weirdness.

These days, Google is good enough that most searchers can be lazy, entering a term or two, and expecting something reasonably good in return. Still, there are a few holdouts, and if your site is older or tends to be used by researchers or librarians, there’s a good chance that you may need to consider supporting old-style query syntax.

A simple way to check is to search your queries for such instances. In the example below, I used Google Analytics to filter a year’s worth of AIGA.org queries for the operator OR (see Figure 4). Of the over 75,000 unique queries for the entire year, only 121 unique queries (and 142 searches overall) included OR. And most of those queries were not using OR as a Boolean operator.

Figure 4—AIGA’s searchers aren’t using the Boolean OR operator very often.

The use of AND, however, was a bit more common; it was included in 1,596 unique queries and 2,205 searches overall. (NOT showed up in 84 unique queries and 112 searches overall.) But out of over 75,000 unique queries and 188,000 searches overall, the volume of searching using Boolean queries is still quite small—in the 1%–2% range—and the majority of those queries don’t use AND, OR, and NOT as Boolean operators. So AIGA is probably safe in not supporting Boolean operators in its query syntax.

Determining What Your Best Bets Should Be

“Best bets (aka “recommended links”) are simple. They’re search results that have been manually connected to a particular query.”

Best bets (aka “recommended links”) are simple. They’re search results that have been manually connected to a particular query. Why do this? Because search engines are robots, and robots aren’t always that effective at retrieving good search results.

In the example in Figure 5, the National Cancer Institute wanted to make sure that searchers always retrieved something useful when searching for melanoma. The organization manually attached three sets of best bet search results to the query, and these are displayed before the search engine’s automated results show up.

So, while best bets are simple, they’re also powerful. “Powerful” in the sense that they can really improve the search experience. And “powerful” in that they can be a weapon wielded in your organization’s political battles.

For example, who gets to determine which are the most appropriate best bets for a query? If, for example, your organization sells hardware products, and someone searches for a product’s name, should there be a best bet result from marketing? Or from sales? Or from the tech support department? Whose should be the highest priority? This sort of situation can spin out into a political firefight very quickly.

And who gets to determine which queries merit best bets in the first place? (And how many should there be at most? We’ve not seen this addressed by researchers, but three or four seems plenty, since you don’t want to completely obstruct the raw search results.)

Rather than let best bets become a political headache, use data—query data—to quell or at least blunt these battles. Look to the short head for common queries, and look at them longitudinally to determine which are the most persistent over time. The combination of popularity and persistence is a great driver for choosing queries that merit best bets.

If you do have multiple best bet candidates, consider prioritizing them by determining the relative importance of a particular query to different audience segments. Consider our hardware vendor once more: if the people searching a specific product name are three times more likely to be existing customers seeking to download drivers, rather than prospects who might be looking to learn about a product, then the data would suggest that tech support’s best bet should come first. Argument settled.

Helping Searchers Auto-Complete Their Queries

“Auto-completion can help searchers save time entering a query.”

Unless you’ve been under a rock for the past few years, you’ve likely encountered sites that automatically complete your queries (also known as “type-ahead”). In effect, the search engine has been given enough information to predict what you want to search—or, at least, provide you with a few useful possibilities for you to select from. Auto-completion can help searchers save time entering a query. They can just click or tab over to their selection, rather than continue typing. And if they’re not exactly sure how to enter their query—perhaps they don’t know the proper spelling of a term—auto-completion will expose some useful possibilities.

Where does SSA fit in auto-completion? Well, it might be tempting to simply use all of your queries—or even your most frequent queries—as your auto-completion list, but beware: these queries are likely to be quite dirty in all senses. They’ll include typos, irrelevant terms, and terms that are dirty in the pornographic sense.

Rather than using raw queries, rely on a cleaned-up version. For example, you may already have a list of keywords associated with best bet search results. Given that they’re probably based on your frequent queries and that they’ve been scrubbed, they’re a great starting point. You might also consider using a tool that can perform entity extraction on your queries to give you a set of proper nouns for your auto-completion list. But again, you’ll still need to manually review such a list; no software application will be able to do that as well as you can.

SSA can also help you identify metadata attributes and content types. Consider them candidates for items to add to an auto-completion list. You may find that you can go out and acquire certain metadata—say, place names—from commercial sources and insert them directly into your auto-completion list. (Just make sure that your newly added terms have content associated with them, or they’ll be navigational dead-ends.) Or you may already have the terms you need somewhere inside your organization.

For example, ESPN.com enables searchers to type ahead and retrieve names of professional athletes, as shown in Figure 6.

Improving a “No Results Found” Page

We’ve all seen error messages like these before. Some are unhelpful (see Figure 7), while others seem to go out of their way to make you feel like a lunkhead. Many sites are addressing their messaging of their “file not found” pages, moving from 404-impersonality to a more helpful approach that suggests alternatives.

Similarly, there’s no reason not to go beyond default “results not found” pages and do even better. And SSA can help in a very simple way, as shown in Figure 8.

Figure 8—No “peeps” at the JellyBelly.com site? No worries; help is on the way.

Certainly, JellyBelly.com’s copy could be even a tad bit more helpful. But more importantly, the company realizes that, in the context of a failed search, it’s a good idea to suggest other queries to try. These suggestions are frequent queries; even better would be suggesting queries with synonyms for the failed query term. (But let’s face it: there probably are no synonyms for “peeps.”) Either way, the searcher is now just one click away from more search results, rather than being made to feel like an idiot.

Helping Searchers Revise Their Queries to Get Better Results

“As searchers get past the initial query entry interface and start to encounter search results pages, they become increasingly likely to invest more effort into finding what they need.”

As searchers get past the initial query entry interface and start to encounter search results pages, they become increasingly likely to invest more effort into finding what they need. There are many reasons for this—the scent of desired information may be getting stronger; they don’t luck out into great results the first go-round as they’d hoped; or they learn more about what they’re looking for as they engage with search results. Whatever the case may be, this is a good time to expose them to a higher level of search functionality than is afforded by the common starting point of “the Box.”

It’s likely that your search engine already has many great features to help searchers revise their queries and massage their search results.

Unfortunately, it’s also likely that these features have been buried in a search system’s sad ghetto, infamously known as Advanced Search. (Did you know that “Advanced Search” is actually search engine vendor terminology for a “Miscellaneous bucket of features that we don’t know what you’ll do with [but we wish you the best of luck]?”)

Like any other kind of help feature, these types of search features work best when presented within the appropriate context of use. Your search engine vendor can’t or won’t help you figure out which of these features to provide to searchers and when, so it’s up to you to do the heavy lifting. Fortunately, SSA (and a little common sense) can help.

Two of the most common motivations for revising a search have to do with adjusting the volume of results—either the engine isn’t returning enough, or it’s returning too many. In the first case, you can guess that a null results page is too few—or you might set your threshold a little higher—say, five results. In the latter case, you might set a threshold of more than one or two screenfuls of results, because you’re fairly certain searchers won’t get past those initial sets of results.

In either case, look to integrate features from your Advanced Search interface that broaden or narrow, respectively. For example, the University of Alaska Fairbanks’ Advanced Search interface consists exclusively of a means for broadening your search results, as shown in Figure 9.

Figure 9—At the University of Alaska Fairbanks, Advanced Search means “broaden your search.” Why not expose these features when the search results need expanding?

The IRS, on the other hand, provides all sorts of ways to narrow a search from its Advanced Search interface (see Figure 10).

Clearly, Advanced Search means different things to different people. Rather than relying on that term having any sort of consistent meaning, consider simply designing your search results pages to incorporate whatever form of refinement made sense given the situation: support expanded results if zero were retrieved, or ways to narrow results if too many were retrieved. This approach would likely be much more helpful than burying such features on an Advanced Search page.

Search Pre-Refinement: How Much Customization to Allow

For most sites, the simpler the search interface you offer, the better. Thanks to Google, searchers are deeply familiar with basic keyword searching methods and can use them effectively. Advanced Search tools don’t usually work any better, and, if poorly conceived, often work rather worse than basic keyword search. There are important exceptions, however, where encouraging searchers to provide additional criteria for their initial search makes good sense.

For sites where search is the almost universal method of finding things, especially where faceting is necessary, providing up-front search refinement makes good sense. If you manage a hotel site, an airline site, or a real estate site, then form-based (aka advanced) search is far more efficient than keyword search.

For example, in real estate sites nearly all searches begin with a city name. For the vast majority of searchers, however, a citywide search—regardless of how the results are sorted—will return too many results to be used effectively. The results should be faceted by price, neighborhood, or categories like size or number of bedrooms. Airlines face a similar search problem. Almost every search begins with both a date and, at minimum, a trip leg (a from: to pair), as shown in Figure 11. The same is true for almost any travel application, including hotel and car rental sites.

Figure 11—A single field search box wouldn’t make sense for travel sites like Expedia.com.

For most industries, there is a fairly obvious subjective ordering of at least a few primary fields. As with most subjective orderings, however, there are always questions. For example, we may know that far more people will search a real estate site based on price or neighborhood than, say, pools. You don’t need analytics to tell you that. But not every question is quite so obvious. For home search, is price or neighborhood more important? Is number of bedrooms or bathrooms more important? Or do you need both? What about square feet?

The goal of analytics is to help answer, or give the necessary information to answer, questions like these. If you start with the assumption that the fewer choices you give searchers, the cleaner and better the interface, how do you decide how many choices and which choices are best?

In general, the goal of search refinement is to narrow the range of search results to some optimum level. You can test facet combinations on initial search relative to two different criteria:

In the real estate case shown in Figure 12, “Pct of Searches Used” captures how often visitors actually used a field when submitting a search. “Lift” measures whether visitors who searched using a field were more or less likely to submit a lead. Four fields (Neighborhood, Property Type, Square Feet, and Amenities) all showed negative lift—meaning visitors using these features were less likely to submit a lead than users who did not. In other words, on average they decreased the effectiveness of search when included in the initial search.

You’ll also note that Bedrooms and Bathrooms and Maximum Price worked very well as fields for an initial search. The bottom line is that analysis of the lift and number of results returned based on the initial search make it much easier to make intelligent decisions about how many and which options are worth providing to searchers.

Figure 12—The data suggests that the presence of fields like “Neighborhood” in a form decrease the search’s effectiveness.

In every case for this study, the least effective fielded searches produced, on average, very small result sets. That won’t always be the case. There is no single right answer to an optimal number of search results returned, particularly across different problem sets. People need to see more search results for houses than they would for kitchen blenders. The critical factor is the degree to which the search criteria can reliably identify what the visitor is looking for. Housing search is necessarily fairly fuzzy.

Designing Search Results Around Specialized Query Types

“Certain query types are worth looking for as you dig into your query data, especially your long tail, because you can tune how your results are presented and how they can be sorted.”

Certain query types are worth looking for as you dig into your query data, especially your long tail, because you can tune how your results are presented and how they can be sorted. Specialized types of queries may include such search terms as

Unique identification numbers, such as ISBNs, SKUs, and course codes

Proper nouns (names of people, places, or objects)

Acronyms

Dates

Navigational queries (URLs)

If unique IDs are usual suspects within your query data, you may be able to program your search engine to look for them—their syntax is usually consistent and predicable—and present custom search results for that specific type of query. So a commerce site might recognize a search for 0-38533-349-8 to be for a book’s ISBN, and would accordingly know to display a cover thumbnail and other information to help the searcher identify and purchase Kurt Vonnegut’s The Sirens of Titan.

A person’s name might work the same way, although name recognition isn’t as surefire as unique IDs. For example, you might teach your search engine to flag a mixed case string of two words separated by a comma (for example, Vonnegut, Kurt) as a name. You could also define a person’s name more loosely, such as a string of any two words in mixed case (for example, Kurt Vonnegut, but also Chopped Liver). The latter case has its obvious drawbacks, but could still be workable if your search engine is searching a federated collection of data stores. In such a case, you could have it default to display results from the staff directory first. If there is no one with the name “Chopped Liver” in your staff directory, your search engine could move on and grab results from elsewhere (like your intranet’s extensive collection of lunchroom menus).

Place names, organization names, and acronyms are often already known. For example, you can purchase or steal lists of place names (or in some cases, like U.S. states, cull them from your memory). Your organization may have lists of its internal division names and its partners’ names. And it may also maintain a glossary of acronyms (which, by the way, you can grow from analyzing queries if you need to). If so, you can feed these to your search engine in advance and provide well-crafted search results using the same approach as you would for best bets.

Dates (and, less frequently, place names) are helpful in a slightly different way—rather than helping identify the most appropriate result, they’re useful in helping identify how to present results. The persistence of dates within queries suggests that the searcher is trying to either narrow or sort her search results. The fix, as the Financial Times found, is to enable searchers to sort and filter by date, as you can see from the search results page shown in Figure 13.

Financial Times could take this approach just a little further by setting the sorting to default to “Date” when date information is included in the query. (The current default for all queries seems to be “Relevance.”)

Finding URLs in query logs isn’t so strange; searcher sees boxes on a page, searcher fills it in with a URL. Why should a search entry box be any different than an address entry box? In fact, because it happened so often, IBM Software found it worth addressing. Rather than punish searchers with an unpleasant 404-like “no results found” experience, IBM simply taught its search engine to redirect searchers to the desired page, as specified by the URL entered. It was a simple and straightforward way to show its searchers respect.

Designing Search Results Around Specialized Content Types

“You can also tune your search results when you know that you have specific types of content that searchers might find especially interesting.”

You can also tune your search results when you know that you have specific types of content that searchers might find especially interesting. In Chapter 3, “Pattern Analysis,” we used our Michigan State University example to illustrate how you might uncover potential content types that occur again and again in your site. Once you have those types in place, they serve as wonderful “building blocks” for tuning your search results in a powerful way.

In the example shown in Figure 14, the searcher is looking for a product that includes the number 1012. Hewlett Packard’s search engine has been taught to guess that such numeric strings are quite likely to be products. And HP has already identified a variety of content types—“Product overview,” “Supplies, options & accessories,” and so on—which are associated with products. (These are displayed in the center of the page under “Product quick links.”) In effect, HP has determined that queries for products ought to have certain content types shown automatically and expects that these are valuable to searchers. In fact, they seem far more valuable than the rough results found by the search engine (to the right, under “Search results”).

Hewlett Packard’s approach is very intelligent—in fact, it’s a souped-up version of best bets. HP has already done the hard work of determining what its content types are, and likely tags them as such in its content management system. It takes advantage of that effort by exposing appropriate content types when it encounters a specialized search, in this case, a product query.
Interestingly, HP shows content types that are geared toward existing customers rather than ones who might be considering purchasing this particular product. The company’s research may have shown that it has more existing customers search its site for product information than prospects. And we’ve all heard how it’s cheaper to keep a customer than to acquire one.

Using SSA to Help Determine Search Scoping

Martin Belam, Information Architect, Guardian News & Media

One way to help searchers to get to the best result quicker is by “scoping” their search: limiting their results to content similar to the page they are currently viewing. For example, if they use the search box on a page in your “press office” section of the Web site, you would only return results from pages published by the press office, not the site as a whole.

While this can be very helpful to some searchers, particularly on sites with large silos of unrelated content, it can also second-guess their intentions. If you scope your content incorrectly, you might have made it nearly impossible for the searchers to locate the information they need.

Site search analytics can help you to understand whether “scoped” search is really helping your searchers, but you have to make sure you are measuring the right things. If you have three different scoped search boxes on your site (for example, the news section, help section, and documentation section of your site or intranet), it is important that you can slice your data to examine behavior in each of those scopes in isolation. You’ll want to be able to see the top search terms that have been used on a particular area of the site, and you’ll also be very interested in those searches that have generated no results.

On the guardian.co.uk Web site, we scope search based on the section of the site that the searcher is currently visiting. This can get quite granular, so that if you are on guardian.co.uk/culture, you only receive results from our culture coverage. If you drill down further within that to the music section, then search results are restricted to just returning content that has been tagged with music. Search log analysis allows us to check whether this is helpful or not. For example, in January 2010, one of the top 10 searches on the music section of the site was for wire (see Figure 15). Because of scoping, we can guarantee that the searcher is only going to get articles about the recently reformed ’70s post-punk band, rather than our extensive coverage of the critically acclaimed TV show The Wire.

Figure 15—The top 10 searches within the Music scope of guardian.co.uk in the first weeks of January 2010.

Looking further down the most popular searches, however, reveals that we have a problem. In the top 40 searches on the Guardian’s music area at guardian.co.uk/music, we also see frequent queries for top columnist Charlie Brooker and for the crossword. These are most certainly not music-specific queries. Our answer is to make sure that more general, site-wide editorial best bets are also retrieved within a scoped search’s results. This gives the searchers the best of both worlds. If their query is specific to the area of the site, they get the narrow focus of scoped search. If their query is a navigational one, aimed at jumping context to another area of the site, then best bets will take them there (see Figure 16).

This means you’ll have to regularly monitor the generic queries within a scope to make sure that you are picking up all of the popular searches that would benefit from being an editorial best bet.

Figure 16—This search originated in the Music area of guardian.co.uk, as indicated by “You searched for ‘charlie brooker’ in Music.” Although the 26 results returned are specific to that area of the site, the “Editor’s picks,” our label for best bets, are generic and direct the searcher to useful links outside of the music scope.

Summary

Identify gaps in your search engine’s index by finding top queries with zero results and then identifying which aren’t retrieving results because there is no content to match searchers’ needs.

Make your search entry box the right width by measuring the width of your queries.

Analyze queries to identify odd or unexpected queries and query syntax that your search engine should be configured to support.

SSA can identify good candidates for best bets and, when you have multiple best bets for a particular query, help you prioritize their order.

Use SSA to help create a cleaned up version of terms for query auto-completion.

SSA can improve your null results page by providing a list of your site’s most popular search terms to choose from or similar search terms based on synonyms.

Use SSA to better support query refinement (instead of relying on Advanced Search interfaces).

When multi-field search interfaces make more sense than a single box, use SSA to help determine which fields or facets to make searchable.

Look for specialized (and very important) query types and then develop specialized search results for those queries.

Conversely, identify important, consistently appearing content types, and use them to power results for important query types (like product names).

Use SSA to help determine and tune search “scopes” or zones within your site’s content.

For more on the “scent of information,” read Peter Pirolli and Stuart Card’s “Information Foraging,” in Psychology Review; 1999, Volume 106, Number 4 (pp.643-675).