Custom Search Engine Topic Refinement

“In this article, you’ll learn the basics of search topic refinement – a powerful way to allow your users to “drill down” through a search result set from a variety of different perspectives.”

Introduction

This tutorial is the first in a series designed to take you from Custom Search Engine (CSE) novice to CSE Ninja. This article moves briskly through the task of creating a Google Custom Search Engine (after all, the basics are pretty straightforward, and you can probably get your first CSE up and running in minutes just by filling in the blanks). The real fun begins when we focus on some advanced and powerful techniques that can set your CSE apart from the crowd.

If you are looking for an introduction to CSEs, please check out these two articles:

“In this article, you’ll learn the basics of search topic refinement – a powerful way to allow your users to “drill down” through a search result set from a variety of different perspectives.”

Introduction

This tutorial is the first in a series designed to take you from Custom Search Engine (CSE) novice to CSE Ninja. This article moves briskly through the task of creating a Google Custom Search Engine (after all, the basics are pretty straightforward, and you can probably get your first CSE up and running in minutes just by filling in the blanks). The real fun begins when we focus on some advanced and powerful techniques that can set your CSE apart from the crowd.

If you are looking for an introduction to CSEs, please check out these two articles:

In this article, you’ll learn the basics of search topic refinement – a powerful way to allow your users to “drill down” through a search result set from a variety of different perspectives. In later articles, you’ll learn how to re-rank the results from a query to promote the resources that provide the best answers to your user’s question to the very top of the search engine results pages (SERPs). You’ll also learn other advanced techniques to customize and add value to your CSE, and a variety of other tips and tricks. Along the way, we’ll provide a thorough look under the hood at the very robust engine that powers Google CSEs.

Because I’m a Search Engine Optimization (SEO) guy, that’s going to be the vertical domain of the example custom search engine that we’ll build. Our example CSE will be a very simple resource that we can consult to get very focused answers to our SEO questions. That, in a nutshell, is the big advantage of CSEs: we get to decide where to search for answers so that the best resources rise to the top, making them easier for us to find.

Building a Simple Custom Search Engine

You should be able to follow along in this tutorial by performing each step yourself on the Google Co-op site. In fact, I highly recommend it. You’ll get a much better sense of how CSEs work if you actually perform each of the steps yourself. You’ll find Google CSEs to be a very interactive environment. Along the way, you’re encouraged to experiment and see how things work. This aspect of the CSE building process makes them fun, interesting, and (mostly) straightforward to learn. To perform the steps in this article, all you need is a web browser and a Google account.

To begin with, we’re going to build a very simple CSE, and indeed it ought to be up and running in a matter of moments. First we go to http://www.google.com/coop/cse/overview and click on the “Create a Search Engine” button at the bottom of the page. If you’re not already logged in, do so now. If you don’t already have a google account, go get one here. They’re free. (Alternatively, you can use your gmail account if you have one).

Give your custom search engine a name and description, and fill in “SEO” (no quotes) for the Search engine keywords. Cut and paste the following list of URLs into the “Sites to search” textbox. Make sure to check the “Search only these sites” radio button. This ensures that your CSE only includes results from this set of pages, and not the whole web.

Check the “I have read and agree to the Terms of Service” box (go ahead and read them — it won’t take that long, and I don’t want Google to get mad at either of us). When you’re done, your form should look like this:

Click the “Next” button, and take your shiny new Custom Search Engine for a spin. Try typing “css” as a search term. You should see a preview with about 4 results, with an article from Jakob Nielsen’s excellent AlertBox on www.useit.com at the top of the list. Notice that it looks very much like normal Google search results. Now click on the “Finish” button, and admire your (very short) list of custom search engines.

You can take a more extensive tour of your CSE by clicking on the “homepage” link. Do that now, and try out a few more SEO-related search terms, like “title tag” and “white hat”. One other thing to try here is switching between searching just the sites we picked and the whole web, using the radio button at the top of the homepage. This will begin to give you a flavor of how the Custom Search Engine results can differ from the normal Google search results. When you’re done searching, take a few moments to notice the other features of your CSE home page.

In particular, you probably want to fill in your profile (click on “Edit Profile”), so that you can provide some details about who you are to others who may want to use your search engine. Your profile also identifies you if you decide to use Google’s built-in collaboration feature, which allows multiple users to cooperate on the effort of building a custom search engine, which is something we’ll get to later in this series of articles. After you update your profile, navigate back to your CSE home page, and click on the “Edit this search engine” link.

You should now be looking at the Custom Search Engine Control panel, on the “Basics” page. This page looks a lot like the form you used to create the search engine a moment ago, with one notable exception: the sites making up the CSE aren’t shown. Click on the “Sites” link near the top of the page to see the list of resources (individual web pages, groups of pages, and web sites) that are included in your CSE. You’ll use this page to add/delete or customize the resources that are included in your CSE.

Let’s take a quick run through the other tabs. We’ll skip “Refinements” for now, since we’ll spend a lot of time with them pretty soon. The “Look and Feel” link takes you to a page where you can customize — you guessed it — the look and feel of your search results page. Don’t hesitate to play around with these settings and give your custom search engine its own unique look and feel. The “Code” link is for another advanced feature: hosting the search box, and optionally the results, right on your own site. This is actually pretty straightforward, and you can probably make it work without much help just by following the directions on the Code page, but we’ll also cover it in a future tutorial.

Moving along the Control Panel, “Collaboration” provides access to a powerful feature that enables a group of people to cooperate on building a custom search engine. This is another feature we’ll cover in a future tutorial. “Make Money” is probably important to a lot of people — it’s how you connect your AdSense account to the ads that appear in your CSE results so that you can get paid when people click on them.

“Advanced” is a place we’ll be spending a lot of time in the upcoming articles — when you’re comfortable slinging XML code on this page you will have earned your badge as a CSE expert — but for now, let’s move on to “Preview”. Whenever you make a change to your CSE, you can quickly check the results by performing a search on the Preview page. So let’s once again do a search on “css” and look at the results. You can see that we get at least 10 pages of results, so there’s a lot of relevant CSS material, even in our short list of sites.

Modifying the Custom Search Engine

Next, we’re going to modify the list of sites that we’re searching. We’ll do this for two reasons: 1) it gives us a chance to see how to use the Control Panel to modify an existing CSE; and 2) (the real reason) we want to limit our searches to producing a small set of results for now, so that we can easily spot what’s going on as we make changes.

Click on the sites link at the top of the control panel. Click on the check box next to the “Add Sites” and “Delete” buttons in the “Included Sites” section of the page. This is a shortcut for checking the box next to each site individually. Now that they’re all checked, let’s go ahead and remove them from our CSE by clicking on the “Delete” button. Now click on the “Include sites” link. (Nifty Ajax interface, eh? Nice UI work again, Google!) Now click on the “Include sites in bulk” link. Copy and paste the list of URLs below into the text box. Click on “Include just the specific pages I have entered” radio button, then on the “Save” button.

Note that we’re listing specific pages here to make it easy to see what’s going on when we start looking at results. Normally, you’ll want to include larger sets of pages — for example a whole site, or a whole directory within a site. You can use very flexible pattern matching in your URL list to identify which sets of pages your CSE should search.See here for more information on using URL pattern matching.

Now let’s go to the Preview page and survey our results. Type “css” in the search box and click the “Search” button. Not surprisingly, the article from www.w3.org is at the top of the list. As the standards body that approves CSS, they are authorities after all, so it’s only fitting that they sit at the top of the hill. At the bottom of the results, click on next, and note that there, way down at the very bottom of the last page of results (not to be overly dramatic), is an article by yours truly. We’ll soon enough remedy that situation and restore that result to the prominent position it deserves (yes, with CSEs we can do that, and without even using any special SEO voodoo), but you’ll have to read the next article to see how to do that. For now we’re going to do a few other fun things with the results.

Adding Refinements to the Custom Search Engine

You’ve Probably Already Seen Refinements

Let’s introduce our first real “advanced” feature: Refinements. Before we get too far with building our very own Refinements, let’s take a look at an example where Google uses this concept in its main results. Yes, that’s right — refinements have been part of Google “normal search” for some time. They’re built on the Google Co-op “Topics” architecture, which also underlies Custom Search Engines. So the kinds of things that Google does with Topics, which we’ll examine in a moment, are available for your Custom Search Engine. This just begins to hint at the power of Custom Search Engines. Let’s take a look at Refinements in action by going to www.google.com and searching on “bird flu” (or most any other medical topic). With “bird flu”, I currently get results like this:

Right below the News OneBox (Google’s name for specialized search results that appear at the top of the list of results), you’ll notice the caption “Refine results for bird flu:”. Below the caption are a series of text links, known as Refinement Labels. Go ahead and click on one, say “For health professionals”. Notice that the page refreshes and a new set of results are shown. Below many of these results is an annotation that looks like “Labeled Professionals”. What’s going on is that Google, working with various other health authorities and contributors, has established a set of categories that help users further refine their queries. They’ve then applied these categories to the web, by labeling, or “annotating” web pages, so that users can sort through the various categories.

The idea is that refinements often reflect an editorial “view” of the results that can’t usually be seen through a “normal” search. This editorial view can perform several functions: it can help sort out the “best” resources in a category by taking into account direct human input (see here for more on how the Health Topic is organized), and it can help “slice” the results along lines that aren’t usually surfaced in search queries. For example, it might be very difficult to locate an authoritative list of resources discussing bird flu, targeted at health professionals, through a normal search. The Health Topic surfaces this view with a single click, as we’ve just seen. You can read a lot more about Google Co-op Topics here and here.

How Refinements Interact with Regular Search

It’s important to know a few more things about Refinements. Note that some of the results from your refined results list do not have labels. The next thing to realize is that some of these unlabeled results actually rank ahead of other labeled results. This reveals a fundamental notion about Refinements that we’ll need to keep in mind: they operateon top of the basic Google search algorithms (the rules that Google uses to locate and rank results for your search query). There are several important consequences of this:

Results that show up in a refined result set must be in the Google index to begin with (i.e., the googlebot must have crawled and indexed the page). In other words, labeling a result will not put it into the refined result set if it wouldn’t otherwise have appeared in the results of a normal Google search on the same term.

the underlying Google selection and ranking algorithms are still in place. By labeling a web page, you’re providing Google with more information about that page that may affect how it’s displayed, especially when the user chooses to refine queries. But it’s still possible in most circumstances for pages that have not been labeled to seep into the refined results, as we see in this case. The interaction between refinement and the basic Google algorithms is complex, and should be viewed as a “knob” for adjusting results, but not as a means for completely controlling them. We’ll talk more about how applying labels (the process is called “annotating URLs”, or “annotation”) can affect rankings later on.

One final thing to notice about the refined results: some of the results have multiple labels. It’s entirely possible to include the same resource (web page) in several different categories, reflecting the fact that it’s often possible to look at the same page from multiple perspectives.

Adding Our Own Refinements

Let’s get back to our SEO custom search engine and see how we might be able to use refinements. Once again, let’s return to the Preview page of the Control Panel, and search on “css”. One of the ways that SEO people like to categorize SEO advice is “white hat” and “black hat”. Black hat practitioners may use techniques that put their sites at higher risk of being penalized or banned by the search engines (but that are designed to increase their rankings in the short term). A classic example of a blackhat technique is using hidden text to load a page with keywords that, while not relevant or beneficial to the human visitor, can be indexed by search engine spiders and may cause a page to rank differently (better on certain terms) than what it would by virtue of its visible content alone. If we can identify search results pages that offer white hat advice, and others that offer black hat advice, we could create a refinement that allows users to choose the kind of results they’d like to see.

Now we’re not really going to include black hat advice in our list because, well, we’re white hats and don’t think it’s in the long term best interests of our clients (or anyone that considers their web site an asset) to use black hat techniques, so we don’t really want to promote them. Instead, we’re going to have a little fun and pick on one of our SEO colleagues. We’re going to treat his blog post here as if it were black hat advice (which it’s not — rest assured, this team is one of the most highly respected SEO firms in the business). We’ll do this just to give us an example. And, OK, because it’s always fun to tweak a fellow SEO. Indeed, if you go read the post and comments, you can see that reasonable people disagree on the finer points of what constitutes white hat, and what is considered black hat, especially in that gray area in the middle (sorry for the bad pun).

The first step in creating a refinement is, not surprisingly, to go to the Refinements page of the control panel (click on the “Refinements” link). Once there, click on the “Add Refinements” button at the bottom of the page. An entry box should appear in the middle of your screen (yet another nice Ajax implementation). Enter a Refinement name of “white hat”, and click on the “Search only selected sites.” buttton. Click on “Save”, then close the help window, and click on “Add Refinements” again to add a “black hat” refinement, again choosing “Search only selected sites.”, and again click on “Save”. You can now click on the “Go to sites tab” link in the popup help message (or exit, and simply click on the “Sites” link of the control panel), and we can begin applying labels. To apply a label, first select it with the checkbox. So we’ll click on the box next to www.seomoz.org/blogdetail.php?ID=1020, our surrogate for a black hat article. Now we’ll click on the “Label actions” drop-down list, and choose “black hat”. The form should update to reflect that you’ve applied this label to the selected URL.

Now let’s apply the “white hat” label to several URLs. Click on the checkboxes in the left column (first click on the www.seomoz.org URL we just labeled to deselect it) to choose the following URLs: www.mattcutts.com/blog/seo-mistakes-unwise-comments, www.w3.org/TR/REC-CSS1, www.searchenginejournal.com/?p=2211, and www.stonetemple.com/articles/css-and-seo.shtml. This time, choose “white hat” from the “Label actions” drop-down list. When you’re done, your Refinements page should look like this:

Let’s go take our CSE for another spin, and see what our refinement handiwork has wrought. Once again, from the Preview page, let’s search on “css”. This time, we should notice the now-familiar refinement interface. We’re offered the choice of refining our results by clicking on the “white hat” or the “black hat” label. Try it out. To get back the whole “unrefined” result set, just click on the “Clear refinements” link. The screen shot below shows the results you should see if you click on the “white hat” label.

Summary

In this tutorial, we learned how to build a simple CSE, modify it by adding sites, and add value to it for our users by utilizing refinements. This article serves as an introduction to CSEs in general, and to the concept of labeling URLs. So far, we’ve seen how labels can be used to generate a refinement interface that lets our users “drill down” through the result set from different perspectives. This is just the beginning of what you can do with labels. In the next article, we’ll show you how to do a series of increasingly more advanced things with labels, including customizing the refinement user interface, using multiple labels, and the extremely powerful concept of background labels, which open up the ability to manipulate not only which results are shown, but the order in which they’re displayed.