Archive for the ‘search’ Category

I’m going to “Innovations in Web and Enterprise Search” at BCS next week

Search Solutions is a special one-day event dedicated to the latest innovations in web and enterprise search. In contrast to other major industry events, Search Solutions aims to be highly interactive and collegial, with attendance limited to 60-80 delegates.

I’ve been thinking about the search functionality for our online shop this week. I’ll write up our approach to search properly at a later date but for now I thought I share the variety of search forms I’ve seen on other online shops.

Will from Distilled just ran a free webinar about the SEOMoz tools so it seemed a good opportunity to learn more about what more is available from SEOMoz.

Will says that SEO tools (some free) give you three things:

Quick research (basic understanding)

Deep dive research (actionable insights)

Making things pretty for boss/client (ever important)

The Pro tools aren’t particularly cheap, so it was useful to have someone talk you through what the return on that investment would actually be. In places the data looks a lot like the stuff you get from your web analytics tool e.g. Google Analytics. But remember this is data on your competitors as well as your own site.

Using AutoTrader as an example, Will talked about

SEOToolbox: Free tools. Will likes and uses Firefox plugins instead of some of these. Still likes and uses Domain Age tool

Term Target: free, aggregates data on a given page, identifies keyphrases

Rank Tracker: Will keen to stress that individual keyword ranking isn’t the important thing. Often your boss will demand it. Makes little graphs and will export to CSV. Can combine with analytics data e.g. using Google Analytics API

Firefox toolbar Will loves this. Uses it more than any other SEO tool. Pro version better. Shows some pagerank-esque data for page and domain. Going up 1 MozRank point is equivalent to 8x stronger. So decimal points are important.MozTrust is similar but restricted to links from trusted sites. Page Analysis also part of the toolbar? Alternative is Bronco tools.

Linkscape: the tool SEOMoz are heavily investing in. Web graph of which pages link to each other on the web. Will doesn’t see an alternative to this. Free version does basic stuff. Pro version produces more data and prettier data. Will recommends the Adv Link Intelligence Report. You can get data on who links with “nofollow” which Will thinks is unique data.

Of the ‘public’ tools I mostly use the Adwords Keyword Tool, inspite of not using Adwords.

Try searching for ‘phones’. From the results you can see whether ‘cell phone’, ‘wireless phone’ or ‘mobile phone’ is the dominant language in your area. When there are labels that my team is arguing about, Ill sometimes see if the Keyword Tool can add evidence to the argument.

So you’ve tested your site search. You’ve submitted some bugs. You’ve probably got lots of responses to those bugs along the lines of “oh, that’s just a config setting” , “you don’t understand – that’s a feature of how this product works” and “the search is fine, you just need to get the authors to do their metadata properly”.

Now the config statement is fine. So long as changing the configurations actually sorts the problem. Don’t sit back at this point. Either make the recommended changes yourself or insist the supplier does. Don’t close the bug until they’ve proved the point.

Changes you can usually make to the configuration

change the crawled pages

change the indexed fields

default query syntax

change stop/noise words, stemming and the thesaurus

ranking parameters

Be very, very careful if you are changing the ranking parameters. If fact, I’d suggest this is a mini-project in it’s own right. You’ll need to be able to make one change at a time and compare the new results with the old, across a large set of queries. You probably want to do this with someone who has experience with the specific search engine.

The other two scenarios/excuses are more problematic. If the search has a feature that you thing make the results bad you’ll need to see if you can get it switched off/removed. If you can’t you may have chosen the wrong product.

If your supplier thinks that teaching authors to do metadata properly is a simple goal then you may need a new supplier. This is hardly the attitude that made Google the search masters.

(I’m not contradicting my Best Bets post here: I think there are scenarios where properly motivated and focused editorial staff can do a better job than natural search results. But I’m not thinking of your average author, I mean your central web or search team. I mean people paid to care about search.)

You change the guidelines/training for authors. You can probably get the current batch of authors to listen to some simple tips and pointers. They might remember. They might pass them on. But be realistic, how much control do you have over the authors? Metadata education is often a thankless and futile task. The best solutions are those that don’t require the authors to think about search, whether that is technology or intervention by search specialists.

Where the natural results just aren’t good enough and the authors can’t help there are things you can do on the search results page to help the user out.

Not really about testing but still coming soonish: Changing the interface

Set aside a reasonable block of time where you won’t be interupted. Schedule later sessions bearing in mind the crawl timescales. If you make changes you’ll need to wait for the crawl to run before you can test again.

You need content in the system before you can test search. The ideal scenario is to be testing search once a site or system is fully populated with real content but this often isn’t possible. Don’t wait for the system to be populated if that means you won’t be able to make any technical changes.

So allow time for content creation as part of testing. You’ll probably want a mix of real content and dummy content that has been specifically written to test an aspect of search.

You’ll need to record the results so you need a spreadsheet.

Set up columns something like this:the query (linked if you are running the tests from here), whether the results are ok, a description of the issues,hypotheses about causes, changes or adjustments made to validate, bugs reported, screenshots (where necessary)

Create new versions of the worksheet each time you test, and label accordingly. If you make changes to the content or the configuration then test again after the crawl has run

Add queries to the spreadsheet as you go. No matter how good your original lists, you’ll explore other issues as you actually use the system.

I’m not merely testing. I’m attempting to analyse and resolve the issues. You could argue that I shouldn’t need to do this, I could just log all the issues with the supplier and get them to resolve them. In my experience it is more successful to do as much as possible yourself.

So what does ok mean? Inevitably it is subjective and it is also qualitative. You could compare with benchmarking metrics for the existing site but some part of the testing usually relies on the subjective judgement of the expert tester. Where time for testing is fixed, I raise the bar with different rounds of testing i.e. round one could be focusing on results that are patently unacceptable, with later rounds raising the standard of quality.

(this testing is in no way meant to replace user testings, the intention is more to test that the functionality works as promised and to get the results to the sort of quality that it is worth putting in front of test participants!)

Mostly you’ll have no problem spotting bad results. Explaining the bad result is the challenge.

Possible sources of issues

Incomplete crawls. First check the search engine successfully completed a crawl. Testing is easiest if you can check yourself. Otherwise you’ll keep having to nag the suppliers/IT to tell you if the crawl went ok. Ask if there is an interface that shows how the crawl went and ask for access.

What is the default query syntax? This is a simple one to check off. If you thought the search was performing an OR search and it is actually running AND then that might well explain why you aren’t happy with the results. And vice versa.

Documents/pages that shouldn’t be crawled? Pages I’ve seen in the results that shouldn’t have been there include:

admin pages (in one case the blocked profanity list!)

permission controlled pages

quiz answers

form thank-you page

user profile information

You may need to get rid of a lot of these pages before you can see the true quality of the results.

Documents/pages that should be crawled?

other specified domains in addition to your main site e.g. www.rnibcollege.ac.uk as well as www.rnib.org.uk

all sub-domains e.g. not just www.bbc.co.uk but also jobs.bbc.co.uk and news.bbc.co.uk.

pages regardless of their position in the site

Office and other documents

images, video, audio (depending on how you want these assets to appear)

What is being indexed within a document/page? You can check by creating a variety of dummy content and adding your test keyword to a different field on each piece of dummy content. Choose an unusual keyword that won’t be appearing in the rest of the content (I tend to use my mother’s Polish maiden name). Fields to check:

titles

URLs

meta descriptions and keywords

main page content

authors and other metadata relevant to your content set

navigation and page furniture (you’ll see this cause trouble more when the content set is small)

full content of Office document, pdfs etc?

metadata attached to multimedia assets

What filters are being applied? Check for:

stop words

stemming

thesaurus

Ask if there is an interface where you can view/edit these filters. If not, ask for copies of the actual files.

What is affecting the ranking? This is complicated to test with any ease as most systems use a variety of factors and there’s usually a level of mystery in the supplier communications. Consider:

where the keyword appears

how many times the keyword appears

the ratio of keywords/article length

type of document

links to the document, text of those links, authority/rank of the linking page

If you’ve been told that your search system utitlises “previous user behaviour” to adjust ranking then this can make testing a bit tricky. It also gives the suppliers a black box to hide behind if you don’t think the search is working right.

I’ve been told “don’t worry about testing search, this is a learning system”. Which sounds lovely but on day one the search results still need to be good enough to go live and you’re going to have to really work hard to get a grip on how the system is working. And who says it is learning the right lessons? In this particular scenario I doubled the amount of time I had set aside for testing.

In last week’s post about Best Bets I commented that search software is “certainly not good enough without a lot of work. A lot of expensive work. If your supplier says ‘the search is really good, you don’t need to worry about it’ then you definitely need to worry about it.”

Worrying about and testing search systems has been a common theme in my working life: whether that involves benchmarking the performance of existing system, testing a new one prior to launch and comparing vendors when choosing a new system.

I’ve had varying levels of exposure to APR Smartlogik, Google, Inktomi/Yahoo, Fast, Verity, Autonomy, SharePoint. At this moment I’m in the middle of testing and tweaking the search for a SharePoint powered website. The challenges are surprisingly similiar to those I encountered when working with Muscat in 2001.

Having gone through such similar processes so many times, now seemed a good time to write it all down. I’ve divided my process into three stages: preparation, running the tests, and making changes.

Preparation

1. Ask the suppliers lots and lots of questions. You are after actual answers, testing their level of knowledge and letting them know that the quality of the search matters to you. Don’t rely wholy on the suppliers answers. Find other users and do your own reading to validate what the supplier tells you.

Most important to find out:

Ranking criteria

What is configurable; of those configurations which have a graphical interface; and of those which have a user friendly graphical interface?

Other useful things to find out:

What query syntax is supported? What is the default syntax?

What are the stemming rules and which words are stop words? Ask for copies

Your list could be a simple list of terms but you’ll find it easier to run many rounds of tests if you set your list up as http links that will run the query in your test search engine.

If you are testing multiple search engines and you have access to coding skills then you can set up the list to run automatically across the range of search engines and display your result back to you, saving lots of time. Or if you are running multiple rounds of testing on the same search system, an interface that checks to see if the results have changed since last time is invaluable.

But for most of us, we’ll be working from a list of queries and running them one by one.

I was working on a Best Bets system this week, which is essentially what I did 8 years ago on my first BBC project . It is nice to working on something straightforward but I’ve had to do a lot of explaining of the concept. What follows is my advice if you are think about adding Best Bets to your search.

What are Best Bets?

Best Bets are essentially editorial picks that appear at the top of the search results. They are a manual intervention for use when the search engine isn’t developing the best results for the users. Some sites use them to fix just a couple of problematic queries but others have built up extensive databases of thousands of best bets.

Some search systems have Best Bets functionality as standard (surprisingly SharePoint is one of these) or you can have something bespoke added. The first system I ever worked with was just a basic text file that I edited and uploaded to server – you should be able to get something better than that!

A Bad Idea?

Kas Thomas thinks that we shouldn’t do best bets:

“In point of fact, the search software should do all this for you. After all, that’s its job: to return relevant results (automatically) in response to queries. Why would you sink tens (or hundreds) of thousands of dollars into an enterprise search system only to override it with a manually assembled collection of point-hacks? Sure, search is a hard problem. But if your search system is so poor at delivering relevant results that it can’t figure out what your users need without someone in IT explicitly telling it the answer, maybe you should search for a new search vendor.”

This is the sort of language I expect from the vendors but it is a bit surprising from industry analysts. Yes, the search systems should be good enough. But they’re not. They’re certainly not good enough without a lot of work. A lot of expensive work. If your supplier says “the search is really good, you don’t need to worry about it” then you definitely need to worry about it.

Oh and IT shouldn’t be managing the Best Bets anyway. The teams I’ve worked with it has always been an editorial or product management role. After all why would you build a simple tool to allow editorial intervention and then ask IT to put the content in?

A simple best bets solution, that can be maintained by editorial/product teams rather than scarce technical experts (or worse expensive consultants) is often a better business solution than battling with the search algorithm to try and get it right for all the scenarios. Particularly on a tight budget.

Other pros for Best Bets:

Just fixes that problem. It doesn’t change any other results. There’s no mysterious black box that has you banging your head against the desk about why when you changed Property X to fix the results for Query Y the results for Query Z changed like that.

Fixes the problem straight away. You don’t have to wait for the next crawl or even for an emergency crawl to finish. Sometimes it really is that important. Other times someone else thinks it really is that important and you want them to leave you alone now.

Buys you time whilst you improve the algorithm.

Managing Best Bets

The critics are however correct that Best Bets have some drawbacks. You have to create and maintain them. If you let the links break then you’ve created a worse user experience than the one you set out to fix.

Don’t go overboard. Only create them where there are clear problems

Plan for maintenance time. Who is going to add Best Bets and when? Do they have time to check existing Best Bets?

Make sure you have access to search logs so you can see what terms users might be having difficulties with

If possible, set up a broken and redirected link checker to run over the Best Bets

And yes, do look at what your Best Bets tell you about the weakness of your search system. If you have the permissions and the skills you may be able to put that knowledge to use in improving the algorithm. But even if you can’t make the changes yourself and there’s no budget for incremental changes (which there often isn’t) then you can at least start building a business case for a search improvement project.

Designing the display

It is tempting to strongly highlight the Best Bets to draw attention to them but this is one area where usability testing tells us a different story.

Users demonstrate a very strong preference for the first ‘ordinary’ looking search result, which is presumably a behaviour they have learnt from web search engines. With search engines any result that is styled slightly differently is probably an ad. Some users didn’t even notice the existence of best bets when we had tried to draw attention to them. This may be a similar situation to banner blindness.

So don’t make a song and dance about it. We might feel the need to tell the user all the effort we’ve put into helping them but ultimately they just want the right result for their query. And they don’t care how it gets to the top of the results, so long as it is at the top of the results.

(Think about it. You’d never highlight a set of the results with a label saying “Brought to you by the IA tweaking the algorithm to weight page title more heavily”)

3 steps to happy Best Bets

In summary:

If the system you are buying doesn’t come with a built in Best Bets system, see if you can get a simple one added on.Think of it as safety net for once all the developers and project managers have packed up and left you to your own devices.

Put them at the top of the search results. If you feel the need to style them differently then keep the styling as minimal as possible

Inside the Index and Search Engines is 624 pages of lovely SharePoint search info. It is the sort of book that sets me apart from my colleagues. I was delighted when it arrived, everyone else was sympathetic.

The audience is “administrators” and “developers”. I’m never sure how technical they are imagining when they say “administrators” so I waded in anyway. The book defines topics for administrators as; managing the index file; configuring the end-user experience; managing metadata; search usage reports; configuring BDC applications; monitoring performance; administering protocol handlers and iFilters. I skimmed through the content for developers and found some useful nuggets in there too.

The book begins by setting the scene, and with lots of fluff about why search matters and some slightly awkward praise for Microsoft’s efforts. It gets much more interesting later, so you can probably skip most of the introduction.

Content I found useful:

Chapter 1. Introducing Enterprise Search in SharePoint 2007

p.28-33 includes a comparison of features for a quick overview of Search Server, Search Server Express and SharePoint Server.

“Queries that are submitted first go through layers of word breakers and stemmers before they are executed against the content index file is available. Word breaking is a technique for isolating the important words out of the content, and stemmers store the variations on a word” p.32

Keyword query syntax p.44

maximum query length 1024 characters

by default is not case sensitive

defaults to AND queries

phrase searches can be run with quote marks

wildcard searching is not supported at the level of keyword syntax search queries. Developers could build this functionality using CONTAINS in the SQL query syntax

exclude words with

you can search for properties e.g rnib author:loasby

property searches can include prefix searches e.g author:loas

properties are ANDed unless it the same property repeated (which would run as OR search)

Search URL parameters p.50

k = keyword query

s = the scope

v = sort e.g “&v=date”

Chapter 4: The Search Usage Reports

Search queries report contains:

number of queries

query origin site collections

number of queries per scope

query terms

Search results report contains:

search result destination pages (which URL was clicked by users)

queries with zero results

most clicked best bets

search results with zero best bets

queries with low clickthrough

Data can be exported to Excel (useful if I need to share the data in an accessible format).

You cannot view data beyond the 30 day data window. The suggested solution is to export every report!

Where search analytics is concerned it appears the RNIB is actually doing what everyone else is doing i.e. using Google Analytics:

“The use of Google Analytics is very much on the increase. Just under a quarter of responding organisations (23%) now use Google Analytics exclusively compared to only 14% a year ago.
A further 57% of respondents are using Google Analytics in conjunction with another tool (up from 52% in 2008), which means that 80% of companies are now using Google for analytics compared to 66% last year…

The majority of responding companies believe that they have set up Google Analytics properly.
There is more doubt among those who do not use Google exclusively, with 23% of these
respondents saying they don’t know if it has been properly configured”

And I’m firmly in the later 46% camp these days:

“since 2008 there has been an increase from 8% to 15% of companies who have two dedicated web analysts and a decrease in the proportion of companies who have one analyst (from 32% to 26%).
But while this is a positive development, it can also be seen that exactly the same proportion of companies (46%) report that they do not have any web analysts.”