Poor Data Quality Gives Enterprise Search a Bad Rap - Part 2

This is Part 2 in this two article series, which delves into the details of Document Meta Data quality issues, and doing data quality checks on your search results. In Part 1
we introduced the issues around basic site indexing in Part 1 of this article. This month, we looks at document meta-data and quality of your search results.

Vocabulary: Meta Data: Extra attributes of you document, beyond just its raw text. Examples include Title, Last Modified Date, Author, etc. In HTML content, often defined with the
tag in the header of web pages.

Basic Meta Data

Basic meta data is common to almost all documents, and includes basic items like document title, last modified date, and document author. In some systems, the document summary is also a meta data field.

Here are some things to keep in mind:

Are you Dates Correct?

Quick Check - do a few simple searches with your search engine and look at the results list. Assuming the date is displayed in your results, look at them closely:

Signs of trouble:

Blank dates

All have today’s date or other recent date

The first one, blank dates, is pretty easy to spot. The second problem, which is far more common, is often less frequently noticed. If all the dates on your site have a very similar data, and many pages have the same data, you probably have a misconfiguration in your system somewhere; is it really likely that 90+% of your web site was updated yesterday!? By default many web servers return the current data and time in the HTTP “last-modified’ header field; many search engine spiders take this field to be the last modified date of that document. Thus, the dates of the documents incorrectly reflect the day the spider was last run.

Fixing this is heavily dependant on your web server, the application itself, and the search engine you are using. One possible solution, if you can’t fix the HTTP header dates, is to have your application include a <META> "http-equiv" tag with the correct date – some search engines will pick this up and use it in place of the HTTP header date.

You can also include a time component in the Last-Modified field, if your application requires that level of precision.

Thorough Audit

As with other aspects of data quality, a thorough audit may be in order. A general strategy is to dump all dates to a file or database. The procedure for getting your search engine to export its meta data to a file or database varies by vendor. You might look for an export utility, or use the vendor’s API, or consider a 3rd party application like NIE’s SearchTrack (link)

You can do a programmatic scan for quite a few statistics; and since dates can be treated as numbers, some mathematics can be applied:

Documents with blank dates

Min date / max date / average date

Calendar day with the least and most documents

Going further, you can create a histogram graph of dates; for each date, have a bar for how many pages are there, sorted by most recent date first.

Good: Graph looks reasonably smooth

Suspect: Graph has 1 or 2 giant spikes or “stair-steps”. Do these correspond to major site updates?

Even if you site had a major overhaul, the content probably didn’t all change. Simple page changes such as navigation bar changes, layout changes, and even rotating ads can fool some systems into thinking that the content itself was changed. When you see spike in this graph related to site changes, ask yourself if this was really a content changing or not; if the content didn’t really change, then neither should the dates have, so you may have a configuration issue.

How about your Titles?

As with dates, you can do a quick spot check and visually scan the titles in a few results lists.

Some easy things to spot:

No titles

URL or file names as titles (some engines default to this if no real title is found)

Overly short titles such as “info”, “FAQ”, etc.

Titles that are too long

Duplicate titles

Gibberish / control characters in titles

For a thorough audit consider using a script. Don’t forget to normalize titles before doing your analysis.

Trim spaces

Force empty strings to match nulls (both bad from user experience)

Lower case (for duplicate detection)

An automated script can certainly spot some problems automatically and definitively. Other data issues may or may not be "bad", depending on the context, and will require some human judgment; however, an audit program can at least flag suspect data and bring it to your attention for review.

Signs of trouble:

Red flag: How many are null or empty?

Red flag: How many are < 5 characters? (probably invalid)

Red flag: How many non-null duplicate titles do you have?

Yellow flag: strings with 8 bit characters, if you know your site should only have English content

Yellow flag: strings with HTML characters <, > and &

Yellow flag: How many are < 15 characters? (or some limit that you pick)

Yellow flag: How many are > 80 characters? (or some limit that you pick)

Yellow flag: Titles with long common prefixes:

This last item, long common prefixes, needs a bit more explanation. Some companies have practices where the full name of the company is prepended to every title (perhaps to aid in Google search results). Or perhaps each page’s title includes the lengthy name of the section of the site the page is on (perhaps trying to help visitors recognize which page they really want) But taken in the extreme, this can produce really bad titles.

Vocabulary: Vertical Application (repeated from Part I): In this context, a highly specialized search application, which may be more complex than a “generic” web search application. Examples would include a pharmaceutical research database, legal evidence management and discovery, a corporate or technical documentation library, or managing regulatory and compliance documents.

Many specialized applications have Meta data above and beyond the standard "title / data / author" variety. These vertical applications often contain fields used in reporting and processing, but these same fields can also be used in searches. Search engines must be configured to look for and record these additional attributes.

One way to get this data into search engines is for the main application to generate HTML content that includes <META> tags for each specific field. Many search engines readily understand this type of data. With other search engine setups, it may be necessary to configure the "database gateway" to include these additional fields.

Meta data can be subject to both spot check and automated data quality tests.

Things you can look at with Meta Data:

Compare the complete list of meta data items that SHOULD be there to what is actually being indexed by your search engine

Then for each of these fields:

Looking at all the documents, are any fields missing? Is this OK for this particular field? (some fields may not be required)

Are ALL the field values missing? This is certainly suspicious.

How many unique values does this field have? For a value-constrained field such as “author” or “color”, this should be much smaller than the total number of records. Sometimes this number is larger than it should because the same data is entered in more than one form, such as “Firstame Lastname” vs “Lastname, Firstname” – you may want to normalize this data during input, or when it is exported to the search engine.

For fields that should be unique, are there any duplicates?

For “typed” fields, run them through a validator. For example, for dates, have a script try to parse all non-blank dates.

A majority of this search engine data quality series is focused on "data prep", making sure that your search engine has an accurate and up to date representation of your data. However, there are a few aspects of data quality that come in to play only after data prep, when a user actually runs a search and looks at results. This type of information can be found by looking at your search results, and also looking at your search activity logs.

Looking at Search Activity Logs

Search logs are a miraculous resource for understanding not only your search engine, but also your visitors. Note, we’re not talking about "click-tracking" – you should be looking at reports that directly show the searches your users typed in, and what they got back for results; old-fashioned click-tracking required too much guesswork to interpret – search logs show you exactly what visitors were thinking.

Some basic things to look at in search activity reports:

What are the top 100 searches on your site?

Do they all bring back results?

What are the top 100 searches that return no results?

What searches return > 10% of your content?

Who are the most frequent visitors, and what are they searching for?

For each search, which document is most frequently clicked on?

By looking at search logs, and taking appropriate actions based on your discoveries, you can:

Suggest more pertinent pages or adjust document ranking

Suggest related documents or vocabulary

Improve your content

Improve site navigation

Are Results Relevant?

Do a spot check by running your site’s top 10 searches and look at the results:

Do the documents returned seem relevant?

Can you think of better documents to return?

For a more thorough audit, create bar graphs showing the Relevancy Histogram of first 100 documents for each of the 10 searches. For each segment in the score (for example, 100% down to 95%), the bar represents how many documents were in that range.

Normal: Is there a curve that slopes down as you go to the right?

Normal: Does the curve flatten out as you go the right, and get steep near the left?

Is the curve smooth, or does it have big "steps"?

Big "steps" indicate a scoring algorithm that might benefit from some tuning. Are users clicking on the documents that are returned by searches? Or are they instead, for example, clicking on the 4th or 5th document down for some searches? Users will show, though their actions, which searches are “working” and which are not.

Are they using the same spelling? This is particularly troublesome when companies intentionally misspell common words to get around trademark restrictions, Kwik vs. Quick, etc.

Are they using the same punctuation as your documentation? This is particularly bothersome with product model and serial numbers. Your site should match searches whether the user types in "AV400", "AV 400" or "AV-400", for example.