This is the first in a series of posts in which I’ll delve into details of the Content Lifecycle that I’ve previously written about – the first topics will be a series on Search – an aspect of the “Use” stage of that lifecycle.

I’ve been working in the area of enterprise search for about 5 years now for my employer. During that time, I’ve learned a lot about how our own search engine QuickFinder works with regards to crawled indexes and (to a lesser extent) file system indexes. I’ve also learned a lot more about the general issues of searching within the enterprise.

A lot of my thinking in terms of a search solution revolves around a constant question I hear from users: “Why doesn’t X show up when I search on the search terms Y?” Users might not always phrase their problems that way, but that’s what most issues ultimately boil down to.

In considering the many issues users face with finding content, I have come to find there are three principles at play when it comes to findability: Coverage, Identity and Relevance. What are each of these principles? The principle of Coverage relates to what is included; the principle of Identity relates to how search results are identified; the principle of Relevance relates to when a content object shows up in search results.

First up – the principle of Coverage.

Issues with Coverage

The principle of Coverage is about the set of targets that a searcher might be able to find using your search tool. A content object first must be found by a search indexer in order to be a potential search results candidate; in other words, one answer to the user’s question above might be, “Because X isn’t in the search index!”. There are many issues I’ve found that inhibit good coverage – that is, issues that keep objects from even being a potential search results candidate (much less a good search results candidate):

Lack of linkage – the simplest issue. Most web-based search engines will have a search indexer that operates as a crawler. A crawler must either be given a link to an object or it must find a link (along a path of links from a page it does start with) in order to discover it. A web page on a site that has no links to it will not be indexed – i.e., it will not be a search results candidate.

JavaScript – a variation of #1 – I have not yet found a search indexer that crawls sites which will execute JavaScript. Any navigation on a page that depends on JavaScript for it to be displayed is effectively navigation that’s not there to search engines. JavaScript is quite common with menuing systems in browsers and so this issue can be problematic. To a user navigating in a browser that executes that JavaScript, it can be hard to understand why a crawler cannot follow the same links – “They’re right there in the menu!”

On our intranet, we have many secure areas – and, for the most part, we have a single sign-on capability provided by iChain. Our search engine also is able to authenticate while it’s crawling. However, at times, the means by which some applications achieve their single-sign on, while it seems transparent to a user in a browser, will use a combination of redirects and required HTTP header values that stymie the search indexer.

Similar to #3 – while probably 98+% of tools on our intranet have single sign-on, some do not support single sign-on. This can cause challenges for a crawler-based indexing engine – even with an indexing engine that will handle authentication, most engines will support only one set of credentials.

Web applications that depend on a user performing a search (using a “local” search interface) to find content. Such a web application keeps its own content invisible to an enterprise search engine without some specific consideration.

Like many enterprise search engines, our search engine will allow (or perhaps require) that you define a set of domains that it is permitted to index. Depending on the search engine, it might only look at content in a specific set of domains you identify (x.novell.com and y.novell.com). Some search engines will allow you to specify the domain with wild-cards (*.novell.com). Some will allow you to specify that the crawler can follow links off of a “permitted” domain to a certain depth (I think of this as the “you can go one link away” rule). However, given the an enterprise search will not be indexing all domains, it can happen that some content your users need to find is not available via a domain that is being indexed.

The opposite problem can occur at times, especially with web applications that exhibit poor implementation of robots tags – you can end up with a crawler finding many, many content items that very low quality, or even useless search results, or which are redundant. Some specific examples I’ve encountered with this issue include:

Web applications that allow a user to click on a link to edit data or content in the application – the link to the “edit view” is found by the crawler and then the “edit” view of an item becomes a potential search target.

Web applications that show a table of items and provide links (commonly in the header row of the table) to sort those lists by values in the different columns. The crawler ends up indexing every possible sort order of the data – not very useful.

The dreaded web-based calendar tool – when these have a “next month” link, a crawler can get stuck in an infinite (or nearly so anyway) link loop indexing days or months or years far out into the future.

Sites (common for a tool like a Wiki) that provide a “printer-friendly” version of a page. A crawler will find the link to the printer-friendly version and, unless the indexer is told not to index it (via robots tags), it will be included along with the actual article itself.

Addressing the issues

One important reminder on all of the above is that, while these represent a variety issues, it will almost always be possible to index the content. It’s primarily a matter of understanding when content is not being included that users are searching for and doing the work to index it. I find that it’s not necessarily the search engine but the potential for some minor development work that will stall this at times. Weighing the cost of the work to make the content visible against the benefit of having the content as potential search targets may not show a compelling reason to do the work.

To address the above issues, the most common solution for these is to work with the content managers to identify (or possibly build) a good index page. No magic, really, but just the need to recognize the omission (or, in the case of the robots tag issue, the proliferation) of content and then priority to act on it.

Within my own enterprise, to address the issue of robots tags in web applications, I have taken a few specific steps:

First, I have been on an educational campaign with our developers for several years now – most developers quickly understand the value in making <title> tags dynamic in terms of how it interacts with search and it’s normally such a small amount of change in an application, they will simply “do the right thing” from the beginning once they understand the problem;

Second, part of our application development guidelines now formally includes standards around titling within web applications (this is really a detail of the education campaign);

Third, when I can, I try to ensure I have an opportunity to review web applications prior to their “go live” to be able to provide feedback to the development teams on the findability of items in their application.

Lastly, independent of “verification”, I have also provided a general methodology to our development teams to help them work through a strategy for titling and tagging in their web applications even if my absence.

In summary – addressing the Coverage principle is critical to ensure that the content your users are looking for is at least included in the index and is a potential search results candidate. In my next posts, I will address the principles that make a potential search results candidate a good search results candidate – the principles of Identity and Relevance.