The Sad State of Dates: Don't Let This Happen to Your Enterprise Search Application

Last Updated Mar 2009

By: Mark Bennett, NIE Enterprise Search - Issue 1 - April 25, 2003

Things change. Technologies, corporate fortunes, product lifecycles, politics, pretty much everything. Next to the title of a web page or document, the date that document was written provides incredibly valuable context information. For example, if you're looking something about the programming language Java and find interesting information, only to notice that the document was written in 1996, then the information is out of date, and possibly obsolete.

The inventers of the Web certainly realized this. A basic part of the HTTP specification, the definition of how data is sent across the World Wide Web, defines specific header fields, including the Last-Modified date. What a wonderful plan: every document will have, as part of is basic information, the date and time it was last modified (or created). Sadly, this part of the HTTP protocol has been largely ignored and is rarely populated with valid data. What is even more frustrating is that many Web spiders and indexers actually still believe this field.

Unfortunately, so many web pages are generated or marked up dynamically, application servers often report the current date and time as the creation date, rather than the date the content was actually created. This effectively removes any way to segregate older data from newer data, and compromises the effectiveness of the Last-Modified date field in the HTTP specification.

The impact from this problem can only grow worse over time. In 1997, searching for information on the public search portals brought back documents with incorrect dates, but the content itself was often new so the discrepancy was a minor inconvenience. Now, in 2003, much of the content is five years old, but search portals continue to report documents from 1997 as new. A search for Java syntax done today, with no ability to display results by date, may return obsolete content with no indication that the document is old and questionable.

If you are using search technology on your corporate web or intranet site, you probably have the same problem with your content. In fact, on corporate sites the problem is often worse, because your site is often the primary archive for legacy information, and you really may want to identify documents form a specific period. Which products were released in 2001? Was a certain policy in effect in June of 1999? What’s new on the web site? Without effective data management, you cannot answer these questions.

What can you do about the situation on your site? If you have control over your content and servers, which is typically the case for your own public and intranet sites, you can probably implement a solution to the problem.

In fact, there are at least two ways to make sure the Last-Modified date is being set correctly. The best way primary way, of course, is to configure your web server to report accurate Last-Modified dates. This is often not the default setting, and may take some doing. It may be even more difficult if your content is dynamically generated by a server like Cold Fusion or from a database. To make matters worse, since servers often report the file data and time, simple file modifications can invalidate the Last-Modified date.

Luckily, there is a better way to control the Last-Modified field: actually include the correct date and time within the content of the web page.

There is a special class of meta-tags that you can specify at the top of a web document that will override or augment the fields sent in the HTTP headers.

When your search engine spider finds this Last-Modified meta-tag in your documents, it will likely populate the field properly, In fact, most public search engine browsers will also respect this meta-tag, so the content on your public sites will be more useful to visitors to your site.

There are two caveats to keep in mind:

First, before you institute wholesale change so your site, test the solution to verify your particular search engine and spider will work as you expect.

Second, be aware that some browsers may report the web date and time sent in the HTTP header, regardless of any Last-Modified meta fields. When you view the Document Properties (from the File menu) you might see the HTTP header date, which may make it look as if the meta date is not being set properly; don't worry, the real test will be when you can sort your search results by date and see new content at the top of your list.

While modifying your HTML content can seem like a daunting task, there are a number of methods to automate the process using the file date and time, or even to scrub your content for dates. These tools can help you add the correct meta-tags to your content with minimum effort.

If you do not have control over the content and servers that publish your information? Your situation is a bit more "interesting", but not impossible. We'll explore some of those issues in upcoming issues of Enterprise Search.