Filtering User Searches with Metadata Fields

Last Updated Mar 2009

By: Mark Bennett, Volume 2 - Number 5 - April 2005

Studies show that the average Internet user query is between one and two words in length, which tends to produce many, many results. Some companies have done a better job educating users of their internal search applications, but even within corporations we see short queries as the norm rather than the exception. Although search engine ranking algorithms can help a bit in some situations, there is another more predictable, simple strategy that corporate search engine managers can use: using the available metadata in conjunction with search terms specified by the search user.

A variety of customized search forms can be deployed with on various parts of the Intranet web site. Each customized form can, by default, limit its scope of search to content associated with the part of the Enterprise’s web content.

For example, an HR web site within your company can restrict the scope of searches entered into its search form to content from the HR portion of the site. And an Intranet Helpdesk might, by default, limit searches to corporate IT documentation and FAQs. This type of filtering can be done with hidden search form fields, or behind the scenes with query cooking as we described in Intelligent Query Pre-Processing in August of 2003.

We will show examples of this for two popular search engines, Verity Ultraseek and Verity K2.

For this example, let’s assume that every document in the enterprise search engine index has a field called “department”. The exact method to define and populate a field like this is search-engine and data specific, and beyond the scope of this particular article.

Let’s further assume that a corporate end user is searching from the search form on the HR department site, and is looking for information about vacations. We will combine their search term, vacation, with the filter of, effectively, department = “hr”.

Note: Whether to use the <contains> or the <in> operator depends on whether “department” is a K2 Field or a Zone. The difference is beyond the scope of this article, but suffice it to say that you can generally search zones much faster than you can search fields.

In both cases, the scope of the search will now be limited to the HR Department’s content.

One note about the Ultraseek syntax. Programmers might find it odd that the vertical bar “|” as a virtual AND operator in Ultraseek, while in many computer languages the vertical bar represents an OR operator. In the case of Ultraseek, think of the "|" like a Unix pipe where the output of one process can be “piped” through another process. The data that “passes through” the first process gets to move on to the second process. The Unix command:

grep pizza file.txt | grep beer

will only output lines from file.txt that have both the terms 'pizza' and 'beer'. This may have been the inspiration for the unusual syntax in Ultraseek.

There is one additional improvement you can make to improve the relevance of your results in both Ultraseek and K2.

If you were to actually index data and use these test queries, you’d notice that the document summaries in the results list have both the words 'vacation' and 'HR' highlighted, even though the user only entered 'vacation'. This may confuse your end users, and should be avoided. But there is more subtle problem as well: Ultraseek and K2 will use both words to calculate the document relevance. So a document which contains many instances of the term 'HR', which is quite likely given the source of the data, could easily be ranked higher than a document containing only a few occurrences of the word 'vacation' – and that’s certainly not what we want.

Both Ultraseek and K2 have the concept of a “query filter”. A query filter looks like a regular full text query, but does not affect search results ranking, and its terms are not highlighted in the results list.

In Ultraseek, use the query:

department:hr || vacation

In K2, we'll use both queryText and sourceQueryText fields: Set QueryText to:

vacation

And set SourceQueryText to either:

department<contains>hr- OR –hr<in>department

Again, whether you use <contains> or <in> depends whether you have defined department as a field or as a zone.

These queries will produce nearly identical results to the earlier examples, but 'HR' will not be highlighted, and documents about vacations should be ranked higher in the results list.

Notice the very subtle change in the Ultraseek query, we’ve swapped out the single vertical bar “|” for a double vertical bar “||”. This simply excludes the left portion of the query from being highlighted or considered in the ranking of matches.

If you find this article helpful, or would like more information, please drop us a line!