Eliminating Repetitive Text from Search Results

created 2/12/08

Requires Webglimpse 2.18.2 or above

Many websites have repeated text in every page that is part of the navigational bar, legal disclaimers, or company information. While helpful to the browsing user, these navigational aids can dilute out real search results for content that happens to have words in common with the repeated text. Webglimpse has a fairly simple solution for eliminating such extraneous hits.

The key is to filter out the repeated text prior to indexing. By default Webglimpse prefilters HTML files to remove unnecessary tags. A small change in the filter script will allow you to remove arbitrary chunks of code and text prior to indexing.

Follow these steps:

Make a copy of the htuml2txt.pl script from the /lib subdirectory of your webglimpse install. You might put it in the archive for which it is customized.
cp -a /usr/local/wg2/lib/htuml2txt.pl /usr/local/wg2/archives/1/htuml2txt.pl

Edit your new script and add SkipSection lines to eliminate unwanted sections
&wgFilter::SkipSection('[regexp matching first line to skip]',
'[regexp matching last line to skip]',
\@lines);
add your lines next to the other SkipSection calls, approximately line # 31.

Modify .glimpse_filters in your archive directory to use your new copy of htuml2txt.pl rather than the default

Reindex your archive. Check that the files produced in the .cache or .remote directory are reasonable; you may need to tune your regexps for matching the section beginning and end.