Html filtering and description services for DL indexing of
html articles. The following two services are implemented.
[1] HtmlDescribe Service:
Describes html articles based currently on TITLE and Hn tags.
With this service, when you search in DigitalLibrarian,
titles are listed in the format:
title -- header1 -- header2 -- header3
[2] HtmlFilter Service:
The purpose of this filter is to remove junk, such as
complete html anchors(.*) and simple html
tags () before the article text is handed over to
indexing scanner. This should reduce the size
of .index.store somewhat (upto 20% compared with Version 0.91).