Build Your Own Search Engine With ht://dig

Watching all that data fly across your network is nice, but knowing that your users can find what they're looking for in the midst of it all is even nicer. With ht://dig, you've got a do-it-yourself search solution that helps them sift through the haystack for that needle.

Finding things is the #1 problem in this here computerized age. It's easier than ever to squirrel away terabytes of data. But then what? How to sort through all that lot to ever find anything? And why should we have to? In the olden days, file clerks, librarians, and secretaries took care of data storage and retrieval. Then along came computers, and suddenly it was decreed that file clerks, librarians, and secretaries were no longer necessary; that mere mortals like managers and sysadmins and programmers and other ordinary, unassuming humble personages could simply order their computers to do the work.

You've probably noticed that Google ignores common words like "to," "a," "the," and so forth. You can do the same, to keep htdig fast and useful.

Well, us folks who remained firmly planted in the real world knew it wasn't going to work, and it didn't. Coders from all walks of life gave us powerful file-searching tools like find, grep, and locate. Database gurus gave us indexes and intricate queries. But these were still better suited as power tools for clerks, librarians, and secretaries; not magic lanterns for the masses. Then the world became entangled in the World Wide Web, and friendly, useful search engines became more imperative than ever. The great masses were not attuned to Boolean searches, and I know you've seen the eyes glaze as soon as you helpfully explained "Boolean logic consists of three logical operators: OR, AND, NOT...."

Fast-forward to the present. You have heroically managed to bring organization and sanity to the great masses of data under your care. All that remains is to construct a user-friendly search engine for the company Web sites. Maybe a public site or two, perchance an internal site or three. Look no farther than the excellent, customizable, easy-to-administer ht://dig.

ht://dig is a suite of several programs:

htfuzzy

htload

htmerge

htnotify

htsearch

rundig

htdig

htdump

Each one has its own man page. htdig searches and parses HTML pages. rundig builds the index database. Also included are the htsearch CGI program and forms to build the search interface for users. You probably won't touch the other programs; these are called internally.

Getting Started

Once ht://dig is installed, edit /etc/htdig/htdig.conf first. Most of the file is self-explanatory. Be sure to customize the start_url: directive. You may designate a single site, or a space-delimited list of sites. htdig is well-behaved, and will stay strictly within the bounds of the URLs you specify:

start_url: http://websiteone.com http://websitetwo.com

Make sure that database_dir points to the directory you want to store your ht://dig database in. That's enough to get started- now fire it up and index your selected sites:

# rundig -vvv > htdig.log

Turning on maximum verbosity and storing the output in a file lets you check that ht://dig found and indexed everything you wanted it to. After its run is finished, which can take a few minutes, test it out by opening the search page in a Web browser:

http://websiteone.com/search.html

This should bring up the the beautiful light blue ht://dig search page, with dropdown menus and Boolean searches and everything.

Figure x. (Click for a larger image)

Customizing The Search Page

You'll probably want to customize the search page to match the rest of your site. Track down the search.html page and edit it just like any HTML page. And also header.html, long.html, short.html, syntax.html, footer.html, nomatch.html, and wrapper.html.

Fine-Tuning Indexes

You've probably noticed that Google ignores common words like "to," "a," "the," and so forth. You can do the same, to keep htdig fast and useful. In /htdig.conf add the bad_word_list directive. Then make a list of words that you don't want indexed in a text file, one word per line. There should be a sample bad_words file to look at. Then name the file:

bad_word_list bad_words.txt

The exclude_urls: directive tells htdig to not index the specified URLs. It is a good idea to not index your CGI directory, temp files, robots.txt, .htaccess, Apache binaries- anything that is not meant to be shared.

You can also exclude certain file extensions, and a number of these should already be excluded by default with the bad_extensions: directive. htdig cannot parse binary files, so image files, binary executables, compressed archives, and soundfiles should be excluded.

Converting Files To HTML

Some non-ASCII text files can be converted to HTML by htdig, with a little help. For example, uncomment these lines in /htdig.conf to enable converting .pdf files:

You'll need either XPDF or Adobe's Acrobat Reader installed. XPDF usually does a better job of translating .pdfs and .ps files to text. You can also convert MS Word docs, PowerPoint files, Excel files, and extract links from Shockwave Flash files. To do this, you need external file converters, a corresponding entry under external_parsers:, and perhaps a tweak to the doc2html.pl script that comes with ht://dig. The doc2html.pl script works as-is with these conversion utilities:

catdoc -- extract text from Word documents

rtf2html -- convert RTF documents to HTML

pdftotext -- extract text from Adobe PDFs. Comes with XPDF

ps2ascii --extract text from PostScript

pptHtml -- convert Powerpoint files to HTML

xlHtml -- convert Excel spreadsheets to HTML

swfparse -- extract links from Shockwave flash files

These are all standard Linux utilities that you can find in the usual haunts. You can add as many more as you like, provided you edit the doc2html.pl script to call the new utilities. See the doc2html/README for more information.

Automating rundig

You'll probably want to create a cron job for updating the database periodically. rundig can suck up a lot of system resources, so schedule it for slack times. /etc/crontab is quick and easy, like this: