Publisher's Note: we are very pleased to feature this valuable research on free and open source search engines on SearchTools.com. It was originally written in around 2004 and revised in April 2006.

A Comparison of Free Search Engine Software [1]

Yiling Chen (yilingchen7 [at] yahoo [dot] com)

Abstract: This paper reviews nine search engine software packages −Alkaline, Fluid Dynamic, ht://Dig, Juggernautsearch, mnoGoSearch, Perlfect, SWISH-E, Webinator, and Webglimpse− which are free to users. Their features and functionalities are compared and contrasted with emphasis on searching mechanisms, crawler and indexer features, and searching features.

1. Motivation

The Internet and computer technology have immeasurably increased the availability of information. However, as the size of information systems increases, it becomes harder for users to retrieve relevant information. Search engines have been developed to facilitate fast information retrieval. There are many software packages for search engine construction on the Internet. The website searchtools.com alone lists more than 170 search tools, many of which are free or free for noncommercial use. With so many software packages, selecting suitable search engine software is, as hard if not harder than retrieving relevant information efficiently from websites. Motivated a desire to aid website administrators in choosing a suitable search engine, this paper reviews basic information, feature, and functionalities of nine free search engine software packages: Alkaline, Fluid Dynamic, ht://Dig, Juggernautsearch 1.0.1, mnoGoSearch, Perlfect, SWISH-E, Webinator, and Webglimpse 2.x.

The remainder of the paper starts with an introduction to free search engine software. Then, we summarize basic information such as source code availability and platform compatibility of the nine software packages. After that, their features and functionalities are compared and contrasted. Finally, we conclude our comparison results.

2. Introduction to Free Search Engine Software

Free search engine software can be spotted at websites such as searchtools.com, sourceforge.net, searchenginewatch.com, and codebeach.com. Some of them are freeware with only binary files distributed, while others are open source software. In general, however, free search engine software is not well-documented and has undergone few formal tests, which makes it difficult to understand the functionalities they provide.

According to whoever provides the actual search service, free search tools can be categorized into remote site search service and the server-side search engine. In the former, the indexer and query engine run on a remote server that stores the index file. When it comes to the time of search, a form on a user's local Web page sends a message to the remote search engine, which then sends the query results back to the user. A server-side search engine is what we usually think of as a search engine. It runs on the user's server, and takes that server's CPU time and disk space. In this paper, the term search engine refers only to server-side search engines.

According to what is indexed, search engines are classified as file system search engines and website search engines. File system search engines index only files in the server's local file system. Website search engines can index remote servers by feeding URLs to web crawlers. Most search engines combine the two functions, and can index both local file systems and remote servers. The nine search engine software packages compared here are all website search engines, some of which can index local file systems.

A fully functional website search engine software package should have the following four blocks:

An Indexer that indexes the documents crawled using some indexing rules and saves the indexed results for searching;

An Query Engine that performs the actual search and returns ranked results;

An Interface that allows users to interact with the query engine.

The nine software packages we compare either have all four blocks or allow adding the missing blocks.

3. Basic Information of the Nine Search Engine Software Packages

This section intends to provide some basic information about each search engine software package. The information includes licensing, where to find, source code availability, documentation availability, what is written in, platform compatibility, completeness of the package, and who built it.

Licensing refers to whether the software is a freeware or is free under some conditions. Source code availability provides the website address to download the source code if it is available. Documentation availability indicates where to find the documentation files. What is written in tells what programming language is used in implementing the software. Platform compatibility specifies what operating systems that the software can run on. If the software package is fully functional, i.e. it has a web crawler, an indexer, a query engine, and a query interface, we consider the package to be complete. Who built it tells us the developers of the software.

A website administrator who is looking for a suitable software package can take a look into this information first to decide whether a software package is a potential candidate. For example, if the search engine software can not be installed in the platform on which the web server is running, there is no need for the administrator to look into specific features of the software. We summarize the basic information of the nine software packages in Table 1.

-Original version, SWISH, is built by Kevin Hughes. - In Fall 1996, The Library of UC Berkeley received permission from Kevin Hughes to implement bug fixes and enhancements to the original binary, hence SWISH-E.

4.1 Comparison Criterions

We compare and contrast the nine software packages from the following four perspectives.

4.1.1 Searching Mechanism

We consider the indexing method and the ranking method as the searching mechanism of a search engine, since these two methods usually determines how many disk space the search engine requires, how fast the indexing process is, and how fast and accurate the search process is.

Indexing method

Most search engines operate on the principle that pre-indexed data is easier and faster to search than raw data. The form and quality of the index created from the original documents is of paramount importance to how searches are performed. The commonly used indexing method is the full text inverted index. It takes a large amount of disk space and the indexing process is slow, because it keeps most of the information in a document. Another method is to index only the title, keywords, description, and author parts of a document. In this way, the indexing process can be very fast and the resulted index is relatively small. Some search engines have their novel indexing method. WebGlimpse uses two-level indexing, which we will introduce later. Alkaline applies Cellular Expansion Algorithm in indexing, which is still kept as a technical secret.

Relevance Ranking

Ranking method refers to the method that decides a document's relevance to a query. Factors such as word frequency in the document, word position in the text, and link popularity are usually considered. Different search engine takes into consideration of different factors.

4.1.2 Crawler and Indexer Features

We compare the following functionalities of built-in web crawlers and indexers.

Robot Exclusion Standard Support - Does the crawler respect the robot exclusion standard that is to not index documents indicated in the robot.txt file?

Crawler Retrieval Depth Control - Can the administrator control the maximum depth that a crawler follows in a retrieval process?

Duplicate Detection - During the process of crawling and indexing, can duplicated documents be detected and thus not be indexed?

File Format to be Indexed - Files in what formats can be crawled and indexed by the crawler and indexer?

4.1.4 Other Features

International Language - Can the search engine support languages other than English?

Page Limit - How many pages can be indexed for the free version of the software? What is the theoretical or empirical limit?

Customizable Result Formatting - Can the result pages be customized to have a desired look and feel?

4.2 Features of the Nine Search Engine Software Packages

We compare and contrast the features of the nine search engine software packages according to the above comparison criterions. The main results are summarized in Table 2. Individual analyses are also provided in subsections respectively.

Table 2: Features and Functionalities of the Nine Search Engine Software Packages

4.2.1 Alkaline

Alkaline is a powerful search server. It supports most of the features we discussed here [2] [3].

Searching Mechanism

Alkaline uses the concept of "cellular expansion" to index and search documents. The cellular expansion algorithm is a technique of hashing and quickly finding short binary blobs. It is claimed that the algorithm makes searching incomplete word forms in 500,000 documents blazing fast. But I haven't been able to find any published document for this algorithm.

Alkaline uses an adaptive mechanism that is said to be able to closely match the results to the elements searched. The more extensive the search query is, the better the relevance the user gets. The word weight ranking gives different weight to words in title, meta keywords, words in description, and words in text body. Alkaline has the Weight option to modify ranking weights. Another option Alkaline provides for changing the ranking is WeakWords. Words in the WeakWords list are assigned lower weight.

Crawler and Indexer Features

Alkaline supports robot directives. AlkalineBOT is the registered robot. Alkaline is compliant with the /robots.txt directives. It will not follow a link if a <meta name= "robots" content= "nofollow">tag is found. It will also not index document contents if a <meta name= "robots" content= "noindex"> tag is found. By specifying Robots=N in the configuration file, Alkaline's robots support can be disabled.

Alkaline allows administrators to define the maximum depth of URLs to follow. The MD5 digest mechanism [4] within Alkaline can identify and ignore symbolic links and duplicated documents, such as http://www.abc.com and http://www.abc.com/index.html.

Alkaline can index html, htm, text, and shtml files. To index PDF, embedded Shockwave flash objects, doc, rtf, LaTex/Tex, WordPerfect, Xml, and MPEG Layer 3 files, Alkaline needs external document filters. A retrieved document of these kinds can be passed to any external filter, processed by this filter and then indexed based on the HTML output.

Retrieval of secured pages on password protected sites (HTTP/1.0 BASIC authentication, NTLM support for Windows NT versions, no support for SSL) is supported by alkaline.

- Wild card: Alkaline can use * to return a list of all indexed documents.

- Numeric Data Search: Alkaline indexes words such as quantity=15 in a special manner. Thus it can support search such as quantity < 15, quantity =15, or quantity > 15.

- Case Sensitivity: Alkaline chooses a case-sensitive search when at least one upper-case letter is present in a word.

Other Features

Alkaline dose not support language other than English. There's a theoretical limit of two billion documents that Alkaline can index. But the recommended usage is to index something around 50,000-500,000 pages and 250,000 word forms. Layout of search results is fully customizable.

For detailed features of Alkaline, please refer to Appendix 1, which is the feature summary from the documentation of Alkaline [2].

4.2.2 Fluid Dynamic

Searching Mechanism

Fluid Dynamic search engine uses attribute indexing [5]. A document's text, keywords, description, title, and address are all extracted and used for searching. Basically, it is a full text indexing, but the option "Max Characters: File" allows one to determine the maximum number of bytes read from any document. Keeping it at a low value will save indexing time at the expense of accuracy in searching.

The ranking of documents is decided by the frequency of query words in the documents. Query words found in the title, keywords, or description parts of the documents are given additional weight which is allowed to be modified by changing the values of "Multiplier: Title", "Multiplier: Keywords", and "Multiplier: Description" settings. Every time a search term found in the web page text, one point is added to the web page's relevance. Every time a search term found in the title, the value of the "Multiplier: Title" setting is added to the relevance. Similar additions are made for the META Keywords and Description. Results can be ranked by last modified time, time web page last indexed, and their inverses.

Crawler and Indexer Features

Fluid dynamic supports Robot Exclusion Standards, i.e. it respects both the robot.txt file and the Robots Meta tags. The crawler can stop after each level of crawling waiting for manual approval, thus an administrator is able to control the depth of crawling. It can detect duplicated pages and will not index them.

Fluid dynamic can index html, htm, shtml, shtm, stm, and mp3 files. To index PDF files, it needs the helper utility xpdf package from www.foolabs.com/xpdf. It can not index servers protected by passwords.

- Boolean search: To express the fact that a page must contain a word, a '+' sign or "and" is placed in front of the word. To search for all pages not containing a word, a '- ' sign or "not" is used. "or" or '|' means that this search term is preferred. Additional preferred terms will increase the ranking.

- Phrase Matching: Enclosing words in quotation marks causes them to be evaluated as a phrase.

- Attribute Search: Fluid Dynamic is able to limit search scopes within URLs, titles, texts, or links by using url:value (host:value or domain:value), title:value , text:value , or link:value .

- Wild card: Fluid Dynamic uses * to represent one or more character or symbol.

Other Features

Fluid Dynamic is designed to search languages that use the Latin character set, including English, German, and Dutch. All Latin extended characters are reduced to their English equivalents. The query interface and result display are template-based, thus are easy to customize. It's also easy to translate the user interface into non-English languages. There's no theoretical page limit for Fluid Dynamic, but the soft limit because of the disk space and CPU running load is about 100,000 documents.

For detailed features of Fluid Dynamic, please refer to Appendix 2, which is the feature summary from the documentation of Fluid Dynamic [5].

4.2.3 ht://Dig

Searching Mechanism

ht://Dig uses the most standard indexing method: full text reverse index. The relevance ranking method is word weight. It is said that word weights are generally determined by the importance of the word in a document.

Crawler and Indexer Features

The crawler of ht://Dig supports Robot Exclusion Standards. The depth of crawling can be limited by setting maxhops option when running the crawling program, htdig. ht://Dig uses the signature of the document to detect duplicated pages. But it was reported that ht://Dig did not remove duplicates [7].

ht://Dig can index html and txt files by default. PDF, MS Word, PowerPoint, PostScript and Excel files can be indexed with the aid of external parsers or converters. The path name of external parser or converter must be put in the configuration file.

ht://Dig can index protected servers. It can be configured to use a specific username and password when it retrieves documents on a password protected server.

- Boolean Search: AND is used to search for more than one keywords. OR is used to search for any of the keywords.

- Attribute Search: ht://Dig can be set to perform search which only returns documents whose URL matche a certain pattern. It's different from the concept of Attribute Search. We list it here because it is similar to only search within a URL scope.

- Wild Card: Wild card usage is not found in any documentation of ht://Dig. But the search engine at he Kennedy Space Center website [8], which is built using ht://Dig, supports powerful wild cards. More specifically,

§ - : Entering <WILDCARD>'f[a-z]ster' specifies the range of allowed characters which can fill in for that space.

Other Features

Both SGML entities, such as '&agrave;' and ISO-Latin-1 characters can be indexed and searched by ht://Dig. In order to support a specific language, we need to configure ht://Dig to use dictionary and affix files for the language of our choice by setting the locale attribute. There is no theoretical page limit. Usually, ht://Dig can index more than 100,000 pages. The output of a search can be easily customized using HTML templates.

For detailed features of ht://Dig, please refer to Appendix 3 which is the feature summary from the documentation of ht://Dig [9].

4.2.4 Juggernautsearch 1.0.1

In the documentation of Juggernautsearch, I couldn't find enough information to make conclusion about weather it supports some of the features we are discussing here. But Juggernautsearch uses a special indexing method which makes the indexing and searching process very fast.

Searching Mechanism

Juggernautsearch extracts the top keywords from a document and index only these keywords. These keywords are assigned word weight according to their appearance frequency in the document. The index file stores these words in the order of decreased weight. When it comes to the time of search, only the keywords stored in the indexed file are examined, the weight of the words in a document is used to calculate the relevance ranking. Since only the keywords are indexed and searched, the indexing and searching are very fast and the index files take little disk space.

Crawler and Indexer Features

Juggernautsearch supports Robot Exclusion Standard. The crawler of Juggernautsearch is called Pagerunner. It does not provide control over the depth of crawling. Juggernautsearch can detect duplicated pages. It pre-scans retrieved URLs to remove unwanted URLs and URLs that have already been visited, and ensures that once indexed a URL will not be crawled again in later web crawl iterations. It can not index protected sites.

Juggernautsearch supports Attribute Search. It can restrict search to be performed only in URLs. Juggernautsearch doesn't support Boolean Search. This is related to their indexing method. Boolean search that returns pages omitting a keyword can work only when we have the full document to search. While Juggernautsearch only extracts the top few keywords, requesting a search to exclude a word can not guarantee that the word is not in the document. No Boolean Search is the price for fast indexing and searching. In addition, Juggernautsearch dose not support Phrase Matching, Numeric Data Search, and Natural Language Query.

Other Features

Juggernautsearch does not support languages other than English. It dose not have a page limit, because the index file is very small.

Juggernautsearch has opened a challenge toward ht://Dig because of the criticism from some of the developers of ht://Dig. An interesting comparison between Juggernautsearch and ht://Dig can be found in [10]. The comparison table is attached as Appendix 4.

4.2.5 mnoGoSearch

Searching Mechanism

mnoGoSearch uses full text inverted index. Words in different parts of the document are assigned different weights. To determine the relevance of a document, mnoGoSearch considers several factors: number of complete phrases (taking into account of word weights), number of words from query found in a document, and number of incomplete phrases (taking into account of word weights).

Crawler and Indexer Features

mnoGoSearch supports Robot Exclusion Standards. The crawling depth of the crawler can be limited. By default, it can index html and txt files. With the aid of external parser, pdf, ps, and doc files can be indexed. On servers supporting HTTP 1.1, mnoGoSearch can index mp3 files. It can also index SQL database text fields. mnoGoSearch has the ability to index password protected servers.

- Phrase Matching: Words enclosed with double quotation will be treated as a phase in searching.

- Attribute Search: mnoGoSearch can limit search within documents with given tags, or with given URL substrings.

- Fuzzy Search: Supports synonyms and substring search.

- Word Forms: Supports word stemming.

- Wild Card: '%' can be used as the wild card to define URL limit, but it can not be used in ordinary search words.

Other Features

mnoGoSearch supports almost all known 8 bit character sets as well as some multi-byte charsets including Korean euc-kr, Chinese big5 and gb2312, Japanese shift-jis, as well as utf8. The euc-kr, big5, gb2312 and shift-jis character sets are not supported by default, because the conversion tables for them are rather large that leads to increase in size of the executable files [11]. mnoGoSearch also supports the following Macintosh character sets: MacCE, MacCroatian, MacGreek, MacRoman, MacTurkish, MacIceland, MacRomania, MacThai, MacArabic, MacHebrew, MacCyrillic. When we talk about supporting language instead of supporting character sets, mnoGoSearch can support around 700 languages, which includes most of the frequently used language in the world.

mnoGoSearch can index about several million documents. It provides PHP3, Perl, and C CGI access to the search engine, offering significant flexibility and options in arranging search results.

For detailed features of mnoGoSearch, please refer to Appendix 5 which is the feature summary from the documentation of mnoGoSearch [11].

4.2.6 Perlfect

Searching Mechanism

Perlfect implements the most standard indexing and ranking algorithm. It uses the inverted index. When it comes to calculate word weight, it applies the algorithm of Gerald Salton [12], that is, the weight W, of a term T, in a document D, is:

W(T, D) = tf(T, D) * log ( DN / df(T)),

where tf(T, D) is the term frequency of T in D, DN is the total number of documents, and df(T) is the sum of frequencies of T in every document considered or as it called the document frequency of T.

Crawler and Indexer Features

Perlfect is the only search engine among the nine that does not support Robot Exclusion Standard. Thus, it is mainly designed for adding a search function to a single website. The depth of crawling can not be controlled. It can not index protected servers.

Searching Features

Perlfect only supports the Boolean Search feature. A '+' sign is used to include a word, while a '-' sign is used to exclude a word.

Other Features

The result page of Perlfect can be shown in many different languages such as German, French, and Italian. The user interface is fully customizable using the provided templates. Perlfect is a lightweight search engine. It can only index about 1,000+ documents.

For detailed features of perlfect, please refer to Appendix 6 which is the feature summary in the documentation of perlfect [13].

4.2.7 SWISH-E

Searching Mechanism

I haven't been successful in finding out what indexing and ranking methods SWISH-E uses in publicly available documents.

Crawler and Indexer Features

The crawler supports the Robot Exclusion Standards. Its maximum depth of crawling can be controlled. SWISH-E can not index protected servers.

SWISH-E can index html, xml and txt files. With the use of filters that convert other types of files such as MS Word documents, pdf, or gzipped files into one of the file types that Swish-e understands, SWISH-E can then index them. Files with extensions gif, xbm, au, mov, and mpg can be indexed but their content can not be indexed.

- Boolean Search: and, or, and not are three logical operators of SWISH-E. The operators are case sensitive.

- Phrase Matching: Words in double quotation are treated as a phase in searching.

- Attribute Search: SWISH-E allows users to specify certain META tags that can be used as document properties. Search can be limited to documents with specified properties.

- Fuzzy Search: SWISH-E supports soundex search.

- Word Forms: SWISH-E supports word stemming.

- Wild Card: * is used to replace single or multiple characters.

Other Features

SWISH-E supports all the languages that use single byte characters.

For detailed features of SWISH-E, please refer to Appendix 7 which is the feature summary from the documentation of SWISH-E [14].

4.2.8 Webinator

Searching Mechanism

Webinator uses inverted index. The ranking algorithm takes into consideration of relative word ordering, word proximity, database frequency, document frequency, and position in text. The relative importance of these factors in computing the quality of a hit can be altered under Ranking Factors option.

Crawler and Indexer Features

The crawler supports the Robot Exclusion Standards. Its maximum depth of crawling can be controlled. Webinator can not index protected servers.

Webinator can detect duplicates by hashing the textual content of the page and not storing any page with a hash code that is already in the database. Files with extension html, htm, txt, pdf, doc, swf, asp, jsp, shtml, jhtml, or phtml can be indexed by Webinator.

- Fuzzy Search: It lets you find "looks roughly like" or "sounds like" information. To invoke a fuzzy match, precede the word or pattern with the '%' character.

- Word Forms: Word stemming is supported.

- Wild Card: * can be used to match just the prefix of a word or to ignore the middle of something.

- Regular Expression: Users can find those items that cannot be located with a simple wildcard search using regular expression pattern matcher. To invoke the REX regular expression pattern matcher within a query, precede the expression with a '/'. For example, we can use /19[789][0-9] to find years between 1970 and 1999.

- Numerical Data Search: It allows you to find quantities in textual information in any way they may be represented. To invoke a numeric value search within a query, precede the value with a '#'. For example, query #>5000 may return match "2.2 million".

- Natural Language Query: A query can be in the form of a sentence or question.

Other Features

Webinator doesn't support languages other than English. The free version of Webinator only can index about 10,000 pages. It has customizable user interface.

For detailed features of Webinator, please refer to Appendix 8 which is the feature summary from the documentation of Webinator [15].

4.2.9 WebGlimpse 2.x

Searching Mechanism

The indexer and query engine of WebGlimpse is Glimpse. Glimpse implements a two-level query method, which leads to small index files and fast index construction, and supports arbitrarily approximate matching. The idea of two-level query method is a hybrid of inverted index and sequential search with no indexing [16] [17].

The first step of a document indexing process is to divide the whole collection into small pieces, which are called blocks. The number of blocks can not exceed 256, so that the address of a block can be stored with one byte. The whole collection is scanned word by word. Then, an index that is similar to a regular inverted index with one notable exception is created. In an inverted index, every occurrence of every word is indexed with a pointer to the exact location of the occurrence. In Glimpse's index, every word is indexed, but not every occurrence. Each entry in the index contains a word and the block numbers in which that word occurs. Since each block can be identified with one byte, and many occurrences of the same word are combined in the index into one entry, the index is typically quite small.

The searching process consists of two phases. First, Glimpse searches the index for a list of all blocks that may contain a match to the query. Then, each such is searched separately. Since the index is small, agrep is used to perform flexible sequential search. Because of the sequential search, arbitrarily approximate search such as fuzzy search, word forms, regular expression, and wild card are easily supported.

Crawler and Indexer Features

WebGlimpse supports Robot Exclusion Standard. The crawling depth can be controlled. It can not index a protected server. By default, it can index html and txt files. With the aid of filters, it can index PDF, and any other documents that can be filtered to plaintext.

WebGlimpse can index all single byte languages. But the output of the interface is not configurable unless the commercial version of the software is purchased.

For detailed features of WebGlimpse, please refer to Appendix 9 which is the feature summary from the documentation of WebGlimpse [18].

5. Conclusion

We compared and contrasted nine free search engine software packages. Each package has its pros and cons. Most search engines support Boolean Search, Phrase Search, and Word Forms. ht://Dig has a powerful wild card. Juggernautsearch and WebGlimpse has small index files and fast indexing processes. Webinator supports natural language query, and it is the only search engine reviewed that can search numeric value in the textual environment. mnoGoSearch excels in supporting multiple languages. Perl scripts search engines such as Perlfect and RuterSearch are usually light-weighted. They have less functionality, but they are easy to use and install. In a nutshell, choosing which search engine software package to use is a decision that should be based on the matching between requirements and software features.

References and Notes:

[1] The original version of this paper was finished in 2002 as a project paper for the course, Information Sciences and Technology 511: Information Management − Information and Technology, taught by Dr. Lee Giles at the College of Information Sciences and Technology, The Pennsylvania State University, University Park, PA. The author was a graduate student at the College of Information Sciences and Technology, The Pennsylvania State University at that time. This paper removed all obsolete contents of the original version.