Monday, December 1, 2008

Getting Started with Apache Lucene

The API of lucene is very simple to use. Here is overview of some of the objects required to start using Apache Lucene.

Document and Fields

The class "org.apache.lucene.document.Document" is necessary container for the index. Lucene requires all indexed objects to provide an instance of Document. Each document defines one or several fields ( (org.apache.lucene.document.Field). Fields contain classified information about the document or metadata related to document. A sample classification is for example the creation date of a file, author of the document etc. These fields allow you to search later for a specific information in this classification.

IndexWriter

The class org.apache.lucene.index.IndexWriter creates the index. Via the method addDocument you can add an existing Document to the index. The constructor for IndexWriter expects the directory to store the index, and the analyzer for the content of the files. In addition a boolean flag is handed over which indicates if the index should be created new or if an existing index should be extended.

Analyser

The class org.apache.lucene.analysis.standard.StandardAnalyzer provides a standard analyzer. This part is responsible to analyse the text and to filter out certain fill-words, e.g. "and".

Searcher and Query

The class org.apache.lucene.search.Searcher provides the search functionality. org.apache.lucene.search.IndexSearcher searches over an index. What is to be searched is provided via the query (org.apache.lucene.search.Query) class. The search allows wildcard search, e.g. *, ?, logical operations (AND, OR, NOT) and much more, e.g. fussy.

// To store an index on disk, use this instead (note that the
// parameter true will overwrite the index in that directory
// if one exists):
//Directory directory = FSDirectory.getDirectory("/tmp/testindex", true);
IndexWriter iwriter = new IndexWriter(directory, analyzer, true);
iwriter.setMaxFieldLength(25000);