.NET

Indexing and Searching Image files

Source Code Accompanies This Article. Download It Now.

Lucene.NET is a high-performance text retrieval library that Adelene uses to index and search image files.

Adelene Ng is a senior staff software engineer with Motorola. She holds a Ph.D. in Computer Science from the University of London. Adelene can be contacted at ng_a@hotmail.com.

Lucene is a high-performance, full-featured text retrieval library. Originally written in Java, it has since been ported to C++, C#, Perl, and Python. In this article, I show how Lucene.NET can be used to index and search image files captured by digital cameras. What makes this possible is that digital photos embed the camera settings and scene information as metadata. The specification for this metadata is the Exchangeable Image File Format (www.exif.org). Examples of stored information include shutter speed, exposure settings, date and time, focal length, metering mode, and whether the flash was fired. Here I show how the EXIF information can be extracted from the images through some user-specified criteria. These user-specified search criteria are then used as an index to search your image library. To keep the example simple, I limited the EXIF search criteria to date range and user comments fields. All images that satisfy the search criteria are displayed as thumbnails.

In addition to Lucene, I use the NLog logging library, which has a programming interface similar to Apache log4net (logging.apache.org/log4net). I use NLog to write diagnostic messages to a file for logging and tracing purposes. To extract EXIF information, I use the ExifExtractor library. Although .NET already has utilities to extract EXIF information, it returns raw data and tags. More processing would be required for this to be used in this application. For example, if I wanted to extract shutter speed information, I would need to know the tag number, extract the tag, and then convert the data from ASCII to a number. ExifExtractor abstracts all these steps. To display the thumbnail images that have been returned by the search engine, I make use of Marc Lievin's Image Thumbnail Viewer (www.codeproject.com/KB/cs/thumbnaildotnet2.aspx).

As Figure 1 illustrates, the ImageSearcher application is made up of six main classes:

ImageSearcherForm

ImageDialog

ImageViewer

ThumbnailController

ThumbnailImageEventHandler

BuildIndex

ImageSearcherForm is the main point of entry into the application. It lets users enter the directory where the images are stored, where the index directory is to be created, and what search parameters (start and end dates, user comments, and so on) to use. The remaining classes control the display of the thumbnail images in the status window. This portion makes extensive reuse of code from Lieven's Image Thumbnail Viewer.

[Click image to view at full size]

Figure 1: ImageSearcher classes and their relationships.

The BuildIndex class is where the index creation and searching takes place. To use Lucene, I first create an index by instantiating an IndexWriter(). An IndexWriter is created using the following constructor:

where indexDir represents the path to the index directory. Text is analyzed with the StandardAnalyzer; the last argument is a Boolean variable that, if set to True, creates the index or overwrites the existing one. If set to False, it appends to the existing index.

Analyzers in Lucene can be used to tokenize text, extract relevant words, remove common words, stem the words (that is, reduce them to the root form; for example, "edits," "editor," and "editing" are condensed to "edit"), and perform any other processing before storing it into the index. The common analyzers provided by Lucene are:

WhiteSpaceAnalyzer, which separates tokens based on whitespace.

SimpleAnalyzer, which tokenizes the string to a set of words and converts them to lowercase.

StandardAnalyzer, which tokenizes the string to a set of words identifying acronyms, e-mail addresses, host names, and so on, discarding the basic English stop words (a, an, the, to) and stemming the words.

A Lucene index is a sequence of files. All searching is done on this index. The raw EXIF metadata associated with the image files has to be read and extracted from my image files, and passed to Lucene where it can be indexed and searched. The IndexWriter object is created in the BuildIndex constructor, which takes in two arguments; the first is the directory containing your image files, the second is the directory in which the index files are created.

Next, the IndexDocs() method is called. This method has one argument, which is the name of the directory containing your image files. It runs through each file in the specified directory, checks that it is a JPEG file, and creates an Image object from the file, using the call Image.FromFile(filename) from the System.Drawing package:

Image img = Image.FromFile(file)

Next, the ExifExtractor library is used to extract EXIF information from the image files. To keep the application simple, I extract only "Date Time" and "User Comment" EXIF data. EXIF data is extracted as follows:

Likewise, to extract the user comments EXIF information, we access the er object as follows:

string s2 = (String)er["User Comment"];

The aforementioned EXIF tags are extracted from each image file. For each image file processed, a Document() object is created. This is created using the Document constructor as follows:

Document d = new Document();

Documents are the primary retrievable items from a Lucene query. Each Document object is made up of one or more field objects.

Fields represent a section of the Document. They contain the name of the section and actual data associated with the section. Each field contains information that you query against or display in your search results. Because I will be using the filename, date, and time the picture was taken and user comments in the search results, these keywords would be added to the Document object as a field. Each of these keywords has an associated value. These values are the EXIF data extracted from the image file. Field values are a sequence of terms. I construct the Field object using the constructor:

where the first argument is the name of the field, the second argument is the value associated with this field name, the third argument indicates how the field is stored, and the last argument denotes how the field is indexed. In this application, the store is set to Field.Store.YES and the index is set to Field.Index.UN_TOKENIZED. The former stores the original field value in the index. The latter indexes the field's value without using an Analyzer, so it can be searched. Fields are added to the Document object using the Add method:

The Document object is then added to the IndexWriter instance using the following method:

idxWriter.AddDocument(d)

When indexing is complete, we optimize the index for search,

idxWriter.Optimize()

Finally, we close the index:

idxWriter.Close()

Once the index has been built, it can be searched. In this application, the search is activated by the user. After users have entered all the search parameters, they activate the search by clicking on the search button. To do the search, we first create an IndexSearcher instance that points to the directory containing the indices that have been created previously:

IndexSearcher searcher = new IndexSearcher(idxDir.FullName)

I use the RangeQuery object to search for documents that match documents within an exclusive range. An instance of RangeQuery is created as follows:

The last argument to the RangeQuery constructor is a Boolean flag, which is set to True if the query constructed is inclusive, or otherwise set to False. The query instance is then passed as an argument to the search method of IndexSearcher instance,

Hits oHitColl = searcher.Search(query)

This returns the documents that match the query. I extract the Document objects from the Hits object by calling the doc method as follows:

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!