This is a rather precise description of the text for a German
speaker. The supported languages at the moment are Danish (da),
German (de), English (en), Spanish (es), Italian (it) and Norwegian
(no). Supporting other languages merely is a question of adding
free dictionaries in an appropriate character set. Further options
are described in the extract man page; see man 1
extract.

Using libextractor in Your Projects

Listing 5 shows the code of a minimalistic program that uses
libextractor. Compiling minimal.c requires passing the option
-lextractor to GCC. The EXTRACTOR_KeywordList is a simple linked list
containing a keyword and a keyword type. For details and additional
functions for loading plugins and manipulating the keyword list, see
the libextractor man page, man 3 libextractor. Java programmers
should know that a Java class that uses JNI to communicate with
libextractor also is available.

Listing 5. minimal.c shows the most important libextractor functions in
concert.

The most complicated thing about writing a new plugin for libextractor
is writing the actual parser for a specific format.
Nevertheless, the basic pattern is always the same. The plugin
library must be called libextractor_XXX.so, where XXX denotes the file
format of the plugin. The library must export a method
libextractor_XXX_extract, with the following signature shown in
Listing 6.

Listing 6. Signature of the function that each libextractor plugin must
export.

The argument filename specifies the name of the file being processed.
data is a pointer to the typically mmapped contents of
the file, and size is the file size. Most plugins do not make use of
the filename and simply parse data directly, starting by
verifying that the header of the data matches the specific format.

prev is the list of keywords extracted so far by other plugins for
the file. The function is expected to return an updated
list of keywords. If the format does not match the expectations of
the plugin, prev is returned. Most plugins use a function such as
addKeyword (Listing 7) to extend the list.

Listing 7. The plugins return the metadata using a simple linked
list.

A typical use of addKeyword is to add the MIME type once the file
format has been established. For example, the JPEG-extractor (Listing
8) checks the first bytes of the JPEG header and then either
aborts or claims the file to be a JPEG. The strdup in the
code is important, because the string will be deallocated later,
typically in EXTRACTOR_freeKeywords(). A list of supported keyword
classifications, in the example EXTRACTOR_MIMETYPE can be found in
the extractor.h header file.