GNU Libextractor

GNU Libextractor is a library used to extract meta data from files.
The goal is to provide developers of file-sharing networks, browsers
or WWW-indexing bots with a universal library to obtain simple
keywords and meta data to match against queries and to show to users
instead of only relying on filenames. libextractor contains the shell
command extract that, similar to the well-known file
command, can extract meta data from a file an print the results to
stdout.

GNU libextractor uses helper-libraries (plugins) to perform the actual
extraction. As a result, GNU libextractor can be extended simply by
installing additional plugins. Writing robust parsers can be difficult.
GNU libextractor protects the main applications from haning or crashing
plugins by executing all plugins out-of-process.

Announcements about
Libextractor
and most other GNU software are made on
info-gnu
(archive).
If you only want to get notifications about Libextractor, we
suggest you subscribe to the project at
freshmeat.

Security reports that should not be made immediately public can be
sent directly to the maintainer.
If there is no response to an urgent
issue, you can escalate to the general
security
mailing list for advice.

Getting involved

Development of
Libextractor,
and GNU in general, is a volunteer effort, and you can contribute. For
information, please read How to help GNU. If you'd
like to get involved, it's a good idea to join the discussion mailing
list (see above).

A source package is here.
This binding has been packaged as a python egg, available here
A second Python binding that includes a binding for doodle can be found here.
A Perl binding is in CPAN
The latest version of the Perl binding is available using git clone git://git.perldition.org/File-Extractor.git/
A Ruby binding has been published here (mirror).
Another Ruby binding has been published here (mirror).
An initial draft of a PHP binding can be found under

$ svn checkout https://gnunet.org/svn/Extractor-php

Translating
Libextractor

To translate
Libextractor's
messages into other languages, please see the Translation Project
page for
Libextractor.
If you have a new translation of the message strings,
or updates to the existing strings, please have the changes made in this
repository. Only translations from this site will be incorporated into
Libextractor.
For more information, see the Translation
Project.

Quick Introduction

Installation

The simplest way to install GNU libextractor is to use one of the binary
packages which are available online for many distributions. Note that
under Debian, the extract tool is in a separate
package extract and headers required to compile other
applications against libextractor are in libextractor-dev.
Thus, under Debian, you should use:

Note that you need various dependencies (read README
for an up-to-date list) in order to compile all of the plugins.

Using the extract tool

After installing GNU libextractor, the extract tool can be used to obtain
meta data from documents. By default, the extract tool uses the
canonical set of plugins, which consists of all format-specific
plugins supported by the current version of libextractor together with
the mime-type detection plugin. If you are a user
of BibTeX
the option -b is likely to come in handy to automatically
create bibtex entries from documents that have been properly equipped
with meta-data (if available).
Further options are described in the extract manpage (man 1 extract).

The following listing shows the code of a minimalistic program that
uses GNU libextractor. Compiling the fragment requires passing the
option -lextractor to gcc. For details and additional
functions for loading plugins and manipulating the keyword list, see
the libextractor manpage (man 3 libextractor).
Java programmers should note that a Java class that uses JNI to
communicate with libextractor is also available. Python programmers
will find that libextractor (since 0.5.0) can also be used from
Python, just import Extractor.

The most complicated thing when writing a new plugin for GNU
libextractor is the writing of the actual parser for a specific
format. Nevertheless, the basic pattern is always the same. The
plugin library must be called libextractor_XXX.so where XXX
denotes the file format supported by the plugin and must be placed in
the plugin directory (typically $PREFIX/lib/libextractor/).
The library must export a method EXTRACTOR_XXX_extract_method
with the following signature:

ec provides a callback to invoke with meta data as well as
functions for reading data from the file that is being processed.
Most plugins start by reading the first bytes of the file and checking that
that the header of data matches the specific format.
The extract function is expected to call ec->proc with each
meta data item found. ec->cls must be passed as the first
argument to proc and other function invoked from within ec.
Finally, ec->config is an arbitrary string of options that the plugin is
free to interpret. Most plugins ignore config.
If the meta data extracted is a string, it is supposed to be converted
into the UTF-8 character set by the plugin. However, in cases where
the character encoding used in the document is unknown, no conversion
should be done. Binary meta data can also be extracted. Plugins
indicate the format of the meta data using the format
argument to proc. Supported formats are UTF-8 strings, C
strings (for strings of unknown encoding) and binary data. In
addition to this rough categorization, the plugin is also supposed to
indicate the mime type of the meta data. For strings, that mime type
is most often text/plain. Finally, the plugin must specify
the meta data type. Common meta data types are "author",
"title" and "mime-type". The full signature of
the "proc" callback is:

Licensing

Libextractor
is free software; you can redistribute it and/or modify it under the
terms of the GNU General Public License as published by the Free
Software Foundation; either version 3 of the License, or (at your
option) any later version.