README.rst

A SmartFile Open Source project. Read more about how SmartFile
uses and contributes to Open Source software.

Introduction

Fulltext is meant to be used for full-text indexing of file contents for
search applications.

Fulltext is a library that makes converting various file formats to
plain text simple. Mostly it is a wrapper around shell tools. It will
execute the shell program, scrape it's results and then post-process the
results to pack as much text into as little space as possible.

Supported formats

The following formats are supported using the command line apps listed.

Installing tools

Fulltext uses the above command line programs to function. Therefore, it is not
useful unless you have installed them. Many of them can be installed via your system's
package manager. I use Fedora, thus the following command installed most of the
required packages.

The package names may differ on other systems, but for the most part will be similar.

Usage

To use the library, simply pass a filename to the .get() module
function. A second optional argument default can provide a string to
be returned in case of error. This way, if you are not concerned with
exceptions, you can simply ignore them by providing a default. This is
like how the dict.get() method works.

Post-processing

Some formats require additional care, this is done in the
post-processing step. For example, unrtf is the tool used to convert
.rtf files to text. It prints a banner including the program version and
some document metadata. This header is removed in post-processing.

A simple regular expression is used to convert adjacent whitespace characters
to a single space.