StringifierContrib is organized in plugins to serialization a document format by delegating it to according backends.
For some formats there are alternative backends to chose from. For example a DOC file can be serialized
by any of abiword, antiword, catdoc, soffice or wvWare. Use the one that serves best your needs and
is available on your platform. For instance soffice is a very good choice to serve as a document converter.
However using it is rather performance demanding. The more simpler ones suffice most of the time but may
have an inferior quality of text being extracted.

Backends for Word Documents

To index Word Documents (.doc) you will need to install one of the following:

antiword

abiword

catdoc

soffice

wvWare

Backend for PDF

To index .pdf files you need to install poppler-utils.

Backend for PPT

To index .ppt files you may select one of the following:

catdoc

pphtml

soffice

Backends for DOCX, PPTX, XLSX

To index these file types, you will need to install the following tools from Sourceforge:

Backend for OpenDocument and Staroffice documents

Installing the Contrib

You do not need to install anything in the browser to use this extension. The following instructions are for the administrator who installs the extension on the server.

Open configure, and open the "Extensions" section. "Extensions Operation and Maintenance" Tab -> "Install, Update or Remove extensions" Tab. Click the "Search for Extensions" button.
Enter part of the extension name or description and press search. Select the desired extension(s) and click install. If an extension is already installed, it will not show up in the
search results.

You can also install from the shell by running the extension installer as the web server user: (Be sure to run as the webserver user, not as root!)

Configuration

There are a number of settings that need to be set in configure before you can use the Contrib.

Test of the Installation

Test if the installation was successful:

Check that antiword, abiword or wvHtml is in place: Type antiword, abiword or wvHtml on the prompt and check that the command exists.

Check that pdftotext is in place: Type pdftotext on the prompt and check that the command exists.

Check that ppthtml is in place: Type ppthtml on the prompt and check that the command exists.

stringify some files (see below)

Test of Stringification with stringify

Some users report problems with the stringification: The stringifier scipts
fails, takes too long on attachments. Some times this may result from
installation errors, especially of the installation of the backends for the
stringification.

stringify give you the opportunity to test the stringification in advance.

Usage: stringify file_name

In the result you see, which stringifier is used and the result of the
stringification.

Further Development

In this extension, a plug-in mechanism is implemented, so that additional
stringifiers can be added without changing the existing code. All stringifier
plugins are stored in the directory lib/Foswiki/Contrib/Stringifier/Plugins.

You can add new stringifier plugins by just adding new files here. The minimum
things to be implemented are:

The plugin must inherit from Foswiki::Contrib::StringififierContrib::Base

The plugin must register itself by __PACKAGE__->register_handler($application, $file_extension);

The plugin must implement the method $text = stringForFile ($filename)

All the stringifiers have unit tests associated with them, and we would
encourage you to provide unit tests for any you wish to contribute. See
Foswiki:Development/UnitTests for more information on unit testing.

Dependencies

Name

Version

Description

File::Which

>0

Required

Module::Pluggable

>0

Required

Spreadsheet::ParseExcel

>0

Required for .xls files

Spreadsheet::XLSX

>0

One of Spreadsheet::ParseXLSX or xlsx2csv is required for .xlsx files

Encode

>0

Required

Error

>0

Required

catdoc

>0

Optional

ppthtml

>0

Required

pdftotext

>0

Required for indexing =.pdf. Part of poppler-utils

soffice

>0

One of antiword, abiword, soffice or wvWare is required for .doc and .docx0 files

antiword

>0

One of antiword, abiword, soffice or wvWare is required for =.doc files

abiword

>0

One of antiword, abiword, soffice or wvWare is required for .doc files

wvWare

>0

One of antiword, abiword, soffice or wvWare is required for .doc files