The text extraction framework

A text extractor is actually just a plain old Java object (POJO). Creating an extractor is pretty straightforward: create a Java class that extends a single abstract class, called TextExtractor:

The abstract class also contains fields and getters (not shown above) for the name and logger that are automatically set by ModeShape during repository initialization.

There are two abstract methods that must be implemented: supportsMimeType(...) and extractFrom(...). The first is fairly obvious: simply return true for all of the MIME types for which the extractor is capable of processing. The extractFrom method is the meat of the implementation, and should process the BINARY value's contents and write the searchable text to the supplied Output object.

Note that the processStream(...) method is a utility that can be called by the extractFrom and that properly opens the BINARY value's stream, processes the content, and ensures that the stream is always closed. Your implementation can therefore implement the extractFrom method as follows:

This can make your implementation a little easier, but feel free to just implement the extractFrom method directly process the stream.