Introduction

After the MetadataDiscussion page was created, Jukka Zitting offered an example of how to get to recursive metadata when parsing with an AutoDetectParser. In addition to sharing Jukka's example, this page also offers some additional details on how, if you are willing to write your own ContentHandler, you can capture both text and metadata for each recursive document.

NOTE - This discussion of recursive metadata is from the point of view of what might be an oddball use case. The assumption of this page is NOT that you would want to take a container file, maybe a zip file, and extract all of the text and metadata into a single mega-representation of all of the text and metadata found in that container. Instead, this page assumes that what you really want to do is to extract the text for each document in the container, and be able to see each of these nested documents as a separate entity with its own text and metadata.

Jukka's Example

Here is the full source for Jukka's example for how to get access to nested metadata. This example writes the metadata for each nested document to standard output. More details about how Jukka's example works are available in subsections below.

The example starts by setting up recursive parsing. If you are parsing text files, word documents, etc. then you'll never notice if recursive parsing is enable or not. If you are parsing containers like zip files and tar.gz files, the only way to get the text for the files contained by the containers is to enable recursive parsing.

The way to enable recursive parsing is to create a ParseContext and add a parser to it as shown on the line context.set(Parser.class, parser). This is the parser that will be used to parse any nested documents.

The parse method is where you get access to the metadata. When the parser set in ParseContext is used to parse a nested document, a new Metadata object is created and passed to the parse method. Since the example put a RecursiveMetadataParser in the ParseContext, RecursiveMetadataParser's parse method is called. Before calling super.parse, the metadata object is empty. After super.parse returns, the metadata object contains all of the metadata the decorated parser found and System.out.println(metadata) prints all of the metadata to standard output.

What's Missing from Jukka's Example?

Jukka's example shows how you can get metadata for a nested document, but it doesn't show how you can get that metadata along with the text for that nested document.

If you only need the metadata, then this example is great. If instead you want to extract complete documents from containers including both text and metadata, then you need to do more.

Extracting Text is an Exorcise for the Reader

A way to match up the metadata for a document with its text requires you to write your own ContentHandler that is able to identify text for individual nested documents. Since this page is called RecursiveMetadata and not HowToGetASeparateTextBodyForEachNestedDocument, no details are offered for how to implement that ContentHandler. While I was hoping there would be help for this in Tika's library, after quickly scanning all the handlers I could find in http://tika.apache.org/0.7/api/ I didn't see any that offered easy ways to get to the text for each contained document as a separate set of text.

Until someone writes a page on how to get the text for each separate document in a container as a separate body of text, writing this ContentHandler is an exercise left to the reader. I have written a ContentHandler that does this for the kinds of files and containers I have tested with, and if no one comes forward with an easy way to write this kinds of ContentHandler, my experiences might become the start of yet another wiki page.

How to get Metadata with Text

Assuming that you have written your own ContentHandler, and that ContentHandler can be used to get the text for individual documents in a container, how can you get associate the metadata for a document with that document's text?

The basic idea is that if you have gone to the trouble of implementing a ContentHandler capable of identifying text for each individual nested document, then if you can also get notifications for when a subdocument with separate metadata starts and ends, you can keep track of this metadata and associate it with the text you extract.

Hopefully this example offers an idea of what you would have to do to get both the text and metadata for a nested document.

A Possibly Misplaced or Inappropriate Wish for Tika

While it is possible to get the text for each nested document in a container using Tika, and it is possible to get the metadata for each nested document, it would be nice if Tika offered an easy way to get both the text and the metadata for a nested document together as a single entity.

Tika seems to want to turn any file you give it into a single XHTML document, or the stream of ContentHandler events you would get if you were parsing that single XHTML document. Containers that aren't logically a single document (containers that are logically single documents include OLE2 and .xslx) don't live comfortably inside this single document model. Because Tika does a great job of identifying and parsing a wide variety of container types, and because Tika is being extended to identify when a container is logically a single document and when a container is logically many separate documents, it would be nice if there was a better way for Tika to return the metadata and text for containers that are logically many separate documents.