<span style="color:#ff0000">'''This pipelet is not yet available in our repository. As soon as the new Aperture release is available we will submit appropriate CQs and hopefully get permission to use it in our project.'''</span>

+

<span style="color:#ff0000">'''This pipelet is not available as we have switched from Aperture to Tika.'''</span>

It converts the document's content in ''AttachmentContent'' and stores the plain text result in ''AttachmentText''. The optional MimeType of ''AttachmentContent'' in ''AttachmentMimeType'' is used for conversion. If no MimeType is provided a MimeType identification is done inside the Pipelet using a '''MimeTypeIdentifier''' service.

+

This pipelet converts various document formats (such as PDF, Microsoft Office formats, OpenOffice formats, etc.) to plain text using [[SMILA/Glossary#Aperture|Aperture]] technology: A binary attachment content can thus be converted to plain text and stored in an attribute. In addition to that, metadata properties of the document (like title, author, etc) can be extracted and written to record attibutes. The optional MimeType of the document in ''MimeTypeAttribute'' is used for conversion. If no MimeType is provided a MimeType identification is done inside the pipelet using a <tt>MimeTypeIdentifier</tt> service.

+

+

The AperturePipelet supports the configurable error handling as described in [[SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation]]. When used in jobmanager workflows, records causing errors are dropped.

+

+

==== Supported document types ====

+

+

By default, SMILA contains only a subset of Aperture that supports the conversion of:

+

* plain text documents (of course ;-)

+

* XML documents

+

* RTF documents

+

* Adobe PDF documents

+

* Microsoft Office documents, both the old formats (doc, xls, ppt) and the new OOXML formats (docx, xlsx, pptx)

+

* Microsoft Visio documents

+

* OpenOffice documents (OpenDocument formats)

+

+

Note: We do not include the HTML extractor currently because it depends on an HTML parser implementation with LGPL, which we are not allowed to redistriebute. See below on hints how to add Aperture extractors for further formats

=== Configuration ===

=== Configuration ===

Line 12:

Line 27:

!Property!!Type!!Read Type!!Description

!Property!!Type!!Read Type!!Description

|-

|-

−

|''inputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||selects if the input is found in an attachment or attribute of the record

+

|''inputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||selects if the input is found in an attachment or attribute of the record. Usually it doesn't make sense to use "ATTRIBUTE" here because the documents to convert are binary content.

|-

|-

|''outputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||selects if output should be stored in an attachment or attribute of the record

|''outputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||selects if output should be stored in an attachment or attribute of the record

Line 20:

Line 35:

|''outputName''||String||runtime|| name of output attachment or path to output attribute for plain text (store result as String literal of attribute)

|''outputName''||String||runtime|| name of output attachment or path to output attribute for plain text (store result as String literal of attribute)

|-

|-

−

|''ExtractProperties''||String||runtime||Parameter that definies what to extract from input and copy into record attributes with the name of the extracted properties. Extract-attribute value can be a set of values or a single value.

+

|''ExtractProperties''||String||runtime||Specifies which metadata properties reported by Aperture for the document should be written to which record attribute. See below for details.

|-

|-

−

|''AttachmentMimeType''||String||runtime||Parameter referencing the attribute that contains the mimetype of the attachment content. The parameter (resp. attribute) may not be set (null) and then a mimetype detection is performed. If the attribute has not been set, it will be set aftwerwards to the detected mime type.

+

|''MimeTypeAttribute''||String||runtime||Parameter referencing the attribute that contains the mimetype of the document. The parameter (resp. attribute) may not be set (null) and then a mimetype detection is performed. If the attribute has not been set, it will be set during the processing of the record to the detected mime type.

+

|-

+

|''FileExtensionAttribute''||String||runtime||Parameter referencing the attribute that file extension of the file that was the source of the attachment content. If the mimetype attribute is not specified or does not have a value, the file extension can be used to improve the automatic mime type detection. It not specified, the mimetype detection is based on the attachment content only.

|-

|-

|}

|}

Note that all properties are required and must be provided.

Note that all properties are required and must be provided.

+

+

==== Configuring the Property Mapping ====

+

+

In addition to the plain text content, Aperture can extract metadata properties from documents like the title, author, publisher, dates of publication etc, ... The names of these properties are URIs. Aperture uses URIs defined by

and probably there are others which we just did not discover yet. It depends very much on the documents what is actually extracted. To check with your documents you can download one of the "aperture-eclipse-1.4.0" archives from [http://sourceforge.net/projects/aperture/files/Aperture/1.4.0/], unpack it and start <code>bin/fileinspector.(sh|bat)</code>. Open a document with it and you will see an RDF representation of the extracted metadata.

+

+

To store such metadata properties in SMILA records, you must specify the URLs of the properties you want to store in the ''ExtractProperties'' parameter. Usually this parameter contains a sequence of string values. The string values can have one of the following formats:

+

* <code><Property-URL></code>: Add the values of this property to an attribute with the same name.

+

* <code><Property-URL>-><Attribute-Name></code>: Add the values of the property to the attribute with the given name

+

* <code><Property-URL>->><Attribute-Name></code>: Store the values of the property in the attribute with the given name, remove existing values first.

+

+

To improve readability, it is possible to abbreviate the property URLs by using namespace prefixes. The available prefixes are specified in [https://dev.eclipse.org/svnroot/rt/org.eclipse.smila/trunk/core/org.eclipse.smila.aperture/namespaces.properties namespaces.properties] in the <code>org.eclipse.smila.aperture</code> bundle. To add namespaces to this file, extend it and put it in the configuration area in directory <code>org.eclipse.smila.aperture</code>. Using the predefined namespaces you can use, for example:

If you use namespace abbreviations to specify the properties to extract, but don't specify target attributes, the target attributes will be the ''abbreviated'' URIs.

+

+

If the property value reported by Aperture is a resource, the pipelet tries to find a display name for it. It checks the following properties in this order:

+

* <tt>nco:fullname</tt>

+

* <tt>nie:title</tt>

+

* <tt>nao:prefLabel</tt>

+

* <tt>rdfs:label</tt>

+

If none of them has a value for the resource, the URI of the resource is used as the attribute value.

+

+

It is possile to specify the complete mapping in a single string value. To do this, concatenate the single values from the sequence using a semicolon ";" as the separator. This makes it easier to use the AperturePipelet in the [[SMILA/Documentation/Worker/PipeletProcessorWorker | PipeletProcessorWorker]] which currently allows only simple string parameters for pipelet configuration.

+

+

In any case, the resulting attribute is

+

* a single <tt>Value</tt>, if only one value has been extracted and the value is not appeded to previously existing values

+

* a <tt>AnySeq</tt> containing all values, if more than one value has been extracted or new values are appended to existing values.

==== Example ====

==== Example ====

Line 34:

Line 88:

E.g. if a word document with the value "ACME" as company and "John Doe" as creator, the resulting record would contain the plain text in the attribute <tt>Text</tt>, the value <tt>ACME</tt> in the attribute <tt><nowiki>http://schemas.openxmlformats.org/officeDocument/2006/extended-properties/Company</nowiki></tt>, as well as the value <tt>John Doe</tt> in an attribute <tt>dc:creator</tt>.

E.g. if a word document with the value "ACME" as company and "John Doe" as creator, the resulting record would contain the plain text in the attribute <tt>Text</tt>, the value <tt>ACME</tt> in the attribute <tt><nowiki>http://schemas.openxmlformats.org/officeDocument/2006/extended-properties/Company</nowiki></tt>, as well as the value <tt>John Doe</tt> in an attribute <tt>dc:creator</tt>.

−

Please note that the resulting attribute is a single </tt>Value<tt>, if only one value has been extracted, but a <tt>AnySeq</tt> containing all values if more than one value has been extracted for the requested propery.

SMILA does not contain the complete Aperture distribution, because some converters need third party libraries with problematic licenses that we are not allowed to distribute. However, it should be easy to include those parts of Aperture into your SMILA installation yourself: Just

+

* Download one of the <tt>aperture-eclipse-1.4.0</tt> archives from [http://sourceforge.net/projects/aperture/files/Aperture/1.4.0/]

+

* Unpack it.

+

* Copy the required bundles from <code>lib/aperture-libs</code> and <code>lib/required-libs</code> to <code>SMILA/plugins</code>.

+

* Add the new extractor bundles to the <code>config.ini</code> to activate them at system start.

+

+

For example, to add the HTML extractor, you must add the following bundles from Aperture to SMILA:

Bundle: org.eclipse.smila.aperture.pipelets.AperturePipelet

Description

This pipelet converts various document formats (such as PDF, Microsoft Office formats, OpenOffice formats, etc.) to plain text using Aperture technology: A binary attachment content can thus be converted to plain text and stored in an attribute. In addition to that, metadata properties of the document (like title, author, etc) can be extracted and written to record attibutes. The optional MimeType of the document in MimeTypeAttribute is used for conversion. If no MimeType is provided a MimeType identification is done inside the pipelet using a MimeTypeIdentifier service.

Supported document types

By default, SMILA contains only a subset of Aperture that supports the conversion of:

plain text documents (of course ;-)

XML documents

RTF documents

Adobe PDF documents

Microsoft Office documents, both the old formats (doc, xls, ppt) and the new OOXML formats (docx, xlsx, pptx)

Microsoft Visio documents

OpenOffice documents (OpenDocument formats)

Note: We do not include the HTML extractor currently because it depends on an HTML parser implementation with LGPL, which we are not allowed to redistriebute. See below on hints how to add Aperture extractors for further formats

Configuration

Property

Type

Read Type

Description

inputType

String : ATTACHMENT, ATTRIBUTE

runtime

selects if the input is found in an attachment or attribute of the record. Usually it doesn't make sense to use "ATTRIBUTE" here because the documents to convert are binary content.

outputType

String : ATTACHMENT, ATTRIBUTE

runtime

selects if output should be stored in an attachment or attribute of the record

inputName

String

runtime

name of input attachment or path to input attribute (process a String literal of attribute)

outputName

String

runtime

name of output attachment or path to output attribute for plain text (store result as String literal of attribute)

ExtractProperties

String

runtime

Specifies which metadata properties reported by Aperture for the document should be written to which record attribute. See below for details.

MimeTypeAttribute

String

runtime

Parameter referencing the attribute that contains the mimetype of the document. The parameter (resp. attribute) may not be set (null) and then a mimetype detection is performed. If the attribute has not been set, it will be set during the processing of the record to the detected mime type.

FileExtensionAttribute

String

runtime

Parameter referencing the attribute that file extension of the file that was the source of the attachment content. If the mimetype attribute is not specified or does not have a value, the file extension can be used to improve the automatic mime type detection. It not specified, the mimetype detection is based on the attachment content only.

Note that all properties are required and must be provided.

Configuring the Property Mapping

In addition to the plain text content, Aperture can extract metadata properties from documents like the title, author, publisher, dates of publication etc, ... The names of these properties are URIs. Aperture uses URIs defined by

and probably there are others which we just did not discover yet. It depends very much on the documents what is actually extracted. To check with your documents you can download one of the "aperture-eclipse-1.4.0" archives from [2], unpack it and start bin/fileinspector.(sh|bat). Open a document with it and you will see an RDF representation of the extracted metadata.

To store such metadata properties in SMILA records, you must specify the URLs of the properties you want to store in the ExtractProperties parameter. Usually this parameter contains a sequence of string values. The string values can have one of the following formats:

<Property-URL>: Add the values of this property to an attribute with the same name.

<Property-URL>-><Attribute-Name>: Add the values of the property to the attribute with the given name

<Property-URL>->><Attribute-Name>: Store the values of the property in the attribute with the given name, remove existing values first.

To improve readability, it is possible to abbreviate the property URLs by using namespace prefixes. The available prefixes are specified in namespaces.properties in the org.eclipse.smila.aperture bundle. To add namespaces to this file, extend it and put it in the configuration area in directory org.eclipse.smila.aperture. Using the predefined namespaces you can use, for example:

If you use namespace abbreviations to specify the properties to extract, but don't specify target attributes, the target attributes will be the abbreviated URIs.

If the property value reported by Aperture is a resource, the pipelet tries to find a display name for it. It checks the following properties in this order:

nco:fullname

nie:title

nao:prefLabel

rdfs:label

If none of them has a value for the resource, the URI of the resource is used as the attribute value.

It is possile to specify the complete mapping in a single string value. To do this, concatenate the single values from the sequence using a semicolon ";" as the separator. This makes it easier to use the AperturePipelet in the PipeletProcessorWorker which currently allows only simple string parameters for pipelet configuration.

In any case, the resulting attribute is

a single Value, if only one value has been extracted and the value is not appeded to previously existing values

a AnySeq containing all values, if more than one value has been extracted or new values are appended to existing values.

Example

The following example shows how to configure the pipelet to extract the text from the attachment called Content and stores the extracted text in the attribute Text. Additionally the eventually contained Company, Manager and Creator will be stored in properties which are named after their class URIs.

E.g. if a word document with the value "ACME" as company and "John Doe" as creator, the resulting record would contain the plain text in the attribute Text, the value ACME in the attribute http://schemas.openxmlformats.org/officeDocument/2006/extended-properties/Company, as well as the value John Doe in an attribute dc:creator.

Extending Aperture

SMILA does not contain the complete Aperture distribution, because some converters need third party libraries with problematic licenses that we are not allowed to distribute. However, it should be easy to include those parts of Aperture into your SMILA installation yourself: Just