* ''record'': the [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/datamodel/AnyMap.html metadata] of the current record

* ''record'': the [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/datamodel/AnyMap.html metadata] of the current record

* ''results'': a slightly modified version of a [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/processing/util/ResultCollector.html result collector] that provides methods to add a new record id to the list of result ids (''results.addResult('...id...')'') and to drop the current record from the same list (''results.excludeCurrentRecord()'')

* ''results'': a slightly modified version of a [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/processing/util/ResultCollector.html result collector] that provides methods to add a new record id to the list of result ids (''results.addResult('...id...')'') and to drop the current record from the same list (''results.excludeCurrentRecord()'')

+

* ''parameterAccessor'': the [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/processing/parameters/ParameterAccessor.html ParameterAccessor] instance for access to the configuration (e.g. ''parameterAccessor.getParameterAny("configMap").asMap().getLongValue("longValue")'').

Please be aware that the intention of this pipelet is to write pipelines fast, but not to write fast pipelines - the script is parsed for every record. Don't use it for production environments where performance matters, but use it to develop an algorithm that you can put into [[SMILA/Development_Guidelines/How_to_write_a_Pipelet|your own pipelet]].

Please be aware that the intention of this pipelet is to write pipelines fast, but not to write fast pipelines - the script is parsed for every record. Don't use it for production environments where performance matters, but use it to develop an algorithm that you can put into [[SMILA/Development_Guidelines/How_to_write_a_Pipelet|your own pipelet]].

Revision as of 04:15, 20 March 2013

This page describes the SMILA pipelets provided by bundle org.eclipse.smila.processing.pipelets.

org.eclipse.smila.processing.pipelets.FilterPipelet

Copies only those record IDs to the result which match a configurable regular expression in a configurable single-valued attribute. This is useful for conditional processing while at the same time pushing multiple records through the pipeline in a single request: Instead of using BPEL conditions use a FilterPipelet to select only the matching records in a new variable and use the this variable as the input variable for the next pipelets. You can still use the original BPEL variable in the BPEL <reply> activity at the end of the pipeline to return all records as the final result.

Configuration

The configuration properties are read either from the _parameters attribute of each record or from the pipelet configuration.

Property

Type

Read Type

Description

filterAttribute

A string value

runtime

The name of the attribute to match

filterExpression

A string value

runtime

The regular expression to match the attribute value against

Example

To get only those records in the textRecords BPEL variable that have a MimeType starting with text something like this could be used:

Configuration

Defines whether the HTML input is found in an attachment or in an attribute of the record

outputType

String : ATTACHMENT, ATTRIBUTE

runtime

Defines whether the plain text should be stored in an attachment or in an attribute of the record

inputName

String

runtime

Name of input attachment or path to input attribute (process literals of attribute)

outputName

String

runtime

Name of output attachment or path to output attribute for plain text (store result as literals of attribute)

defaultEncoding

String

runtime

Optional, default encoding to apply to documents when not specified in the documents themselves

removeContentTags

String

runtime

Comma-separated list of HTML tags (case-insensitive) for which the complete content should be removed from the resulting plain text. If not set, it defaults to "applet,frame,object,script,style". If the value is set, you must add the default tags explicitly to have their contents removed, too.

meta:<name>

String: attribute path

init

Store the content of the <META> tag with name="<name>" (case insensitive) to the attribute named as the value of the property. E.g. a property named "meta:author" with value "authors" causes the content attributes of <META name="author" content="..."> tags to be stored in the attribute authors of the respective record.

tag:title

String: attribute path

init

Store the content of the <TITLE> tag with to the attribute named as the value of the property.

Example

This configuration extracts plain text from the HTML document in attachment "html" and stores the results to the attribute "text". It removes the complete content of heading tags <h1>, ..., <h4>. In addition to that, it looks for <meta> tags with names "author" and "keywords" and stores their contents in attributes "authors" and "keywords", respectively:

org.eclipse.smila.processing.pipelets.CopyPipelet

Description

This pipelet can be used to copy or move attribute values to other attributes or to copy or move a string value between attributes and/or attachments. It suppoprts two execution modes:

COPY: copy the value from the input attribute/attachment to the output attribute/attachment

MOVE: same as COPY, but after that delete the value from the input attribute/attachment

When an attribute is copied to another attribute, the type remains the same. When copying an attachment to an attribute, a string value is created by assuming the the attachment is a text in UTF-8 encoding. When copying an attribute value to an attachment, the attribute must be single value which is interpreted as a string value and converted to a byte array using UTF-8 encoding.

Configuration

Property

Type

Read Type

Description

inputType

String : ATTACHMENT, ATTRIBUTE

runtime

selects if the input is found in an attachment or attribute of the record

outputType

String : ATTACHMENT, ATTRIBUTE

runtime

selects if output should be stored in an attachment or attribute of the record

inputName

String

runtime

name of input attachment or input attribute

outputName

String

runtime

name of output attachment or output attribute

mode

String : COPY, MOVE

runtime

execution mode. Copy the value or move (copy and delete) the value. Default is COPY.

Example

This configuration shows how to copy the value of attachment 'Content' into the attribute 'TextContent':

org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet

Description

Extracts literal values from an attribute that has a nested map. The attributes in the nested map can have nested maps themselves. To address a attribute in the nested structure, a path needs to be specified. The pipelet supports different execution modes:

FIRST: selects only the first literal of the specified attribute

LAST: selects only the last literal of the specified attribute

ALL_AS_LIST: selects all literal values of the specified attribute and returns a list

ALL_AS_ONE: selects all literal values of the specified attribute and concatenates them to a single string, using a separator (default is blank)

This pipelet works only on attributes, not on attachments!

Note:
If the maps on the path are nested in sequences, the pipelet uses the first element of such a sequence.

Configuration

Property

Type

Read Type

Description

inputPath

String

runtime

the path to the input attribute with Literals

outputPath

String

runtime

the name of the attribute to store the extracted value(s) as Literals in (not a path, only a top-level attribute, currently)

mode

String : FIRST, LAST, ALL_AS_LIST, ALL_AS_ONE

runtime

execution mode. See above for details.

separator

String

runtime

the separation string used for mode ALL_AS_ONE. Default is a blank

Example

This configuration can be applied to records provided by the FeedAgent. It shows how to access the subattribute 'Value' of attribute 'Contents', concatenating all values to one:

org.eclipse.smila.processing.pipelets.ReplacePipelet

Description

Searches for one or more patterns in the literal value of an attribute and substitutes the found occurrences by the configured replacements.

You can choose from different matching types:

entity: Every pattern is matched against the whole attribute value (with respect to the ignoreCase property) and the first matching pattern defines the new value of the attribute. If no pattern matches, the result is the current value of the attribute.

substring: All patterns that are part of the attribute value are replaced.

results: a slightly modified version of a result collector that provides methods to add a new record id to the list of result ids (results.addResult('...id...')) and to drop the current record from the same list (results.excludeCurrentRecord())

parameterAccessor: the ParameterAccessor instance for access to the configuration (e.g. parameterAccessor.getParameterAny("configMap").asMap().getLongValue("longValue")).

Please be aware that the intention of this pipelet is to write pipelines fast, but not to write fast pipelines - the script is parsed for every record. Don't use it for production environments where performance matters, but use it to develop an algorithm that you can put into your own pipelet.

Configuration

Property

Type

Read Type

Description

type

String

init

the mime type of the scripting language, defaults to "text/javascript"

scriptFile

String

runtime

the path of the file that contains the script - modifications of this file are observed on every execution of the pipelet

script

String

init

The "inline" script, required unless scriptFile is specified (ignored in that case)

resultAttribute

String

runtime

The name of an attribute that will receive the result of the script (usually the result of the last expression)

Examples

This configuration can be used to concatenate the values of two attributes and save the result into a third one:

org.eclipse.smila.processing.pipelets.ExecPipelet

Description

Executes an external program for each record.

This pipelet may be used to integrate native programs into the pipeline.

Attention: This pipelet may lead to security issues! Please be aware that although one can not change the executed command during runtime (as this parameter is only evaluated at initialization time), it is possible to change the arguments and input of the command using values in the processed record. Every "pipeline developer" should ensure that only arguments in the expected value range are processed (especially if the program is allowing files from the file system as arguments).

Configuration

Property

Type

Read Type

Description

command

String

init

The program to execute (including its path in the file system).

directory

String

runtime

The (optional) working directory for the command. The SMILA directory is used if not given.

parameters

Sequence of strings

runtime

The optional parameters given to the program (ignored if the contents of the parameters attribute exists).

parametersAttribute

String

runtime

The optional name of the attribute that contains the sequence of parameters given to the program.

inputAttachment

String

runtime

The optional name of the attachment that contains the bytes to send as input for the program.

outputAttachment

String

runtime

The optional name of the attachment that is filled with the standard output of the program.

exitCodeAttribute

String

runtime

The name of the attribute that is filled with the exit code of the program.

errorAttachment

String

runtime

The optional name of the attachment that is filled with the error output of the program.

failOnError

Either a boolean or a sequence of strings

runtime

Indicates to mark a record as failed if the program returns an error code. Either as a sequence of exit code ranges or as a boolean where "true" means that everything except 0 is an error code. Defaults to false.

Examples

This configuration can be used to execute FFMPEG for transformation of an MP3 input file into a WAV output file:

org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet

Description

This pipelet is used to identify the MIME type of a document.
It uses an org.eclipse.smila.processing.pipelets.mimetype.MimeTypeIdentifier service to perform the actual identification of the MIME type. Depending on the specified properties, the MIME type is detected from the file content, from the file extension, or from both. If the identification does not return a MIME type - and if configured accordingly - the service will search the metadata for this information. The identified MIME type is then stored to an attribute in the record.

Configuration

The pipelet is configured using the <configuration> section inside the <invokePipelet> activity of the corresponding BPEL file. It provides the following properties:

org.eclipse.smila.processing.pipelets.LanguageIdentifyPipelet

Description

This pipelet identifies the language of textual input and stores the returned ISO 639 language code to some target attribute. It uses an org.eclipse.smila.common.language.LanguageIdentifier service to perform the actual identification. If the identification does not return a language, the specified DefaultLanguage (or DefaultAlternativeName) is returned. If no defaults are specified, no value is set.

The pipelet returns the detected language as an ISO 639 code. Where you need special language tags in your application, the pipelet is able to produce
an alternative language code according to a configurable mapping. To define such a mapping, create the file SMILA/configuration/org.eclipse.smila.tika/languageMapping.properties. The following shows an exemplary mapping:

Configuration

The pipelet is configured using the <configuration> section inside the <invokePipelet> activity of the corresponding BPEL file. It provides the following properties:

Property

Type

Read Type

Usage

Description

ContentAttribute

String

runtime

Required

Name of the attribute containing the text whose language should be identified

LanguageAttribute

String

runtime

Optional

Name of the attribute to store the code of the identified language to

DefaultLanguage

String

runtime

Optional

Language code to set if no language could be detected. If not set and no language could be identified, the LanguageAttribute attribute remains empty.

AlternativeNameAttribute

String

runtime

Optional

Name of the attribute to store the alternative language code of the identified language to. The mapping defining this alternative code must be located in SMILA/configuration/org.eclipse.smila.tika/languageMapping.properties (see above).

DefaultAlternativeName

String

runtime

Optional

Alternative language code to set if no language could be detected. If not set and no language could be identified, the DefaultAlternativeName attribute remains empty.

UseCertainLanguagesOnly

Boolean

runtime

Optional

Boolean flag indicating whether to apply only those languages that were identified with a reasonable certainty (true) or all (false). Default is false.

Example

The following example could be used to identify the language of documents that are delivered by the File System Crawler or Web Crawler.

org.eclipse.smila.processing.pipelets.JSONReaderPipelet

Description

It is not possible to overwrite the record id of the record, even if a key "_recordid" exists in the JSON string.

Configuration

Property

Type

Read Type

Description

inputType

String : ATTACHMENT, ATTRIBUTE

init

selects if the JSON string is found in an attachment or attribute of the record

inputName

String

init

name of the input attachment or input attribute that contains the JSON string

outputAttribute

String

init

the optional name of the attribute in the record where the generated object is put into. If no attribute is specified and the object is a map, all contained attributes are written to the current record.

Examples

The following examples use this input object:

{"jsonString":"{\"attribute1\": \"value1\"}"}

This example unwraps the contents of the attribute "jsonString" into the attribute "jsonObject":