UIMA (Unstructured Information Management Architecture) is a platform for natural language
processing, originally developed by IBM but now maintained by the Apache Software Foundation.
It has many similarities to the GATE architecture – it represents documents as text plus
annotations, and allows users to define pipelines of analysis engines that manipulate the document
(or Common Analysis Structure in UIMA terminology) in much the same way as processing
resources do in GATE. The Apache UIMA SDK provides support for building analysis
components in Java and C++ and running them either locally on one machine, or deploying
them as services that can be accessed remotely. The SDK is available for download from
http://incubator.apache.org/uima/.

Clearly, it would be useful to be able to include UIMA components in GATE applications and
vice-versa, letting GATE users take advantage of UIMA’s flexible deployment options and
UIMA users access JAPE and the many useful plugins already available in GATE. This
chapter describes the interoperability layer provided as part of GATE to support this. The
UIMA-GATE interoperability layer is based on Apache UIMA 2.2.2. GATE 5.0 and
earlier included an implementation based on version 1.2.3 of the pre-Apache IBM UIMA
SDK.

The rest of this chapter assumes that you have at least a basic understanding of core UIMA
concepts, such as type systems, primitive and aggregate analysis engines (AEs), feature structures,
the format of AE XML descriptors, etc. It will probably be helpful to refer to the relevant sections
of the UIMA SDK User’s Guide and Reference (supplied with the SDK) alongside this
document.

There are two main parts to the interoperability layer:

A wrapper to allow a UIMA Analysis Engine (AE), whether primitive or aggregate, to
be used within GATE as a Processing Resource (PR).

A wrapper to allow a GATE processing pipeline (specifically a CorpusController) to
be used within UIMA as an AE.

The two components operate in very similar ways. Given a document in the source
form (either a GATE Document or a UIMA CAS), a document in the target form is
created with a copy of the source document’s text. Some of the annotations from the
source are transferred to the target, according to a mapping defined by the user, and
the target component is then run. Finally, some of the annotations on the updated
target document are then transferred back to the source, according to the user-defined
mapping.

The rest of this document describes this process in more detail. Section 18.1 describes the GATE
AE wrapper, and Section 18.2 describes the UIMA CorpusController wrapper.

Embedding a UIMA analysis engine in a GATE application is a two step process. First, you must
construct a mapping descriptor XML file to define how to map annotations between the UIMA
CAS and the GATE Document. This mapping file, along with the analysis engine descriptor, is
used to instantiate an AnalysisEnginePR which calls the analysis engine on an appropriately
initialized CAS. Examples of all the XML files discussed in this section are available in
examples/conf under the UIMA plugin directory.

Figure 18.1 shows the structure of a mapping descriptor. The inputs section defines how
annotations on the GATE document are transferred to the UIMA CAS. The outputs section
defines how annotations which have been added, updated and removed by the AE are transferred
back to the GATE document.

When a document is processed, this will create one UIMA annotation of type uima.Type
in the CAS for each GATE annotation of type GATEType in the input annotation set,
covering the same offsets in the text. If indexed is true, GATE will keep a record of
which GATE annotation gave rise to which UIMA annotation. If you wish to be able to
track updates to this annotation’s features and transfer the updated values back into
GATE, you must specify indexed="true". The indexed attribute defaults to false if
omitted.

Each contained feature element will cause the corresponding feature to be set on the generated
annotation. UIMA features can be string, integer or float valued, or can be a reference to another
feature structure, and this must be specified in the kind attribute. The feature’s value is specified
using a nested element, but exactly how this value is handled is determined by the
kind.

<docFeatureValue name="featureName" /> The value of the given named feature of
the current GATE document.

<gateAnnotFeatureValue name="featureName" /> The value of a given feature on
the current GATE annotation (i.e. the one on which the offsets of the UIMA annotation
are based).

<featureStructure type="uima.fs.Type">...</featureStructure> A
feature structure of the given type. The featureStructure element can itself contain
feature elements recursively.

The value is assigned to the feature according to the feature’s kind:

string

The value object’s toString() method is called, and the resulting String is set as
the string value of the feature.

int

If the value object is a subclass of java.lang.Number, its intValue() method is
called, and the result is set as the integer value of the feature. If the value object
is not a Number, it is toString()ed, and the resulting String is parsed using
Integer.parseInt(). If this succeeds, the integer result is used, if it fails the feature
is set to zero.

float

As for int, except that Numbers are converted by calling floatValue(), and
non-Numbers are parsed using Float.parseFloat().

fs

The value object is assumed to be a FeatureStructure, and is used as-is. A
ClassCastException will result if the value object is not a FeatureStructure.

In particular, <featureStructure> value elements should only be used with features of kind fs.
While nothing will stop you using them with string features, the result will probably not be what
you expected.

Annotations which have been added by the AE, and for which corresponding new
annotations are to be created in the GATE document.

updated

Annotations that were created by an input definition (with indexed="true")
whose feature values have been modified by the AE, and these values are to be
transferred back to the original GATE annotations.

removed

Annotations that were created by an input definition (with indexed="true")
which have been removed from the CAS1
and whose source annotations are to be removed from the GATE document.

For added annotations, this has the mirror-image effect to the input definition – for each UIMA
annotation of the given type, create a GATE annotation at the same offsets and set its feature
values as specified by feature elements. For a gateAnnotation the feature elements do not have
a kind, as features in GATE can have arbitrary Objects as values. The possible feature value
elements for a gateAnnotation are:

<string value="fixed string" /> A fixed string, as before.

<uimaFSFeatureValue name="uima.Type:FeatureName" kind="string|int|float" />
The value of the given feature of the current UIMA annotation. The feature name must be
specified in fully-qualified form, including the type on which it is defined. The kind is used in
a similar way as in input definitions:

string

The Java String object returned as the string value of the feature is used.

int

An Integer object is created from the integer value of the feature.

float

A Float object is created from the float value of the feature.

fs

The UIMA FeatureStructure object is returned. Since FeatureStructure objects
are not guaranteed to be valid once the CAS has been cleared, a downstream
GATE component must extract the relevant information from the feature
structure before the next document is processed. You have been warned.

Feature names in uimaFSFeatureValue must be qualified with their type name, as the feature may
have been defined on a supertype of the feature’s own type, rather than the type itself. For
example, consider the following:

For updated annotations, there must have been an input definition with indexed="true" with the
same GATE and UIMA types. In this case, for each GATE annotation of the appropriate type, the
UIMA annotation that was created from it is found in the CAS. The feature definitions are then
used as in the added case, but here, the feature values are set on the original GATE annotation,
rather than on a newly created annotation.

For removed annotations, the feature definitions are ignored, and the annotation is removed from
GATE if the UIMA annotation which it gave rise to has been removed from the UIMA annotation
index.

Figure 18.2 shows a complete example mapping descriptor for a simple UIMA AE that takes tokens as
input and adds a feature to each token giving the number of lower case letters in the token’s
string.2
In this case the UIMA feature that holds the number of lower case letters is called
LowerCaseLetters, but the GATE feature is called numLower. This demonstrates that the
feature names do not need to agree, so long as a mapping between them can be defined.

As well as the mapping file, you must provide the UIMA component descriptor that defines how to
access the AE that is to be called. This could be a primitive or aggregate analysis engine
descriptor, or a URI specifier giving the location of a remote Vinci or SOAP service. It is up to the
developer to ensure that the types and features used in the mapping descriptor are
compatible with the type system and capabilities of the AE, or a runtime error is likely to
occur.

To use a UIMA AE in GATE Developer, load the UIMA plugin and create a ‘UIMA Analysis
Engine’ processing resource. If using the GATE Embedded, rather than GATE Developer,
the class name is gate.uima.AnalysisEnginePR. The processing resource expects two
parameters:

analysisEngineDescriptor

The URL of the UIMA analysis engine descriptor (or URI
specifier, for a remote AE service). This must be a file: URL, as UIMA needs a file
path against which to resolve imports.

mappingDescriptor

The URL of the mapping descriptor file. This may be any kind of
URL (file:, http:, Class.getResource(), ServletContext.getResource(), etc.)

Any errors processing either of the descriptor files will cause an exception to be thrown. Once
instantiated, you can add the PR to a pipeline in the usual way. AnalysisEnginePR
implements LanguageAnalyser, so can be used in any of the standard GATE pipeline
types.

The PR takes the following runtime parameter (in addition to the document parameter which is set
automatically by a CorpusController):

annotationSetName

The annotation set to process. Any input mappings take annotations
from this set, and any output mappings place their new annotations in this set (added
outputs) or update the input annotations in this set (updated or removed). If not
specified, the default (unnamed) annotation set is used.

The Annotator implementation must be available for GATE to load. For an annotator written in
Java, this means that the JAR file containing the annotator class (and any other classes it depends
on) must be present in the GATE classloader. The easiest way to achieve this is to put the JAR file
or files in a new directory, and create a creole.xml file in the same directory to reference the
JARs:

This directory should then be loaded in GATE as a CREOLE plugin. Note that, due to the
complex mechanics of classloaders in Java, putting your JARs in GATE’s lib directory will not
work.

For annotators written in C++ you need to ensure that the C++ enabler libraries (available
separately from http://incubator.apache.org/uima/) and the shared library containing your
annotator are in a directory which is on the PATH (Windows) or LD_LIBRARY_PATH (Linux) when
GATE is run.

The process of embedding a GATE controller in a UIMA application is more or less the mirror
image of the process detailed in the previous section. Again, the developer must supply a mapping
descriptor defining how to map between UIMA and GATE annotations, and pass this, plus the
GATE controller definition, to an AE which performs the translation and calls the GATE
controller.

The mapping descriptor format is virtually identical to that described in Section 18.1.1, except
that the input definitions are <gateAnnotation> elements and the output definitions are
<uimaAnnotation> elements. The input and output definition elements support an extra attribute,
annotationSetName, which allows inputs to be taken from, and outputs to be placed in, different
annotation sets. For example, the following hypothetical example maps com.example.Person
annotations into the default set and com.example.html.Anchor annotations to ‘a’ tags in the
‘Original markups’ set.

Figure 18.3 shows a mapping descriptor for an application that takes tokens and sentences produced
by some UIMA component and runs the GATE part of speech tagger to tag them with Penn TreeBank
POS tags.3
In the example, no features are copied from the UIMA tokens, but they are still indexed="true"
as the POS feature must be copied back from GATE.

The GATE application to embed is given as a standard ‘.gapp file’, as produced by saving the
state of an application in the GATE GUI. The .gapp file encodes the information necessary to load
the correct plugins and create the various CREOLE components that make up the application.
The .gapp file must be fully specified and able to be executed with no user intervention other than
pressing the Go button. In particular, all runtime parameters must be set to their correct values
before saving the application state. Also, since paths to things like CREOLE plugin directories,
resource files, etc. are stored relative to the .gapp file’s location, you must not move the .gapp file
to a different directory unless you can keep all the CREOLE plugins it depends on at the
same relative locations. The ‘Export for Teamware’ option (section 3.8.4) may help you
here.

GATEApplicationAnnotator is the UIMA annotator that handles mapping the CAS into a GATE
document and back again and calling the GATE controller. There is a template AE descriptor
XML file for the annotator provided in the conf directory. Most of the template file can be used
unchanged, but you will need to modify the type system definition and input/output capabilities
to match the types and features used in your mapping descriptor. If the mapping descriptor
references a type or feature that is not defined in the type system, a runtime error will
occur.

The annotator requires two external resources:

GateApplication

The .gapp file containing the saved application state.

MappingDescriptor

The mapping descriptor XML file.

These must be bound to suitable URLs, either by editing the resourceManagerConfiguration
section of the primitive descriptor, or by supplying the binding in an aggregate descriptor that
includes the GATEApplicationAnnotator as one of its delegates.

In addition, you may need to set the following Java system properties:

uima.gate.configdir

The path to the GATE config directory. This defaults to gate-config
in the same directory as uima-gate.jar.

uima.gate.siteconfig

The location of the sitewide gate.xml configuration file. This
defaults to gate.uima.configdir/site-gate.xml.

uima.gate.userconfig

The location of the user-specific gate.xml configuration file. This
defaults to gate.uima.configdir/user-gate.xml.

The default config files are deliberately simplified from the standard versions supplied with GATE,
in particular they do not load any plugins automatically (not even ANNIE). All the plugins
used by your application are specified in the .gapp file, and will be loaded when the
application is loaded, so it is best to avoid loading any others from gate.xml, to avoid
problems such as two different versions of the same plugin being loaded from different
locations.

In addition to the usual UIMA library JAR files, GATEApplicationAnnotator requires
a number of JAR files from the GATE distribution in order to function. In the first
instance, you should include gate.jar from GATE’s bin directory, and also all the
JAR files from GATE’s lib directory on the classpath. If you use the supplied Ant
build file, ant documentanalyser will run the document analyser with this classpath.
Depending on exactly which GATE plugins your application uses, you may be able to
exclude some of the lib JAR files (for example, you will not need Weka if you do not
use the machine learning plugin), but it is safest to start with them all. GATE will
load plugin JAR files through its own classloader, so these do not need to be on the
classpath.

1Strictly speaking, removed from the annotation index, as feature structures cannot be removed from theCAS entirely.

2The Java code implementing this AE is in the examples directory of the UIMA plugin. The AE descriptor andmapping file are in examples/conf.

3The .gapp file implementing this example is in the test/conf directory under the UIMA plugin, along with themapping file and the AE descriptor that will run it.