Embedding GATE-based language processing in other applications using GATE Embedded (the
GATE API) is straightforward:

add
$GATE_HOME/bin/gate.jar
and
the
JAR
files
in
$GATE_HOME/lib
to
the
Java
CLASSPATH
($GATE_HOME
is
the
GATE
root
directory)

tell
Java
that
the
GATE
Unicode
Kit
is
an
extension:
-Djava.ext.dirs=$GATE_HOME/lib/ext
N.B.
This
is
only
necessary
for
GUI
applications
that
need
to
support
Unicode
text
input;
other
applications
such
as
command
line
or
web
applications
don’t
generally
need
GUK.

Instead of creating your processing resources individually using the Factory, you can create your
application in GATE Developer, save it using the ‘save application state’ option (see
Section 3.8.3), and then load the saved state from your code. This will automatically reload any
plugins that were loaded when the state was saved, you do not need to load them manually.

All CREOLE resources have some associated meta-data in the form of an entry in a special XML
file named creole.xml. The most important role of that meta-data is to specify the set of
parameters that a resource understands, which of them are required and which not, if they have
default values and what those are. The valid parameters for a resource are described in the
resource’s section of its creole.xml file or in Java annotations on the resource class – see Section
4.7.

All resource types have creation-time parameters that are used during the initialisation phase.
Processing Resources also have run-time parameters that get used during execution (see Section
7.5 for more details).

Controllers are used to define GATE applications and have the role of controlling the execution
flow (see Section 7.6 for more details).

This section describes how to create and delete CREOLE resources as objects
in a running Java virtual machine. This process involves using GATE’s Factory
class2,
and, in the case of LRs, may also involve using a DataStore.

CREOLE resources are Java Beans; creation of a resource object involves using a default
constructor, then setting parameters on the bean, then calling an init() method. The Factory
takes care of all this, makes sure that the GATE Developer GUI is told about what is happening
(when GUI components exist at runtime), and also takes care of restoring LRs from DataStores. Aprogrammer using GATE Embedded should never call the constructor of a resource:always use the Factory!

Creating a resource involves providing the following information:

fully qualified class name for the resource. This is the only required value. For all
the rest, defaults will be used if actual values are not provided.

values for the creation time parameters.†

initial values for resource features.†For an explanation on features see Section
7.4.2.

a name for the new resource;

†Parameters and features need to be provided in the form of a GATE Feature Map which is
essentially a java Map (java.util.Map) implementation, see Section 7.4.2 for more details on
Feature Maps.

Creating a resource via the Factory involves passing values for any create-time parameters that
require setting to the Factory’s createResource method. If no parameters are passed, the defaults
are used. So, for example, the following code creates a default ANNIE part-of-speech tagger:

Note that if the resource created here had any parameters that were both mandatory and had no
default value, the createResource call would throw an exception. In this case, all the information
needed to create a tagger is available in default values given in the tagger’s XML definition (in
plugins/ANNIE/creole.xml):

Note that the document created here is transient: when you quit the JVM the document will no
longer exist. If you want the document to be persistent, you need to store it in a DataStore (see
Section 7.4.5).

Apart from createResource() methods with different signatures, Factory also provides some
shortcuts for common operations, listed in table 7.1.

Method

Purpose

newFeatureMap()

Creates a new Feature Map (as used
in the example above).

newDocument(String content)

Creates a new GATE Document
starting from a String value that will
be used to generate the document
content.

newDocument(URL sourceUrl)

Creates a new GATE Document
using the text pointed by an URL to
generate the document content.

newDocument(URL sourceUrl,String encoding)

Same as above but allows the
specification of an encoding to
be used while downloading the
document content.

newCorpus(String name)

creates a new GATE Corpus with a
specified name.

Table 7.1: Factory Operations

GATE maintains various data structures that allow the retrieval of loaded resources. When a
resource is no longer required, it needs to be removed from those structures in order to remove all
references to it, thus making it a candidate for garbage collection. This is achieved using the
deleteResource(Resource res) method on Factory.

Simply removing all references to a resource from the user code
will NOT be enough to make the resource collect-able. Not calling
Factory.deleteResource() will lead to memory leaks!

As shown in the examples above, in order to use a CREOLE resource the relevant CREOLE plugin
must be loaded. Processing Resources, Visual Resources and Language Resources other
than Document, Corpus and DataStore all require that the appropriate plugin is first
loaded. When using Document, Corpus or DataStore, you do not need to first load a
plugin. The following API calls listed in table 7.2 are relevant to working with CREOLE
plugins.

Class gate.Gate

Method

Purpose

public static voidaddKnownPlugin(URLpluginURL)

adds the plugin to the list of known
plugins.

public static voidremoveKnownPlugin(URLpluginURL)

tells the system to ‘forget’ about
one previously known directory. If
the specified directory was loaded,
it will be unloaded as well - i.e. all
the metadata relating to resources
defined by this directory will be
removed from memory.

public static voidaddAutoloadPlugin(URLpluginUrl)

adds a new directory to the list of
plugins that are loaded automatically
at start-up.

public static voidremoveAutoloadPlugin(URLpluginURL)

tells the system to remove a plugin
URL from the list of plugins that
are loaded automatically at system
start-up. This will be reflected in the
user’s configuration data file.

Class gate.CreoleRegister

public voidregisterDirectories(URLdirectoryUrl)

loads a new CREOLE directory. The
new plugin is added to the list of
known plugins if not already there.

public voidregisterComponent(Class<?extends Resource> cls)

registers a single @CreoleResource
annotated class without the need for
a creole.xml file.

public voidremoveDirectory(URLdirectory)

unloads a loaded CREOLE plugin.

Table 7.2: Calls Relevant to CREOLE Plugins

If you are writing a GATE Embedded application and have a single resource class that will only be
used from your embedded code (and so does not need to be distributed as a complete plugin), and
all the configuration for that resource is provided as Java annotations on the class, then it is
possible to register the class with the CreoleRegister at runtime without needing to package it in
a JAR and provide a creole.xml file. You can pass the Class object representing your resource
class to Gate.getCreoleRegister().registerComponent() method and then create instances of
the resource in the usual way using Factory.createResource. Note that resources cannot be
registered this way in the developer GUI, and cannot be included in saved application states (see
section 7.8 below).

The content of a document can be any implementation of the
gate.DocumentContent interface; the features are <attribute, value> pairs stored a Feature Map.
Attributes are String values while the values can be any Java object.

The annotations are grouped in sets (see section 7.4.3). A document has a default (anonymous)
annotations set and any number of named annotations sets.

Documents are defined by the gate.Document interface and there is also a provided
implementation:

gate.corpora.DocumentImpl

: transient document. Can be stored persistently through
Java serialisation.

All CREOLE resources as well as the Controllers and the annotations can have attached meta-data
in the form of Feature Maps.

A Feature Map is a Java Map (i.e. it implements the java.util.Map interface) and holds
<attribute-name, attribute-value> pairs. The attribute names are Strings while the values can be
any Java Objects.

The use of non-Serialisable objects as values is strongly discouraged.

Feature Maps are created using the gate.Factory.newFeatureMap() method.

The actual implementation for FeatureMaps is provided by the
gate.util.SimpleFeatureMapImpl class.

Objects that have features in GATE implement the gate.util.FeatureBearer interface which
has only the two accessor methods for the object features: FeatureMap getFeatures() and voidsetFeatures(FeatureMap features).

A GATE document can have one or more annotation layers — an anonymous one, (also called
default), and as many named ones as necessary.

An annotation layer is organised as a Directed Acyclic Graph (DAG) on which the nodes are
particular locations —anchors— in the document content and the arcs are made out of
annotations reaching from the location indicated by the start node to the one pointed by the end
node (see Figure 7.1 for an illustration). Because of the graph metaphor, the annotation layers are
also called annotation graphs. In terms of Java objects, the annotation layers are represented using
the Set paradigm as defined by the collections library and they are hence named annotation sets.
The terms of annotation layer, graph and set are interchangeable and refer to the same concept
when used in this book.

Figure 7.1: The Annotation Graph model.

An annotation set holds a number of annotations and maintains a series of indices in order to
provide fast access to the contained annotations.

The GATE Annotation Sets are defined by the gate.AnnotationSet interface and there is a
default implementation provided:

gate.annotation.AnnotationSetImpl

annotation set implementation used by transient
documents.

The annotation sets are created by the document as required. The first time a particular
annotation set is requested from a document it will be transparently created if it doesn’t
exist.

Creates a new annotation between
two offsets, adds it to this set and
returns its id.

Integer add(Node start, Nodeend, String type, FeatureMapfeatures)

Creates a new annotation between
two nodes, adds it to this set and
returns its id.

boolean remove(Object o)

Removes an annotation from this set.

Nodes

Method

Purpose

Node firstNode()

Gets the node with the smallest
offset.

Node lastNode()

Gets the node with the largest offset.

Node nextNode(Node node)

Get the first node that is relevant for
this annotation set and which has the
offset larger than the one of the node
provided.

Set implementation

Iterator iterator()

int size()

Table 7.4: gate.AnnotationSet methods (general purpose).

Searching

AnnotationSet get(Long offset)

Select annotations by offset. This
returns the set of annotations whose
start node is the least such that it
is less than or equal to offset. If a
positional index doesn’t exist it is
created. If there are no nodes at or
beyond the offset parameter then it
will return null.

AnnotationSet get(LongstartOffset, Long endOffset)

Select annotations by offset. This
returns the set of annotations that
overlap totally or partially with the
interval defined by the two provided
offsets. The result will include all the
annotations that either:

start before the start offset and
end strictly after it

start at a position between the
start and the end offsets

AnnotationSet get(String type)

Returns all annotations of the
specified type.

AnnotationSet get(Set types)

Returns all annotations of the
specified types.

AnnotationSet get(String type,FeatureMap constraints)

Selects annotations by type and
features.

Set getAllTypes()

Gets a set of java.lang.String objects
representing all the annotation types
present in this annotation set.

An annotation, is a form of meta-data attached to a particular section of document content. The
connection between the annotation and the content it refers to is made by means of two pointers
that represent the start and end locations of the covered content. An annotation must also have a
type (or a name) which is used to create classes of similar annotations, usually linked together by
their semantics.

an Integer value. All annotations IDs are unique inside an annotation set.

In GATE Embedded, annotations are defined by the gate.Annotation interface and implemented
by the gate.annotation.AnnotationImpl class. Annotations exist only as members of
annotation sets (see Section 7.4.3) and they should not be directly created by means of a
constructor. Their creation should always be delegated to the containing annotation
set.

Fills this corpus with documents
created on the fly from selected files
in a directory. Uses a FileFilter
to select which files will be used
and which will be ignored. A simple
file filter based on extensions is
provided in the Gate distribution
(gate.util.ExtensionFileFilter).

Assuming that you have a DataStore already open called myDataStore, this code will ask the
datastore to take over persistence of your document, and to synchronise the memory representation
of the document with the disk storage:

When you want to restore a document (or other LR) from a datastore, you make the same
createResource call to the Factory as for the creation of a transient resource, but this time you
tell it the datastore the resource came from, and the ID of the resource in that datastore:

Processing Resources (PRs) represent entities that are primarily algorithmic, such as parsers,
generators or ngram modellers.

They are created using the GATE Factory in manner similar the Language Resources. Besides the
creation-time parameters they also have a set of run-time parameters that are set by the system
just before executing them.

Analysers are a particular type of processing resources in the sense that they always have a
document and a corpus among their run-time parameters.

The most used methods for Processing Resources are presented in table 7.7

Method

Purpose

void setParameterValue(StringparamaterName, ObjectparameterValue)

Sets the value for a specified
parameter. method inherited fromgate.Resource

voidsetParameterValues(FeatureMapparameters)

Sets the values for more parameters
in one step. method inherited fromgate.Resource

ObjectgetParameterValue(StringparamaterName)

Gets the value of a named parameter
of this resource. method inherited fromgate.Resource

Resource init()

Initialise this resource, and return it.
method inherited from gate.Resource

void reInit()

Reinitialises the processing resource.
After calling
this method the resource should be
in the state it is after calling init.
If the resource depends on external
resources (such as rules files) then the
resource will re-read those resources.
If the data used to create the resource
has changed since the resource has
been created then the resource will
change too after calling reInit().

void execute()

Starts the execution of this
Processing Resource.

void interrupt()

Notifies this PR that it should stop
its execution as soon as possible.

boolean isInterrupted()

Checks whether this PR has been
interrupted since the last time its
Executable.execute() method was
called.

Controllers are used to create GATE applications. A Controller handles a set of Processing
Resources and can execute them following a particular strategy. GATE provides a series of serial
controllers (i.e. controllers that run their PRs in sequence):

gate.creole.SerialController:

a serial controller that takes any kind of PRs.

gate.creole.SerialAnalyserController:

a serial controller that only accepts Language
Analysers as member PRs.

gate.creole.ConditionalSerialController:

a serial controller that accepts all types of
PRs and that allows the inclusion or exclusion of member PRs from the execution
chain according to certain run-time conditions (currently features on the document
being processed are used).

gate.creole.ConditionalSerialAnalyserController:

a serial controller that only
accepts Language Analysers and that allows the conditional run of member PRs.

Sometimes, particularly in a multi-threaded application, it is useful to be able to create an
independent copy of an existing PR, controller or LR. The obvious way to do this is to call
createResource again, passing the same class name, parameters, features and name, and for many
resources this will do the right thing. However there are some resources for which this may be
insufficient (e.g. controllers, which also need to duplicate their PRs), unsafe (if a PR uses
temporary files, for instance), or simply inefficient. For example for a large gazetteer this would
involve loading a second copy of the lists into memory and compiling them into a second identical
state machine representation, but a much more efficient way to achieve the same behaviour would
be to use a SharedDefaultGazetteer (see section 13.11), which can re-use the existing state
machine.

The GATE Factory provides a duplicate method which takes an existing resource instance and
creates and returns an independent copy of the resource. By default it uses the algorithm described
above, extracting the parameter values from the template resource and calling createResource to
create a duplicate. However, if a particular resource type knows of a better way to duplicate itself
it can implement the CustomDuplication interface, and provide its own duplicate method which
the factory will use instead of performing the default duplication algorithm. A caller who
needs to duplicate an existing resource can simply call Factory.duplicate to obtain
a copy, which will be constructed in the appropriate way depending on the resource
type.

Note that the duplicate object returned by Factory.duplicate will not necessarily be of the same
class as the original object. However the contract of Factory.duplicate specifies that where the
original object implements any of a list of core GATE interfaces, the duplicate can be assumed to
implement the same ones – if you duplicate a DefaultGazetteer the result may not be
an instance of DefaultGazetteer but it is guaranteed to implement the Gazetteer
interface.

GATE Embedded allows the persistent storage of applications in a format based on XML
serialisation. This is particularly useful for applications management and distribution. A developer
can save the state of an application when he/she stops working on its design and continue
developing it in a next session. When the application reaches maturity it can be deployed to the
client site using the same method.

When an application (i.e. a Controller) is saved, GATE will actually only save the values
for the parameters used to create the Processing Resources that are contained in the
application. When the application is reloaded, all the PRs will be re-created using the saved
parameters.

Many PRs use external resources (files) to define their behaviour and, in most cases, these files are
identified using URLs. During the saving process, all the URLs are converted relative URLs based
on the location of the application file. This way, if the resources are packaged together
with the application file, the entire application can be reliably moved to a different
location.

API access to application saving and loading is provided by means of two static methods on the
gate.util.persistence.PersistenceManager class, listed in table 7.8.

Method

Purpose

public static voidsaveObjectToFile(Object obj,File file)

Saves the data needed to re-create
the provided GATE object to the
specified file. The Object provided
can be any type of Language or
Processing Resource or a Controller.
The procedures may work for other
types of objects as well (e.g. it
supports most Collection types).

public static ObjectloadObjectFromFile(File file)

Parses the file specified (which needs
to be a file created by the above
method) and creates the necessary
object(s) as specified by the data
in the file. Returns the root of the
object tree.

Starting from GATE version 3.1, support for ontologies has been added. Ontologies are nominally
Language Resources but are quite different from documents and corpora and are detailed in
chapter 14.

Classes related to ontologies are to be found in the gate.creole.ontology package and its
sub-packages. The top level package defines an abstract API for working with ontologies while the
sub-packages contain concrete implementations. A client program should only use the classes and
methods defined in the API and never any of the classes or methods from the implementation
packages.

The entry point to the ontology API is the gate.creole.ontology.Ontology interface which is
the base interface for all concrete implementations. It provides methods for accessing the class
hierarchy, listing the instances and the properties.

Ontology implementations are available through plugins. Before an ontology language resource can
be created using the gate.Factory and before any of the classes and methods in the API
can be used, one of the implementing ontology plugins must be loaded. For details see
chapter 14.

An annotation schema (see Section 3.4.6) can be brought inside GATE through the creole.xml file.
By using the AUTOINSTANCE element, one can create instances of resources defined in
creole.xml. The gate.creole.AnnotationSchema (which is the Java representation of an annotation
schema file) initializes with some predefined annotation definitions (annotation schemas) as
specified by the GATE team.

Note: All the elements and their values must be written in lower case, as XML is defined
as case sensitive and the parser used for XML Schema inside GATE searches is case
sensitive.

In order to be able to write XML Schema definitions, the ones defined in GATE
(resources/creole/schema) can be used as a model, or the user can have a look at
http://www.w3.org/2000/10/XMLSchema for a proper description of the semantics of the elements
used.

compile the class, and any others that it uses, into a Java Archive (JAR) file;

write some XML configuration data for the new resource;

tell GATE the URL of the new JAR and XML files.

GATE Developer helps you with this process by creating a set of directories and files that
implement a basic resource, including a Java code file and a Makefile. This process is called
‘bootstrapping’.

For example, let’s create a new component called GoldFish, which will be a Processing Resource
that looks for all instances of the word ‘fish’ in a document and adds an annotation of type
‘GoldFish’.

menu select ‘BootStrap Wizard’, which will pop up the dialogue in figure 7.2. The meaning of the
data entry fields:

The ‘resource name’ will be displayed when GATE Developer loads the resource, and
will be the name of the directory the resource lives in. For our example: GoldFish.

‘Resource package’ is the Java package that the class representing the resource will be
created in. For our example: sheffield.creole.example.

‘Resource type’ must be one of Language, Processing or Visual Resource. In this
case we’re going to process documents (and add annotations to them), so we select
ProcessingResource.

‘Implementing class name’ is the name of the Java class that represents the resource.
For our example: GoldFish.

The ‘interfaces implemented’ field allows you to add other interfaces (e.g.
gate.creole.ControllerAwarePR4)
that you would like your new resource to implement. In this case we just leave the
default (which is to implement the gate.ProcessingResource interface).

The last field selects the directory that you want the new resource created in. For our
example: z:/tmp.

Now we need to compile the class and package it into a JAR file. The bootstrap wizard creates an
Ant build file that makes this very easy – so long as you have Ant set up properly, you can simply
run

ant jar

This will compile the Java source code and package the resulting classes into GoldFish.jar. If you
don’t have your own copy of Ant, you can use the one bundled with GATE - suppose your GATE
is installed at /opt/gate-5.0-snapshot, then you can use /opt/gate-5.0-snapshot/bin/antjar to build.

You can now load this resource into GATE; see Section 3.6. The default Java code that was
created for our GoldFish resource looks like this:

is shown in figure 7.3. GoldFish.java lives in the src/sheffield/creole/example directory.
creole.xml and build.xml are in the top GoldFish directory. The lib directory is for libraries;
the classes directory is where Java class files are placed; the doc directory is for documentation.
These last two, plus GoldFish.jar are created by Ant.

This process has the advantage that it creates a complete source tree and build structure for the
component, and the disadvantage that it creates a complete source tree and build structure for the
component. If you already have a source tree, you will need to chop out the bits you need from the
new tree (in this case GoldFish.java and creole.xml) and copy it into your existing
one.

In order to add a new document format, one needs to extend the gate.DocumentFormat class and
to implement an abstract method called:

1publicvoidunpackMarkup(Documentdoc)throws2DocumentFormatException

This method is supposed to implement the functionality of each format reader and to create
annotations on the document. Finally the document’s old content will be replaced with a new one
containing only the text between markups.

If one needs to add a new textual reader will extend the gate.corpora.TextualDocumentFormat and
override the unpackMarkup(doc) method.

This class needs to be implemented under the Java bean specifications because it will be
instantiated by GATE using Factory.createResource() method.

The init() method that one needs to add and implement is very important because in here the
reader defines its means to be selected successfully by GATE. What one needs to do is to add some
specific information into certain static maps defined in DocumentFormat class, that will be used at
reader detection time.

After that, a definition of the reader will be placed into the one’s creole.xml file and the reader will
be available to GATE.

We present for the rest of the section a complete three step example of adding such a reader. The
reader we describe in here is an XML reader.

Step 1

Create a new class called XmlDocumentFormat that extendsgate.corpora.TextualDocumentFormat.

Step 2

Implement the unpackMarkup(Document doc) which performs the required functionality for the
reader. Add XML detection means in init() method:

GATE Embedded can be used in multithreaded applications, so long as you observe a few
restrictions. First, you must initialise GATE by calling Gate.init() exactly once in your
application, typically in the application startup phase before any concurrent processing threads are
started.

Secondly, you must not make calls that affect the global state of GATE (e.g. loading or unloading
plugins) in more than one thread at a time. Again, you would typically load all the plugins your
application requires at initialisation time. It is safe to create instances of resources in multiple
threads concurrently.

Thirdly, it is important to note that individual GATE processing resources, language
resources and controllers are by design not thread safe – it is not possible to use a single
instance of a controller/PR/LR in multiple threads at the same time – but for a well
written resource it should be possible to use several different instances of the same
resource at once, each in a different thread. When writing your own resource classes you
should bear the following in mind, to ensure that your resource will be useable in this
way.

Avoid static data. Where possible, you should avoid using static fields in your class,
and you should try and take all configuration data via the CREOLE parameters
you declare in your creole.xml file. System properties may be appropriate for truly
static configuration, such as the location of an external executable, but even then it is
generally better to stick to CREOLE parameters – a user may wish to use two different
instances of your PR, each talking to a different executable.

Read parameters at the correct time. Init-time parameters should be read in the init()
(and reInit()) method, and for processing resources runtime parameters should be
read at each execute().

Use temporary files correctly. If your resource makes use of external temporary files
you should create them using File.createTempFile() at init or execute time, as
appropriate. Do not use hardcoded file names for temporary files.

If there are objects that can be shared between different instances of your resource,
make sure these objects are accessed either read-only, or in a thread-safe way. In
particular you must be very careful if your resource can take other resource instances
as init or runtime parameters (e.g. the Flexible Gazetteer, Section 13.7).

Of course, if you are writing a PR that is simply a wrapper around an external library that
imposes these kinds of limitations there is only so much you can do. If your resource cannot be
made safe you should document this fact clearly.

All the standard ANNIE PRs are safe when independent instances are used in different threads
concurrently, as are the standard transient document, transient corpus and controller
classes. A typical pattern of development for a multithreaded GATE-based application
is:

Develop your GATE processing pipeline in GATE Developer.

Save your pipeline as a .gapp file.

In your application’s initialisation phase, load n copies of the pipeline using
PersistenceManager.loadObjectFromFile() (see the Javadoc documentation for
details), or load the pipeline once and then make copies of it using Factory.duplicate
as described in section 7.7, and either give one copy to each thread or store them in a
pool (e.g. a LinkedList).

When you need to process a text, get one copy of the pipeline from the pool, and return
it to the pool when you have finished processing.

Alternatively you can use the Spring Framework as described in the next section to handle the
pooling for you.

GATE Embedded provides helper classes to allow GATE resources to be created and managed by
the Spring framework. For Spring 2.0 or later, GATE Embedded provides a custom namespace
handler that makes them extremely easy to use. To use this namespace, put the following
declarations in your bean definition file:

The gate-home, user-config-file, etc. and the <value> elements under <gate:preload-plugins>
are interpreted as Spring “resource” paths. If the value is not an absolute URL then Spring will
resolve the path in an appropriate way for the type of application context — in a web application
they are taken as being relative to the web app root, and you would typically use locations within
WEB-INF as shown in the example above. To use an absolute path for gate-home it is not
sufficient to use a leading slash (e.g. /opt/gate), for backwards-compatibility reasons Spring will
still resolve this relative to your web application. Instead you must specify it as a full URL, i.e.
file:/opt/gate.

The children of <gate:parameters> are Spring <entry/> elements, just as you would
write when configuring a bean property of type Map<String,Object>. <gate:url>
provides a way to construct a java.net.URL from a resource path as discussed above.
If it is possible to resolve the resource path as a file: URL then this form will be
preferred, as there are a number of areas within GATE which work better with file:
URLs than with other types of URL (for example plugins that run external processes,
or that use a URL parameter to point to a directory in which they will create new
files).

A note about types: The <gate:parameters> and <gate:features> elements define GATE
FeatureMaps. When using the simple <entry key="..." value="..." /> form, the entry values
will be treated as strings; Spring can convert strings into many other types of object using the
standard Java Beans property editor mechanism, but since a FeatureMap can hold any kind of
values you must use an explicit <value type="...">...</value> to tell Spring what type the
value should be.

There is an additional twist for <gate:parameters> – GATE has its own internal logic to convert
strings to other types required for resource parameters (see the discussion of default parameter
values in section 4.7.1). So for parameter values you have a choice, you can either use an
explicit <value type="..."> to make Spring do the conversion, or you can pass the
parameter value as a string and let GATE do the conversion. For resource parameters
whose type is java.net.URL, if you pass a string value that is not an absolute URL
(starting file:, http:, etc.) then GATE will treat the string as a path relative to the
creole.xml file of the plugin that defines the resource type whose parameter you are
setting. If this is not what you intended then you should use <gate:url> to cause Spring
to resolve the path to a URL before passing it to GATE. For example, for a JAPE
transducer, <entry key="grammarURL" value="grammars/main.jape" /> would resolve to
something like file:/path/to/webapp/WEB-INF/plugins/ANNIE/grammars/main.jape,
whereas

‘Customisers’ are used to customise the application after it is loaded. In the example above,
we load a singleton copy of an ontology which is then shared between all the separate
instances of the (prototype) application. The <gate:set-parameter> customiser accepts
all the same ways to provide a value as the standard Spring <property> element (a
”value” or ”ref” attribute, or a sub-element - <value>, <list>, <bean>, <gate:resource>
…).

The <gate:add-pr> customiser provides support for the case where most of the application is in a
saved state, but we want to create one or two extra PRs with Spring (maybe to inject other Spring
beans as init parameters) and add them to the pipeline.

By default, the <gate:add-pr> customiser adds the target PR at the end of the pipeline, but an
add-before or add-after attribute can be used to specify the name of a PR before (or after)
which this PR should be placed. Alternatively, an index attribute places the PR at a specific
(0-based) index into the pipeline. The PR to add can be specified either as a ‘ref’ attribute, or with
a nested <bean> or <gate:resource> element.

The above example defines the <gate:application> as a prototype-scoped bean, which means the
saved application state will be loaded afresh each time the bean is fetched from the bean factory
(either explicitly using getBean or implicitly when it is injected as a dependency of another bean).
However in many cases it is better to load the application once and then duplicate it as required
(as described in section 7.7), as this allows resources to optimise their memory usage, for example
by sharing a single in-memory representation of a large gazetteer list between several
instances of the gazetteer PR. This approach is supported by the <gate:duplicate>
tag.

The <gate:duplicate> tag acts like a prototype bean definition, in that each time it is fetched or
injected it will call Factory.duplicate to create a new duplicate of its template resource
(declared as a nested element or referenced by the template-ref attribute). However the tag also
keeps track of all the duplicate instances it has returned over its lifetime, and will ensure
they are released (using Factory.deleteResource) when the Spring context is shut
down.

The <gate:duplicate> tag also supports customisers, which will be applied to the newly-created
duplicate resource before it is returned. This is subtly different from applying the customisers to
the template resource itself, which would cause them to be applied once to the original resource
before it is first duplicated.

Finally, <gate:duplicate> takes an optional boolean attribute return-template. If set to false
(or omitted, as this is the default behaviour), the tag always returns a duplicate — the original
template resource is used only as a template and is not made available for use. If set to true, the
first time the bean defined by the tag is injected or fetched, the original template resource is
returned. Subsequent uses of the tag will return duplicates. Generally speaking, it is only
safe to set return-template="true" when there are no customisers, and when the
duplicates will all be created up-front before any of them are used. If the duplicates
will be created asynchronously (e.g. with a dynamically expanding pool, see below)
then it is possible that, for example, a template application may be duplicated in one
thread whilst it is being executed by another thread, which may lead to unpredictable
behaviour.

In a multithreaded application it is vital that individual GATE resources are not used in more
than one thread at the same time. Because of this, multithreaded applications that use GATE
Embedded often need to use some form of pooling to provided thread-safe access to GATE
components. This can be managed by hand, but the Spring framework has built-in tools to support
transparent pooling of Spring-managed beans. Spring can create a pool of identical objects, then
expose a single “proxy” object (offering the same interface) for use by clients. Each method call on
the proxy object will be routed to an available member of the pool in such a way as to
guarantee that each member of the pool is accessed by no more than one thread at a
time.

Since the pooling is handled at the level of method calls, this approach is not used to create a pool
of GATE resources directly — making use of a GATE PR typically involves a sequence of method
calls (at least setDocument(doc), execute() and setDocument(null)), and creating a pooling
proxy for the resource may result in these calls going to different members of the pool. Instead the
typical use of this technique is to define a helper object with a single method that internally
calls the GATE API methods in the correct sequence, and then create a pool of these
helpers. The interface gate.util.DocumentProcessor and its associated implementation
gate.util.LanguageAnalyserDocumentProcessor are useful for this. The DocumentProcessor
interface defines a processDocument method that takes a GATE document and performs some
processing on it. LanguageAnalyserDocumentProcessor implements this interface using a GATE
LanguageAnalyser (such as a saved “corpus pipeline” application) to do the processing. A pool of
LanguageAnalyserDocumentProcessor instances can be exposed through a proxy which can then
be called from several threads.

The machinery to implement this is all built into Spring, but the configuration typically required
to enable it is quite fiddly, involving at least three co-operating bean definitions. Since the
technique is so useful with GATE Embedded, GATE provides a special syntax to configure pooling
in a simple way. Given the <gate:duplicate id="theApp"> definition from the previous section
we can create a DocumentProcessor proxy that can handle up to five concurrent requests as
follows:

The <gate:pooled-proxy> element decorates a singleton bean definition. It converts the original
definition to prototype scope and replaces it with a singleton proxy delegating to a pool of
instances of the prototype bean. The pool parameters are controlled by attributes of the
<gate:pooled-proxy> element, the most important ones being:

max-size

The maximum size of the pool. If more than this number of threads try to call
methods on the proxy at the same time, the others will (by default) block until an
object is returned to the pool.

initial-size

The default behaviour of Spring’s pooling tools is to create instances in the pool
on demand (up to the max-size). This attribute instead causes initial-size instances
to be created up-front and added to the pool when it is first created.

when-exhausted-action-name

What to do when the pool is exhausted (i.e. there are
already max-size concurrent calls in progress and another one arrives). Should be set
to one of WHEN_EXHAUSTED_BLOCK (the default, meaning block the excess requests until
an object becomes free), WHEN_EXHAUSTED_GROW (create a new object anyway, even
though this pushes the pool beyond max-size) or WHEN_EXHAUSTED_FAIL (cause the
excess calls to fail with an exception).

Many more options are available, corresponding to the properties of the Spring
CommonsPoolTargetSource class. These allow you, for example, to configure a pool that
dynamically grows and shrinks as necessary, releasing objects that have been idle for a set amount
of time. See the JavaDoc documentation of CommonsPoolTargetSource (and the documentation
for Apache commons-pool) for full details.

Note that the <gate:pooled-proxy> technique is not tied to GATE in any way, it is simply an
easy way to configure standard Spring beans and can be used with any bean that needs to be
pooled, not just objects that make use of GATE.

These custom elements all define various factory beans. For full details, see the JavaDocs for
gate.util.spring (the factory beans) and gate.util.spring.xml (the gate: namespace
handler). The main Spring framework API documentation is the best place to look for more detail
on the pooling facilities provided by Spring AOP.

Note: the former approach using factory methods of the gate.util.spring.SpringFactory
class will still work, but should be considered deprecated in favour of the new factory
beans.

Once initialized, you can create GATE resources using the Factory in the usual way (for example,
see Section 7.1 for an example of how to create an ANNIE application). You should also
read Section 7.13 for important notes on using GATE Embedded in a multithreaded
application.

Instead of an initialization servlet you could also consider doing your initialization in a
ServletContextListener, or using Spring (see Section 7.14).

Groovy is a dynamic programming language based on Java. Groovy is not used in the core GATE
distribution, so to enable the Groovy features in GATE you must first load the Groovy plugin.
Loading this plugin:

provides access to the Groovy scripting console (configured with some extensions for
GATE) from the GATE Developer “Tools” menu.

provides a PR to run a Groovy script over documents.

provides a controller which uses a Groovy DSL to define its execution strategy.

enhances a number of core GATE classes with additional convenience methods that can
be used from any Groovy code including the console, the script PR, and any Groovy
class that uses the GATE Embedded API.

This section describes these features in detail, but assumes that the reader already has some
knowledge of the Groovy language. If you are not already familiar with Groovy you should read
this section in conjunction with Groovy’s own documentation at http://groovy.codehaus.org/.

To help scripting GATE in Groovy, the console is pre-configured to import all classes from the gate,
gate.annotation, gate.util, gate.jape and gate.creole.ontology packages of the core GATE
API5.
This means you can refer to classes and interfaces such as Factory, AnnotationSet, Gate, etc.
without needing to prefix them with a package name. In addition, the following (read-only)
variable bindings are pre-defined in the Groovy Console.

corpora: a list of loaded corpora LRs (Corpus)

docs: a list of all loaded document LRs (DocumentImpl)

prs: a list of all loaded PRs

apps: a list of all loaded Applications (AbstractController)

These variables are automatically updated as resources are created and deleted in GATE.

Here’s an example script. It finds all documents with a feature “annotator” set to “fred”, and puts
them in a new corpus called “fredsDocs”.

Why won’t the ‘Groovy executing’ dialog go away?
Sometimes, when you execute a Groovy script through the console, a dialog will appear, saying
“Groovy is executing. Please wait”. The dialog fails to go away even when the script has ended,
and cannot be closed by clicking the “Interrupt” button. You can, however, continue to use the
Groovy Console, and the dialog will usually go away next time you run a script. This is not a
GATE problem: it is a Groovy problem.

The Groovy scripting PR enables you to load and execute Groovy scripts as part of a GATE
application pipeline. The Groovy scripting PR is made available when you load the Groovy plugin
via the plugin manager.

inputASName: an optional annotation set intended to be used as input by the PR
(but note that the PR has access to all annotation sets)

outputASName: an optional annotation set intended to be used as output by the
PR (but note that the PR has access to all annotation sets)

scriptParams: optional parameters for the script. In a creole.xml file, these should
be specified as key=value pairs, each pair separated by a comma. For example:
’name=fred,type=person’ . In the GATE GUI, these are specified via a dialog.

As with the Groovy console described above, and with JAPE right-hand-side Java code, Groovy
scripts run by the scripting PR implicitly import all classes from the gate, gate.annotation,
gate.util, gate.jape and gate.creole.ontology packages of the core GATE API. The Groovy
scripting PR also makes available the following bindings, which you can use in your
scripts:

doc: the current document (Document)

corpus: the corpus containing the current document

content: the string content of the current document

inputAS: the annotation set specified by inputASName in the PRs runtime parameters

outputAS: the annotation set specified by outputASName in the PRs runtime
parameters

Note that inputAS and outputAS are intended to be used as input and output AnnotationSets.
This is, however, a convention: there is nothing to stop a script writing to or reading from any
AnnotationSet. Also, although the script has access to the corpus containing the document it is
running over, it is not generally necessary for the script to iterate over the documents in the corpus
itself – the reference is provided to allow the script to access data stored in the FeatureMap of the
corpus. Any other variables assigned to within the script code will be added to the
binding, and values set while processing one document can be used while processing a later
one.

In addition to the above bindings, one further binding is available to the script:

scriptParams: a FeatureMap with keys and values as specified by the scriptParams
runtime parameter

For example, if you were to create a scriptParams runtime parameter for your PR, with the keys
and values: ’name=fred,type=person’, then the values could be retrieved in your script via
scriptParams.name and scriptParams.type

A Groovy script may wish to do some pre- or post-processing before or after processing the
documents in a corpus, for example if it is collecting statistics about the corpus. To support this,
the script can declare methods beforeCorpus and afterCorpus, taking a single parameter. If the
beforeCorpus method is defined and the script PR is running in a corpus pipeline application, the
method will be called before the pipeline processes the first document. Similarly, if the
afterCorpus method is defined it will be called after the pipeline has completed processing of all
the documents in the corpus. In both cases the corpus will be passed to the method as
a parameter. If the pipeline aborts with an exception the afterCorpus method will
not be called, but if the script declares a method aborted(c) then this will be called
instead.

Note that because the script is not processing a particular document when these methods are
called, the usual doc, corpus, inputAS, etc. are not available within the body of the methods
(though the corpus is passed to the method as a parameter). The scriptParams variable is
available.

The following example shows how this technique could be used to build a simple tf/idf
index for a GATE corpus. The example is available in the GATE distribution as
plugins/Groovy/resources/scripts/tfidf.groovy. The script makes use of some of the utility
methods described in section 7.16.4.

The script needs to have the runtime parameter scriptParams set with keys and values as
follows:

regex: the Groovy regular expression that you want to match e.g. [^\s]*ing

type: the type of the annotation to create for each regex match, e.g. regexMatch

When the PR is run over a document, the script will first make a matcher over the document
content for the regular expression given by the regex parameter. It will iterate over all matches for
this regular expression, adding a new annotation for each, with a type as given by the type
parameter.

The Groovy plugin’s “Scriptable Controller” is a more flexible alternative to the standard pipeline
(SerialController) and corpus pipeline (SerialAnalyserController) applications and their
conditional variants. Like the standard controllers, a scriptable controller contains a list of
processing resources and can optionally be configured with a corpus, but unlike the standard
controllers it does not necessarily execute the PRs in a linear order. Instead the execution strategy
is controlled by a script written in a Groovy domain specific language (DSL), which is detailed in
the following sections.

To run a single PR from the scriptable controller’s list of PRs, simply use the PR’s name as a
Groovy method call:

1somePr()2"ANNIEEnglishTokeniser"()

If the PR’s name contains spaces or any other character that is not valid in a Groovy identifier, or
if the name is a reserved word (such as “import”) then you must enclose the name in single or
double quotes. You may prefer to rename the PRs so their names are valid identifiers. Also, if there
are several PRs in the controller’s list with the same name, they will all be run in the order in
which they appear in the list.

You can optionally provide a Map of named parameters to the call, and these will override the
corresponding runtime parameter values for the PR (the original values will be restored after the
PR has been executed):

If a corpus has been provided to the controller then you can iterate over all the documents in the
corpus using eachDocument:

1eachDocument{2tokeniser()3sentenceSplitter()4myTransducer()5}

The block of code (in fact a Groovy closure) is executed once for each document in the corpus
exactly as a standard corpus pipeline application would operate. The current document is available
to the script in the variable doc and the corpus in the variable corpus, and in addition any calls to
PRs that implement the LanguageAnalyser interface will set the PR’s document and corpus
parameters appropriately.

Calling allPRs() will execute all the controller’s PRs once in the order in which they appear in
the list. This is rarely useful in practice but it serves to define the default behaviour:
the initial script that is used by default in a newly instantiated scriptable controller is
eachDocument{allPRs()}, which mimics the behaviour of a standard corpus pipeline
application.

The basic DSL is extremely simple, but because the script is Groovy code you can use all the other
facilities of the Groovy language to do conditional execution, grouping of PRs, etc. The control
script has the same implicit imports as provided by the Groovy Script PR (section 7.16.2), and
additional import statements can be added as required.

For example, suppose you have a pipeline for multi-lingual document processing, containing PRs
named “englishTokeniser”, “englishGazetteer”, “frenchTokeniser”, “frenchGazetteer”,
“genericTokeniser”, etc., and you need to choose which ones to run based on a document feature:

As another example, suppose you have a particular JAPE grammar that you know is slow on
documents that mention a large number of locations, so you only want to run it on documents
with up to 100 Location annotations, and use a faster but less accurate one on others:

You can have more than one call to eachDocument, for example a controller that pre-processes some
documents, then collects some corpus-level statistics, then further processes the documents based
on those statistics.

As a final example, consider a controller to post-process data from a manual annotation task. Some
of the documents have been annotated by one annotator, some by more than one (the annotations
are in sets named “annotator1”, “annotator2”, etc., but the number of sets varies from document
to document).

Like the standard SerialAnalyserController, the scriptable controller implements the
LanugageAnalyser interface and so can itself be nested as a PR in another pipeline. When used in
this way, eachDocument does not iterate over the corpus but simply calls its closure once, with the
“current document” set to the document that was passed to the controller as a parameter. This is
the same logic as is used by SerialAnalyserController, which runs its PRs once only rather than
once per document in the corpus.

When you double-click on a scriptable controller in the resources tree of GATE Developer you see
the same controller editor that is used by the standard controllers. This view allows you to add
PRs to the controller and set their default runtime parameter values, and to specify the corpus
over which the controller should run. A separate view is provided to allow you to edit the Groovy
script, which is accessible via the “Control Script” tab (see figure 7.4). This tab provides a text
editor which does basic Groovy syntax highlighting (the same editor used by the Groovy Console).

Loading the Groovy plugin adds some additional methods to several of the core GATE API classes
and interfaces using the Groovy “mixin” mechanism. Any Groovy code that runs after the plugin
has been loaded can make use of these additional methods, including snippets run in the Groovy
console, scripts run using the Script PR, and any other Groovy code that uses the GATE
Embedded API.

The methods that are injected come from two classes. The gate.Utils class (part of the core GATE
API in gate.jar) defines a number of static methods that can be used to simplify common tasks
such as getting the string covered by an annotation or annotation set, finding the start or end
offset of an annotation (or set), etc. These methods do not use any Groovy-specific
types, so they are usable from pure Java code in the usual way as well as being mixed
in for use in Groovy. Additionally, the class gate.groovy.GateGroovyMethods (part
of the Groovy plugin) provides methods that use Groovy types such as closures and
ranges.

The added methods include:

Unified access to the start and end offsets of an Annotation, AnnotationSet or
Document: e.g. someAnnotation.start() or anAnnotationSet.end()

Simple
access to the DocumentContent or string covered by an annotation or annotation set:
document.stringFor(anAnnotation), document.contentFor(annotationSet)

Simple access to the length of an annotation or document, either as an int
(annotation.length()) or a long (annotation.lengthLong()).

A method
to construct a FeatureMap from any map, to support constructions like def params= [sourceUrl:’http://gate.ac.uk’, encoding:’UTF-8’].toFeatureMap()

A method to convert an annotation set into a List of annotations in the
order they appear in the document, for iteration in a predictable order:
annSet.inDocumentOrder().collect { it.type }

The each, eachWithIndex and collect methods for a corpus have been redefined to
properly load and unload documents if the corpus is stored in a datastore.

Various getAt methods to support constructions like annotationSet["Token"] (get
all Token annotations from the set), annotationSet[15..20] (get all annotations
between offsets 15 and 20), documentContent[0..10] (get the document content
between offsets 0 and 10).

A withResource method for any resource, which calls a closure with the
resource passed as a parameter, and ensures that the resource is properly
deleted when the closure completes (analagous to the default Groovy method
InputStream.withStream).

For full details, see the source code or javadoc documentation for these two classes.

To write the config data back to the XML file: Gate.writeUserConfig();.

Note that new config data will simply override old values, where the keys are the same. In this way
defaults can be set up by putting their values in the main gate.xml file, or the site gate.xml file;
they can then be overridden by the user’s gate.xml file.

If we have annotations about the same subject on the same document from different annotators,
we may need to merge those annotations to form a unified annotation. Two approaches
for merging annotations are implemented in the API, via static methods in the class
gate.util.AnnotationMerging.

The two methods have very similar input and output parameters. Each of the methods takes an
array of annotation sets, which should be the same annotation type on the same document from
different annotators, as input. A single feature can also be specified as a parameter (or given asnull
if no feature is to be specified).

The output is a map, the key of which is one merged annotation and the value of which represents
the annotators (in terms of the indices of the array of annotation sets) who support the
annotation. The methods also have a boolean input parameter to indicate whether or not the
annotations from different annotators are based on the same set of instances, which can be
determined by the static method public boolean isSameInstancesForAnnotators(AnnotationSet[]annsA) in the class gate.util.IaaCalculation. One instance corresponds to all the
annotations with the same span. If the annotation sets are based on the same set of instances,
the merging methods will ensure that the merged annotations are on the same set of
instances.

The two methods corresponding to those described for the Annotation Merging plugin described in
Section 19.12. They are:

The Method public static void mergeAnnogation(AnnotationSet[] annsArr, StringnameFeat,HashMap<Annotation,String>mergeAnns, intnumMinK, boolean isTheSameInstances) merges the annotations stored in the array
annsArr. The merged annotation is put into the map mergeAnns, with a key of the
merged annotation and value of a string containing the indices of elements in the
annotation set array annsArr which contain that annotation. NumMinK specifies the
minimal number of the annotators supporting one merged annotation. The boolean
parameter isTheSameInstances indicate if or not those annotation sets for merging are
based on the same instances.

Method public static void mergeAnnogationMajority(AnnotationSet[] annsArr, StringnameFeat, HashMap<Annotation, String>mergeAnns, boolean isTheSameInstances)
selects the annotations which the majority of the annotators agree on. The meanings
of parameters are the same as those in the above method.

1CREOLE stands for Collection of REusable Objects for Language Engineering