As soon as more-or-less faithful replication has evolved, then natural selection
begins to work. To say this is not to invoke some magic principle, some deus exmachina; natural selection in this sense is a logical necessity, not a theory waiting
to be proved. It is inevitable that those cells more eﬃcient at capturing and using
energy, and of replicating more faithfully, would survive and their progeny spread;
those less eﬃcient would tend to die out, their contents re-absorbed and used by
others. Two great evolutionary processes occur simultaneously. The one, beloved
by many popular science writers, is about competition, the struggle for existence
between rivals. Darwin begins here, and orthodox Darwinians tend both to begin
and end here. But the second process, less often discussed today, perhaps because
less in accord with the spirit of the times, is about co-operation, the teaming up
of cells with particular specialisms to work together. For example, one type of
cell may evolve a set of enzymes enabling it to metabolise molecules produced
as waste material by another. There are many such examples of symbiosis
in todayâĂŹs multitudinous world. Think, amongst the most obvious, of the
complex relationships we have with the myriad bacteria – largely Escherichia coli
– that inhabit our own guts, and without whose co-operation in our digestive
processes we would be unable to survive. In extreme cases, cells with diﬀerent
speciﬁc specialisms may even merge to form a single organism combining both,
a process called symbiogenesis.

Symbiogenesis is now believed to have been the origin of mitochondria, the
energy-converting structures present in all of todayâĂŹs cells, as well as the
photosynthesising chloroplasts present in green plants.

Stephen Rose, The Future of the Brain: The Promise and Perils of TomorrowâĂŹs
Neuroscience, 2005, (p. 18).

The majority of GATE plugins work well on any English languages document (see Chapter 15 for
details on non-English language support). Some domains, however, produce documents that use
unusual terms, phrases or syntax. In such cases domain speciﬁc processing resources are often
required in order to extract useful or interesting information. This chapter documents GATE
resources that have been developed for speciﬁc domains.

Documents from the biomedical domain oﬀer a number of challenges including a highly specialised
vocabulary, words that include mixed case and numbers requiring unusual tokenization as well as
common English words used with a domain speciﬁc sense. Many of these problems can only be
solved through the use of domain speciﬁc resources.

Some of the processing resources documented elsewhere in this user guide can be adapted with
little or no eﬀort to help with processing biomedical documents. The Large Knowledge Base
Gazetteer (Section 13.9) can be initialized against a biomedical ontology such as Linked Life Data
in order to annotate many diﬀerent domain speciﬁc concepts. The Language Identiﬁcation PR
(Section 15.1) can also be trained to diﬀerentiate between document domains instead of languages,
which could help target speciﬁc resources to speciﬁc documents using a conditional corpus
pipeline.

Also many plugins can be used “as is” to extract information from biomedical documents. For
example, the Measurements Tagger (Section 21.6) can be sued to extracting information about the
dose of a medication, or the weight of patients in a study.

The rest of this section, however, documents the resources included with GATE which are focused
purely on processing biomedical documents.

ABNER is A Biomedical Named Entity Recogniser [Settles 05]. It uses machine learning
(linear-chain conditional random ﬁelds, CRFs) to ﬁnd entities such as genes, cell types, and DNA
in text. Full details of ABNER can be found at http://pages.cs.wisc.edu/ bsettles/abner/

To use ABNER within GATE, ﬁrst load the Tagger_Abner plugin through the plugins
console, and then create a new ABNER Tagger PR in the usual way. The ABNER Tagger
PR has no initialization parameters and itt does not require any other PRs to be run
prior to execution. Conﬁguration of the tagger is performed using the following runtime
parameters:

abnerMode The ABNER model that will be used for tagging. The plugin can use one of
two previously trained machine learning models for tagging text, as provided by
ABNER:

BIOCREATIVE trained on the BioCreative corpus

NLPBA trained on the NLPBA corpus

annotationName The name of the annotations the tagger should create (defaults to
‘Tagger’). If left blank (or null) the name of each annotation is determined by the type of
entity discovered by ABNER (see below).

outputASName The name of the annotation set in which new annotations will be
created.

The tagger ﬁnds and annotates entities of the following types:

Protein

DNA

RNA

CellLine

CellType

If an annotationName is speciﬁed then these types will appear as features on the created
annotations, otherwise they will be used as the names of the annotations themselves.

ABNER does support training of models on other data, but this functionality is not, however,
supported by the GATE wrapper.

MetaMap, from the National Library of Medicine (NLM), maps biomedical text to the UMLSMetathesaurus and allows Metathesaurus concepts to be discovered in a text corpus [Aronson &
Lang 10].

The Tagger_MetaMap plugin for GATE wraps the MetaMap Java API client to allow GATE to
communicate with a remote (or local) MetaMap PrologBeans mmserver10 and MetaMap
distribution. This allows the content of speciﬁed annotations (or the entire document
content) to be processed by MetaMap and the results converted to GATE annotations and
features.

To use this plugin, you will need access to a remote MetaMap server, or install one locally by
downloading and installing the complete distribution:

The default mmserver10 location and port locations are localhost and 8066. To use a
diﬀerent server location and/or port, see the above API documentation and specify the
–metamap_server_host and –metamap_server_port options within the metaMapOptions
run-time parameter.

annotateNegEx: set this to true to add NegEx features to annotations (NegExType
and NegExTrigger). See http://code.google.com/p/negex/ for more information
on NegEx

annotatePhrases: set to true to output MetaMap phrase-level annotations (generally
noun-phrase chunks). Only phrases containing a MetaMap mapping will be annotated.
Can be useful for post-coordination of phrase-level terms that do not exist in a
pre-coordinated form in UMLS.

inputASName: input Annotation Set name. Use in conjunction
with inputASTypes: (see below). Unless speciﬁed, the entire document content will
be sent to MetaMap.

inputASTypes: only send the content of these annotations within inputASName to
MetaMap and add new MetaMap annotations inside each. Unless speciﬁed, the entire
document content will be sent to MetaMap.

inputASTypeFeature: send the content of this feature within inputASTypes
to MetaMap and wrap a new MetaMap annotation around each annotation in
inputASTypes. If the feature is empty or does not exist, then the annotation content
is sent instead.

metaMapOptions: set parameter-less MetaMap options here. Default is -Xdt
(truncate Candidates mappings, disallow derivational variants and do not use full
text parsing). See http://metamap.nlm.nih.gov/README_javaapi.html for more
details. NB: only set the -y parameter (word-sense disambiguation) if wsdserverctl
is running.

outputASName: output Annotation Set name.

outputASType: output annotation name to be used for all MetaMap annotations

outputMode: determines which mappings are output as annotations in the GATE
document, for each phrase:

AllCandidatesAndMappings: annotate both Candidate and ﬁnal mappings.
This will usually result in multiple, overlapping annotations for each term/phrase

AllMappings: annotate all the ﬁnal MetaMap Mappings for each phrase. This
will result in fewer annotations with higher precision (e.g. for ’lung cancer’ only
the complete phrase will be annotated as Neoplastic Process [neop])

HighestMappingOnly: annotate only the highest scoring MetaMap Mapping
for each phrase. If two Mappings have the same score, the ﬁrst returned by
MetaMap is output.

HighestMappingLowestCUI: Where there is more than one highest-scoring
mapping, return the mapping where the head word/phrase map event has the
lowest CUI.

HighestMappingMostSources: Where there is more than one highest-scoring
mapping, return the mapping where the head word/phrase map event has the
highest number of source vocabulary occurrences.

AllCandidates: annotate all Candidate mappings and not the ﬁnal Mappings.
This will result in more annotations with less precision (e.g. for ’lung cancer’ both
’lung’ (bpoc) and ’lung cancer’ (neop) will be annotated).

taggerMode: determines whether all term instances are processed by MetaMap, the ﬁrst
instance only, or the ﬁrst instance with coreference annotations added. Only used if the
inputASTypes parameter has been set.

FirstOccurrenceOnly: only process and annotate the ﬁrst instance of each term
in the document

CoReference: process and annotate the ﬁrst instance and coreference following
instances

Support for using AbGene [Tanabe & Wilbur 02] (a modiﬁed version of the Brill tagger), to
annotate gene names, within GATE is provided by the Tagger Framework plugin (Section
21.3).

AbGene needs to be downloaded1
and installed externally to GATE and then the example AbGene GATE application,
provided in the resources directory of the Tagger Framework plugin, needs to be modiﬁed
accordingly.

A number of diﬀerent biomedical language processing tools have been made developed under the
auspices of the GENIA Project. Support is provided within GATE for using both the GENIA
sentence splitter and the tagger, which provides tokenization, part-of-speech tagging, shallow
parsing and named entity recognition.

To use either the GENIA sentence splitter2
or tagger3
within GATE you need to have downloaded and compiled the appropriate programs which can
then be called by the GATE PRs.

The GATE GENIA plugin provides the sentence splitter PR. The PR is conﬁgured through the
following runtime parameters:

annotationSetName the name of the annotation set in which the Sentence
annotations should be created

debug if true then details of calling the external process will be reported within the
message pane

splitterBinary the location of the GENIA sentence slitter binary

Support for the GENIA tagger within GATE is handled by the Tagger Framework which is
documented in Section 21.3.

Together these two components in a GATE pipeline provides a biomedical equivalent of ANNIE (minus
the orthographic coreference component). Such a pipeline is provided as an example within the GENIA
plugin4.

For more details on the GENIA tagger and it’s performance over biomedical text see [Tsuruoka etal. 05].

NormaGene is a web service, provided by the BiTeM group in Geneva. The service provides tools
for both gene tagging and normalization, although currently only tagging is supported by this
GATE wrapper.

The NormaGene Tagger PR is conﬁgured via two runtime parameters as follows:

annotationSetName the name of the annotation set in which the Gene annotations
should be created.

threshold the threshold at which an entity will be considered a gene (defaults to
0.6). Minimize the threshold parameter with short text input to receive better results.
Tunning the threshold down helps to ﬁnd complicated gene names in the text but it
also increases time to process the text.

4The plugin contains a saved application, genia.xgapp, which includes both components. The runtimeparameters of both components will need changing to point to your locally installed copies of the GENIAapplications