As soon as more-or-less faithful replication has evolved, then natural selection
begins to work. To say this is not to invoke some magic principle, some deus exmachina; natural selection in this sense is a logical necessity, not a theory waiting
to be proved. It is inevitable that those cells more eﬃcient at capturing and using
energy, and of replicating more faithfully, would survive and their progeny spread;
those less eﬃcient would tend to die out, their contents re-absorbed and used by
others. Two great evolutionary processes occur simultaneously. The one, beloved
by many popular science writers, is about competition, the struggle for existence
between rivals. Darwin begins here, and orthodox Darwinians tend both to begin
and end here. But the second process, less often discussed today, perhaps because
less in accord with the spirit of the times, is about co-operation, the teaming up
of cells with particular specialisms to work together. For example, one type of
cell may evolve a set of enzymes enabling it to metabolise molecules produced
as waste material by another. There are many such examples of symbiosis in
today’s multitudinous world. Think, amongst the most obvious, of the complex
relationships we have with the myriad bacteria – largely Escherichia coli – that
inhabit our own guts, and without whose co-operation in our digestive processes
we would be unable to survive. In extreme cases, cells with diﬀerent speciﬁc
specialisms may even merge to form a single organism combining both, a process
called symbiogenesis.

Symbiogenesis is now believed to have been the origin of mitochondria, the
energy-converting structures present in all of today’s cells, as well as the
photosynthesising chloroplasts present in green plants.

Stephen Rose, The Future of the Brain: The Promise and Perils of Tomorrow’s
Neuroscience, 2005, (p. 18).

The majority of GATE plugins work well on any English languages document (see Chapter 15 for
details on non-English language support). Some domains, however, produce documents that use
unusual terms, phrases or syntax. In such cases domain speciﬁc processing resources are often
required in order to extract useful or interesting information. This chapter documents GATE
resources that have been developed for speciﬁc domains.

Documents from the biomedical domain oﬀer a number of challenges, including a highly specialised
vocabulary, words that include mixed case and numbers requiring unusual tokenization, as well as
common English words used with a domain-speciﬁc sense. Many of these problems can only be
solved through the use of domain-speciﬁc resources.

Some of the processing resources documented elsewhere in this user guide can be adapted with
little or no eﬀort to help with processing biomedical documents. The Large Knowledge Base
Gazetteer (Section 13.9) can be initialized against a biomedical ontology such as Linked Life Data
in order to annotate many diﬀerent domain-speciﬁc concepts. The Language Identiﬁcation PR
(Section 15.1) can also be trained to diﬀerentiate between document domains instead of languages,
which could help target speciﬁc resources to speciﬁc documents using a conditional corpus
pipeline.

Also many plugins can be used “as is” to extract information from biomedical documents. For
example, the Measurements Tagger (Section 23.7) can be used to extract information about the
dose of a medication, or the weight of patients participating in a study.

The rest of this section, however, documents the resources included with or available to GATE and
which are focused purely on processing biomedical documents.

ABNER is A Biomedical Named Entity Recogniser [Settles 05]. It uses machine learning
(linear-chain conditional random ﬁelds, CRFs) to ﬁnd entities such as genes, cell types, and DNA
in text. Full details of ABNER can be found at http://pages.cs.wisc.edu/ bsettles/abner/

To use ABNER within GATE, ﬁrst load the Tagger_Abner plugin through the plugins
console, and then create a new ABNER Tagger PR in the usual way. The ABNER Tagger
PR has no initialization parameters and it does not require any other PRs to be run
prior to execution. Conﬁguration of the tagger is performed using the following runtime
parameters:

abnerMode The ABNER model that will be used for tagging. The plugin can use one of
two previously trained machine learning models for tagging text, as provided by
ABNER:

BIOCREATIVE trained on the BioCreative corpus

NLPBA trained on the NLPBA corpus

annotationName The name of the annotations the tagger should create (defaults to
‘Tagger’). If left blank (or null) the name of each annotation is determined by the type of
entity discovered by ABNER (see below).

outputASName The name of the annotation set in which new annotations will be
created.

The tagger ﬁnds and annotates entities of the following types:

Protein

DNA

RNA

CellLine

CellType

If an annotationName is speciﬁed then these types will appear as features on the created
annotations, otherwise they will be used as the names of the annotations themselves.

ABNER does support training of models on other data, but this functionality is not, however,
supported by the GATE wrapper.

MetaMap, from the National Library of Medicine (NLM), maps biomedical text to the UMLSMetathesaurus and allows Metathesaurus concepts to be discovered in a text corpus [Aronson &
Lang 10].

The Tagger_MetaMap plugin for GATE wraps the MetaMap Java API client to allow GATE to
communicate with a remote (or local) MetaMap PrologBeans mmserver and MetaMap
distribution. This allows the content of speciﬁed annotations (or the entire document
content) to be processed by MetaMap and the results converted to GATE annotations and
features.

To use this plugin, you will need access to a remote MetaMap server, or install one locally by
downloading and installing the complete distribution:

The default mmserver location and port locations are localhost and 8066. To use a
diﬀerent server location and/or port, see the above API documentation and specify the
–metamap_server_host and –metamap_server_port options within the metaMapOptions
run-time parameter.

annotateNegEx: set this to true to add NegEx features to annotations (NegExType
and NegExTrigger). See http://code.google.com/p/negex/ for more information
on NegEx

annotatePhrases: set to true to output MetaMap phrase-level annotations (generally
noun-phrase chunks). Only phrases containing a MetaMap mapping will be annotated.
Can be useful for post-coordination of phrase-level terms that do not exist in a
pre-coordinated form in UMLS.

inputASName: input Annotation Set name. Use in conjunction
with inputASTypes: (see below). Unless speciﬁed, the entire document content will
be sent to MetaMap.

inputASTypes: only send the content of these annotations within inputASName to
MetaMap and add new MetaMap annotations inside each. Unless speciﬁed, the entire
document content will be sent to MetaMap.

inputASTypeFeature: send the content of this feature within inputASTypes
to MetaMap and wrap a new MetaMap annotation around each annotation in
inputASTypes. If the feature is empty or does not exist, then the annotation content
is sent instead.

metaMapOptions: set parameter-less MetaMap options here. Default is -Xdt
(truncate Candidates mappings, disallow derivational variants and do not use full
text parsing). See http://metamap.nlm.nih.gov/README_javaapi.html for more
details. NB: only set the -y parameter (word-sense disambiguation) if wsdserverctl
is running.

outputASName: output Annotation Set name.

outputASType: output annotation name to be used for all MetaMap annotations

outputMode: determines which mappings are output as annotations in the GATE
document, for each phrase:

AllCandidatesAndMappings: annotate both Candidate and ﬁnal mappings.
This will usually result in multiple, overlapping annotations for each term/phrase

AllMappings: annotate all the ﬁnal MetaMap Mappings for each phrase. This
will result in fewer annotations with higher precision (e.g. for ’lung cancer’ only
the complete phrase will be annotated as Neoplastic Process [neop])

HighestMappingOnly: annotate only the highest scoring MetaMap Mapping
for each phrase. If two Mappings have the same score, the ﬁrst returned by
MetaMap is output.

HighestMappingLowestCUI: Where there is more than one highest-scoring
mapping, return the mapping where the head word/phrase map event has the
lowest CUI.

HighestMappingMostSources: Where there is more than one highest-scoring
mapping, return the mapping where the head word/phrase map event has the
highest number of source vocabulary occurrences.

AllCandidates: annotate all Candidate mappings and not the ﬁnal Mappings.
This will result in more annotations with less precision (e.g. for ’lung cancer’ both
’lung’ (bpoc) and ’lung cancer’ (neop) will be annotated).

taggerMode: determines whether all term instances are processed by MetaMap, the ﬁrst
instance only, or the ﬁrst instance with coreference annotations added. Only used if the
inputASTypes parameter has been set.

FirstOccurrenceOnly: only process and annotate the ﬁrst instance of each term
in the document

CoReference: process and annotate the ﬁrst instance and coreference following
instances

This plugin wraps the GSpell API, from the National Library of Medicine Lexical Systems Group,
to add spelling suggestions to features in the input/output annotations deﬁned (default is Token).
The GSpell plugin has a number of options to customise the behaviour and to reduce the number
of false positives in the spelling suggestions. For example, ignore words and spelling suggestions
shorter than a given threshold, and regular expressions to ﬁlter the input to the spell checker. Two
ﬁlters are provided by default: ignore capitalised abbreviations/words in all caps, and words
starting or ending with a digit.

There are two processing modes: WholePhrase, which will spell-check the content of deﬁned
annotations as a single phrase, and does not require any prior tokenization; and PhraseTokens,
which requires a tokenizer to have been run as a prior phase.

BADREX (identifying Biomedical Abbreviations using Dynamic Regular Expressions)[Gooch 12]
is a GATE plugin that annotates, expands and coreferences term-abbreviation pairs using
parameterisable regular expressions that generalise and extend the Schwartz-Hearst algorithm
[Schwartz & Hearst 03]. In addition it uses a subset of the inner–outer selection rules described in
the [Ao & Takagi 05] ALICE algorithm. Rather than simply extracting terms and their
abbreviations, it annotates them in situ and adds the corresponding long-form and short-form text
as features on each.

In coreference mode BADREX expands all abbreviations in the text that match the short
form of the most recently matched long-form–short-form pair. In addition, there is the
option of annotating and classifying common medical abbreviations extracted from
Wikipedia.

The MiniChem Tagger is a GATE plugin uses a small set ( 500) of chemistry morphemes classiﬁed
into 10 types (root, suﬃx, multiplier etc), and some deterministic rules based on the
Wikipedia IUPAC entries, to identify chemical names, drug names and chemical formula in
text.

Support for using AbGene [Tanabe & Wilbur 02] (a modiﬁed version of the Brill tagger), to
annotate gene names, within GATE is provided by the Tagger Framework plugin (Section
23.3).

AbGene needs to be downloaded1
and installed externally to GATE and then the example AbGene GATE application,
provided in the resources directory of the Tagger Framework plugin, needs to be modiﬁed
accordingly.

A number of diﬀerent biomedical language processing tools have been developed under the auspices
of the GENIA Project. Support is provided within GATE for using both the GENIA sentence
splitter and the tagger, which provides tokenization, part-of-speech tagging, shallow parsing and
named entity recognition.

To use either the GENIA sentence splitter2
or tagger3
within GATE you need to have downloaded and compiled the appropriate programs which can
then be called by the GATE PRs.

The GATE GENIA plugin provides the sentence splitter PR. The PR is conﬁgured through the
following runtime parameters:

annotationSetName the name of the annotation set in which the Sentence
annotations should be created

debug if true then details of calling the external process will be reported within the
message pane

splitterBinary the location of the GENIA sentence slitter binary

Support for the GENIA tagger within GATE is handled by the Tagger Framework which is
documented in Section 23.3.

Together these two components in a GATE pipeline provides a biomedical equivalent of ANNIE (minus
the orthographic coreference component). Such a pipeline is provided as an example within the GENIA
plugin4.

For more details on the GENIA tagger and its performance over biomedical text see [Tsuruoka etal. 05].

4The plugin contains a saved application, genia.xgapp, which includes both components. The runtimeparameters of both components will need changing to point to your locally installed copies of the GENIAapplications