I think they should introduce ’sleeping’ to the Olympics. It would be an excellent
field event, in which the ’athletes’ (for want of a better word) all lay down in beds,
just beyond where the javelins land, and the first one to fall asleep and not wake
up for three hours would win gold. I, for one, would be interested in seeing what
kind of personality would be suited to sleeping in a competitive environment.
…

GATE has a single model for information that describes documents, collections of documents
(corpora), and annotations on documents, based on attribute/value pairs. Attribute names are
strings; values can be any Java object. The API for accessing this feature data is Java’s Map
interface (part of the Collections API).

A Corpus in GATE is a Java Set whose members are Documents. Both Corpora and Documents
are types of LanguageResource (LR); all LRs have a FeatureMap (a Java Map) associated with
them that stored attribute/value information about the resource. FeatureMaps are also used to
associate arbitrary information with ranges of documents (e.g. pieces of text) via the annotation
model (see below).

Documents have a DocumentContent which is a text at present (future versions may
add support for audiovisual content) and one or more AnnotationSets which are Java
Sets.

Annotations are organised in graphs, which are modelled as Java sets of Annotation. Annotations
may be considered as the arcs in the graph; they have a start Node and an end Node, an ID, a
type and a FeatureMap. Nodes have pointers into the sources document, e.g. character
offsets.

Annotation schemas provide a means to define types of annotations in GATE. GATE
uses the XML Schema language supported by W3C for these definitions. When using
the development environment to create/edit annotations, a component is available
(gate.gui.SchemaAnnotationEditor) which is driven by an annotation schema file. This
component will constrain the data entry process to ensure that only annotations that correspond
to a particular schema are created. (Another component allows unrestricted annotations to be
created.)

Schemas are resources just like other GATE components. Below we give some examples of such
schemas. Section 3.21 describes how to create new schemas.

This material is adapted from [Grishman 97], the TIPSTER Architecture Design document upon
which GATE version 1 was based. Version 2 has a similar model, although annotations are now
graphs, and instead of multiple spans per annotation each annotation now has a single start/end
node pair. The current model is largely compatible with [Bird & Liberman 99], and
roughly isomorphic with "stand-off markup" as latterly adopted by the SGML/XML
community.

Each example is shown in the form of a table. At the top of the table is the document being
annotated; immediately below the line with the document is a ruler showing the position (byte
offset) of each character (see TIPSTER Architecture Design Document).

Underneath this appear the annotations, one annotation per line. For each annotation is shown
its Id, Type, Span (start/end offsets derived from the start/end nodes), and Features.
Integers are used as the annotation Ids. The features are shown in the form name =
value.

The first example shows a single sentence and the result of three annotation procedures:
tokenization with part-of-speech assignment, name recognition, and sentence boundary recognition.
Each token has a single feature, its part of speech (pos), using the tag set from the University of
Pennsylvania Tree Bank; each name also has a single feature, indicating the type of name: person,
company, etc.

Text

Cyndi savored the soup.

^0...^5...^10..^15..^20

Annotations

Id

Type

SpanStart

Span End

Features

1

token

0

5

pos=NP

2

token

6

13

pos=VBD

3

token

14

17

pos=DT

4

token

18

22

pos=NN

5

token

22

23

6

name

0

5

name_type=person

7

sentence

0

23

Table 6.1:

Result of annotation on a single sentence

Annotations will typically be organized to describe a hierarchical decomposition of a text. A simple
illustration would be the decomposition of a sentence into tokens. A more complex case would be a
full syntactic analysis, in which a sentence is decomposed into a noun phrase and a verb phrase, a
verb phrase into a verb and its complement, etc. down to the level of individual tokens. Such
decompositions can be represented by annotations on nested sets of spans. Both of these are
illustrated in the second example, which is an elaboration of our first example to include parse
information. Each non-terminal node in the parse tree is represented by an annotation of type
parse.

Text

Cyndi savored the soup.

^0...^5...^10..^15..^20

Annotations

Id

Type

SpanStart

Span End

Features

1

token

0

5

pos=NP

2

token

6

13

pos=VBD

3

token

14

17

pos=DT

4

token

18

22

pos=NN

5

token

22

23

6

name

0

5

name_type=person

7

sentence

0

23

constituents=[1],[2],[3].[4],[5]

Table 6.2:

Result of annotations including parse information

In most cases, the hierarchical structure could be recovered from the spans. However, it may be
desirable to record this structure directly through a constituents feature whose value is a sequence
of annotations representing the immediate constituents of the initial annotation. For the
annotations of type parse, the constituents are either non-terminals (other annotations in the parse
group) or tokens. For the sentence annotation, the constituents feature points to the constituent
tokens. A reference to another annotation is represented in the table as "[ Annotation Id]"; for
example, "[3]" represents a reference to annotation 3. Where the value of an feature is
a sequence ofitems, these items are separated by commas. No special operations are
provided in the current architecture for manipulating constituents. At a less esoteric level,
annotations can be used to record the overall structure of documents, including in particular
documents which have structured headers, as is shown in the third example (Table
6.3).

Text

To: All Barnyard Animals

^0...^5...^10..^15..^20.

From: Chicken Little

^25..^30..^35..^40..

Date: November 10,1194

...^50..^55..^60..^65.

Subject: Descending Firmament

.^70..^75..^80..^85..^90..^95

Priority: Urgent

.^100.^105.^110.

The sky is falling. The sky is falling.

....^120.^125.^130.^135.^140.^145.^150.

Annotations

Id

Type

SpanStart

Span End

Features

1

Addressee

4

24

2

Source

31

45

3

Date

53

69

ddmmyy=101194

4

Subject

78

98

5

Priority

109

115

6

Body

116

155

7

Sentence

116

135

8

Sentence

136

155

Table 6.3:

Annotation showing overall document structure

If the Addressee, Source, ... annotations are recorded when the document is indexed for retrieval, it
will be possible to perform retrieval selectively on information in particular fields. Our final
example (Table 6.4) involves an annotation which effectively modifies the document. The current
architecture does not make any specific provision for the modification of the original text. However,
some allowance must be made for processes such as spelling correction. This information
will be recorded as a correction feature on token annotations and possibly on name
annotations:

Note that annotation types should consist of a single word with no spaces. Otherwise they may not
be recognised by other components such as JAPE transducers, and may create problems when
annotations are saved as inline (save preserving format).

To view and edit annotation types, see Section 3.16. To add annotations of a new type, see Section
3.19. To add a new annotation schema, see Section 3.21.

By default GATE will try and identify the type of the document, then strip and convert any
markup into GATE’s annotation format. To disable this process, set the markupAware parameter
on the document to false.

When reading a document of one of these types, GATE extracts the text between tags (where such
exist) and create a GATE annotation filled as follows:

The name of the tag will constitute the annotation’s type, all the tags attributes will materialize in
the annotation’s features and the annotation will span over the text covered by the tag. A few
exceptions of this rule apply for the RTF, Email and Plain Text formats, which will be described
later in the input section of these formats.

The text between tags is extracted and appended to the GATE document’s content and all
annotations created from tags will be placed into a GATE annotation set named “Original
markups”.

The startNode and endNode are created from offsets refereing the beginning and the end of “A
piece of text” in the document’s content.

The documents supported by GATE have to be in one of the encodings accepted by Java. The
most popular is the “UTF-8” encoding which is also the most storage efficient one for UNICODE.
If, when loading a document in GATE the encoding parameter is set to “”(the empty string), then
the default encoding of the platform will be used.

In order to successfully apply the document creation algorithm described above, GATE needs to
detect the proper reader to use for each document format. If the user knows in advance what kind
of document they are loading then they can specify the MIME type (e.g. text/html) using the init
parameter mimeType, and GATE will respect this. If an explicit type is not given, GATE attempts
to determine the type by other means, taking into consideration (where possible) the information
provided by three sources:

Document’s extension

The web server’s content type

Magic numbers detection

The first represents the extension of a file like (xml,htm,html,txt,sgm,rtf, etc), the second represents
the HTTP information sent by a web server regarding the content type of the document being send
by it (text/html; text/xml, etc), and the third one represents certain sequences of chars which are
ultimately number sequences. GATE is capable to support multimedia documents, if the right
reader is added to the framework. Sometimes, multimedia documents are identified by a signature
consisting in a sequence of numbers. Inside GATE they are called magic numbers. For
textual documents, certain char sequences form such magic numbers. Examples of magic
numbers sequences will be provided in the Input section of each format supported by
GATE.

All those tests are applied to each document read, and after that, a voting mechanism decides
what is the best reader to associate with the document. There is a degree of priority for all those
tests. The document’s extension test has the highest priority. If the system is in doubt which
reader to choose, then the one associated with document’s extension will be selected. The next
higher priority is given to the web server’s content type and the third one is given to the magic
numbers detection. However, any two tests that identify the same mime type, will have the
highest priority in deciding the reader that will be used. The web server test is not always
successful as there might be documents that are loaded from a local file system, and the
magic number detection test is not always applicable. In the next paragraphs we will
se how those tests are performed and what is the general mechanism behind reader
detection.

The method that detects the proper reader is a static one, and it belongs to the gate.DocumentFormat
class. It uses the information stored in the maps filled by the init() method of each reader. This
method comes with three signatures:

The first two methods try to detect the right MimeType for the GATE document, and after that,
they call the third one to return the reader associate with a MimeType. Of course, if an explicit
mimeType parameter was specified, GATE calls the third form of the method directly, passing the
specified type. GATE uses the implementation from “http://jigsaw.w3.org” for mime
types.

The magic numbers test is performed using the information form
magic2mimeTypeMap map. Each key from this map, is searched in the first bufferSize (the default
value is 2048) chars of text. The method that does this is called
runMagicNumbers(InputStreamReader aReader) and it belongs to DocumentFormat class. More
details about it can be found in the GATE API documentation.

In order to activate a reader to perform the unpacking, the creole definition of a GATE document
defines a parameter called “markupAware” initialized with a default value of true. This
parameter, forces GATE to detect a proper reader for the document being read. If no reader is
found, the document’s content is load and presented to the user, just like any other text editor
(this for textual documents).

The next subsections investigates particularities for each format and will describe the file
extensions registered with each document format.

GATE permits the processing of any XML document and offers support for XML namespaces. It
benefits the power of Apache’s Xerces parser and also makes use of Sun’s JAXP layer. Changing
the XML parser in GATE can be achieved by simply replacing the value of a Java system property
(”javax.xml.parsers.SAXParserFactory”).

GATE will accept any well formed XML document as input. Although it has the possibility
to validate XML documents against DTDs it does not do so because the validating
procedure is time consuming and in many cases it issues messages that are annoying for the
user.

There is an open problem with the general approach of reading XML, HTML and SGML
documents in GATE. As we previously said, the text covered by tags/elements is appended to the
GATE document content and a GATE annotation refers to this particular span of text. When
appending, in cases such as “end.</P><P>Start” it might happen to concatenate the ending word
of the previous annotation with the beginning phrase of the annotation currently being created,
resulting in a garbage input for GATE processing resources that operate at the text
surface.

Let’s take another example in order to better understand the problem :

<title>This is a title</title><p>This is a paragraph</p><a
href="#link">Here is an useful link</a>

When the markup is transformed to annotations, it is likely that the text from the document’s
content will be as follows:

This is a titleThis is a paragraphHere is an useful link

The annotations created will refer the right parts of the texts but for the GATE’s processing
resources like (tokenizer, gazetter, etc) which work on this text, this will be a major diaster.
Therefore, in order to prevent this problem from happening, GATE checks if it’s likely to join
words and if this happens then it inserts a space between those words. So, the text will look like
this after loaded in GATE:

This is a title This is a paragraph Here is an useful link

There are cases when these words are meant to be joined, but they are just a few. This is why it’s
an open problem.

The magic numbers test searches inside the document for the XML(<?xml version="1.0")
signature. It is also able to detect if the XML document uses the semantic described in the GATE
document format DTD (see section 6.5.2) or uses other semantics.

GATE is capable to assure persistence for its resources. These layers of persistence are various and
they span until database persistence. However, for some purposes, a light and simple level of
persistence would be highly appreciated. The types of persistent storage used for Language
Resources are:

Databases (like Oracle);

Java serialization;

XML serialization.

We describe the latter case in here.

XML persistence doesn’t necessarily preserve all the objects belonging to the annotations,
documents or corpora. Their features can be of all kinds of objects, with various layers of nesting.
For example, lists containing lists containing maps, etc. Serializing these arbitrary data types in
XML is not a simple task; GATE does the best it can, and supports native Java types such as
Integers and Booleans, but where complex data types are used, information may be lost(the types
will be converted into Strings). GATE provides a full serialization of certain types of features
such as collections, strings and numbers. It is possible to serialize only those collections
containing strings or numbers. The rest of other features are serialized using their string
representation and when read back, they will be all strings instead of being the original objects.
Consequences of this might be observed when performing evaluations(see the evaluation
section).

When GATE outputs an XML document it may do so in one of two ways:

When the original document that was imported into GATE was an XML document,
GATE can dump that document back into XML (possibly with additional markup
added);

For all document formats, GATE can dump its internal representation of the document
into XML.

In the former case, the XML output will be close to the original document. In the latter case, the
format is a GATE-specific one which can be read back by the system to recreate all the
information that GATE held internally for the document.

In order to understand why there are two types of XML serialization, one needs to understand the
structure of a GATE document. GATE allows a graph of annotations that refer to parts of the
text. Those annotations are grouped under annotation sets. Because of this structure, sometimes it
is impossible to save a document as XML using tags that surround the text referred by
the annotation, because tags crossover situations could appear (XML is essentially a
tree-based model of information, whereas GATE uses graphs). Therefore, in order to
preserve all annotations in a GATE document, a custom type of XML document was
developed.

The problem of crossover tags appears with GATE’s second option (the preserve format one),
which is implemented at the cost of loosing certain annotations. The way it is applied in GATE is
that it tries to restore the original markup and where it is possible, to add in the same manner
annotations produced by GATE.

How to access and make use of the two ways of XML serialization

Save As XML option

This option is available in GATE’s GUI in the pop up menu associate with each language resource
(document or corpus). Saving a corpus as XML is done by calling save as XML on each
document of the corpus. This option saves all the annotations of a document together their
features(applying the restrictions previously discussed), using the GateDocument.dtd
:

<TextWithNodes>
<Node id="0"/>A TEENAGER <Node
id="11"/>yesterday<Node id="20"/> accused his parents of cruelty
by feeding him a daily diet of chips which sent his weight
ballooning to 22st at the age of l2<Node id="146"/>.<Node
id="147"/>
</TextWithNodes>

Note: One must know that all features that are not collections containing numbers or strings or
that are not numbers or strings are discarded. With this option, GATE does not preserve those
features it cannot restore back.

The preserve format option

This option is available in the GATE GUI from the popup menu of the annotations table.
If no annotation in this table is selected, then the option will restore the document’s
original markup. If certain annotations are selected, then the option will attempt to
restore the original markup and insert all the selected ones. When an annotation violates
the crossed over condition, that annotation is discarded and a message is issued by
GATE.

This option makes possible to generate an XML document with tags surrounding the annotation’s
refereed text and feature saved as attributes. All features which are collections, strings or numbers
are saved, and the others are discarded. However, when read back, only the attributes under the
GATE namespace (see bellow) are reconstructed back different than the others. That is because
GATE does not store in the XML document the information about the features class and for
collections the class of the items. So, when read back all features will become strings, except those
under the GATE namespace.

One will notice that all generated tags have an attribute called “gateId” under the
namespace “http://www.gate.ac.uk”. The attribute is used when the document is read back
in GATE, in order to restore the annotation’s old ID. This feature is needed because
it works in close cooperation with another attribute under the same namespace,
called “matches”. This attribute indicates annotations/tags that refer the same
entity1.
They are under this namespace because GATE is sensitive to them and treats them differently
then all other elements with their attributes which falls under the general reading algorithm
described at the beginning of this section.

The “gateId” under GATE namespace is used to create an annotation which have as ID, the value
indicated by this attribute. The “matches” attribute is used to create an ArrayList in
which the items will be Integers, representing the ID of annotations that the current one
matches.

Example:

If the text being processed is as follows:

<Person gate:gateId="23">John</Person> and <Person
gate:gateId="25" gate:matches="23;25;30">John Major</Person> are
the same person.

Under GATE’s API, this option is available by calling gate.Document’s toXml(SetaSetContainingAnnotations) method. This method returns a string which is the XML
representation of the document on which the method was called. If called with null as a
parameter, then the method will attempt to restore only the original markup. If the parameter is
a set that contains annotations, then each annotation is tested against the crossover
restriction, and for those found to violate it, a warning will be issued and they will be
discarded.

In the next subsections we will show how this options applies to the other formats supported by
GATE.

The magic numbers test searches inside the document for the HTML(<html) signature.There are
certain HTML documents that do not contain the HTML tag, so the magical numbers test might
not hold.

There is a certain degree of customization for HTML documents in that GATE introduces new
lines into the document’s text content in order to obtain a readable form. The annotations will
refer the pieces of text as described in the original document but there will be a few extra new line
characters inserted.

After reading H1,H2,H3,H4,H5,H6,TR,CENTER,LI,BR and DIV tags, GATE will introduce a new
line(NL) char into the text. After a TITLE tag it will introduce two NLs. With P tags, GATE
will introduce one NL at the beginning of the paragraph and one at the end of the
paragraph. All newly added NLs are not considered to be part of the text contained by the
tag.

The Save as XML option works exactly the same for all GATE’s documents so there is no
particular observation to be made for the HTML formats.

When attempting to preserve the original markup formatting, GATE will generate the document
in xhtml. The html document will look the same with any browser after processed by GATE but it
will be in another syntax.

The SGML support in GATE is fairly light as there is no freely available Java SGML
parser. GATE uses a light converter attempting to transform the input SGML file into
a well formed XML. Because it does not make use of a DTD, the conversion might
not be always good. It is advisable to perform a SGML2XML conversion outside the
system(using some other specialized tools) before using the SGML document inside
GATE.

When reading a plain text document, GATE attempts to detect its paragraphs and add
“paragraph” annotations to the document’s “Original markups” annotation set. It does that by
detecting two consecutive NLs. The procedure works for both UNIX like or DOS like text
files.

Example:

If the plain text read is as follows:

Paragraph 1. This text belongs to the first paragraph.

Paragraph 2. This text belongs to the second paragraph

then two “paragraph” type annotation will be created in the “Original markups” annotation set
(refereing the first and second paragraphs ) with an empty feature map.

GATE is able to read email messages packed in one document (UNIX mailbox format). It detects
multiple messages inside such documents and for each message it creates annotations for all the
fields composing an e-mail, like date, from, to, subject, etc. The message’s body is analyzed and a
paragraph detection is performed (just like in the plain text case) . All annotation created have as
type the name of the e-mail’s fields and they are placed in the Original markup annotation
set.

GATE attempts to detect lines such “From someone@zzz.zzz.zzz Wed Sep 6 10:35:50 2000” in the
e-mail text. Those lines separate e-mail messages contained in one file. After that, for each field in
the e-mail message annotation are created as follows:

The annotation type will be the name of the field, the feature map will be empty and the
annotation will span from the end of the filed until the end of the line containing the e-mail
field.

Support for input from and output to XML is described in section 6.5.2. In short:

GATE will read any well-formed XML document (it does not attempt to validate XML
documents). Markup will by default be converted into native GATE format.

GATE will write back into XML in one of two ways:

Preserving the original format and adding selected markup (for example to add
the results of some language analysis process to the document).

In GATE’s own XML serialisation format, which encodes all the data in a GATE
Document (as far as this is possible within a tree-structured paradigm – for 100%
non-lossy data storage use GATE’s RDBMS or binary serialisation facilities – see
section 4.7).

When using the GATE framework, object representations of XML documents such as
DOM or jDOM, or query and transformation languages such as X-Path or XSLT, may be
used in parallel with GATE’s own Document representation (gate.Document) without
conflicts.