User Tools

Site Tools

Table of Contents

Tokenization and Sentence Annotation

Tokenization

Unit Symbols

Currency: Tokenize units of currency apart, e.g.: $350 is two tokens, $ and 350. This makes sense because it is read as two words, and the number 350 is functioning in its usual way, combining with $ to create a compositional phrase (350 dollars)

Temperature symbols: separate temperatures such as 35°C into three separate tokens (For example: 35°C –> 35, °, and C on separate lines). Rationale: F/C are different compositional constructs and should be treated similarly to currency ($ is a different construct than £).

Hyphenation

As a general rule, hyphenated words should be kept together. This is especially true of words that are determinative compounds, where the modifier cannot take a plural form and does not constitute an independent word. For example:

10-year plan (10-year is one token: if 10 were modifying year as an independent word, we would see 'years')

one-liners (note the plural -s inflects the whole 'one-liner'; separating 'one' would imply there is a word 'liners', and a subtype of that is one-liners, but actually this is the plural of the noun 'one-liner')

The same logic applies to participles and their argument, as well as 'self':

energy-based (1 token)

self-proclaimed (1 token)

Some exceptions to keeping hyphens together are spans of time, where the hyphen means from-to, and hyphens coordinating items on the same level (copulative, non-determinative compounds):

URLs and symbols from the Web

Plurals with apostrophes

Many dates are written as if they contained a genitive 's. These items should be treated as plurals, and thus as single tokens. For example:

1600's (single token)

Indicating original spacing around tokens spelled together

Items which originally were spelled together but which will be tokenized separately should be surrounded with the <w> tag to indicate that there was no space between them in the original text (unless original spacing is trivial to infer). For example:

We distinguish original “can not” from “cannot” by adding <w> around the latter (it’s two tokens either way)

We distinguish original “apples / oranges” from “apples/oranges” by adding <w> around the latter (it’s three tokens either way)

contractions such as “didn't” do not get surrounded by <w>, as it is trivial to infer that the two tokens (i.e. “did” and “n't”) were originally written without an intervening space.

The <w> tag is not used in cases of morphologically complex words which are analyzed as single tokens, such as:

“graveyard”

“granddaughter”

Sentence Annotation

Segmentation

Full sentences are segmented using the <s> tag during XML mark up (see TEI Markup)

The text is divided entirely into non-overlapping sentences, so that every token is part of exactly one sentence

No tokens are left outside of sentences, meaning that headings and image captions are also surrounded by <s> tags

It is possible for a caption to include multiple sentences, each enclosed in <s> tags

If direct speech subordinates multiple sentences, up to two sentences are allowed within a single sentence tag together with the main clause containing the speech verb. More than two sentences in direct speech should all receive separate <s> tags. The following examples illustrate this:

<s>John said: “I've had it. I'm not doing this anymore.”</s>

<s>John said:</s><s>“I've had it.</s><s>I'm not doing this anymore.</s><s>I'm going home.”</s>

Sentence Types

Each sentence tag <s> receives a type attribute from the following list:

decl - declarative sentence (indicative)

imp - imperative

sub - subjunctive, including modals like would, could, but not indicative future 'will'

Exceptions and doubtful cases

In certain cases, what looks like a modal can actually be indicative, e.g. 'can' describing ability - this should be tagged as decl if it's simply a statement of fact:

<s type=“decl”>I ca n't swim</s>

A modal 'can' of potential, not ability, is tagged 'sub' (this is the more common case):

<s type=“sub”>You can find some in the supermarket</s> (not debating hearer's ability to do so, just saying this is a possible option).

Similarly 'will' can be used in a non-indicative way and the sentence will be tagged 'sub'

<s type=“sub”>Boys will be boys</s> (i.e. they may well behave as boys; this is not an indicative future claiming some boys will in fact be boys)

<s type=“sub”>I couldn't stand it if it spoke.</s> (i.e.Whenever it might have spoken, I wouldn't have been able to stand it.)

When to use 'multiple'

The category 'multiple' is meant for sentences containing two (or more) complete clauses of varying types (e.g. do it and I don’t care how! – imp + decl)

The 'multiple' category does not apply when there is a main clause of one type and a subordinate clause of a different type, e.g. “washing the dishes, John noticed the burglar” - in this case, we have a normal declarative clause that has a subordinate gerund. It is not a gerund type (“ger”), since there is really only one main matrix clause: the past tense one with “noticed”.

Prioritization when multiple types apply

There is a hierarchy among the sentence types that sometimes comes into play when sentences fit two definitions. Specifically, being a question gets ‘first dibs’ on the sentence type. We might have wanted to say about a sentence that it’s both hypothetical and a question, for example: “Would you do it if you could?”. but we only get one label, and whether or not something is a question is seen as more crucial, so this example gets the type “q” (yes/no question).