Tokenization

In most cases, tokenization of the Irish English corpus is quite standard.

Parts of speech tagging

Partial word

For partial words, use target hypothesis.

Example

So uhm <,> then we all <.> dec </.> they all decided they wanted to go to the disco like but I had no money

Token

So

uhm

then

we

all

dec

they

all

decided

Tag

RB

UH

RB

PRP

RB

VVD

PRP

RB

VVD

Sometimes, it may be difficult to use target hypothesis. In these cases, see the section UNCLEAR below.

Discourse markers

Null-to-low semantic value

Words that contain null-to-low semantic value are tagged as discourse markers (i.e. UH). These words are usually affirmative responses, where the words contain less semantic value than their alternative usage. For example, well in “oh well” no longer contains the sense of well as in “the child behaved well”.

Examples

Oh right

Token

Oh

right

Tag

UH

UH

Ah cool

Token

Ah

cool

Tag

UH

UH

He rang her alright

Token

He

rang

her

alright

Tag

PRP

VVD

PRP

UH

Clause-final 'like'

Function: “retroactive focusing power, but more importantly, […] they can be interpreted as countering potential inferences, objections, or doubts.” (Miller & Weinert, 1995)

Since clause-final 'like' is extremely common, and does not (a) appear in the same distribution, and (b) have the same function as other forms of 'like', they should be tagged as UH.

All the people were out like.

Token

All

the

people

were

out

like

Tag

PDT

TD

NNS

VBD

IN

UH

Misc.

ye

Example

Did she go out with ye.

Token

Did

she

go

out

with

ye

Tag

VVD

PRP

VV

IN

IN

PRP

UNCLEAR

Either use target hypothesis or the tag XX.

N.B. XX is also used in the Switchboard Corpus for partial words, and unclear parts of speech (Calhoun et al., 2010). Here, we tag partial words using target hypothesis. If the partial word is unclear, then proceed to tag as XX.

Notice that in this example, Speaker B had interrupted Speaker A. Speaker A was still listing out the activities from their previous turn. These two turns should be annotated distinct utterances even though they are closely related.

False Starts

False starts should be included in the utterance.

Example

<#> But uhm she 's she 's from Galway

Tokens

But

uhm

she

s

she

s

from

Utterance

UTTERANCE

Exceptions include false starts at the beginning of a sentence, in which the lexical item differs significantly. These should be segmented as distinct utterances. However, there may be cases where the distinction between false starts and topicalization is ambiguous. In these cases, you should use your own judgment.

Example

<#> <.> Sat </.> who else <,>

Tokens

Sat

who

else

,

Sentence

SENTENCE

SENTENCE

Utterance

UTTERANCE

UTTERANCE

Pauses

Pauses at the end of an utterance should be included.

Example

<#> Yeah <,> she was <{> <[> with her sister </[> <,> <#> She was going in shopping

Tokens

Yeah

,

she

was

with

her

sister

,

She

was

going

in

shopping

Utterance

UTTERANCE

UTTERANCE

Sentence Boundaries

In most cases, pre-annotated sentence boundaries should be used as utterance boundaries.

Example

<#> So then uhm <,> what 'd I do Sunday then <#> Sunday I did nothing much

Tokens

So

then

uhm

,

what

'd

I

do

Sunday

then

Sunday

I

did

nothing

much

Sentence

SENTENCE

SENTENCE

Utterance

UTTERANCE

UTTERANCE

Constituent Parsing

Empty Categories

In speech, subject pronouns are frequently dropped. In these case, null subjects should be marked as an empty category (NONE *).

Examples

<#> Met Nicole in town <#>

<#> Went in shopping for a while

Fragments

You may notice in the previous examples are annotated as fragments. The question is whether these kinds of sentences should be annotated as a fragment, or a regular sentence. For example, if a speaker is providing a narrative in the first person, they may drop subject pronouns but their sentences may be well-formed and complex. We would then expect that these sentences should be annotated as a sentence, and not a fragment. However, this is not always so clear as the boundary is oftentimes fuzzy. Therefore, this guideline will adopt the following definition for fragments - FRAG.

“FRAG marks those portions of text that appear to be clauses, but lack too many essential elements. Essential elements include phonologically overt nominal subjects and verbs.”

Interjections

Multiple interjections may appear in clusters or “streams”. Phrases containing multiple interjections should be annotated flat.

Example

<#> Oh right yeah

Clause-final LIKE

Clause-final LIKE is very frequent in the ICE Ireland Corpus, more so than either clause-initial or clause-medial LIKE (Schweinberger 2011). Many scholars consider the function of clause-final LIKE as a focus marker with backward scope (i.e. modifying the previous clause) (Harris 1993; Miller & Weinert 1999; Anderson 2000; Columbus 2009). Following their discussions, clause-final LIKE should then be attached to the root.

Example

<#> What 's new like </[> <#>

However, there may be situations when the presence of clause-final LIKE may be unclear.

For example, in the phrase <[> So then <,> </[> she was asking like if we were going out Saturday night LIKE is syntactically ambiguous.

Clause-initial

So then she was asking [like if we were going out Saturday night]

Clause-final

[So then she was asking like] if we were going out Saturday night

Since the corpus does not include recordings, this may be difficult to determine. Furthermore, the syntactic positions of LIKE are linked to their discourse-pragmatic function (Anderson 1998, 2000; Miller & Weinert 1995; Miller 2009).

The functions of LIKE within the linguistic literature include (Schweinberger 2011):

Hedging

Focusing

Buying Processing Time

Indicating the Passage is Hard to Follow

Holding the Floor

Signaling Minor Non-Equivalence Between What’s Said and What’s in Mind

Signaling Loose Talk/Marking Non-Literalness

Signaling Approximation

Introducing Exemplifications

Signaling Similarity

LIKE can therefore be functionally ambiguous, in addition to being syntactically ambiguous. In these cases, it should be up to the annotator's intuition on the true form and function of sentences containing LIKE.

Disfluencies

Reparandum and Repair

For several types of these disfluencies, there are usually two parts: (a) the reparandum, and (b) the repair. The reparandum is defined as the phrase that is subjected to repair.

Example

<#> So uhm <,> then we all <.> dec </.> they all decided they wanted to go to the disco like

In this example, the reparandum is we all dec, and the repair is they all decided.

In these cases, the guideline adopts the NXT-format Switchboard Corpus (Calhoun et al. 2009) where the reparandum is subsumed within the category EDITED. The token dec, in the example above, appears as an the unfinished token corresponding to decided. Unfinished categories should be annotated with the label UNF. The corresponding parse tree is represented below.

Repetition

Stuttering or hesitation often results in repetition of a word, phrase, or sentence.

The repeated word or phrase (i.e. the second occurrence) should be included within the category REPEAT.

Examples

<#> But uhm she 's she 's from Galway as well though

<#> So she 's <&> laughter </&> she 's in great form like

Unknown, Uncertain or Un-bracketable

Unclear or unfamiliar words may sometimes appear in the transcript. The guideline again adopts the NXT-format Switchboard Corpus (Calhoun et al. 2009) where unknown, uncertain or un-bracketable are subsumed within the category X.

Examples

<#> <[> Did you go <unclear> 1 syll </unclear> </[> </{>

<#> Derv

Adverbs

Sentence-initial 'so' - flat?
'then'

Version 1

#2. Uhm Friday night I didn't do much.
#11. Oh yeah unbelievable - 'Oh yeah' is INTJ together because possible MWE, but in general, each UH is an INTJ
#13. Went in shopping for a while - added (NONE *) before 'Went'.
#14. Buy anything - made it SQ - target hypothesis.

Version 2

#2. Broke - added (NONE *), short for 'I am broke.' therefore ADJP-PRD. Frag or S? Frag because incomplete, missing verb.
#3. What's new like - 'like' is phrase-final so append INTJ to phrase before.
#6. So uhm what else did I do then - sentence-initial 'so' must all be flat.
#11. Did you - FRAG? SQ? where's the verb - 'did'?
#14. Did you go XX - target hypothesis.
#15. Derv - category X.
#17. So yeah - RB is flat.
#19. …she was asking like - append INTJ UH like at the end of phrase.
#22. Oh right yeah - [Oh right] [yeah]
#31. Cushty - not NP-SBJ, missing verb.

#36. So uhm then we all dec they all decided… - need a label to state false start/disfluency

#73 That Cliona's mum
#74 That has Cliana - fragment of an SBAR? WHNP?
#75 Yeah that's right yeah - attach last 'yeah' to S or to VP?
#80 Oh right right - where does thee constituency go? I made Oh right] [right
#82 Yeah I do yeah

When two interjections in a row…which is the head?
Phrase-final 'like' is at the end of phrase, inside.
'Sat' - single lexical item…fragment
v2 #14 UNCLEAR…target hypothesis.
'so uhm' #6 v2.
'so yeah' #17 v2.
#22 v2. 'oh right yeah'
tagset not same as ours…
#36 false start = frag, label?
#39 I'd say
#48 interjections at end of phrase…
#54 Did he? ← frag? SQ?
#59 So she's she's in great form like - FRAG as part of S? or on its own?
SQ - Do you…

KISS…while the tags follow the PENN standard tagset, it is not an exhaustive use for the following reasons…
Sentence initial 'so'…maybe some test…
CHANGE: Keep uttereance boundary the same - helps with constituency, just make them fragments.

Dependency Parsing

#4 what was on ←- was is the root.

No not really #7 - root = really, not clear, right side.

#24 who else ←- root is on who

X is Y ←- Y is root
X is with Y ←- 'is' is root

#30 did you go UNCLEAR –> go→UNCLEAR is dep

#40 so then –> then→So mwe

discourse always from root.

Oh right - root where? #46 –> used mwe, but use tests to determine!

So i went home, i'd say #55 –> I'd say is discourse function or parataxis?

Uhm what s #57 –> 's' is root, right most and is verb?

#66 did she not –> 'did' is the root…

#70 did he –> 'did' is root

#74 ah cool –> mwe

#85 with Fred –> with is root because if Fred is root, cannot link together

#86 With Fred and Ciaran… –> the subject is 'I', therefore with is prep…def interesting…