Article Structure

Abstract

Introduction

In NLP, we often model annotation as if it reflected a single ground truth that was guided by an underlying linguistic theory.

Annotator disagreements across domains and languages

In this study, we had between 2-10 individual annotators with degrees in linguistics annotate different kinds of English text with POS tags, e.g., newswire text (PTB WSJ Section 00), transcripts of spoken language (from a database containing transcripts of conversations, Talkbankl), as well as Twitter posts.

Hard cases and annotation errors

In the previous section, we demonstrated that some disagreements are consistent across domains and languages.

Learning to detect annotation errors

In this section, we examine whether we can learn a classifier to distinguish between hard cases and annotation errors.

Related work

Juergens (2014) presents work on detecting linguistically hard cases in the context of word sense annotations, e.g., cases where expert annotators will disagree, as well as differentiating between underspecified, overspecified and metaphoric cases.

Conclusion

In this paper, we show that disagreements between professional or lay annotators are systematic and consistent across domains and some of them are systematic also across languages.

Topics

POS tags

Appears in 7 sentences as: POS tag (1) POS taggers (2) POS tags (5)

In Linguistically debatable or just plain wrong?

In this study, we had between 2-10 individual annotators with degrees in linguistics annotate different kinds of English text with POS tags , e.g., newswire text (PTB WSJ Section 00), transcripts of spoken language (from a database containing transcripts of conversations, Talkbankl), as well as Twitter posts.

Page 2, “Annotator disagreements across domains and languages”

We instructed annotators to use the 12 universal POS tags of Petrov et al.

Page 2, “Annotator disagreements across domains and languages”

2Experiments with variation 71- grams on WSJ (Dickinson and Meurers, 2003) and the French data lead us to estimate that the fine-to-coarse mapping of POS tags disregards about 20% of observed tag-pair confusion types, most of which relate to fine-grained verb and noun distinctions, e. g. past participle versus past in “[..] criminal lawyers speculated/VBD vs. VBN that [..]”.

Lastly, we compare the disagreements of annotators on a French social media data set (Seddah et al., 2012), which we mapped to the universal POS tag set.

Page 3, “Annotator disagreements across domains and languages”

(2014) use small samples of doubly-annotated POS data to estimate annotator reliability and show how those metrics can be implemented in the loss function when inducing POS taggers to reflect confidence we can put in annotations.

Page 5, “Related work”

They show that not biasing the theory towards a single annotator but using a cost-sensitive learning scheme makes POS taggers more robust and more applicable for downstream tasks.

social media

Appears in 4 sentences as: Social Media (1) social media (3)

In Linguistically debatable or just plain wrong?

N OUN VERB ADP/PRT ADV/NOUN (2) Noam likes social media

Page 1, “Introduction”

Besides these English data sets, we also obtained doubly-annotated POS data from the French Social Media Bank project (Seddah et al., 2012).3 All data sets, except the French one, are publicly available at http: / /lowlands .

Page 2, “Annotator disagreements across domains and languages”

Lastly, we compare the disagreements of annotators on a French social media data set (Seddah et al., 2012), which we mapped to the universal POS tag set.

n-gram

the longest sequence of words (n-gram) in a corpus that has been observed with a token being tagged differently in another occurence of the same n-gram in the same corpus.

Page 3, “Hard cases and annotation errors”

For each variation n-gram that we found in WSJ-OO, i.e, a word in various contexts and the possible tags associated with it, we present annotators with the cross product of contexts and tags.

Page 3, “Hard cases and annotation errors”

The figure shows, for instance, that the variation n-gram regarding ADP-ADV is the second most frequent one (dark gray), and approximately 70% of ADP-ADV disagreements are linguistically hard cases (light gray).