Tag Archives: scattergraph

This is a very broad question, ultimately answered empirically by the performance of a particular parser.

However to predict performance, we might consider the types of structure that a parser is likely to find difficult and then examine a parsed corpus of speech and writing for key statistics.

Variables such as mean sentence length or main clause complexity are often cited as a proxy for parsing difficulty. However, sentence length and complexity are likely to be poor guides in this case. Spoken data is not split into sentences by the speaker, rather, utterance segmentation is a matter of transcriber/annotator choice. In order to improve performance, an annotator might simply increase the number of sentence subdivisions. Complexity ‘per sentence’ is similarly potentially misleading.

In the original London Lund Corpus (LLC), spoken data was split by speaker turns, and phonetic tone units were marked. In the case of speeches, speaker turns could be very long compound ‘run-on’ sentences. In practice, when texts were parsed, speaker turns might be split at coordinators or following a sentence adverbial.

In this discussion paper we will use the British Component of the International Corpus of English (ICE-GB, Nelson et al. 2002) as a test corpus of parsed speech and writing. It is worth noting that both components were parsed together by the same tools and research team.

A very clear difference between speech and writing in ICE-GB is to be found in the degree of self-correction. The mean rate of self-correction in ICE-GB spoken data is 3.5% of words (the rate for writing is 0.4%). The spoken genre with the lowest level of self-correction is broadcast news (0.7%). By contrast, student examination scripts have around 5% of words crossed out by writers, followed by social letters and student essays, which have around 0.8% of words marked for removal.

However, self-correction can be addressed at the annotation stage, by removing it from the input to the parser, parsing this simplified sentence, and reintegrating the output with the original corpus string. To identify issues of parsing complexity, therefore we need to consider the sentence minus any self-correction. Are there other factors that may make the input stream more difficult to parse than writing? Continue reading →

Introduction

In a recent paper focusing on distributions of simple NPs (Aarts and Wallis, 2014), we found an interesting correlation across text genres in a corpus between two independent variables. For the purposes of this study, a “simple NP” was an NP consisting of a single-word head. What we found was a strong correlation between

the probability that an NP consists of a single-word head, p(single head), and

the probability that single-word heads were a personal pronoun, p(personal pronoun | single head).

Note that these two variables are independent because they do not compete, unlike, say, the probability that a single-word NP consists of a noun, vs. the probability that it is a pronoun. The scattergraph below illustrates the distribution and correlation clearly.

Scattergraph of text genres in ICE-GB; distributed (horizontally) by the proportion of all noun phrases consisting of a single word and (vertically) by the proportion of those single-word NPs that are personal pronouns; spoken and written, with selected outliers identified.