The Summer School is a short three-day intensive course aimed at PhD-level students and researchers who wish to get to grips with Corpus Linguistics. Numbers are deliberately limited on a first-come, first-served basis. You will be taught in a small group by a teaching team.

Each day begins with a theory lecture, followed by a guided hands-on workshop with corpora, and a more self-directed and supported practical session in the afternoon.

Introduction

Over the last year, the field of psychology has been rocked by a major public dispute about statistics. This concerns the failure of claims in papers, published in top psychological journals, to replicate.

Replication is a big deal: if you publish a correlation between variable X and variable Y – that there is an increase in the use of the progressive over time, say, and that increase is statistically significant, you expect that this finding would be replicated were the experiment repeated.

I would strongly recommend Andrew Gelman’s brief history of the developing crisis in psychology. It is not necessary to agree with everything he says (personally, I find little to disagree with, although his argument is challenging) to recognise that he describes a serious problem here.

There may be more than one reason why published studies have failed to obtain compatible results on repetition, and so it is worth sifting these out.

In this blog post, what I want to do is try to explore what this replication crisis is – is it one problem, or several? – and then turn to what solutions might be available and what the implications are for corpus linguistics. Continue reading →

Introduction

Recently, a number of linguists have begun to question the wisdom of assuming that linguistic change tends to follow an ‘S-curve’ or more properly, logistic, pattern. For example, Nevalianen (2015) offers a series of empirical observations that show that whereas data sometimes follows a continuous ‘S’, frequently this does not happen. In this short article I try to explain why this result should not be surprising.

The fundamental assumption of logistic regression is that a probability representing a true fraction, or share, of a quantity undergoing a continuous process of change by default follows a logistic pattern. This is a reasonable assumption in certain limited circumstances because an ‘S-curve’ is mathematically analogous to a straight line (cf. Newton’s first law of motion).

Regression is a set of computational methods that attempts to find the closest match between an observed set of data and a function, such as a straight line, a polynomial, a power curve or, in this case, an S-curve. We say that the logistic curve is the underlying model we expect data to be matched against (regressed to). In another post, I comment on the feasibility of employing Wilson score intervals in an efficient logistic regression algorithm.

We have already noted that change is assumed to be continuous, which implies that the input variable (x) is real and linear, such as time (and not e.g. probabilistic). In this post we discuss different outcome variable types. What are the ‘limited circumstances’ in which logistic regression is mathematically coherent?

Introduction

One of the challenges for corpus linguists is that many of the distinctions that we wish to make are either not annotated in a corpus at all or, if they are represented in the annotation, unreliably annotated. This issue frequently arises in corpora to which an algorithm has been applied, but where the results have not been checked by linguists, a situation which is unavoidable with mega-corpora. However, this is a general problem. We would always recommend that cases be reviewed for accuracy of annotation.

A version of this issue also arises when checking for the possibility of alternation, that is, to ensure that items of Type A can be replaced by Type B items, and vice-versa. An example might be epistemic modal shall vs. will. Most corpora, including richly-annotated corpora such as ICE-GB and DCPSE, do not include modal semantics in their annotation scheme. In such cases the issue is not that the annotation is “imperfect”, rather that our experiment relies on a presumption that the speaker has the choice of either type at any observed point (see Aarts et al. 2013), but that choice is conditioned by the semantic content of the utterance.

The perspective that the study of linguistic data should be driven by studies of individual speaker choices has been the subject of attack from a number of linguists.

The first set of objections have come from researchers who have traditionally focused on linguistic variation expressed in terms of rates per word, or per million words.

No such thing as free variation?

As Smith and Leech (2013) put it: “it is commonplace in linguistics that there is no such thing as free variation” and that indeed multiple differing constraints apply to each term. On the basis of this observation they propose an ‘ecological’ approach, although in their paper this approach is not clearly defined.

Spoken categories, modal verbs and change over time

In a recently-published paper, Bowie, Wallis and Aarts (2013) demonstrate that observations regarding changes in the frequency of modal verbs over time are highly sensitive to differences in genre (‘register’ or ‘text category’). Our paper, although based on spoken British English, may shed some light on a recent dispute between Leech (2011) and Millar (2009) regarding how linguists should interpret corpus observations regarding changes in the modal verb system in written US English.

The following table summarises statistically significant percentage decreases and increases of individual modal verbs as a proportion of the number of tensed verb phrases (VPs that could conceivably take a modal verb), within different spoken genre subcategories of the Diachronic Corpus of Present-day Spoken English (DCPSE). The statistical test used examines differences in observed probabilities between samples, i.e. a Newcombe-Wilson test.

For our purposes the cited percentages do not matter, but the direction of travel (indicated by coloured cells) does.

can

may

could

might

shall

will

should

would

must

All

formal f2f

ns

ns

ns

ns

ns

ns

-60%

ns

-75%

informal f2f

27%

-42%

ns

47%

-32%

ns

ns

ns

-53%

ns

telephone

-37%

ns

-44%

ns

-56%

-30%

ns

-44%

ns

-35%

b. discussions

-41%

-59%

ns

ns

-83%

ns

ns

ns

-54%

-20%

b. interviews

ns

-61%

ns

-59%

ns

-41%

-55%

-32%

-57%

-35%

commentary

ns

ns

ns

ns

-93%

58%

ns

ns

-64%

ns

parliament

ns

ns

ns

ns

ns

-39%

ns

-30%

ns

-20%

legal x-exam

304%

ns

ns

ns

ns

ns

1,265%

254%

ns

157%

spontaneous

ns

ns

ns

ns

ns

ns

ns

ns

ns

ns

prepared sp.

ns

-63%

ns

ns

ns

327%

ns

-32%

-48%

ns

All genres

ns

-40%

-11%

ns

-48%

13%

-14%

-7%

-54%

-6%

Significant changes (α<0.05) in the proportion of individual core modals out of tensed verb phrases from the 1960s (LLC) to 1990s (ICE-GB) components in DCPSE, adapted from Bowie et al. 2013.

This study concerns modal verbs within text categories. Against a general baseline (words, verb phrases or tensed verb phrases), the total number of modals decrease in use over the course of the period covered by the data (at least, noting the caveat, for spoken English data sampled comparably). Above, we employ tensed verb phrases as the most meaningful baseline out of the three. See That vexed problem of choice.

Note that if we take all genres together (bottom row in the table), except for will, every significant change is a decline in use, but in the (large) category of informal face-to-face conversation (second row from top), can and might are both significantly increasing.

Legal cross-examination is a predictable outlier, but broadcast interviews and discussions appear to generate very different results. Continue reading →