Understanding Keyness

In order to identify significant differences between 2 corpora or 2 parts of a corpus, we often use a statistical measure called keyness.

Before we get to any mathematical explanations, here a few important things to keep in mind:

Whether calculated by log-likelihood or a chi-squared test, keyness is a significance test, meaning that it tests the amount of evidence we have for the existence of an effect. It does not tell us the size of that effect.

Chi-squared tests and log-likelihood tests are sensitive to the size of the corpora we are testing (our sample size). In essence, when we use really large corpora, we are likely to generate a lot of keywords with high keyness values. As Gries (2010) observes, “[M]any contemporary corpora provide basically guarantee that even minuscule effects will be highly significant.”

Imagine two highly simplified corpora. Each contains only 3 different words (cat, dog, and cow) and has a total of 100 words. The frequency counts are as follows:

Corpus A: cat 52; dog 17; cow 31

Corpus B: cat 9; dog 40; cow 31

Cat and dog would be key, as they are distributed differently across the corpora, but cow would not as its distribution is the same. Put another way, cat and dog are distinguishing features of the corpora; cow is not.

Normally, we use a concordancing program like AntConc or WordSmith to calculate keyness for us. While we can let these programs do the mathematical heavy lifting, it’s important that we have a basic understanding of what these calculations are and what exactly they tell us.

There are 2 common methods for calculating distributional differences: a chi-squared test ( or test) and log-likelihood. AntConc, for example, gives you the option of selecting one or the other under “Tool Preferences” for Keyword settings:

For this post, I’m going to concentrate on the chi-squared test. Lancaster University has a nice explanation of log-likelihood here.

A chi-squared test measures the significance of an observed versus and expected frequency. To see how this works, I’m going to walk through an example Rayson et al. (1997) and summarized by Baker (2010).

For this example, we’ll consider the distribution by sex of the word lovely. Here is their data:

lovely

All other words

Total words

Males

414

1714029

1714443

Females

1214

2592238

2593452

Total

1628

4306267

4307895

Table 1: Use of 'lovely' by males and females in the spoken demographic section of the BNC

What we want to determine is the statistical significance this distribution. To do so, we’ll apply the same formula that they did: a Pearson’s chi-squared test. Here is the formula:

O is the observed frequency, and E is the expected frequency if the independent variable (in this case sex) had no effect on the distribution of the word. The is the mathematical symbol for ‘sum’. In our example, we need to calculate the sum of (or add together) four values: observed minus expected squared, divided by expected for (1) lovely used by males; (2) other words used by males; (3) lovely used by females; and (4) other words used by females.

In other words, our main calculations are for the values in red; we have a 2×2 contingency table. The totals–the peripheral table values in green–are values that we use to determine our expected frequencies.

To find the expected frequency for a value in our 2×2 table, we use the following formula:

(R * C) / N

That is, the row total (R) times the column total (C) divided by the number of words in the corpus (N).

The expected value of lovely for males is:

(1714443 * 1628) / 4307895 = 647.91

If we then fill out the expected frequencies for the rest of our table, it looks like this:

lovely

All other words

Total words

Males

647.91

1713795.09

1714443

Females

980.09

2592471.91

2593452

Total

1628

4306267

4307895

Table 2: Expected use of 'lovely' in the spoken demographic section of the BNC

Now we can finish our calculations. For each of our table cells, we need to subtract the expected frequency from the observed frequency; multiply that value by itself; then divide the result by the expected frequency. The calculations for each cell would look like this:

lovely

All other words

Total words

Males

((414 - 647.91) * (414 - 647.91)) / 647.91

((1714029 - 1713795.09) * (1714029 - 1713795.09)) / 1713795.09

1714443

Females

((1214 - 980.09) * (1214 - 980.09)) / 980.09

((2592238 - 2592471.91) * (2592238 - 2592471.91) / 2592471.91

2593452

Total

1628

4306267

4307895

Table 3: Calculating the chi-square values for our frequencies

When we complete those calculations, our contingency table looks like this:

lovely

All other words

Total words

Males

84.45

0.03

1714443

Females

55.83

0.02

2593452

Total

1628

4306267

4307895

Table 4: Chi-square values for our frequencies

Now we just add together those four values:

84.45 + 0.03 + 55.83 + 0.02 = 140.3

If you want to check our calculations, you can use the calculator below. The input fields are designed to make it easy as they take values that concordancing programs typically generate: word list counts and corpus sizes.

Pearson Chi-Square Calculator

Frequency in Corpus A

Total Words in Corpus A

Frequency in Corpus B

Total Words in Corpus B

Chi-Square

Pearson

Now the question is: What does this tell us? Sometimes corpus linguists just look at the top key words (maybe the top 20). Alternatively, we might be interested in keywords from a particular lexical class (like modal verbs or pronouns), words that have a shared rhetorical function (like hedges or boosters), or from a lexical field (like words related to the body).

The p-value is the probability that our observed frequencies are the result of chance. To find the p-value we need to look it up on a published table (or use an online calculator). In either case, we need to know the degree of freedom. In simple terms, the degree of freedom is the number of observations minus one. Or, even more simply, the number of rows (not including totals) in our contingency table minus one.

For us, then, df = 1

We can look up the p-value on the simplified table below, where the degree of freedom is one.

The critical cutoff point for statistical significance is usually at p < 0.05 (though it can also be p < .01). So a chi-square value above 3.84 would be considered significant. Our value is 140.3, so the distribution of lovely is significant (p < 0.0001). In other words, the probability that our observed distribution was by chance is approaching zero.

The plot below illustrates how p-values are calculated. The keyness values are along the x-axis. Probabilities (for d.f. =1) are along the y-axis. The p-values are derived from the area under the curve at specific points. At 3.84, the area under the probability curve (shaded in red) represents 5% of the total area under that curve. So the p-value = 0.05 at that point. At 6.63, the area under the curve is 1% of the total. So the p-value = 0.01. And so on. Our value for lovely is 140.3, so it is way off to right along the x-axis. The area under the curve at that point is approaching 0%.

I want to leave you with a couple of tips, questions, and resources:

Under most circumstances, our concordancing software can calculate keyness values for us; however, if we are interested in multi-word units like phrasal verbs, we can use online chi-square or log-likelihood calculators to determine their keyness or use other tools like R.

Typically, our chi-square tests (or log-likelihood tests) in corpus linguistics will have a degree of freedom of 1. We are typically comparing one corpus (our target corpus) to another (our reference corpus). And this process is a built in function of the most popular concordancers. However, we can (and do) make comparisons with higher degrees of freedom. Remember that the degree of freedom changes the p-value thresholds.