Why corpus linguists should be wary of kidney stones and Simpson’s paradox

What does your intuition tell you when a phenomenon is counterintuitive?

It is my intuition that you cannot do good research without intuition. But, without safeguards, intuition can play tricks on you. Suppose that scientists compared the efficiency of two drugs to cure a single disease. They ran two clinical trials. After each trial, they calculated the percentage of patients cured by each drug. Tab. 1 summarizes the results.

Tab. 1. Number of successes in two fictitious clinical trials

trial 1

trial 2

drug A

60%

10%

drug B

90%

30%

In the the first trial, drug A cures 60% of the patients and drug B cures 90% of the patients. If you were to pick a drug to cure yourself, your intuition would tell you to choose drug B over drug A (I would). In the second trial, drug A cures 10% of the patients and drug B cures 30% of the patients. Again, you have all the reasons to go for drug B. After two trials, drug B is the big winner. Yet, you would be well advised to double think before making a rushed decision, especially if your life depends on it. You might believe that the drugs have not been tested on the same number of patients but this does not need to be the case.

Because percentages are notoriously unforgiving, and because contingency tables should never be interpreted without reference to the row/column totals (known as the marginal totals, i.e. the totals in the margins), let us examine the same table again, this time adding raw counts and the row totals (Tab. 2).1

Tab. 2. Success rates in two fictitious clinical trials

trial 1

trial 2

both trials

drug A

60/100 (60%)

1/10 (10%)

61/110 (55.45%)

drug B

9/10 (90%)

30/100 (30%)

39/110 (35.45%)

This time, the success rate of drug A (55.45%) is superior to the success rate of drug B (35.45%). The drug that you should take is in fact drug A, by far. This is nothing but a manifestation of Simpson’s paradox.

If you were a scientist, which drug would you choose?

A very short history of Simpson’s paradox

The phenomenon derives its name from Edward H. Simpson, who first addressed the paradox in the early 1950s (Simpson 1951). However, its first mention goes back to Pearson et al. (1899) and Yule (1903). The phenomenon was not called a paradox until Blyth (1972). For more details on the history of Simpson’s paradox, see Pearl (2000: chapter 6).

The paradox « describes a phenomenon whereby an association between two variables reverses sign upon conditioning on a third variable, regardless of the value taken by the latter » (Pearl 2013: Sect. 3.1). It has been found to occur in many scientific fields. The importance of spotting the paradox has been noted in disciplines where decision making is a matter of life and death, such as clinical trials and biostatistics.

Kidney stones

When you are given the raw counts in the second table above, you might think that the differences between the counts are too extreme and that the table is rigged. To better convince you, let me therefore provide a classical example, which was pointed out to me by Antoine Chambaz (Chambaz, Drouet & Memetea, to appear).

Charig et al. (1986) compared different methods of treating kidney stones to determine which was the most successful and the cheapest.

Of 1052 patients with renal calculi, 350 underwent open surgery, 350 percutaneous nephrolithotomy, 328 extracorporeal shockwave lithotripsy (ESWL), and 24 both percutaneous nephrolithotomy and ESWL. Treatment was defined as successful if stones were eliminated or reduced to less than 2 mm after three months.

The authors summarized their findings in a table such as Tab. 3.

Tab. 3. A contingency table summarizing Charig et al. (1986)

stone size < 2cm

stone size ≥ 2cm

total

open surgery

81 (93%)

192 (73%)

273 (78%)

percutaneous nephrolithotomy

234 (87%)

55 (69%)

289 (83%)

ESWL

200 (98%)

101 (82%)

301 (92%)

The table compares the success rate of three methods. Each cell displays the number of patients for whom the method was a success and the corresponding percentage. Focusing on the rightmost column (‘total’), Charig et al. rightly concluded that ESWL was the best method. The study also showed that it was the cheapest. The authors also observed that percutaneous nephrolithotomy performed better than open surgery.

Julius and Mullee (1994) looked at the same table and noticed that the success rates reverse when stone size is taken into account with respect to open surgery vs. percutaneous nephrolithotomy. Indeed, when stone size is taken into account, open surgery has a higher success rate than percutaneous nephrolithotomy. Surprisingly, this is true regardless of the size of the kidney stone! Indeed, for smaller stones, success rates of 93% and 87% are observed for open surgery and percutaneous nephrolithotomy respectively (open surgery is the winner). For larger stones, open surgery beats percutaneous nephrolithotomy again, with success rates of 73% and 69% respectively.

To sum up, the method (open surgery vs. lithotomy) and the outcome (success vs. failure) are strongly associated with a third variable (stone size), which is known as the confounding variable. This is common in observational studies. What is unusual is the fact that the third variable (stone size) reverses the effect observed between the method and the outcome. This reversal is the hallmark of Simpson’s paradox.

Again, if you were a doctor and had to decide which method to use to treat a patient, which one would you choose? Julius and Mullee (1994: 1480) do not answer this question. Rather, they advocate randomised trials:

The main reason why the success rate reversed is because the probability of having open surgery or percutaneous nephrolithotomy varied according to the diameter of the stones. In observational (non-randomised) studies comparing treatments it is likely that the initial choice of treatment would have been influenced by patients’ characteristics such as age or severity of condition; so any difference between treatments could be accounted for by these original factors. Such a situation may arise when a new treatment is being phased in over time. Randomised trials are therefore necessary to demonstrate any treatment effect.

Chambaz, Drouet & Memetea (to appear) and, more generally, experts in causal inference, propose a different take. They convincingly argue the causal structure underlying the kidney stone example is underdetermined. Through the observation of causal graphs, the authors reason as follows:

Intuitively given prior knowledge of cause and effect in the etiology of disease, it seems fair to assume (i) that both stone diameter and method do influence success causally, and (ii) that neither the method nor the success of the removal can causally influence stone diameter. The contrary would blatantly violate the chronology of events, which in general coincides with the causal ordering. (p.7)

The conjoined influence of diameter and method is at the core of a causal graph, which they name RHS.

Consequently, the choice of the RHS as the correct or most plausible causal representation of the scenario mandates the choice of open surgery. (ibid.)

Causal underdetermination also applies to the example involving two drugs in a clinical trial. Reformulating the issue in causal terms provides a sensible way out when Simpson’s paradox applies, i.e. when one ends up with two contradictory answers to the same question.

Visualizing Simpson’s paradox

Let me illustrate Simpson’s paradox graphically with R. The code below is available from my Nakala repository. First, we load the car package and create some data.

Note that the two graphs are identical with respect to how the data points are positioned. In the first graph, the correlation between variable X and variable Y is positive. In the second graph, the correlation between X and Y in each subgroup is negative.

What’s in it for linguists?

Let me now provide a linguistic illustration of Simpson’s paradox based on the dative dataset (Bresnan et al. 2007, Baayen 2013). It had gone unnoticed until we spotted it in Chambaz and Desagulier (2016).

Well known to linguists is the dative alternation, which consists of the prepositional dative (henceforth PD) and the ditransitive constructions (or double-object construction, henceforth DO), as exemplified below:

PD

Anthony

gave

good advice

to Will.

SAGENT

V

OTHEME

ORECIPIENT

DO

Anthony

gave

Will

good advice.

SAGENT

V

ORECIPIENT

OTHEME

The dative event involves three participants: a giver (Anthony), someone who receives something (Will), and an entity transferred from the giver to the recipient (good advice). In terms of semantic roles, the giver is an agent, the participant receiving something is the recipient, and the entity transferred from the agent to the recipient is a theme. What alternates in this case is the realization of the recipient and the theme, one of which must be an object while the other can be either a direct object or a prepositional object.

In PD and DO, the theme can be either definite, as in (1) & (2), or indefinite, as in (3) and (4).2

Anthony bought the cake for Will. (PD)

Anthony bought Will the cake. (DO)

Anthony bought a cake for Will. (PD)

Anthony bought Will a cake. (DO)

The contingency table below summarizes the count of prepositional datives (PD) and double-object datives (DO) when the theme is definite or indefinite in the dative dataset.

Tab. 4. Contingency table summarizing the count of prepositional datives (PD) and double-object datives (DO) when the theme is definite or indefinite

theme: definite

theme: indefinite

PD

63

378

DO

28

858

On the basis of this table, let us calculate the probability of obtaining a PD with a theme that is either definite or indefinite, neglecting contextual information. This probability is known in statistics as the excess risk parameter ER(P). We calculate it as follows:

ER(P) = \frac{63}{63+28}-\frac{378}{378+858}

We find that ER(P) ≈ 38.65%. The associated 95% confidence interval is [28.82%, 48.48%]. On the other hand, when we take the context into account and averaged it out, we find that there was a 5.68% decrease of the probability of obtaining a PD construction when the theme is swiched from definite to indefinite, with a 95% confidence interval of [−8.65%, −2.72%]! The difference between ER(P) and our estimator suggests that confusion is at play. This is, again, an effect of Simpson’s paradox.

The bottom line

Go back to Tab. 2. Suppose that, instead of testing drugs, you compare the presence/absence of two alternating constructions in a corpus (alternant A and alternant B). The corpus breaks down into a written subcorpus and a spoken subcorpus.

If you split the corpus into its written and spoken components, you conclude that alternant B occurs more than alternant A, regardless of the context. If you focus on the rightmost column, and disregard the written vs. spoken distinction, you conclude that alternant A is more present than alternant B.

The bottom line is that there is nothing wrong with working with your intuition. On the contrary, your intuition should trigger a warning and tell you to double think before taking the trends in a contingency table at face value, especially when you add a confounding variable.

Dear Hypotheses blogger,
We found your article particularly interesting. To increase its visibility so the community can more easily appreciate it, we made it a headline article on the hypotheses.org slider.
Best regards,
The Hypotheses.org team