On September 21st, Romy van der Lee and Naomi Ellemers published a paper in PNAS in which they claim to have found compelling evidence of gender bias against women in the allocation of NWO Veni-grants in the period 2010-2012.

The day after, I posted a blog post in Dutch criticising this study (and they day after that an abridged version in English). In these posts, I explained how the significance of the result is due to Simpson’s paradox – thus a statistical artefact rather than true evidence for gender bias. This blog post sparked an amount of public interest which was new to me. I normally publish on linear algebra, (minor) improvements to statistical procedures and other topics that are generally regarded as boring. This time, I’ve been interviewed by Nature, Science and various Dutch academic newspapers. (Great evidence on how post-peer review and blog posts are Science 2.0 – but that’s another topic).

Last week, an abbridged and updated version of my blog post appeared as a peer-review letter in PNAS.

Independently, Beate Volker and Wouter Steenbeek had their letter published in PNAS a few days later.

Van der Lee and Ellemers responded to both letters (response 1 and 2). In their response they misinterpret the consequences of the Simpson’s paradox. I wasn’t planning on responding again – my time is limited – but since they repeat this incorrect interpretation in multiple responses as well as in the newspaper, I find it important to outline why their statistical reasoning is flawed.

In this blog post I will outline that a correct interpretation of Simpson’s paradox results in insignificance of many p-values and not just the one I focussed on in my criticism. In their response to my letter, Van der Lee and Ellemers wrote:

“Further, Simpson’s paradox cannot explain that fewer women than men are selected for the next phase in each step of the review procedure”.

In their response to Volker and Steenbeek, they phrased this as:

“Simpson’s paradox also cannot account for the observation that in every step of the review procedure women are less likely than men to be prioritized.”

It is clear from this figure that the gender bias seems to increase in each step of the process. It is true that I, in my letter, focussed on gender bias in the final step – the number of awarded grants. This, however, was due to the word count limit that PNAS imposes and not because the other steps cannot be explained by Simpson’s paradox as well: they can.

It is easier to show this through a constructed example, rather than the true NWO data. Suppose that the setting is as follows. The funding agency has two research disciplines, A and B. Both receive 100 applications and through three stages (pre-selection, interviews, awards) it is decided who gets funded. In neither field A nor field B gender bias is present: gender is no issue in this example. However, the percentage of applications by women differs per field, and so does the amount of applications that receives funding.

Field A receives 100 applications: 75 by men and 25 by women. Finally, 40 applications will be funded. So 60 applicants receive bad news, which is equally distributed over the three steps: in each step, 20 scientists will be disappointed. In the case of total absence of gender bias (and coincidence), this leads to the following table:

Field A

# M

# F

% M

% F

Step 0: Applications

75

25

75%

25%

Step 1: Pre-selection

60

20

75%

25%

Step 2: Interviews

45

15

75%

25%

Step 3: Funding

30

10

75%

25%

As you can see, in each step the gender ratio is 75%-25%. No gender bias at all.

Field B also receives 100 applications: 50 by men and 50 by women. Out of these 100, only 10 will be funded: in each step 30 applications lose out. This leads to the following table:

Field B

# M

# F

% M

% F

Step 0: Applications

50

50

50%

50%

Step 1: Pre-selection

35

35

50%

50%

Step 2: Interviews

20

20

50%

50%

Step 3: Funding

5

5

50%

50%

Thus also no gender bias in Field B. If we combine the tables for fields A and B (by simply adding up the frequencies for each cell), we obtain:

Field A + B combined

# M

# F

% M

% F

Step 0: Applications

125

75

62.5%

37.5%

Step 1: Pre-selection

95

55

63.3%

36.7%

Step 2: Interviews

65

35

65.0%

35.0%

Step 3: Funding

35

15

70.0%

30.0%

Converting these percentages into a graph similar to Van der Lee and Ellemers’ Figure 1 provides:

The pattern from the table and figure is very clear: in each step of the process men seem to be favoured at the cost of women. Although the percentages for this example are obvious different than those from the NWO-data, the type of pattern is the same. Since in my example there is no-gender bias whatsoever, Van der Lee and Ellemers’ claim that “Simpson’s paradox also cannot account for the observation that in every step of the review procedure women are less likely than men to be prioritized” evidently is false. The power of paradoxes should not be underestimated.

As a final note: as outlined above, the significant results claimed by Van der Lee and Ellemers is lost once correct statistical reasoning is applied. It is important though to realise that the absence of significant gender bias does not imply that there is no gender bias. There could be and it is important to find out whether – and where! – this is the case or not. To conclude, I quote Volker and Steenbeek, who write:

More in-depth analyses with statistical techniques that overcome the above-mentioned issues are needed before jumping to conclusions about gender inequality in grant awards.

In the early seventies, the University of California, Berkeley received sincere negative attention due to supposed gender bias in graduate admissions. The data for fall 1973 clearly seemed to point in this direction:

Nr. of applications

admissions

Male

8442

44%

Female

4321

35%

Out of 8442 male applicants, 44% was admitted, whereas out of the 4321 female applicants, only 35% was admitted. The χ2-test on the 2×2 frequency table (or any other sensible test for 2×2 tables) will give a very significant result, with a p-value smaller than one in a billion. A scrutiny of the data in Science by Bickel, Hammel and O’Connel (1975) revealed that there was no evidence for gender bias. This apparent counterintuitive result was due to the interaction with an external variable. Not all departments at the university had the same admission rate, and there was a relation between the proportion of female applications and the admission rate.

Competitive departments such as English received relatively many female applications, whereas departments such as chemistry, with a surplus of male applications, where much less selective. When studying the male/female admissions on a departmental level, the supposed gender bias disappeared. (For the fall 1973 data, there even was evidence of bias in favour of women.) This paradox is termed spurious correlation or Simpson’s paradox, after the British statistician Edward Simpson. (For a recent open access paper on Simpson’s paradox in psychological science, see Kievit, Frankenhuis, Waldorp and Borsboom, 2013.)

The authors, correctly, point at another pitfall: although there seemed to be evidence of bias (in favour of women) for fall 1973, there is no such evidence for other years. A significant result once in a number of years, could just be coincidence.

In the analysis by Van der Lee and Ellemers the same two flaws occur in a setting not too dissimilar from the one discussed above. Based on the results of n = 2,823 grant applications to the “VENI programme” of the Netherlands Organisation for Scientific Research, NWO, in the years 2010, 2011 and 2012, the authors conclude that the data “provide compelling evidence of gender bias in personal grant applications to obtain research funding”. One of the main results this claim is based upon the following table:

applications

Succesfull

Male

1635

17,7%

Female

1188

14,9%

When applying a standard χ2-test to the data, the authors find a just significant p-value of .045. It is not only questionable to denote a p-value this close to 0.05 as “compelling evidence”, due to Simpson’s paradox, this p-value simply is wrong.

In the supplementary table S1 (Van der Lee and Ellemers, 2015), available online without paywall, a breakdown of the 2,823 grant applications per discipline is presented. The proportion of female applicants varies from 11.8% (physics) to 51.4% (health sciences), and the total succes rate varies from 13.4% (social sciences) to 26.3% (chemical sciences).

Proportion of applications by female scientists vs total success rate. Size of the markers is proportional to number of applications within the discipline.

The figure above visualises these data and immediately shows a clear negative relation between the proportion of female applicants and the total succes rate (i.e. the rate for men and women combined). In four out of the nine disciplines, women have a higher succes rate than men, and in five out of nine, men have a higher succesrate than women. When taking into account that multiple comparisons are performed, for none of the disciplines the gender bias – either in favour of women or in favour of men – is significant (at the α = .05 level). Thus, when taking into account the spurious correlation, the “compelling evidence” is lost.

Bickel et al. (1975) pointed at a second pitfall, concerning focussing on the year(s) where the difference was signicant and ignoring the other year(s) where it was not. Again, a similar situation occurs here. NWO publishes the results of all VENI rounds since its establishment in 2002 until 2015 (except for 2012) on its website. In some years, such as 2011, men received relatively more grants than women; and in other years, such as 2010 and 2015, the reverse was true. The z-test for log-odds ratio only provides a significant sign of gender bias in favour of men for the years 2010 (z = 2.002, p = .023) and 2011 (z = 1.752, p = .040) and a significant gender bias in favour of women for 2002 (z = 2.005, p = .022). When applying the Bonferroni correction for multiple comparisons none of these gender biases are significant.

Conclusion. Van der Lee and Ellemers failed to recognise the dependence of the results on the different NWO disciplines. Futhermore, they focused on results during a three-year, whereas the results of the other periods in which VENI-grants where provided did not confirm the just significant results for 2010-2012. As a consequence, the conclusion of “compelling evidence of gender bias” is inappropriate. In the data, there is no evidence for gender bias (which does not have to mean that there is no gender bias). In discussions on institutional sexual discrimination, it is important to stay factual.

Furthermore, I find it worrying that this analysis gets published. Simpson’s paradox is one of statistics most well-know paradoxes (I teach it yearly to a new batch of psychology students in Groningen) and PNAS is a high-ranking journal with an impact factor of nearly ten. This paper – where conclusions are drawn on basis of flawed methodology – is not an exception. Apparently, the current peer-review system is inadequate in filtering out methodological flaws in papers. If a system doesn’t work, it should be changed.

Final note. The paper by Van der Lee and Ellemers focusses on more tests than just the one criticised by me here. However, these other tests make use of related data (e.g. the number of applicants that go through to the interview-stage) and it is not unlikely that Simpson’s paradox plays a role there too. (The data provided in the paper was insufficient for me to check this.) And even if it does not: the authors are providing interpretations to effects with tiny effect sizes (partial eta-squareds of 0.006(!))… Furthermore, the paper contains a section on “language use” in NWO documents. My comments do not apply to this section.