In the early seventies, the University of California, Berkeley received sincere negative attention due to supposed gender bias in graduate admissions. The data for fall 1973 clearly seemed to point in this direction:

Nr. of applications

admissions

Male

8442

44%

Female

4321

35%

Out of 8442 male applicants, 44% was admitted, whereas out of the 4321 female applicants, only 35% was admitted. The χ2-test on the 2×2 frequency table (or any other sensible test for 2×2 tables) will give a very significant result, with a p-value smaller than one in a billion. A scrutiny of the data in Science by Bickel, Hammel and O’Connel (1975) revealed that there was no evidence for gender bias. This apparent counterintuitive result was due to the interaction with an external variable. Not all departments at the university had the same admission rate, and there was a relation between the proportion of female applications and the admission rate.

Competitive departments such as English received relatively many female applications, whereas departments such as chemistry, with a surplus of male applications, where much less selective. When studying the male/female admissions on a departmental level, the supposed gender bias disappeared. (For the fall 1973 data, there even was evidence of bias in favour of women.) This paradox is termed spurious correlation or Simpson’s paradox, after the British statistician Edward Simpson. (For a recent open access paper on Simpson’s paradox in psychological science, see Kievit, Frankenhuis, Waldorp and Borsboom, 2013.)

The authors, correctly, point at another pitfall: although there seemed to be evidence of bias (in favour of women) for fall 1973, there is no such evidence for other years. A significant result once in a number of years, could just be coincidence.

In the analysis by Van der Lee and Ellemers the same two flaws occur in a setting not too dissimilar from the one discussed above. Based on the results of n = 2,823 grant applications to the “VENI programme” of the Netherlands Organisation for Scientific Research, NWO, in the years 2010, 2011 and 2012, the authors conclude that the data “provide compelling evidence of gender bias in personal grant applications to obtain research funding”. One of the main results this claim is based upon the following table:

applications

Succesfull

Male

1635

17,7%

Female

1188

14,9%

When applying a standard χ2-test to the data, the authors find a just significant p-value of .045. It is not only questionable to denote a p-value this close to 0.05 as “compelling evidence”, due to Simpson’s paradox, this p-value simply is wrong.

In the supplementary table S1 (Van der Lee and Ellemers, 2015), available online without paywall, a breakdown of the 2,823 grant applications per discipline is presented. The proportion of female applicants varies from 11.8% (physics) to 51.4% (health sciences), and the total succes rate varies from 13.4% (social sciences) to 26.3% (chemical sciences).

Proportion of applications by female scientists vs total success rate. Size of the markers is proportional to number of applications within the discipline.

The figure above visualises these data and immediately shows a clear negative relation between the proportion of female applicants and the total succes rate (i.e. the rate for men and women combined). In four out of the nine disciplines, women have a higher succes rate than men, and in five out of nine, men have a higher succesrate than women. When taking into account that multiple comparisons are performed, for none of the disciplines the gender bias – either in favour of women or in favour of men – is significant (at the α = .05 level). Thus, when taking into account the spurious correlation, the “compelling evidence” is lost.

Bickel et al. (1975) pointed at a second pitfall, concerning focussing on the year(s) where the difference was signicant and ignoring the other year(s) where it was not. Again, a similar situation occurs here. NWO publishes the results of all VENI rounds since its establishment in 2002 until 2015 (except for 2012) on its website. In some years, such as 2011, men received relatively more grants than women; and in other years, such as 2010 and 2015, the reverse was true. The z-test for log-odds ratio only provides a significant sign of gender bias in favour of men for the years 2010 (z = 2.002, p = .023) and 2011 (z = 1.752, p = .040) and a significant gender bias in favour of women for 2002 (z = 2.005, p = .022). When applying the Bonferroni correction for multiple comparisons none of these gender biases are significant.

Conclusion. Van der Lee and Ellemers failed to recognise the dependence of the results on the different NWO disciplines. Futhermore, they focused on results during a three-year, whereas the results of the other periods in which VENI-grants where provided did not confirm the just significant results for 2010-2012. As a consequence, the conclusion of “compelling evidence of gender bias” is inappropriate. In the data, there is no evidence for gender bias (which does not have to mean that there is no gender bias). In discussions on institutional sexual discrimination, it is important to stay factual.

Furthermore, I find it worrying that this analysis gets published. Simpson’s paradox is one of statistics most well-know paradoxes (I teach it yearly to a new batch of psychology students in Groningen) and PNAS is a high-ranking journal with an impact factor of nearly ten. This paper – where conclusions are drawn on basis of flawed methodology – is not an exception. Apparently, the current peer-review system is inadequate in filtering out methodological flaws in papers. If a system doesn’t work, it should be changed.

Final note. The paper by Van der Lee and Ellemers focusses on more tests than just the one criticised by me here. However, these other tests make use of related data (e.g. the number of applicants that go through to the interview-stage) and it is not unlikely that Simpson’s paradox plays a role there too. (The data provided in the paper was insufficient for me to check this.) And even if it does not: the authors are providing interpretations to effects with tiny effect sizes (partial eta-squareds of 0.006(!))… Furthermore, the paper contains a section on “language use” in NWO documents. My comments do not apply to this section.

There’s not a single link. You can find it by typing “jaarverslag 2011” (‘year report 2011’) etc. in the search box at the NWO-site. I have received a table with the data (not split out per discipline), send me an e-mail if you want it.

‘In some years, such as 2011, men received relatively more grants than women; and in other years, such as 2010 and 2015, the reverse was true. The z-test for log-odds ratio only provides a significant sign of gender bias in favour of men for the years 2010 (z = 2.002, p = .023) and 2011 (z = 1.752, p = .040) and a significant gender bias in favour of women for 2002 (z = 2.005, p = .022). When applying the Bonferroni correction for multiple comparisons none of these gender biases are significant.’

This first and second sentences contradict each other regarding the direction of bias in 2010 (a typo, I suppose?).

Leaving this aside, what’s the reasoning behind your application of statistical tests here? The question whether there was bias in, say 2010, is not a question of inference statistics. The data *is* the population data.

If, however, the question is whether the data stem from a (hypothetical) infinite pool with bias, why would you split up the available data? This is bound to artificially reduce power. Let’s say I test whether men (on average) are heavier than women. I go about this weighting two males and two females a day. After 25 days I test across the full sample (n=100) and find the hypothesised weight ‘advantage’ for men. What good would it do to split up the sample according to the 25 days of weighting, leaving 25 samples with n=4? Applying 25 statistical tests and Bonferroni correction would probably yield no statistically significant result. But it would not be a very meaningful way of posing the question either. The only thing it would tell me is that there is likely no ginormous effect size (about a metric ton in this particular case).

The only inference hypothesis I can see here would state that annual data represent random samples from a pool and that this pool is biased. If that’s the case, the best inference will rely on all available data (i.e. pooled across available years). What result do you get for that?

Yes, that is a typo. The second sentence should have read: “The z-test for log-odds ratio only provides a significant sign of gender bias in favour of men for the years *2011* (z = 2.002, p = .023) and *2012* (z = 1.752, p = .040) and [..]”

The main reason for me to do a statistical test, is to be comparable with the Van der Lee & Ellemers paper, who also do statistical tests. But we indeed have the peculiar situation that we do know the whole population. With a statistical test one could say something about the odds of a hypothetical grant application by a female submitted in 2010-2012.
Pooling all data is not ideal either since this would only work under the assumption that there are no long term trends in time. It could’ve been theoretically possible, for instance, that the VENI system was highly gender biased, but that after a few years new policies have been made to counter this and that, by now, these policies have gone too far such that there’s bias in the opposite direction. Then, pooling all together would not give a sensible result.

The reasoning seems a little inconsistent. On the one hand you state the stats are motivated by the aim to be comparable with the original paper. But then you go on to test a different hypothesis from theirs.

If you hold the hypothesis that there is an interaction with time, that’s surely an interesting question to follow up on. But it’s simply different from the one they are asking. The original paper seems to test a main effect of gender (w/o using all available data, which surely is silly). I fail to see how the interaction hypothesis you hold should be *mandatory* and attempts to test a main effect invalid. Crucially, inferential stats seem meaningful to me wrt the hypothesis of a main effect they are testing (making inferences about future grant rounds etc). But if the question becomes specific to certain years, we indeed have the relevant population data and tests of significance become absurd.

In short, it seems they tested a hypothesis that can meaningfully be addressed by stats, whereas you aim to test a different hypothesis for which this is not the case.

My guess is that the authors do not study some time-interaction-effect, since they focus on just a small subset in time. My guess for the explanation that they didn’t focus on *all* data from 2002-2012 is that they didn’t receive the data for all years in the detailed level they needed for some analyses (e.g. drop-out per stage in the process) from NWO.

If you look at 2002-2015 combined, the following numbers arise (note: this is what I found in a draft excel, the ‘real’ numbers could be very slightly different, this draft hadn’t been typo-checked):
Total applications by men: 6358
Total applications by women: 4526
Successful men: 1164
Successul women: 779
The standard chi-square test yields a p-value of around 0.14. The alternative tests (Fisher’s exact test, Yates’ correction, etc.) will yield roughly the same, non-sig, p-value.