Twitter was appalled. Philip Moriarty, in a much-retweeted plea said: “Ugh. *Please* stop giving credence to simplistic metrics like the h-index. V. damaging”. David Colquhoun, with whom I agree on many things, responded like an exorcist confronted with the spawn of the devil, arguing that any use of metrics would just encourage universities to pressurise staff to increase their H-indices.

Now, as I’ve explained before, I don’t particularly like metrics. In fact, my latest proposal is to drop both REF and metrics and simply award funding on the basis of the number of research-active people in a department. But I‘ve become intrigued by the loathing of metrics that is revealed whenever a metrics-based system is suggested, particularly since some of the arguments put forward do seem rather illogical.

Odd idea #1 is that doing a study relating metrics to funding outcomes is ‘giving credence’ to metrics. It’s not. What would give credence would be if the prediction of REF outcomes from H-index turned out to be very good. We already know that whereas it seems to give reasonable predictions for sciences, it’s much less accurate for humanities. It will be interesting to see how things turn out for the REF, but it’s an empirical question.

Odd idea #2 is that use of metrics will lead to gaming. Of course it will! Gaming will be a problem for any method of allocating money. The answer to gaming, though, is to be aware of how this might be achieved and to block obvious strategies, not to dismiss any system that could potentially be gamed. I suspect the H-index is less easy to game than many other metrics - though I’m aware of one remarkable case where a journal editor has garnered an impressive H-index from papers published in his own journals, with numerous citations to his own work. In general, though, those of us without editorial control are more likely to get a high H-index from publishing smaller amounts of high-quality science than churning out pot-boilers.

Odd idea #3 is the assumption that the REF’s system of peer review is preferable to a metric. At the HEFCE metrics meeting I attended last month, almost everyone was in favour of complex, qualitative methods of assessing research. David Colquhoun argued passionately that to evaluate research you need to read the publications. To disagree with that would be like slamming motherhood and apple pie. But, as Derek Sayer has pointed out, it is inevitable that the ‘peer review’ component of the REF will be flawed, given that panel members are required to evaluate several hundred submissions in a matter of weeks. The workload is immense and cannot involve the careful consideration of the content of books or journal articles, many of which will be outside the reader’s area of expertise.

My argument is a pragmatic one: we are currently engaged in a complex evaluation exercise that is enormously expensive in time and money, that has distorted incentives in academia, and that cannot be regarded as a ‘gold standard’. So, as an empirical scientist, my view is that we should be looking hard at other options, to see whether we might be able to achieve similar results in a more cost-effective way.

Different methods can be compared in terms of the final result, and also in terms of unintended consequences. For instance, in its current manifestation, the REF encourages universities to take on research staff shortly before the deadline – as satirised by Laurie Taylor (see Appointments section of this article). In contrast, if departments were rewarded for a high H-index, there would be no incentive for such behaviour. Also, staff members who were not principal investigators but who made valuable contributions to research would be appreciated, rather than threatened with redundancy. Use of an H-index would also avoid the invidious process of selecting staff for inclusion in the REF.

I suspect, anyhow, we will find predictions from the H-index are less good for REF than for RAE. One difficulty for Mryglod et al that it is not clear whether the Units of Assessment they base their predictions on will correspond to those used in REF. Furthermore, in REF, a substantial proportion of the overall score comes from impact, evaluated on the basis of case studies. To quote from the REF2014 website: “Case studies may include any social, economic or cultural impact or benefit beyond academia that has taken place during the assessment period, and was underpinned by excellent research produced by the submitting institution within a given timeframe.” My impression is that impact was included precisely to capture an aspect of academic quality that was orthogonal to traditional citation-based metrics, and so this should weaken any correlation of outcomes with H-index.

Be this as it may, I’m intrigued by people’s reactions to the H-index suggestion, and wondering whether this relates to the subject one works in. For those in arts and humanities, it is particularly self-evident that we cannot capture all the nuances of departmental quality from an H-index – and indeed, it is already clear that correlations between H-index and RAE outcomes are relatively low these disciplines. These academics work in fields where complex, qualitative analysis is essential. Interestingly, RAE outcomes in arts and humanities (as with other subjects) are pretty well predicted by departmental size, and it could be argued that this would be the most effective way of allocating funds.

Those who work in the hard sciences, on the other hand, take precision of measurement very seriously. Physicists, chemists and biologists, are often working with phenomena that can be measured precisely and unambiguously. Their dislike for an H-index might, therefore, stem from awareness of its inherent flaws: it varies with subject area and can be influenced by odd things, such as high citations arising from notoriety.

Psychologists, though, sit between these extremes. The phenomena we work with are complex. Many of us strive to treat them quantitatively, but we are used to dealing with measurements that are imperfect but ‘good enough’. To take an example from my own research. Years ago I wanted to measure the severity of children’s language problems, and I was using an elicitation task, where the child was shown pictures and asked to say what was happening. The test had a straightforward scoring system that gave indices of the maturity of the content and grammar of the responses. Various people, however, criticised this as too simple. I should take a spontaneous language sample, I was told, and do a full grammatical analysis. So, being young and impressionable I did. I ended up spending hours transcribing tape-recordings from largely silent children, and hours more mapping their utterances onto a complex grammatical chart. The outcome: I got virtually the same result from the two processes – one which took ten minutes and the other which took two days.

Psychologists evaluate their measures in terms of how reliable (repeatable) they are and how validly they do what they are supposed to do. My approach to the REF is the same as my approach to the rest of my work: try to work with measures that are detailed and complex enough to be valid for their intended purpose, but no more so. To work out whether a measure fits that bill, we need to do empirical studies comparing different approaches – not just rely on our gut reaction.