Systematic assessments of the research performance of academic institutions are increasingly common around the world. A key question for the design of such systems is whether and how bibliometrics should be incorporated. This column argues that bibliometrics can perform well at identifying quality in some fields, while providing cost-effective and transparent review. Peer review is found to be no guarantor of quality, though it may be essential in the evaluation of certain fields.

The periodic assessment of the performance of academic institutions has become common across Europe. Systematic research assessment is also conducted elsewhere (for example, in Australia), though it is less developed in some major academic systems, namely that of the US. Further, the design of these systems differs along many dimensions, including the period of review, the implications for funding, and the way bibliometric indicators and peer review are combined to form a judgement of research quality and volume.

A 2008 European Commission report outlines the variety of systems that exist or have existed (European Commission 2008), as does the recent Stern Review of the UK Research Excellence Framework (REF) (BIS 2016). The recommendations of the Stern Review, which have just been released, are wide ranging: all research staff should be submitted to the REF, an average number of outputs per unit of assessment rather than a set number per person should be returned, that average should be set quite low for cost reasons, and bibliometrics should be used to support peer review.

Several recent papers have addressed alternative designs. As much of the work has been undertaken by economists, the evaluation of economics has been a primary case study for these articles.

However, a definitive answer on the optimal design of such a system remains elusive. For example, in order to evaluate the quality of research outputs, many systems rely on some kind of ranking for journal outputs. These systems often rank journals differently, calling into question whether any single one of these rankings could accurately reflect ‘true’ quality. Indeed, the diversity of rankings suggests an arbitrary component in a review could result from the specific features of the ranking on which it is based. Hudson (2013) addresses this issue for the case of economics by building an overall picture of journal rankings in economics from a variety of different and sometimes inconsistent published ranking systems. Such a meta-ranking could represent a consensus view. On the other hand, the European Commission (2008) points out that exactly how one evaluates outputs depends on the priorities of the system conducting the review. For example, some countries prioritise outputs published in the local language, an effect that is not taken into account in the rankings evaluated by Hudson (2013). It is clear, then, that the ‘right’ design for research evaluation will depend on local preferences. Designing a single, European-wide system may not be desirable with such a diversity of goals and priorities.

Even aside from linguistic issues, it is not clear how to measure what constitutes high-quality research output. For example, recent discussions surrounding the REF have centred on whether citations should be used, with the idea that high quality is revealed by an output’s ability to spawn future research streams. Laband (2013) and Sgroi and Oswald (2013) evaluate the role of citations in conducting quality evaluations and how one should combine output counts with citations to generate an overall view of quality. Even if one agrees that citations are a good indicator of quality, however, these analyses are technically complex. This complexity could make implementation of their methods problematic – any review process must be sufficiently transparent to recruit supporters from the population that is under review. Indeed, some academics have claimed that they will boycott the most recent Italian research review, which renders it of little use in evaluating a national research system (Baccini and De Nicolao 2016). Any usable system must attain ‘buy in’ to be effective. To do so, it must be viewed as both accurate and understandable.

Further, as Lord Stern has noted, research assessments are intended to raise overall research quality, and if the review process is not sufficiently understandable to all constituents, it is hard to see how the process will influence behaviour positively and predictably. Clearly, an accurate and understandable system potentially can influence research via ‘targeting behaviour’ by academics. A system that is either not accurate or not understandable might simply induce unproductive changes in terms of where and how outputs are published. Similarly, a system that is subject to manipulation, such as the system proposed by Wang et al. (2016), could also generate unproductive effort rather than the desired increase in research quality.

In practice, research assessment can be costly. The 2014 REF assessment in the UK has been estimated to cost anywhere from £121 million to over £1 billion, depending on how fully one prices the exercise (Jump 2015), with an often-cited middle ground of about £250 million (BIS 2016). A worrying element, pointed out by the Stern Review, is that this cost has risen from £66 million in 2008. This, together with the variety of designs available, raises the question of how one should conduct such an exercise with both cost and accuracy in mind.

As the time spent by peer reviewers (and staff at the institutions undergoing review) are a major component of costs, a bibliometric system could, at first sight, have advantages. On the other hand, bibliometrics are not necessarily suited to some fields. There may also be elements of the review that are not conducive to bibliometrics, such as impact.

The Stern Review has argued that peer review should be retained despite its cost. However, given these high costs, we cannot take for granted that peer review is any guarantor of accuracy. Gans and Shepherd (1994) note that the work of Nobel prize and Clark prize winners was not well-identified by discipline-based peer reviewers, a point echoed by Starbuck (2005, 2006). More recently, Bertocchi et al. (2014) investigate bibliometrics and find that peer review-based and bibliometrics-based approaches are in relatively good agreement, at least for the subject areas that they study, although they note that the peer review was perhaps contaminated by bibliometrics for their sample. Still, if peer review is not clearly better, then there is a trade-off to consider between cost and review method.

A way to pick through the quality of bibliometrics as an alternative for at least some fields is how one might validate the accuracy of such bibliometrics as guarantors of quality. In other words, if we cannot necessarily validate bibliometrics with peer review, we should look for other external validation. Such validation benchmarks have not been investigated in the literature, but they offer a promising test of whether bibliometrics perform well at identifying quality research.

New research

In a new paper, we address this lacuna by building simple, readily understandable bibliometrics and then evaluating how they perform (Régibeau and Rockett 2016). We use a series of benchmarks including a simulated ‘department’ composed solely of Nobel Prize winners, a composite leading department comprised of members of generally top-ranked US academic departments, and a ‘market forces’ department based on reputational rankings.

We find that our simple bibliometrics can distinguish between these benchmarks well, although some designs work better than others. For example, we find that finer grids of outlet rankings can better distinguish top quality, and that a larger number of outputs does better as well. Citations also improve the performance of the bibliometrics, especially when citations are allowed to accumulate over two assessment periods. Our paper also presents a ‘back of the envelope’ costing for our bibliometrics – they appear to be much more cost effective than peer review, while not necessarily being less accurate at the group level (e.g. the department level), at least for certain fields.

Overall, then, our results indicate that properly designed bibliometrics could work well to get a sufficiently accurate review whilst still constraining costs, at least for some fields. This is broadly supportive of the fourth recommendation of the Stern Review. At the same time, however, we present a number of other findings that suggest additional design elements for the research evaluation process.

One of these is that more outputs tend to generate a finer distinction of quality. The reduction to two outputs, floated in the Stern Review, does not do as good a job of distinguishing top quality departments in our simulations. On the other hand, the Stern Review allows for flexibility in the number of submissions per person. Using our raw data and performing the exercise of an average of two submissions per person does not distinguish the benchmark populations well from the comparator departments.

At the department level – where individuals are grouped together – our rankings are not completely stable. In other words, even with no change in composition of the underlying group being reviewed, the rankings change over evaluation cycles with random variation in publication rhythm. This lack of stability suggests employing a citation indicator that reflects lagged performance, which would smooth the variability of the ranking. In earlier work, we also note that as one shrinks the (simulated) size of a unit of assessment, the ranking – unsurprisingly – gets less stable (Régibeau and Rockett 2015). At the individual level, there is considerable variation. As smaller departments should be expected to perform in a less stable manner, a university’s reaction to changes in the measured performance should be muted. For the same reason, research assessment results should not be relied upon to evaluate individual performance.

Concluding remarks

Overall, there are reasons to support bibliometrics-based review beyond cost considerations. Even simple metrics can perform well at identifying quality for some fields, while providing cost effective and transparent review. Peer review does not appear to be a guarantor of quality, although it may be needed to evaluate certain disciplines, and it may be needed for certain elements of research evaluation that go beyond the academic quality of outputs. Outputs should not be boiled down too much if one wishes to identify quality – while such a reduction might be cost-saving, some fields could achieve a better cost saving by relying on bibliometrics. Citations appear to help with quality rankings, and could also serve to smooth the natural variance of rankings. This is especially true if they can be allowed to accumulate over a longer period of time.