Pages

Wednesday, March 22, 2017

Academia is fucked-up. So why isn’t anyone doing something about it?

A week or so ago, a list of perverse incentives in academia made rounds. It offers examples like “rewarding an increased number of citations” that – instead of encouraging work of high quality and impact – results in inflated citation lists, an academic tit-for-tat which has become standard practice. Likewise, rewarding a high number of publications doesn’t produce more good science, but merely finer slices of the same science.

It’s not like perverse incentives in academia is news. I wrote about this problem ten years ago, referring to it as the confusion of primary goals (good science) with secondary criteria (like, for example, the number of publications). I later learned that Steven Pinker made the same distinction for evolutionary goals, referring to it as ‘proximate’ vs ‘ultimate’ causes.

The difference can be illustrated in a simple diagram (see below). A primary goal is a local optimum in some fitness landscape – it’s where you want to go. A secondary criterion is the first approximation for the direction towards the local optimum. But once you’re on the way, higher-order corrections must be taken into account, otherwise the secondary criterion will miss the goal – often badly.

The number of publications, to come back to this example, is a good first-order approximation. Publications demonstrate that a scientist is alive and working, is able to think up and finish research projects, and – provided the paper are published in peer reviewed journals – that their research meets the quality standard of the field.

To second approximation, however, increasing the number of publications does not necessarily also lead to more good science. Two short papers don’t fit as much research as do two long ones. Thus, to second approximation we could take into account the length of papers. Then again, the length of a paper is only meaningful if it’s published in a journal that has a policy of cutting superfluous content. Hence, you have to further refine the measure. And so on.

This type of refinement isn’t specific to science. You can see in many other areas of our lives that, as time passes, the means to reach desired goals must be more carefully defined to make sure they still lead where we want to go.

Take sports as example. As new technologies arise, the Olympic committee has added many additional criteria on what shoes or clothes athletes are admitted to wear, which drugs make for an unfair advantage, and they’ve had to rethink what distinguishes a man from a woman.

Or tax laws. The Bible left it at “When the crop comes in, give a fifth of it to Pharaoh.” Today we have books full of ifs and thens and whatnots so incomprehensible I suspect it’s no coincidence suicide rates peak during tax season.

It’s debatable of course whether current tax laws indeed serve a desirable goal, but I don’t want to stray into politics. Relevant here is only the trend: Collective human behavior is difficult to organize, and it’s normal that secondary criteria to reach primary goals must be refined as time passes.

The need to quantify academic success is a recent development. It’s a consequence of changes in our societies, of globalization, increased mobility and connectivity, and is driven by the increased total number of people in academic research.

Academia has reached a size where accountability is both important and increasingly difficult. Unless you work in a tiny subfield, you almost certainly don’t know everyone in your community and can’t read every single publication. At the same time, people are more mobile than ever, and applying for positions has never been easier.

This means academics need ways to judge colleagues and their work quickly and accurately. It’s not optional – it’s necessary. Our society changes, and academia has to change with it. It’s either adapt or die.

But what has been academics’ reaction to this challenge?

The most prevalent reaction I witness is nostalgia: The wish to return to the good old times. Back then, you know, when everyone on the committee had the time to actually read all the application documents and was familiar with all the applicants’ work anyway. Back then when nobody asked us to explain the impact of our work and when we didn’t have to come up with 5-year plans. Back then, when they recommended that pregnant women smoke.

Well, there’s no going back in time, and I’m glad the past has passed. I therefore have little patience for such romantic talk: It’s not going to happen, period. Good measures for scientific success are necessary – there’s no way around it.

Another common reaction is the claim that quality isn’t measurable – more romantic nonsense. Everything is measurable, at least in principle. In practice, many things are difficult to measure. That’s exactly why measures have to be improved constantly.

Then, inevitably, someone will bring up Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.” But that is clearly wrong. Sorry, Goodhard. If you want to indeed optimize the measure, you get exactly what you asked for. The problem is that often the measure wasn’t what you wanted to begin with.

With use of the terminology introduced above, Goodhard’s Law can be reformulated as: “When people optimize a secondary criterion, they will eventually reach a point where further optimization diverts from the main goal.” But our reaction to this should be to improve the measure, not throw the towel and complain “It’s not possible.”

This stubborn denial of reality, however, has an unfortunate consequence: Academia has gotten stuck with the simple-but-bad secondary criteria that are currently in use: number of publications, the infamous h-index, the journal impact factor, renown co-authors, positions held at prestigious places, and so on.

We all know they’re bad measures. But we use them anyway because we simply don’t have anything better. If your director/dean/head/board is asked to demonstrate how great your place is, they’ll fall back on the familiar number of publications, and as a bonus point out who has recently published in Nature. I’ve seen it happen. I just had to fill in a form for the institute’s board in which I was asked for my h-index and my paper count.

Last week, someone asked me if I’d changed my mind in the ten years since I wrote about this problem first. Needless to say, I still think bad measures are bad for science. But I think that I was very, very naïve to believe just drawing attention to the problem would make any difference. Did I really think that scientists would see the risk to their discipline and do something about it? Apparently that’s exactly what I did believe.

Of course nothing like this happened. And it’s not just because I’m a nobody who nobody’s listening to. Similar concerns like mine have been raised with increasing frequency by more widely known people in more popular outlets, like Nature and Wired. But nothing’s changed.

The biggest obstacle to progress is that academics don’t want to admit the problem is of their own making. Instead, they blame others: policy makers, university administrators, funding agencies. But these merely use measures that academics themselves are using.

The result has been lots of talk and little action. But what we really need is a practical solution. And of course I have one on offer: An open-source software that allows every researcher to customize their own measure for what they think is “good science” based on the available data. That would include the number of publications and their citations. But there is much more information in the data which currently isn’t used.

You might want to know whether someone’s research connects areas that are only loosely connected. Or how many single-authored papers they have. You might want to know how well their keyword-cloud overlaps with that of your institute. You might want to develop a measure for how “deep” and “broad” someone’s research is – two terms that are often used in recommendation letters but that are extremely vague.

Such individualized measures wouldn’t only automatically update as people revise criteria, but they would also counteract the streamlining of global research and encourage local variety.

Why isn’t this happening? Well, besides me there’s no one to do it. And I have given up trying to get funding for interdisciplinary research. The inevitable response I get is that I’m not qualified. Of course it’s correct – I’m not qualified to code and design a user-interface. But I’m totally qualified to hire some people and kick their asses. Trust me, I have experience kicking ass. Price tag to save academia: An estimated 2 million Euro for 5 years.

What else has changed in the last ten years? I’ve found out that it’s possible to get paid for writing. My freelance work has been going well. The main obstacle I’ve faced is lack of time, not lack of opportunity. And so, when I look at academia now, I do it with one leg outside. What I see is that academia needs me more than I need academia.

The current incentives are extremely inefficient and waste a lot of money. But nothing is going to change until we admit that solving the problem is our own responsibility.

Maybe, when I write about this again, ten years from now, I’ll not refer to academics as “us” but as “they.”

61 comments:

Very interesting, Too many papers being published, and I read somewhere 50% of papers are never read.How about restricting people? You're only allowed to write one paper a year, so you'd better make sure it's good, because you get judged on its impact. That would prioritise quality over quantity.

Pinker didn't introduce the idea of proximate vs ultimate causation in biology, or even those terms. The idea goes back at least to Darwin, and the terms go back at least to Tinbergen, and I think before that.

Thank you, brilliant and insightful as always. Your solution only addresses the volume question though not how to improve the judgement of the quality of the "finer slice" or judge the qualily of something novel. You would be improving the quality of the silos and that's a good thing to do but it isn't the whole problem...

There is a statistical methodology well known in sports, top military units, and portions of business. Define the objectives, identify those who have achieved the objectives, and study how this correlates with other things you can measure. This is the theme of the popular book and movie Moneyball, and of much of the work economic Nobelist Daniel Kanehman and Amos Tversky.

Hire weird (autism!), fund early; Darwin. Genius is ugly, then redefining beauty (Google hiring practices). "Deviation from professional trajectory" is obscene. To dissolve utterly insoluble in everything copper phthalocyanine pigment in Plexiglas, use nickel superalloy metallurgy. That is insane! It creates a 30 wt-% masterbatch dye. “This is not the solution we were seeking.” To criticize is to volunteer.

Great post. I suspect everyone thinks about it every once in a while. But most people, myself included, think someone else should solve it. Great to see a volunteer.

The problem is not the coding. If you have a decent idea, coders will flock around you. I think you could get it worked out in an Open Source project.

So what is the idea?

A kind of Google page ranking like system? Given a profile (search term) come up with a ranked list of candidates (pages)? Or a network analysis where you look for scientists who are hubs? Or a facebook/twitter like "follow"/"friend" system? (Already done by researchgate and linkedin)

All these ideas could be implemented using standard bibliometric data, if you can get at the data (Elsevier is not the easiest in showing their data). But they all have their down sides.

I would think that a lot of academics have developed their own methods for relating publication statistics to quality of research, if only in self-defense when asked to submit such documentation. Admittedly, I think that because I did it for myself, and have to assume I'm not super clever.

But one of the factors that tended to limit me was the effort and time involved in getting the numbers and putting them into a framework that could be analysed and updated without having to be redone each time. (In my case, I've settled on Excel and a spreadsheet format that supports some of what I want, but not all.)

Furthermore, the more sources of statistics that become available (and here I'm thinking mostly of citation statistics), the more uncertainty there is in the reports. One has to spend a lot of time working the individual numbers to make sure that you end up with something consistent and reliable.

And then there's the problem of convincing the people you deliver the report to as to why your method is appropriate - especially when it involves even more onerous application. (For example, I think you want to compare statistics with 'equivalent' groups as baselines or controls, which means doing a lot more work.)

And, of course, there's always Eugene Garfield in the back of the mind, repeating (sepulchurally) the mantra: "Never use citation statistics to study anything smaller than a large group. Certainly, never for assessing individual performance.")

So, you piqued my curiosity with your reference to your home-grown methodology. Do you have a reference to where I can get information about it?

Excellent analysis! Let me recommend something that has to do with your last words in this post. It is an interview that I made for one of my blogs. You'll find the original text in English at the same post in the link below. https://www.producoeseclipse.com/sociedade/2017/2/25/o-fim-da-escola-entrevista-com-zak-slayback

Interesting read, thanks for writing about this very important problem.

It would be great if there was some standard format for expressing the scientometrically relevant metadata of papers (coathours, citation graph edges, keywords, institutions, date, etc), plus a repository where anyone could easily query this data and download the result (or one repository for each field or journal or whatever), plus a library of functions for performing common tasks with the data. Then anyone with basic python skills could test out whatever metric they came up with. This would be more flexible than a single monolithic application for computing metrics, and would probably be simpler to implement as well.

There are some existing publically available datasets with this sort of information, but as far as I can tell, what exists seems to be pretty sparse and lacking. One of the bigger ones would be the OpenCitations project, although I think they just have articles from the open-access subset of PubMed central, and within this, they only have the citations of about 120000 different papers. I hope this sort of thing expands and spreads to other fields; it could be the same sort of boon for scientometrics that datasets like MNIST and ImageNet have been for machine learning.

Wow. Passionately and cogently written. Glad I’m not ensnared in it. Good luck in your efforts; I want science to continue to make progress.

You mean science administrators and scientists make sausage just like legislators?

Goodhard’s law sounds like a version of Joseph Heller’s Catch-22: if you don’t measure a process it will go to hell. If you measure it, the measurement will lead to corrupt game playing and the process will go to hell. Catch-22 is a classic on how an organization’s goals can splinter. 1,000 shares of Milo Minderbinder Enterprises for you silk parachute that you need to get out of this airplane alive.

Along with the universal law of gravitation and the principle of least action, we must add the Peter Principle and the Dilbert principle. (Of course science is not exempt; it is a human enterprise.)

How do you compute H-factor’s for people who are part of the CMS or ATLAS collaboration (3,000 apiece)?

Firstly, thank you for not equating academia with science. I've been in the sciences for 4 decades and have rarely rub shoulders with academia. As has been tossed out at conferences... the fundamental properties of physics may be finalized in Califonia but it wont be in the physics department at Berkely but at the research labs at Microsoft.

Also, 'academia' is not just the USA but rather the world. In my own field, China leads research and the whole hierarchy is geared towards its own concept of status, reward, etc. and doesn't equate with what we would recognize in the West.

Anyways, there is a whole lot of 'stuff' going on in science that doesnt involve the disfunctional culture of current American academia.

I don't agree with your proposal at all.First of all, the elaboration of new metrics is completely arbitrary. Is a good research someone who publishes more? Or someone with impact (citations, papers in prestigious journals, long careers, etc.)? Maybe someone which as a good social network? The root problem with these metrics is that there is no real unambiguous and universal definition of scientific success. This is in sharp contrast to your simplied discussion of the optmization problem. There is no unambigous or well-defined function to optimize or approximate. It is thus pure folly to try to predict scientific discoveries (the real goal of a research institution) with the use of metrics. Hence all proposals are flawed and will favour some people for no real reason (on the same note, do you think it is a coincidence that the new metric for journal success, CiteScore, proposed by Elsevier favours there journals and are detrimental to Nature and Science journals? A nice analysis of CiteScore: http://eigenfactor.org/projects/posts/citescore.php).

All the new « improved metrics » will only increase the size of the already crowded list of metrics. This reminds me of the excellent XKCD on standards (https://xkcd.com/927/) and this situation is doomed to happen with your proposal.

At best, you are proposing a system in which the metric can easily (and often) be changed. In this way, people cannot game the results too much. This is analogous to the role of sortition in diminishing corruption in the early Greek democracies. Unfortunately, in this case, achieving any kind of consensus on which new metric to adopt is going to be difficult and prone to controversy (since it could even, in principle, lead to a new kind of hidden biase).

I once toyed with the idea of using machine learning algorithms to help select candidates (can be more general than science). In this way, you don't need to design and preprogramme some metrics, let the AI find it's own recipe for selecting the candidates which offer the most probability of success in their fields. But this also has draw backs. Firstly, you would need a big database of candidates with an evaluation of there success (this is hard to do because we need many informations spanning years). After all, the mantra in machine learning right now is “a good dataset is better than a good algorithm”. Second, we then need to trust the choice of the machines. I like reminding people that right now, most AIs are like autistic savants. They both can arrive at solutions but cannot explain their process (or exact logical steps). So, even with machine learning, I think there should be humans involved in the selection process.

Yes, I know the paper, I've written about this various times. Yes, it's a similar problem in the life sciences as in my area, though each discipline sees it's own variants of the issue. Ie, in areas where p-values play a large role you end up with people who are skilled at finding or creating correlations where there are none. In areas like mine, you end up with people who are skilled at producing basically infalsifiable models that will feed them for a lifetime. Neither is 'good science' but it's amply rewarded.

It's painful, to some extent, how obvious the problem is when you look at it from the outside. When you're part of the game, that's the normal. I'm feeling a little bipolar switching between the both perspectives. Best,

Frankly I get the impression you haven't thought much about neither the problem nor my proposal.

It's correct that there's no universal definition for 'good science' - this is exactly why I am saying we need a metric for scientists that everyone can customize individually.

You seem to assume that I want everyone to converge on one metric in the end. This isn't so. I merely want to give people an interface that makes it easier for them to find out who or what research fits their bill. (We can then talk about how to address known social and cognitive biases.)

The problem with the existing metrics is that they aren't created to help academics in their work-life, they are created to help administrators. This isn't what we need, and not what anybody wants, hence nobody's happy.

This is not a profitable project. Indeed I think that any attempt to make it profitable would inevitably backfire and compromise the very purpose. And nobody's going to 'flock' around me without money.

Much of it isn't interesting work, I'm afraid, it's basically (as Tom anticipates in his comment) aggregating very different but already existing databases and devising ways to read them out in various forms - the former being the tedious part, the latter the interesting part. It isn't only that it will be difficult to combine the different types of data, but it would also require quite some negotiation because much of the citation and co-authorship stats will need some agreement from publishers to work with.

But yes, once the basis is there, I'd be optimistic that volunteers would be interested to see what can be done with it and essentially work out some templates for metrics that others can use without putting in all too much effort. It should be of some academic interest too.

In any case, if I'd seen any way to get it done without funding, I'd have done it. 2 million Euro is the typical ERC 5 year grant. It's basically what you need to pay 3 people for 5 years and that's what it'll take. Unfortunately it's way above the funding you can even apply for at most philanthropic institutions, and the ERC will just throw me out in the pre-selection for lack of qualification, meaning it isn't even worth the effort of handing in an application.

Yes, that's pretty close to what I have in mind. Except that you'll have to get it to a level where no coding skills are required from most users. Basically, the goal would be that you could upload a list of names (think a list of applicants), and in return get their scores on the various indicators that you have in your metrics. You'd use it to make a pre-selection of whose application to look in closer.

I believe that this would remedy various problems very quickly, for example that presently academia is very dominated by who-knows-whom. The reason is rather simply that if someone recognizes your name, they're more likely to read your stuff and pay attention to your application, so it's a rich-get-richer game. A data-based approach might bring up people you'd never have known of otherwise.

I'm not sure why many people seem to think of using metrics as something bad. I think they have a great potential to even the playing field and to reduce bias. Best,

Trying hard, but have to remind myself of not conflating academia with science. You're right of course that there are national differences. But the differences are more in the external organization of academia than in the internal one. Eg, some national funding agencies try to limit how much reviewers pay attention to long publication lists by allowing applicants to only list a limited number of papers. Well meant, but of course entirely futile as long as reviewers themselves look up publication lists to create an opinion (if not they know the applicant to begin with).

bee: when i first entered academia (at a major research university) over 40 years ago, citations to one's work was a major determinant in determining whether to give a person tenure or not. This was eventually dropped since it was misleading and inhibited collaboration between young unestablished untenured professors and older, more established tenured professors. Unfortunately, it was replaced by the use of counting the number of a person's research grants. The administration claimed that this criterion was a way to judge how a person's work was viewed by his colleagues but this is of course a sham - the real purpose is to pressure a young researcher to bring in money to the university so that the university could 'get a cut' of the grant money as 'overhead expenses' (in places like Las Vegas, this practice is known as 'skimming'). The story goes that when asked to review an individual who was up for promotion, Feynman declined, saying that certainly, the individual's colleagues at the same university should be far more qualified to judge the merits of the individual since they interacted with him on a daily basis. The fact is that university administrators always claim they are judging individuals based on their academic achievements but this is rarely true; their criteria are far more materialistically based (just as teaching evaluations are not used to improve teaching but rather to provide a weapon for administrators to use to get rid of people who do not contribute adequately to the university's bottom line - its coffers. The entire promotion process at universities in the US is fundamentally corrupt and not intended to promote either research or educational excellence.

I've toyed with the idea, but in the end I don't think it makes sense. Everybody thinks and concludes at their own pace. What we really need is to make best use of each person's individual skills and allow for more variety, rather than expecting everyone to do the same - publish often, or publish once a year, both will force some people to work inefficiently.

Where's all your citations, Sabine? :-) Only joking -- they (citations) always seem to conspire to lose audience or make you look elitist, and yet they're essential when making an argument referring to prior work.

I found the putative refutation of Goodhart's law to be so poorly expressed that I couldn't understand it. "Poorly" as opposed to "well" expressed. (I realise English is likely not your native language).

Can I quantify the deficiencies in expression? Well, I suppose I could get 100 native English speakers with sufficient relevant expertise (knowledge of Goodhart's law) to read it and see how many understand. That would require a lot of human energy (and therefore money) finding the qualified people and getting them to do the work.

But can I quantify these deficiencies--so as to be able to issue a judgement that Sabine's expression is, say, 10% worse than Joe's-- without anyone having to read the passages in question? No, this cannot be done.

The problem with all the quantifiable proxies is that, as Goodhart's law rightly says, they can and will be gamed. And, more to the point, they are an attempt to write the necessity for human judgement out of "equations" that have everything to do with human judgement. Which is why human qualitative judgement is absolutely necessary and every attempt to spare those doing the assessing of the hard tasks of becoming expert in the field and reading the research in question is doomed to fail.

So count me among the nostalgics who are not likely to find your argument against them (viz. "Well, there’s no going back in time, and I’m glad the past has passed.") to be particularly persuasive. In order to be what they are supposed to be, academic institutions require that expert human judgement be exercised. And if we can no longer afford it, then we can no longer afford properly academic institutions and are instead replacing them with something else. Not to see this as a decline is simply perverse.

Let me see there... You don't understand what I am explaining but rather than asking a question, you blame me for supposedly bad writing? Like, clearly it's my fault when you don't get what I lay out? All I said is that any measure is a good measure - for itself. Problems arise only if what you wanted to optimize isn't what you measure. So it comes down to finding the right measure. Is that so hard to grasp?

Human judgement is subject to a great many of biases, starting with communal reinforcement ahead of all. If anything perverts academia it's people who don't realize the limits of their own objectivity.

"An open-source software that allows every researcher to customize their own measure for what they think is “good science” based on the available data." This idea is obviously interesting, in principle. Nevertheless, it depends on a very fundamental assumption that is not necessarily true: "that people are honest". The paper written by Edwards and Roy seems to me a clear demonstration that people do anything to survive. Since the culture of survival (at any cost) is strong enough to contaminate universities, I seriously doubt that your proposal could work. People always find ways to survive, including alliances with other people that want to survive too. And since universities do not survive without public opinion, I believe that our civilization achieved its peak and now is just decaying. That's normal, Sabine. Even you work and think in terms of survival. You wrote "Maybe, when I write about this again, ten years from now, I’ll not refer to academics as “us” but as “they.”" In other words, you are tired of Academia, and you feel that maybe you should follow another path, for your own sake. This need of survival at any cost is human nature. And this human nature is our doom. You are like this, I am like this, almost everybody is like this. Who knows? Maybe I'm wrong. If that is the case, then your post promisses a major contribution to humanity.

The problem you wrote of is embedded in modern society as a whole, which is; with the advent of computers, massive volumes of data, and spreadsheets we’ve grown fixated on quantifying data and using those statistics as a primary measure of success or justification for an action. This is true of government, corporate, media, and unfortunately academia.

Your illustrated diagram makes an excellent point. As a mathematician you know better than most how easy and often statistics are compiled in a way that gives a very misleading impression or result; another reminder that even something as precise as mathematics can be misleading if misused or not fully understood. I applaud your essay and truly hope more join you in speaking out about the danger of this, not just in academia but also in society overall; unfortunately I think the practice is here to stay.

@adonai" Nevertheless, it depends on a very fundamental assumption that is not necessarily true: "that people are honest"."

The theory behind such systems would be "honnest signaling". In simple terms, you must show you are good by doing something that you can only do when you are actually good.

The citation index is such a signaling system: To obtain a lot of citations, you must publish "good" papers people want to cite. We see it can be subverted by coordinated action of groups of scientists.

Cryptographers have designed proof of work systems for just such situations.

"Frankly I get the impression you haven't thought much about neither the problem nor my proposal."

And with this small sentence, you completely shut the dialogue on my counter-arguments. But I still strongly disagree and I can assure you I have thought much about this problem and your proposal (which is not quite original).

"The problem with the existing metrics is that they aren't created to help academics in their work-life, they are created to help administrators. This isn't what we need, and not what anybody wants, hence nobody's happy."

This is false. Most metrics originate from academics (scientists) who try rank their colleagues or their own work, not just for administration purposes. Just look at the literature on the subject. They primarily all orginiate from the h-index, which was suggested in 2005 by Jorge E. Hirsch, a physicist at UCSD, as a tool for determining theoretical physicists' relative quality. REF: https://arxiv.org/pdf/physics/0508025.pdf

They have been hijacked by administrators as a short-cut to quickly judge CVs. Now most administrations give them an inflated importance they were never intended to have. That is the real problem.

"It's correct that there's no universal definition for 'good science' - this is exactly why I am saying we need a metric for scientists that everyone can customize individually."

Here again, I disagree with the intent (promoting more metrics) and frankly I don't see how this even addresses the problem academia is facing. Now institutions and/or academicians could in principle create there own metrics and grow overly attached to them even if there is no real reasons to do so. As you clearly state yourself, "good science" does not have a universal definition. But like I was pointing out in my original comment, the problem is deeper than that, because there is no real quantitative predictor of "good science" in the future. The whole exercise is folly.

In another note, from your blog and comments, I really don't think you understand Goodhart's law. Even if what you measure is what you want to optimize, you will have problems. Let's use an hockey analogy (which is actually real history). Teams wanted to quantify the value of players for recruiting purposes. At first, they decided to keep tally of the number of goals scored by players. This incentive pushed players become more aggressive in their control of the puck. Everybody wanted to score a goal. Many individuals increased their scores. But what happened is that overall teamwork decreased leading to less total goals scored by each team. The quality of the play was far degraded. This retro-action of the measure on the system is the essence of Goodhart's law. Any time you choose a measure to control an incentive, this measure stops reflecting the basic reality of the system (even if it was previously the case).

Now, the hockey leagues started to change their metrics to better encompass teamwork by adding or deducting points (when they give a goal to the opposing team) to all players on the ice when a goal is scored (you can document yourself on this). Sometimes I think that since science (or academia) is a also a team sport, the metrics should not be focused on individuals. But things are really not as simple in academia. But now I am ranting...

"Which is why human qualitative judgement is absolutely necessary and every attempt to spare those doing the assessing of the hard tasks of becoming expert in the field and reading the research in question is doomed to fail."

I understand your anxiety about taking human qualitative judgement out of the picture, but quantifiable metrics don't necessarily exclude the use of human judgement. Say you are a person "doing the assessing", and you ask some number of people who had read and understood the research in question to rank its significance. Then you could get a quantifiable metric by averaging those ranks. Is this any worse than reading and understanding the work yourself, and making a decision based on that?

A (kernel of a) thought about how we might achieve metrics that are both scalable and incorporate an appropriate degree human qualitative judgement: every time a scientist reads a paper and forms an opinion on it, that is potentially valuable information that could be used for things like recommending papers for others to read, for hiring decisions, funding decisions, or admissions decisions. Today, only a tiny fraction of this information is used for these purposes. In principle, a large fraction of this information could be stored and accessed digitally, and incorporated into metrics. Obviously there are many practical challenges that this entails but I think it's something worth thinking about as a long-term goal.

By the way, if you want samples for your proposed experiment: I understood the passage about Goodhart's Law just fine.

I understand the sentiment, but I simply don't have the time to assemble literature references for blogposts.

Sabine, I don't want to make this thread much longer than it already is -- you've really touched a nerve with this post -- but I did want to say that I sympathise with your views.

The citation count is simply a metric whose measurement can be automated, while missing what it's actually measuring. In other words, the decision to use it was probably a "tick in the box" justifying someone's employment.

Although I'm out of academia, I recognise a similarity with institutions being judged on how many papers they publish; this letting them lose people who don't publish enough, and even taking people from other countries simply because they have a good publishing record -- the papers might be of no research value at all, but counting them is easy.

Maybe a very vague analogy can be found in the music business where people are expected to produce albums regularly, irrespective of whether they have any artistic value.

So how does the Internet manage online publications, such as blog posts? Well, they have the ubiquitous 'Like' button, and that requires people to read the material and evaluate it on that basis. Still easy but at least it has some merit.

This is interesting because I know two persons who, being dissatisfied with existing tools/methods, came to practical solutions widely used in science: Donald Knuth and Stephen Wolfram. They are gradually changed their mind about the subject during implementation process because it required greater effort than originally had planned.

Which "counterargument" have I supposedly ignored? You're merely making statements without any argument whatsoever. Take your first sentence "the elaboration of new metrics is completely arbitrary". You could level the same criticism at the economic system. "the price of a good is completely arbitrary". Yeah, that's right. You use the system itself to find the optimum. In principle it doesn't even matter which direction the first approximation goes, as long as you take into account the feedback.

You are probably right that most metrics originate from scientists, because the development of metrics has become a field of science. Not that this bears any relevance for what I explained above: Most of these metrics aren't of practical use. Yes, the h-index is - as I said, we're stuck with the least bad option. I don't know why you think this is an argument in your favor.

Neither do I know why you quote the eigenfactor (and loads of similar measures), which arguably are not devised by scientists for scientists on any level.

I don't know what your example of Goodhard's law is supposed to teach me. You say that individuals did increase their scores, so it seems to have worked. You complain that this degraded team-play. Well, that wasn't what you were trying to optimize, hence the problem. Having said this, it's an example for a collective action problem. Of course if you don't correctly take into account backreaction, you'll not be able to do a good optimization.

In any case, I agree that academia is a team sport to some extent and that the present measures aren't taking this into account. Best,

However, I don't think it will take 5 years to implement it (more like 1 or 2 with a small team). With modern technologies you can create a working prototype in a relatively short period of time. The most work is probably keeping up to date with all external data sources.

If you start the project (say on github), I'm willing to make some contributions to it.

"number of publications, the infamous h-index, the journal impact factor, renown co-authors, positions held at prestigious places, and so on"

Number of publications: bad, because perhaps someone writes many worthless papers.

Number of citations (not mentioned, but common): three problems: 1) difficult to assess the contribution to a multi-author paper (it is rare to include a percentage in the author list, though I have seen this a few times), 2) not just the quality of the paper but also other things determine how often a paper is cited, e) for the same number of citations it is probably better to have them distributed over more papers.

The "infamous" h-index tries to take this into account. (You have an h index of h if you have h papers with at least h citations.) This is better than just counting papers or just counting citations. Most of the criticism of it is criticism of bibliometry in general. It can be made even better:

One way is the g index: you have a g index of g if you have g papers which have on average g citations (or, equivalently, have a total of g*g citations). Why is this better? Someone with 30 papers cited 30 times each will have an h index of 30. So will someone with 50 papers, 30 of which have been cited 1000 times each and the other 20 have been cited 25 times each. In other words, the h index is a good idea, but moves too far in the other direction.

Another way (and of course one needs both) is to divide the number of citations by the number of authors. So 15 citations on a single-author paper count as much as 225 citations on a 15-author paper.

This normalized g index makes counting papers and citations better, but still suffers from the other disadvantages of bibliometry, some of which are mentioned above. Also, consider that the criteria necessary for moving from a mention in the acknowledgements to co-authorship vary widely.

The journal impact factor is bullshit. It's like believing that one should make the basketball team even though one is a midget just because one's neighbours were tall. The journal impact factor is the average number of citations per paper. However, it has been shown that journals with high impact factors have a highly skewed distribution: a few papers are really highly cited, but most are not (perhaps even less than the average in low-impact journals).

Renown co-authors? This is probably more due to luck than anything else.

Positions held at prestigious places: Often this gets one on the short list. However, it should in a sense do the opposite: increase the threshold on other criteria for getting hired. Why? Someone with a prestigious position has more possibilities, so it is OK to expect more than of someone of equal quality with a less position. Yes, he might have got the prestigious position because of quality. However, even if he did, he might have burned out now. He might have got it for some other reason. If this is accepted, then you just have to get one prestigious position early on, then on the basis of this get the next one, and so on, without doing high-quality work (again, one should demand more of someone during his time at a prestigious position).

Hard data shows that h-index is correlated extremely strongly with the square root of the number of papers a person has published. So it's not much better than counting papers. This was written up in the Notices of the American Mathematical Society.

Some while ago I wrote a timeplan on this and a budget estimate and so on. You're right that it won't take 5 years to come up with a prototype. I estimated 3 years for a user-friendly version and added 2 years for which staff will be needed for the inevitably necessary fixes and updates. It's of little use to just throw something at people and then say, now good-bye and good luck.

But thanks for your interest. Can you send me an email to hossi[at]fias.uni-frankfurt.de?

@Philip Helbig" So 15 citations on a single-author paper count as much as 225 citations on a 15-author paper."

So being part of the team that found gravital waves or the Higgs boson will earn you less points than doing a single author paper in PLoS One that you cite yourself in two other papers in PLoS One.

All these new measures are based on the idea of the lone scientific genius. That worked for Newton amd Einstein, but not for CERN or the Human Genome project. And even in Einstein's case, we can argue whether the people who helped him were sufficiently acknowledged in the public/scientific arena at the time.

Even Watson and Crick based their discovery on other people's data who were "inadequately" acknowledged at the time.

Science is a social endeavor. Scientists should be rewarded points for collaborative progress.

Sabine,I an but... an engineer-wannabe-physicist, but I enjoy your posts immensely. Work is more quantifiable as an engineer I think. My take is that your quandry is another result/reaction to a very closed circle of "experts" taking us we know not where. At one time I thought there was "truth" and science was in pursuit of it. Now I realize it is often political [the corruption of much of "climate science/academia" by gov money strikes me as outrageous], and there is no such thing as "truth." Indeed, physics looks like magic [religion?] the more I see of it and its explanations :-) I still feel we are only a lifetime into the future, and the wondrous science and engineering is yet/never to be revealed. I am an optimist in that respect. But I really wonder how we are to get there... The future looks very unfree with closed circles of experts deciding things. There must [have to be...] changes in the scheme of things to get there. Go for it. Just an engineer thinking above his station. Or maybe a "non expert" point of view. John

The present is always a sad time for it always comes too late. The great spirits of the 19th century have mostly had an unhappy life to find that out http://www.facstaff.bucknell.edu/tcassidy/Abel.pdfSome tried to be wiser perhaps out of noticing this accumulation of mishaps or out of their own inherited sense of dignityhttps://en.wikipedia.org/wiki/Alexander_Grothendieck#Retirement_into_reclusion_and_deathhttps://en.wikipedia.org/wiki/Grigori_Perelman#Possible_withdrawal_from_mathematicsAs Col. Kurtz magnificently put it once "It is judgement that defeats us", nature though wisely does not need any such...https://www.youtube.com/watch?v=70D_oYAMhsE

@Philip Helbig" So 15 citations on a single-author paper count as much as 225 citations on a 15-author paper. "

So being part of the team that found gravital waves or the Higgs boson will earn you less points than doing a single author paper in PLoS One that you cite yourself in two other papers in PLoS One.

Don't be silly. Don't compare apples and oranges. Obviously, if you are counting citations at all, they have to be in comparable journals. Also, correct for self-citations if you think that they are bogus. The fact that you need to caricature my position shows how weak yours is. As I've indicated above, I'm sceptical about bibliometry in general. It's almost as bad as saying "I like music which is in the charts" because it is in the charts. However, all else being equal, yes, 15 citations on a single-author paper is worth more than 225 citations on a 15-author paper. If you don't believe me, accomplish both, and let us know which was more work for you. (Of course, the gravitational-wave and Higgs-boson papers have more than 15 authors and more than 225 citations. But do the math. Of course, some of the authors did more than others, but this is difficult for outsiders to judge. Publication order might indicate a ranking (but not a percentage of work done), might be random, might be alphabetical. In some fields, the senior author is usually listed last.)

All these new measures are based on the idea of the lone scientific genius. That worked for Newton amd Einstein, but not for CERN or the Human Genome project.

The problem is that now the pendulum is swinging too far in the other direction, disadvantaging those that don't work in huge groups. They still exist, even though they might not can afford a PR department like the big collaborations can.

And even in Einstein's case, we can argue whether the people who helped him were sufficiently acknowledged in the public/scientific arena at the time.

Einstein himself certainly acknowledged those who helped him. Hopefully you won't claim that his first wife made a significant contribution to SR.

Even Watson and Crick based their discovery on other people's data who were "inadequately" acknowledged at the time.

To some extent, perhaps, but the case is overblown. Also, keep in mind that Nobel Prizes are not awarded posthumously.

Science is a social endeavor. Scientists should be rewarded points for collaborative progress.

Why? Surely it should be the results which matter. If social endeavors lead to good results, the results are their own reward. If not, then why should the scientific community reward them? And there is some science done by people working alone, for whatever reason. Why go out of your way to penalize them?

The problem afflicts only theoretical physics. Applied physics, including CERN and LIGO, and materials and optics science, are in pretty good shape. In those disciplines it's clear when people are doing it right: their apparatus works. But it's impossible to know if string theory (for instance) is being done right since it has no contact with the real world (i.e., experiment). The solution is obvious. Stop funding pie-in-the-sky research. This will have the unexpected benefit of finally allowing advances to be made - by amateurs. Smart unconventional thinkers will be heard, when the establishment is no longer there to drown their voices out.

This solution is unacceptable, of course, to theoretical physicists: they'll lose all that money (and attention). Plus, the straitjacket of your education conditions you to think inside the box. Fortunately this de-funding will happen naturally in the next decades. A new generation of leaders (like Trump) aren't going to continue wasting money and brainpower on stuff like string theory and LQG, when roads and bridges are failing, people are starving and wars are looming due to overpopulation, etc. They'll have to deal with real-world problems.

Meanwhile, you believe (incorrectly) that AI will create artificial conscious brains. For the sake of argument, let's suppose you're right. Such brains would achieve very high IQ's - 400, 1000 - within 100 years. Then, if the solution to the puzzle really requires massive IQ (BTW, it doesn't, a mere 160 is enough), they'll do it. It doesn't matter whether we figure out the "secrets of the Old One" now, or 100 years from now, since it has no practical use.

To young people thinking of a career in theoretical physics, I say: don't. But if you can't stay away from it because it's so fascinating, learn it as a hobby, while making a living at the patent office.

I believe the problem lies around too many people chasing too few funds. The net result is people have to game the system, and there are various ways of going about this, but none of them include seriously questioning whatever you think the peer reviewers will believe. Some reviewers may accept such questioning, but enough do not. Accordingly, the strategy seems to be to form large cooperative groups (multi-author papers - some have up to sixty authors) mutual citing, and whatever, do not question the work of anyone else in the group. What happens next is we get your progressively thinner slicing. The main purpose for most academic scientists is to get funding, get prizes, and get promotion within the institution. You cannot blame them for this, but it does not encourage reconsidering past potential mistakes.

There will be exceptions to this, but I believe only too many scientists are deeply conservative. They want to build on what is already there. That is fine for experimentalists, but not so good for theory.

I think a weak will combined with an irresolute disposition is what forces the thing Matthew Rapaport mentioned. The reasonable response, the one I naively expected physicists to prefer, would be to abolish that system. The fate of the whole world swings on physics but too many weaklings prefer to hide quietly in their departments, and therefore, in my opinion don't even deserve those departmental positions.

For an interesting look at the accuracy of published research.http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124https://www.theatlantic.com/magazine/archive/2010/11/lies-damned-lies-and-medical-science/308269/

The h-index, the citation rating, impact factors andthe aspiring researcher Til Wykes, Sonya Lipczynska & Martin GuhaSome extracts."Clearly, some impartial method has to be used for selecting academics for appointment,for promotion, and, most especially, for the distribution of research funding. Without onethere is a very strong risk of nepotism and favouritism, leading to unsuitable appointments".

" The time of year in which papers are published then becomes largely unpredictable although evidence from physics suggests that July is a good month to submit (Schreiber, 2012) A reference in the article.

Finally the holiday message last paragraph is interesting. This article is in open access.

I'm late to the party, but I would like to point out the standard book on this topic:Robert D. Austin, Measuring and Managing Performance in Organizations

Mr. Austin uses the term "measurement dysfunction" to describe the situation when workers optimize aspects of their work that are measured to the detriment of aspects that are not measured. In the absence of complete supervision, any incentive scheme based on some kind of performance metrics inevitably leads to such behaviour.

This is also why I am skeptic of your claim "But I’m totally qualified to hire some people and kick their asses." because you cannot even tell whether the people which you hired are doing a good job or not. And if you tried to, chances are that they will start doing a worse job.

I don't know if folks are still reading, but I wanted to make a few points about your great post.

First of all I really liked how you focused on customized metrics. We should remember that we don't need to total order academics, rather we need to find a best fit for a particular circumstance. The practical side of it has, I think, not just the difficulty of getting data and software set up but also that it becomes yet another administrative thing an academic needs to do -- like most of us, I'd rather just do my research!

That ties into a second point. Better metrics aren't the only tool in the box to avoid optimizing to the wrong thing. Another is simply to lower the stakes so researchers can weight how much they care about metrics at all less and instead do the work they want to do. This is the counterargument to Chí-Thanh Christopher Nguyễn's point -- measurement dysfunction only applies if the incentive scheme is actually what is primarily inventivising people. An academic researcher's personal incentive scheme is going to be a mix of personal intellectual goals with externally imposed goals and we only need to summed incentive vector to point towards actual better performance, not necessarily the external part by itself. Personal goals are a nontrivial part; that's why managing academics is like herding cats.

But how to lower the stakes? Of course as many commenters mentioned just having more money around would help, but dividing the pot differently can help too. For areas like mathematics and theoretical physics where the costs of research are low, having relatively easy to get small grants is the way forwards. Canada's NSERC Discovery grant system has traditionally been good at this (though not historically without various other flaws) but there is a perpetual fear that it will move to the superstar model that has dominated in the US and elsewhere.

Finally, lets not discount the nostalgia crowd too quickly. Whatever better incentive or evaluation system we can devise, I would like to see it in operation for at least a few years in parallel with the nostalgia system because the nostalgia system at least can give a rich evaluation (albeit biased in various ways), and without that richness in actual operation to compare to I don't think we can really evaluate a new system.

You write, "measurement dysfunction only applies if the incentive scheme is actually what is primarily inventivising people."

The idea that incentive schemes inevitably lead to measurement dysfunction is backed up by decades of research. It is not necessarily always true that workers will perform worse when offered a reward, however it is especially true for knowledge workers like scientists and computer programmers.

If you don't want to read the whole book from Robert Austin, there is a nice summary of results between a quarter and half a century old in a now-classic Harvard Business Review article:

"As for productivity, at least two dozen studies over the last three decades have conclusively shown that people who expect to receive a reward for completing a task or for doing that task successfully simply do not perform as well as those who expect no reward at all. These studies examined rewards for children and adults, males and females, and included tasks ranging from memorizing facts to creative problem-solving to designing collages. In general, the more cognitive sophistication and open-ended thinking that was required, the worse people performed when working for a reward." (emphasis mine)

Karen, one more comment about your remark "we only need to summed incentive vector to point towards actual better performance"

Alfie Kohn specifically addresses this in his article:

"Outside of psychology departments, few people distinguish between intrinsic and extrinsic motivation. Those who do assume that the two concepts can simply be added together for best effect. Motivation comes in two flavors, the logic goes, and both together must be better than either alone. But studies show that the real world works differently."

You cannot just sum up intrinsic and extrinsic motivation. Introducing extrinsic motivators will affect intrinsic motivation, almost universally to the detriment of the latter.

Two things. (0) One more proxy for not reading others research does not solve it. Look at any easy to browse criterion for judging a researcher work as a compression problem. You want to compress N bits (of research results from articles, etc) into S bits, so that S << N. You also want that the function N --> S to have the good properties of a hash function. Now there are very strong limits to that. Indeed, the function which takes an article and takes the first bits, containing the title of the journal, then outputs a number according to the importance of the journal, that's very crude. Maybe one can do better. But basically one can't do better than taking the identity function for articles which are really good. (Oh wait, maybe there is a function which compresses the article, such that the fraction S/N is significantly better for articles which historically turned out to be very good. But I doubt it.)

(1) But people are doing something for science. Is called Open Science. Forget a bit about the problem of sharing research (has good technical solutions already), about the problem of hoarding attention for a research subject, forget even about the problem of geting funds for the research. Concentrate on the core of the scientific method: validation. As any research article is only a story told about the real research, as peer review consists mainly in reading the said article, then just accept this is not enough. Share then all the body of the research (decision to be taken only by the author or lab, not mandatory), in such a way another interested part can validate (by reproducing, executing programs or doing statistics on the data, etc) the research. The other interested part does not have to give an absolute judgement on the research shared (because it is not possible to have a fully, objectively justified one). Instead, the other part can form an opinion about the research, or creatively use it, which is based on the actual research, without having to resort to a proxy, like the opinion of an editor, etc.

There are two advantages of OS: anybody can start doing this without waiting for the opinions of the community to change, and any research contribution has more chances to survive because is OS as compared to being non-OS.