Abstract thoughts about online review systems

As many people have pointed out, to get to a new and better system for dealing with mathematical papers, a positive strategy of actually setting up a new system might work rather better than complaining about the current system. Or rather, since it seems unlikely that one can simply invent ex nihilo a system that’s satisfactory in all respects, one should set up systems (in the plural) and see which ones work and catch on.

I’ve already had a go at suggesting a system, back in this post and this post. Another system that has been advocated, which I also like the sound of, is free-floating “evaluation boards” that offer their stamps of approval to papers that are on the arXiv. (I associate this idea with Andrew Stacey, though I think that in this area there are several good ideas that have been had independently by several people.) But instead of discussing particular systems, which runs the risk that one ends up arguing about incidental details, I want to try to adopt a more “axiomatic” approach, and think about what it is that we want these new systems to do. Once we’re clear on that, we have a more straightforward problem to solve: how do we achieve most efficiently what we want to achieve?

The first “axiom” I suggest is one that pretty well everyone seems to agree about, so I won’t argue for it. It’s that the internet takes care of dissemination very nicely (assuming that we bother to make our papers available). So this is not a function we should worry about.

Slightly more controversially, I would suggest that typesetting and copy-editing, services that journals provide to some extent, can be ignored in this discussion. We’re all capable of producing reasonably nice versions of our papers (or at least almost all of us are) for the arXiv, so we shouldn’t let the additional fine tuning be a major consideration when we think about what we want out of a new system. Basically, if you want your paper to look good, it’s up to you to put in the work to make it look good (though getting feedback from other people may well be helpful — that’s another matter).

Why do we need an official “mark of quality” at all? One might argue that when a paper is written, then either it isn’t interesting enough to get people’s attention, or it gets people’s attention and those people provide feedback (in particular drawing attention to potential mistakes if they exist). Thus, the papers that matter get looked at and understood well before they come out in journals, and the ones that don’t are quietly forgotten.

What about an excellent result produced by a newcomer to a field? Is there not a danger that that result will be unjustly ignored? Well, yes, but I would suggest that if you are such a newcomer, then what will really make a difference to your situation is not the pat on the back you’ll get if you’re lucky by having your excellent result accepted by a good journal. Long before that happens, you should get in touch with experts, go to conferences, and so on. If your result really is an important contribution, people will be very pleased to have their attention drawn to it.

The trouble with the kind of argument I sketched out in the previous paragraph but one is that we often have to judge mathematicians in areas other than our own. For example, we might be on a hiring committee for a job with hundreds of applicants. If so, we need quick, and therefore crude, methods of evaluation. Journals give us one such method: we just skim through the publication list and get a sense of the quality of the journals that a candidate has managed to publish in. You can argue, and I would argue, that this measure is too crude. But if you’ve got to get a list down from 500 to 60, say, then the difference between spending 30 seconds per candidate and a minute per candidate is over three hours of mind-numbingly tedious work. So I’ll adopt as my second “axiom” that we need some kind of “metric”. It’s an unfortunate necessity, to be sure, and one would hope that crude metrics were used only for a preliminary sifting when there is a very long list of candidates, and not, say, to distinguish between two candidates on a shortlist. But it’s still something that some of us sometimes need.

Since I’m trying to discuss the fundamentals here, let me briefly address the question of whether the notion of the “quality” of a piece of mathematics makes sense. We certainly talk as though it makes sense, but is there something objective that underlies the seemingly subjective judgments that we make the whole time?

I think it is well worth thinking quite hard about this question. What makes a piece of mathematics good? I don’t just mean the extreme cases such as a theorem that opens up a new field or solves a major open problem. Those are the easy cases. But if we’ve got one journal that’s “a little bit below Annals” and another that “accepts high-quality papers” but is regarded as step below the first, can we say what it is that papers in the first journal have got that papers in the second lack?

Obviously we’re never going to come up with precise criteria that would allow us to give a numerical measure of quality to people’s papers. But in a way that is what we’re trying to do. We behave as though such a measure exists, even if it is extremely hard to calculate, and fondly imagine that our journal system dimly reflects the “true quality”.

Let me attempt to describe a few different quality scales, since a single linear scale doesn’t seem right.

1. Solving an open problem.

A result will cause a stir if it solves a long-standing open problem that has been worked on by many highly reputable mathematicians. [This is an inductive definition, since a reputable mathematician is one who has produced excellent papers.] An extreme example is the solution of Fermat’s last theorem by Wiles and Taylor/Wiles. A less extreme (but still pretty unusual) example is the recent solution of the Erdős distance problem by Guth and Katz.

There’s an implicit measure here: how long has the problem been around, how many mathematicians have worked on it, and how good are they? In my area people will know, or at least be pretty sure, that such-and-such a problem has been worked on by, say, Noga Alon or Jean Bourgain, and that will place a very high lower bound on the achievement of someone who manages to solve it.

But there’s another measure that is also important, which is what one might call the size of the potential audience. Is your problem of interest to all mathematicians, or all number theorists, or all analytic number theorists, or all mathematicians working on estimates for exponential sums, or all mathematicians working on refinements to what we know about Waring’s problem, or …?

2. Introducing ideas/definitions/techniques that change the way other mathematicians tackle a range of problems in some field. (Extreme example: the discovery of the Seiberg-Witten equations.)

Again there are two measures one might apply here. How radical and new were the ideas? How many top mathematicians could reasonably be said (i) to be instantly capable of recognising their significance but (ii) to have failed to notice them? That’s a measure of something like the “cleverness” of the ideas. But then there is the breadth again: how big a circle of mathematicians will have their lives changed by these ideas?

3. Making progress on a well-established project.

Quite a lot of good papers neither solve a pre-existing problem nor introduce a technique that changes a field. Rather, they make an incremental contribution to a research programme. I suppose a fairly extreme example of this is some of the work that was done on the classification of finite simple groups: it was hard and needed significant expertise, but, barring unexpected difficulties, it was always going to get done. (Just to be clear, I’m not saying that that’s a good description of every part of the classification.) The measures of quality here could be technical difficulty, expertise needed, even length. And of course all that should be multiplied by something like the size of the circle of mathematicians for whom that particular research programme is a truly central one.

4. Doing something difficult.

Leaving aside the interest of a result altogether, one of the things that makes it good, or at least is positively correlated with quality, is how difficult it is. This is particularly hard to measure, because what we want to measure is not absolute difficulty but difficulty taking into account what people could already do. If you’re not an expert in an area, then a proof may look extraordinarily difficult, because the author (all too often) has not taken the trouble to say what was truly new and what was merely a long but standard argument. Here of course the opinion of a referee can be extremely helpful. It doesn’t happen as often as I’d like, but at least in principle a referee can say, “The proof of Lemma 2.3 looks very long and complicated, but it goes along the lines that I’d expect,” or, “When I read the statement of Lemma 4.1 my first reaction was to think that it couldn’t be true, but then I understood what was going on. This is a truly important new idea.”

5. New ideas.

Now that I’ve used that phrase, let me give it a category to itself. One important measure of a paper is the degree to which that paper is going to help other mathematicians with their own research. It can do this in many ways: providing new techniques, proving statements that can be applied to solve other problems, providing proof templates that can be imitated. One nice thing it can do is enlarge our stock of ideas. It’s hard to say exactly what this means, but we do have a sense that “this paper contains two main ideas” and things like that. Maybe a loose definition would be that an idea is a piece of mathematics that can be condensed into a slogan, such as, “If the characteristic function of a set has no non-trivial large Fourier coefficients, then the set behaves like a random set.”

As ever, some new ideas (such as “try a random example”) are immensely broad, while others are considerably narrower.

No doubt the list I’ve just produced of possible attributes of a paper is incomplete. The point I’m making, however, is that there could be something to be gained from trying to be more precise about what it is we are looking for in a paper. For instance, if we are giving a stamp of quality to a paper whose main merit is that it solves a problem that was in the literature, we might want to be able to give a stamp that everyone in the relevant area understands as, “This paper solves a problem that Noga Alon has undoubtedly thought about. It is of strong interest to specialists in Ramsey theory and some interest to other combinatorialists.”

Could we hope to have a quality stamp as fine-grained as that? Possibly, but if not then there is an alternative, which one might call mini-reviews. Our evaluation procedures, whatever form they took, could result in crude stamps (for the very quick evaluations) coupled with more detailed, but short, justifications for those stamps. What I’ve just written could, for instance, be made slightly more formal as follows. “This paper solves a problem that was posed in 1995 and has attracted the attention of several of the top names in the field. It is of strong interest to specialists in Ramsey theory and some interest to other combinatorialists.” People writing such justifications would be instructed to convey as accurately as they could, in narrative terms, how good a paper was according to at least some of the criteria outlined earlier. (When it comes down to it, I think the main thing I’d want to get a sense of from one of these justifications is the breadth of interest: that is, the size of the circle of mathematicians likely to be interested in the paper. But it’s still interesting to know what kind of mathematical contribution one is dealing with.)

If you could call up, online, a list of people’s papers together with very brief summaries of that kind, then you would have an excellent way of getting a good feel for someone’s achievements. Perhaps it would be a good way of whittling a list of job candidates down from 60 to the half dozen that you really want to look into carefully.

I’m in danger of getting away from the abstractness of the discussion and actually making a concrete suggestion, which is partly because I think that there is something that the current system does not provide (roughly speaking, a way of saying how good a paper is that’s more precise than one’s rough impression of the quality of the journal it’s in but less precise than a well-written expert reference) that would be extremely useful. So I have felt the need to explain in a little detail what that missing something in the current system is.

After all that, I have arrived at the following general principles that I think should apply to any new system of evaluating papers.

1. It should be independent of whatever mechanisms we use for dissemination and formatting.

2. It should give us a “metric” that we can use to make very quick judgments in situations where that is an unfortunate necessity.

3. It should also provide us with a more precise idea of the kind of contribution that a paper makes — how genuinely difficult, how new, how broadly interesting, etc.

I’ve forgotten one other important thing, so let me state it and then discuss it.

4. It should be at least as good as the current system at identifying papers that are wrong.

One of the things that the current system is supposed to do is give us confidence that all those results in the literature are correct. My own belief is that this is an illusion, and that many published results have not been carefully scrutinized by their referees. I also think that this doesn’t matter too much, because the results that really matter, the ones that contribute to the big story of mathematics, the ones that form the lower floors of some big mathematical edifice, are, by and large, scrutinized carefully. It’s just that the scrutiny is more or less independent of the journal system.

Anyhow, the wording in 4. is chosen carefully: I’d be unhappy with a new system that left us significantly less confident in the correctness of our generally accepted results, but I also think that there is not much danger of that.

Here are a couple more features that I think a new system should have if it is to have any chance of success.

5. It should be able to coexist with the current system.

The point here is that we can’t just jump from one system to another. Rather, a new system has to get started and be seen to work, and then the old system can afford to retire gracefully.

6. It should carry authority.

What I mean here is that I can imagine a group of people setting up a wonderful paper-evaluation system and a large proportion of mathematicians not taking it seriously and not regarding it as “counting”. If that happened, then people would not be ready to rely on the evaluations from those systems when it came to building up their CVs.

In practice, the effect of this, at least to start with, could be that the new evaluation systems have to mimic journals more than one might ideally want. For instance, suppose you had an idea for a variable stamp of approval — a paper could be passed at level 1, 2 or 3, say. That probably wouldn’t work to begin with, because it would require people to buy into your system. But if you had three pseudo-journal names and made it known that one was for amazing papers, one for very good ones, and one for good ones, then it would be more like the old system (in the sense that people could make their snap judgments based on some notion of journal quality), and people wouldn’t worry that their CVs were going to be underestimated or misunderstood.

As for more radical departures from the current system, such as websites where people review papers and build up reputation points, they are likely to have to run for a long time before any young mathematician would feel able to submit a paper to one of those sites and nowhere else. But maybe it would be good to get some of these things going so as to start the process of familiarization and, perhaps, acceptance.

One final point I should make before stopping (really it belongs earlier, but I can’t face reorganizing this post) is that the journal system performs rather different functions at different quality levels. At the top, people getting papers into the very best journals may thereby improve their chances of getting jobs at the very best universities. But there are plenty of people whose needs are rather different. I have often had to referee papers that seem completely uninteresting. But these papers will often have a string of references and will state that they are answering questions from those references. In short, there are little communities of mathematicians out there who are carrying out research that has no impact at all on what one might call the big story of mathematics. Nevertheless, it is good that these researchers exist. To draw an analogy with sport, if you want your country to produce grand slam winners at tennis, you want to have a big infrastructure that includes people competing at lower levels, people who play for fun, and so on. Similarly with mathematics, if we want the subject to thrive, we need our own big infrastructure, with excellent teachers (who are likely to be better if they are doing research, whatever that research is like), large numbers of mathematics departments rather than just a few, and so on. This creates a need for some way of distinguishing papers that are bona fide research papers (even if not terribly interesting) from nonsense.

So I’ll add another requirement of a system.

7. Amongst other things it should be able to provide a “minimal stamp of approval”, the meaning of which is something along the lines of “This is a genuine piece of mathematical research and it looks correct.”

At the moment this minimal stamp of approval is obtained if you can get your paper published somewhere (with the possible exception of Chaos, Solitons and Fractals).

What I find encouraging about the current situation is that it does seem to be possible to imagine a continuous (and monotonically improving) path from where we are now to where we might want to be in the not too distant future. We could begin by setting up cheap but relatively conventional alternatives such as open-access electronic journals. But it would then be easy for the editors of an electronic journal to experiment with the refereeing procedure. (For example, one innovation I would like to see is the process of evaluation being separated from the process of carefully reading a paper and making comments about its presentation. The latter could be done non-anonymously and in consultation with the author. Or the author could even commission this work in advance from a colleague and submit the paper with something like a certification of correctness. The person who provided that certification would not be anonymous, so their reputation would be on the line if the paper turned out not to be correct.) And if different electronic journals had different refereeing procedures, the stage would be set for significantly different procedures of a kind that many people, including me, enjoy fantasizing about. (If you do too, then you may enjoy this blog discussion.)

60 Responses to “Abstract thoughts about online review systems”

What is inquiry? And how can we tell if a potential contribution makes an actual contribution to it? Questions like these often arise, as far as mathematical inquiry goes, in trying to build heuristic problem solvers, theorem-provers, and other sorts of mathematical amanuenses.

Charles S. Peirce, who pursued the ways of inquiry more doggedly than any thinker I have ever read, sifted the methods of “fixing belief” into four main types — Tenacity, Authority, Plausibility (à priori pleasingness), and full-fledged Scientific Inquiry.

Actually, the current system could be “tweaked” to much the same effect:

1. Papers are first deposited on arxiv or something similar.

2. The role of the “journals” is now to collect the best ones in a particular field and list them with reviews. For example (I am a chemist): the American Chemical Society top 100 papers of week 2, 2012 or the ACS top 100 papers in Theoretical and Computation Chemistry for 2011, or Nature top 50 papers in January, 2012. (It’s a bit like Faculty of 1000). .

3. A paper can be highlighted in several “journals”, so a paper on a CV would be xxx, arxiv citation, listed in Nature week 8, 2012, ACS week 2, 2012 and TCC Best of 2011. Even now I would mention if one of my papers was selected by F1000 or mentioned on a blog.

4. These “journals” could be free or require a subscription – doesn’t matter the paper is freely available on arxiv, though one might have to pay to see the comments. Really popular sites could generate some money from advertising: not as much as before, but the costs are lower.

5. Papers could be submitted to these “journals”, but now one could submit to several journals as once, and initially overlooked papers could work their way up the “journal ladder”.

I like parts of this idea: a “journal” can now simply be “a coherent peer-reviewed window” into papers on the arxiv… an author submits an arxiv link, the editors send it to reviewers, and if accepted, then it goes into the “list of links” that now comprises the journal.

But we want to avoid authors submitting to several journals at once, or the overall load on referees will increase.

To the list of qualities I would also add: suggesting interesting problems/conjectures. An excellent example in group theory is Charles Leedham-Green and Mike Newman suggesting the coclass conjectures.

Now, we do have a system that does (or supposed to do) a lot of what you suggest, namely, MathSciNet. Of course to make it do what you want, it has to be taken a lot more seriously by everyone involved. For example, not just quote the abstract.

I think one area that needs to be carefully addressed is the anonymity of referees, or whoever carries out the evaluation in the new system (I think you implicitly assume this). The opinion of a referee needs to be independent and that’s more challenging if that opinion goes against the immediate feelings of the community.

Also I think that any sort of reputation system will only serve to make the situation more difficult. After all the best way to accrue reputation is to tell the community what they expect to hear and so a reputation system is promoting that impulse over the impulse to be critical.

My own view is that a decoupling is needed: decisions about whether to accept or reject should be taken anonymously, but checking for mistakes and helping with presentation is better done non-anonymously and in consultation with the author. I see a reputation system as more to do with the latter and as a recognition of conscientiousness, reliability and hard work. One could think of it not so much as a reputation system and more as a points system, not that different from what some grant awarding bodies offer to people who have handled some of their proposals.

I agree that a reputation system connected with people’s judgments of quality raises all sorts of problems, and I would be against it.

Thank you for the clarification. I am very happy with your reply and contrary to my reply to a different post below do think that I could be convinced that some kind of points system may by appropriate. Perhaps.

I’d like to suggest that it may be well worth it to take a closer look at the various stackexchange sites. These sites, for example mathiverflow, cstheory and of course stackoverflow draw attention – uncompensated very high quality attention – from some of the most highly regarded researchers and programmers in these fields. These sites also tend to do a marvelous job at separating the wheat from the chaff, and prominently highlighting questions and answers that are ‘interesting’ and ‘important’ and also at wheedling out those that are not.

They also provide and interesting and egalitarian way of including contributors and reviewers of differing experience.

Rather than trying to come up with something new from scratch it might be worth considering how this approach could be augmented to supplant the ‘journal’. I suppose the promary challenge would determining how to more effectively incorporate and define reputation scores, and then how to define the submission space and formatting.

But these models are stunning in their usability as well as their ability to draw top tier contributors.

Well this is where I disagree. I think that the stackexchange websites are brilliant and that is reflected in their success and popularity, however I don’t think that this means that the reputation system is something to be replicated in other scenarios.

As a community we accept that the judging of people on metrics based on impact factor scores is crude, however jump on the chance to use reputation scores. I think this is mainly because a number (standing in for) reputation is something that is immediate and accessible, but alas I don’t think that it’s a good measure to pick reviewers (and I fear authors) by.

I favour the editor’s (or who ever replaces the role of editor) good judgement and she/he can find reviewers based on experience, based on contacts or based on any metric they care to choose.

I can’t seem to reply to a reply from my phone so I’ll try and do it here. First, I should have properly read the previous discussions so my comment was probably pointless although my opinion stands. Second, with regard to James I completely agree – that is what I meant by a need to revise that part of those systems. But if the primary goal is to disseminate and promote valuable work then I still think that that general template is by far the most desirable, and as a current graduate student i find those locations to be some of the most vibrant and stimulating. I recognize that they present pragmatic issues but probably disagree with regards to their importance.

I think we need to be a little careful not to ‘throw out the baby with the bath water’.

I think it would be unwise for mathematicians to go at it alone with a new system to replace scholarly journals. There are many benefits to be gained from maintaining within the confines of the wider academic community.

Journals play many important, different roles in Academia, many of which may not be relevant to mathematics, but should in any case not be forgotten.

I would suggest that we should focus our efforts on improving the current system rather than looking for a replacement.

I believe the online petition by Prof. Patrick O. Brown provides a good precedent of how to proceed: namely, to encourage the use of open access journals through pledges, and set up new open access journals where there is a need. Certainly with these new journals there is room to experiment in order to attempt to improve the current system.

There are some interesting related comments, including a description of a possible two-tier online review/publication system (basic acceptance followed by annotated selection and refereeing), in this article by arXiv founder Paul Ginsparg (from 2003).

My incentive is to have my own work contribute to the community. I do the work anyway, it’s a shame to just throw it away. It’s hard work reading a paper carefully, and personally I’d love some place to write: “I’ve checked this work and there are these three typos, and she ignored this subcase, but the paper is 100% correct.”

In the first, the question of free-floating “evaluation boards” is concretely addressed -in different terms. There have been attempts to make “arxiv overlay” journals from existing journals, they reverted to traditional publishing due to funding failure. I see this as a good example that the transition to a new “administrative framework” for journals can be done “continuously”, that is, independently from other aspects of the review/rating system, change can be modular and reversible to some extent.

In the second blog post link, the question of the value and format of journal evaluation is addressed implicitly. It is assumed that we actually need replacements for Elsevier journal, this implies that we need editorial boards with a scope, and detailed aspects of this are discussed. I think that puts in evidence several requirements of review systems, which can be found in your posts.

Advances in Mathematics is an example of an editorial board concerned with stamping papers primarily for overal impact on mathematics and difficulty. Other boards/journals close to it are “Duke, JEMS, Transactions, Compositio, Amer. Jour. of Math.”.

Journal of Combinatorial Theory: Series A yields a lesser stamp in many respects, though perhaps higher in “combinatorial” difficulty/prerequisites for understanding the paper (these are part of your rating parameter proposals here in your post).

Discrete Math, David Speyer refers to you.

Journal of Algebra is seen as important, hard to replace, in the “stamps”/certification it provides, its scope, which probably comes much from the expertise of its board.

(A side note: Each journal is in itself a statement that its subject, scope, is important for mathematics and society. This adds depth to ratings.)

I think this discussion supports the idea of finer rating you propose here, to be done explicitly by journal editors, especially if a flexible system were provided. Much of it is already implicit in a sense. It would have to allow progressive learning by editors. Other desirable features are centralization, related to authoritativeness, and being easy to analyze (i.e. come in a website that would provide statistical tools).

To expand on this: in the comments of that post by David Speyer I had given my ideas why journals are important, and what could be done to develop our rating system incrementally. They are close to what you propose in the present and previous posts but differ in that I focus on adding an independent website to the present journal system, (perhaps) contrary to you, I assume it is a good basis to build upon, and I would say I do not see it replaced by mathoverflow or “something like a cross between the arXiv, a social networking site, Amazon book reviews” (but this can be interpreted broadly):

1. Having a journal and articles “objective review” site, gathering “metrics” like citation counts, impact factor, eigenfactor. This could include your proposed ratings (difficulty, novelty, etc.) to be provided by journals (=review boards). I stress the importance of being able to play with many metrics to be able to spot possible unwanted biases and gaming/manipulation attempts of the system, and personalize the overall rating for different needs (say hiring a leading probabilist, a beginning combinatorialist, or a college teacher).

This metrics part could at first be implemented in a wiki, with little structure. It would gather “objective” data.

2. There should then be a journal and article “subjective review” part, integrated with the objective part, where mathematicians could review journals like was done in the SBS blog post and comments, with the possibility to give “stars” (for various aspects), similar to appliance, book, or restaurant review sites. Journals and article review data could be analyzed together when analyzing for instance a researcher, a research group, or a university. Article-only review could be pitted against journal reviews, objective agains subjective data, etc. the goal is having much flexibility for statistical analysis.

3. (Rather separate proposal) In reply to Peter Krautzberger’s link to http://www.papercritic.com, I wish we could comment directly on the arxiv website (by Cornell) to centralize and make open discussions of math articles, instead of the current system where we rely on a blog post or mathoverflow question. The author is unavoidably connected to the website because she has submitted her article there. Though she could disable comments for privacy (that would render more meaningful the usual “Comments: Welcome!”).

So if anybody wanted to setup a wiki on journal “metrics”, a journal review site, or add a comments feature to an arxiv mirror (even the Cornell one), that could be a great start.

One thing I really like about this post is that it recognizes the diversity in how we evaluate mathematical work. This is an essential part of any future system, that we should not get locked into any single metric.

However, there are two other aspects I consider essential:

One is unquantifiability. The fact that the current system cannot be quantified is an enormous benefit in making it robust. The problem is that when you formalize quality metrics, there is a powerful temptation to change one’s behavior to optimize these metrics. This could be almost subconscious, or it could be a deliberate attempt to game the system. One should not underestimate how widespread this would be: I am certain that a large majority of mathematicians would be at least somewhat susceptible to this influence, and a substantial minority would try to game the system. (For comparison, think of the incredible influence college rankings have had on the admissions behavior of US universities.)

In particular, I would be strongly opposed to any system of assigning points or scores, even in several dimensions. It’s easy to come up with a system that is roughly correlated with what we want to know, up until the point at which people start changing their behavior based on it, but I do not believe it is possible to do this in a robust and incentive-compatible way. I feel we are much better off with vagueness than with any possible formalization.

The second essential property is that the system should be impersonal. In particular, giving a personal endorsement or stamp of approval may be possible, but it should be far from the norm. Instead, certification should come from impersonal organizations (as such journals). This is to avoid two dangers:

1. There would be enormous demand for endorsements from, say, Gowers or Tao, and even less famous mathematicians would experience similar behavior on a smaller scale. Requests for personal intervention are not a good way to organize things.

2. Junior mathematicians would worry about their relationships with senior mathematicians. They may feel pressure to suck up by writing fawning reviews or endorsements, they may worry about being unfairly seen as sucking up, they may worry that others are getting ahead by doing so, etc. This is a case where people can benefit greatly from not having freedom: if one cannot write a non-anonymous review, or if it would at least stand out as highly unconventional, then one needn’t worry about pressure to write one. (Note that pressure might be felt even if the senior mathematicians have no such intention.)

I agree that vagueness is good. Also, as I’ve said in comments above, I agree that evaluations are best done anonymously, for the reasons you give. However, much of what referees do is not evaluation but suggestions to the author to improve his or her paper. I see no reason for that to be anonymous and good reasons for it not to be (since it would allow the person reading a paper to interact with the author).

It was partly to avoid points and scores that I suggested the idea of brief narrative evaluations. Here’s another sample of the kind of thing I mean. “The notion of graph limit makes it possible to present streamlined proofs of many facts in extremal graph theory. These proofs are qualitative, but can often be made quantitative — but in any case the bounds are often huge so there is often not much to be lost by using purely qualitative arguments. This paper introduces an important new idea and will be of interest to all extremal combinatorialists.”

The point here is that I don’t give numerical scores for things like originality and breadth of interest, but I do try to convey in a few words how it looks to me. (Somebody could try to use accounts like this to create their own private numerical scores when judging candidates for a job.)

Because these are evaluations, I would suggest that they be done anonymously.

I believe understanding the balance between vagueness(es) and precision(s) is the very heart of the matter.

I think that trying to model this (e.g. with ODEs) leads us to very sensitive dynamics. We will eventually have to rely on administrators (hiring committees, editors, politicians) with great analytical tools to quickly detect gaming opportunities/attempts and accurately value important contributions, and with the best reputation, “purity” of incentives. This together with society keeping them in check (through journal reviews for instance, in case the “administrators” are editors). If we try to formalize this (e.g. as a game) I think we will find much sensitivity (to rules’ fine prints), as we keep adding layers of sophistication we will see small effects leading to dramatic changes.

So I think we are not out of the woods, that is, I think we could prove that controlling society requires a large amount of energy in itself (say of the same order of magnitude as “running it”), all attempts (as we are making here) at finding an ideal system will be difficult and only succeed temporarily, exactly because we are essentially as powerful as the people who would game it (the players are the regulators).

I have found myself “gaming” the system on MathOverflow subconsciously. I’ve noticed that I should be a little long-winded, and I should try to provide links to terms. But perhaps that isn’t “gaming”; perhaps that’s just improving my work-product.

Thanks Kevin, that is one aspect I had in mind. The fuzziness between “positive” and “negative” gaming/awareness of the system, knowing what details are useful and generously rated by others. How advantaged are those who know the system and think much about it? Is someone more focused on his research and less on publishing, polishing her CV, etc. not unfairly treated by the system when her contribution is very important.
How and when this can degenerate, this is where quantitative thinking is necessary, and I am convinced that will be (provably) tough.
Everybody accepts a certain amount of administrative duties, outreach, CV embellishment, applications, small touches to writings that are generously valued and relatively easy to make, etc. Precisely how much is useful, should be promoted, or discouraged, and how?…

Stemming from the same principles, the situation of mathematicians, who focus on basic research and forgo most decision power on resource allocation, opposed to that of investment bankers or top executives who directly decide their compensatoin, nontrivial percentages of revenues, which they directly influence, setting the price of an iPud at say $199 when it could perhaps be $173, or the price of a journal bundle at $500000 instead of $412000 or $150000, or a commission on trading at $5 per trade instead of $2.86.

I think there needs to be a greater emphasis on the revision process as part of the publication process. For “4. It should be at least as good as the current system at identifying papers that are wrong.” rather than only emphasizing filtering out the bad ideas I think equal emphasis should be given to providing useful critiques and a structured way for authors to make improvements. For example, a paper may not be as interesting as could be if an additional corollary was proved. Newer mathematicians in particular may not recognize these opportunities to improve their work. This is feedback that is currently given in the publishing review process. I’ve been examining content heavy question/answer posts on MathOverflow and about 40% of these posts have at least one answer that is revised. One could make the argument that revisions and corrections will happen no matter what and should also be happening through informal channels. At the moment a formal mechanism for improving papers is part of the publishing process and should explicitly be part of the axioms.

Your list of criteria for evaluating a paper seems to be biased towards a certain type of mathematics.

2. Introducing ideas/definitions/techniques: I have the impression that the deeper the idea the less it is likely to affect people on the spot (very often a new concept is just ignored by the already working mathematicians who are busy with their own stuff and it takes some time before it is appreciated properly through the work of younger people). So asking “how big a circle of mathematicians will have their lives changed by these ideas” is probably not the right question (and has some smell of “impact factor”)

4. Difficulty for the sake of dfficulty is somewhat pointless (although if you want to hire a young mathematician, it could be a good criteria to know that they have technical power). Also it is not clear to me what “difficult” means: for example, proving that a given number is irrational is usually truly difficult, but once the proof has been found it can often be transformed into a problem for undergraduate students (this is for example the case fo Apery’s proof of the irrationality of $\zeta(3)$).

5. New ideas: same comment as 2.

As someone pointed out, we already have something that could do most of what you suggest, namely Mathscinet (or Zentralblatt). One thing that could make it better would be the possibility to add later comments (like “this result is wrong because…; the correct statement would be…” or “The idea contained in the proof of such and such were very influential in
the development of my paper…”, i.e. anything factual concerning correctness and anything positive that the paper contributed to).

Yes, 2nding this. Math is supposed to be about making things transparent, not about making them opaque. I hate this aspect of math. Suppose two authors independently discover a theorem, one of them giving an extremely difficult proof, and the other giving an extremely easy proof. Obviously the latter is a better paper!! And yet, the former is more likely to be accepted by a high-end journal.

If we’re speaking of axiomatizing paper reputation, we should have a principle like: “The reputation of a paper cannot be increased by the author intentionally withholding an easier proof and submitting a harder one on purpose.”

Difficulty for the sake of difficulty is pointless. I don’t know anyone who’d disagree with that. But solving a difficult problem is probably, all other things being equal, a greater achievement than solving an easy problem. At any rate, knowing how difficult a result was provides a useful piece of data even if it is rarely the main criterion by which we judge the quality of a paper. And I’d be the first to agree that proofs can sometimes be difficult because the author has not taken the trouble to find an easier proof. (But then we have to distinguish carefully between the difficulty of finding an argument and the difficulty for others of understanding it afterwards.)

I have read part but not all of the discussions on this blog, in other blogs and on Google+. I have only thought a little about these issues and I am afraid I am going to repeat (sometimes voluntarily but mostly involuntarily) some of the suggestions made in those discussions, but it is more convenient for me to try to describe a pseudo-coherent model that uses two interlinked rankings. Here it goes.

A scientist should be associated with two numbers, similar to Google’s PageRank:

– an AuthorRank

– a ReviewerRank.

These two numbers would reflect the reputation (value?) of the researcher in the two major activities/roles of a scientist: that of producing new and interesting results, and that of judging/checking/validating the results of others. These numbers would be calculated also adopting an algorithm similar to PageRank (see below).

Each scientist should have an account with two corresponding modes: Author and Reviewer. The first would be associated with the real name of the scientist, while the second would allow the scientist to act anonymously. Anyone could open an account, but the Reviewer mode would be activated only upon referral from an official institution (university?) or after having built enough AuthorRank. This would reduce the risk of people polluting the system with bad behavior in Reviewer mode, and of accounts opened just to rig the system.

Each “published” (“arXived”?) paper should be open for discussion (commenting, suggestions, etc.) and for voting. Voting would be given by scientists in their Reviewer (anonymous) mode, with only the ReviewerRank displayed and having an effect (although the Author mode would have an effect indirectly; see later). The vote casted by a Reviewer with higher ReviewerRank should count more than the vote casted by a Reviewer with a low ReviewerRank (in this sense the system is PageRank inspired). In principle one could even keep track separately (besides with the total count) of the votes coming from people with high ReviewerRank (much in the similar way in which in Rottentomatoes one can check the rate of the “top critics”).

The AuthorRank would (should?) influence the ReviewerRank by adding to it. The rationale is that if one is a good author, he/she is probably able to judge properly the works of others, even if he/she does not dedicate much time to reviewing and to building the ReviewerRank with an intense reviewing activity.

The researcher would take part in the discussion on his/her article in his/her Author mode. His AuthorRank would increase thanks to the votes given to the article and potentially to the votes given to the activity of the author in the discussion on the author’s paper (e.g., replying effectively to the comments/questions of the Reviewers). The AuthorRank would also increase with citations of his/her paper by other papers. As in the calculation of PageRank, this increase would depend on the AuthorRank of the authors of the citing paper. The point is to make the quality of the citations at least as important as the number of the citations. The ReviewerRank of a Reviewer would increase thanks to the votes of both the Authors and the other Reviewers for constructive feedback, good comments, helpful suggestions.

There could be tags associated to papers to indicate the fields and subfields of research: one could then even end up with Author and Reviewer ranks in each subfield, depending on the votes associated to both the uploads (papers published) and the discussions in a particular field. This would make more objective saying “this person is a leader in this field but also an expert in this other field”.

As a result of this system, a researcher would be associated with his/her Author and Reviewer ranks, possibly (sub)split by field/subfield. Also, each paper in the list of papers would have an associated score. Committees evaluating a candidate for a job should then be able to get a good sense of the ability of the person in a given field/subfield, as well as of his/her contribution to the community through his/her referee activity.

Oh please, let there never ever be something like ‘AuthorRank’ or ‘ReviewerRank’! What a shame to reduce working minds to two numbers.

This is not meant as a general complaint about reducing people to number, but I really do not see any point in that. It certainly is possible in some way but it really is not that helpful but most importantly it is very dangerous to do (since number can be manipulated in many many ways).

What would be nice is to have some means to answer the question ‘Is this guy a good mathematician?’ or maybe also ‘Is this guy a better mathematician than this other guy?’ (whatever ‘better’ may mean here – probably ‘better suited for this job/task’). But this can (and should) be done by other means that just two numbers.

I agree that we should not reduce people to numbers, but we need metrics, and for a number of reasons. Two are:

1) to make a system where papers are submitted, reviewed and voted—like the one that is discussed here—work better, in particular more efficiently, with the “signal” coming from “good” reviewers amplified;

2) to improve over a system (the one adopted, for example, by funding agencies) where the bare number of citations or the h-index is often the figure of merit.

I am confident we can agree that not all citations are created equal; in this sense it makes a lot of sense (to me) to try to capture this fact.

Numbers and rankings should be used and not abused. As long as we agree that having one (or two, as I suggest) of those is helpful, I believe we should make it (them) the best possible.

Let me remark that, if one keeps track of the subfield in which ranking points are earned, one could split the rankings in a subfield-dependent way. That is potentially a good way to identify experts in particular (sub)fields.

Or punish bad writing? Good writing is a professional responsibility. Good expository writing should be rewarded though because currently there is not enough of this. On the whole though, I feel very uncomfortable with metrics.

I am a little worried about making commenting on papers and improving the exposition non-anonymous for reasons to do with the discussion surrounding criterion 7.

Suppose you get a paper to referee that is completely uninteresting. Right now, you do it in large part because it is a favor both to the journal and the mathematical community and also a personal favor to the editor, whom you probably know. As part of that refereeing, you not only give an evaluation but also check correctness and make some comments improving the exposition.

Now suppose that evaluation and improvement of exposition are decoupled and the latter becomes not anonymous. In that case, it would seem to me that the latter no longer feels as much like a service to the mathematical community (especially for uninteresting papers) as like a personal service to the author. This shift in whom you are responsible to would probably make you much less likely to be willing to do it, since the author is some stranger you have no personal connection to.

One of the strengths of the current system is that, even for uninteresting papers, it assigns someone who is responsible, not to the author but to a third party, for making suggestions that improve the paper.

There’s another issue that I was unaware of until recently. At the ICM in Hyderabad, I ended up on a bus with many mathematicians living/working/from Africa. In the course of discussions I mentioned the arXiv and its value, and was floored to find that I was the only one (in my immediate surroundings, anyway) who would even consider posting to the arXiv. The objection was over priority; I argued that posting to the arXiv established priority, while the others on the bus asserted that it opened one up to having their work stolen.

The gist of the fear was that they would post some result, and then somebody in the 1st world would generalize or simplify or strengthen or steal or whatever, and publish that before they could, without proper attribution. They had the impression that publication times are much less for 1st worlders (their term, not mine), and that anyway we had more access to conferences and the community and so our work is known long *before* publication.

I’m not sure what to make of this, as I don’t think it accurate in any detail. But, it wasn’t the opinion of one person, it was of nearly a busload. It is relevant to this discussion, but I’m not sure how. Perhaps it motivates:

9. The new review system should actively defend the priority of results and proper attribution, and be agnostic to the gender, race, country of origin, country of residence, religion, personal politics, and employer of contributing mathematicians.

Part of this is that more traditionally educated mathematicians (as will be most in countries with less developed university systems) tend to still view the ArXiv as a place for “preprints”. This tendency is also exacerbated in countries in which curricular evaluation is heavily based on publication counts (weighted by impact factor) in which “publication” in a venue like the ArXiv may not count at all (!this point may be hard to understand for some in the US) or may only count if there happens to be an evaluator competent to argue for counting it.

Since I see putting something on the ArXiv as publishing it, I see it as a way to establish priority, but if I were to instead see it as a sort of preliminary publication, I might indeed be worried about losing priority …

Working in India, let me give a personal perspective on Kevin O’ Bryants comment. I have had the experience of a paper bouncing from Journal to Journal, till it finally gets partially obsolete because of work done after my first submission. Leaving aside the question of whether the repeated bouncing has to do with bias (that is not for me to say), it is the inefficiency of the serial submission system that is truly annoying when one is actually communication through publication (and preprints), rather than conferences etc.

It would be much more efficient to have detailed comments, ideally not necessarily anonymous (to encourage careful thought from the commenter), gathered together early. The judgement, including of the quality of papers can be made by editors (perhaps asking trusted mathematicians for more comments).

One thing about going from “serial” to “parallel” article submitting, which could be allowed if journals were just a board that give a seal of approval: if it goes “massively parallel”, the quantity of papers under review will grow very quickly, and there won’t be enough mathematicians to read them all. So it seems that there should be at least some kind of policy that that prevents one from submitting to 10 journals in a row.

One can avoid this as:
(1) A large part of the review process should be common, so there is no duplication.
(2) If the information of what has been submitted where is public, then someone indulging in massive parallel substitution will be ignored by all the journals.

The late Joseph Ransdell (1931–2010), who did more to keep C.S. Peirce’s thought alive on the Web than anyone else I know, had a particular interest in the issues surrounding open peerage and publication. Synchronicity being what it is, the members of the Peirce List have being conducting a slow reading of one of Joe’s papers on the subject, where he examined the work of Paul Ginsparg on open access and Peter Skagestad on intelligence augmentation in the light of Peirce’s theory of signs, a.k.a. semiotic. Here is the paper:

Almost any initial attempt will be imperfect, so setting up an error-correcting mechanism will be important. The system should be flexible enough to amend itself and even change its mission statement if need be. (This very proposal is evidence that well-established systems need to be revisited.) Stack Exchange is brilliant at this to a degree, but I would argue that the entrenched power among high-rated users causes some sclerosis.

In order to have an error-correcting system, or be capable of changing the mission statement in other than a random way, one has to have an independent sense of the objective. In practice, this usually means a number of independent but converging operations that tell you when and how far your system has gone off course.

NSF allows two pages for biographical sketches in grant applications, and for years I’ve been using that space to report as many different forms of recognition as I can. Good papers naturally gather citations and positive summaries in followup papers (and blog entries and so on), and the whole point of a CV is to efficiently report the highlights. Not being able to report anything other than journal acceptances would be quite embarrassing—“The referees guessed that my papers would amount to something, but apparently they were wrong.”

I’m puzzled by the notion that it’s important for paper evaluations to be anonymous. Here’s one of many examples of evaluations contained in my public writing: “The split-radix FFT computes a size-n complex DFT, when n is a large power of 2, using just 4n lg n-6n+8 arithmetic operations on real numbers. This operation count was first announced in 1968, stood unchallenged for more than thirty years, and was widely believed to be best possible. Recently James Van Buskirk posted software demonstrating that the split-radix FFT is _not_ optimal.” There are certainly many busy administrators who simply want to know that Van Buskirk’s paper ended up appearing in _Computing_, impact factor 0.959, but there are also many people who know me and who will assign my statement far higher weight than any mere journal acceptance.

I’m also puzzled by the claim that non-anonymous paper evaluations will create an unprecedented flood to senior researchers of “requests for personal intervention”, pressure on junior researchers “to suck up by writing fawning reviews”, etc. There’s nothing new about senior researchers having tremendous personal power to promote their favorite junior researchers—but the people who have this power also tend to be people who place a high value upon research excellence. Someone who spams senior researchers with “fawning reviews” and “requests for personal intervention” (“Prof. Dr. Gauss, I just wrote a blog entry saying how awesome you are! What job openings do you have for me?”) will be laughed at and eventually blacklisted, while someone who writes good papers, points them out to the authors of previous papers, gives good talks, etc. will be rewarded.

Certainly there are horror stories of senior researchers abusing their power, and these horror stories are enough justification for _allowing_ anonymous paper evaluations. In particular, if a student is scared to publicly issue a negative evaluation such as “1993 Aiello Subbarao ‘A conjecture in addition chains related to Scholz’s conjecture': The stated conjecture is identical to Scholz’s conjecture. The constructions are special cases of Hansen’s 1959 l^0 construction. The computations are tiny portions of the l^0 computations done years ago” then the student should have—and does have, and will always have—a way to hide behind someone more senior who is vouching for the evaluation.

But trying to _require_ anonymity would be a radical change from the current system. Anonymizing all paper evaluations would require anonymizing all followup papers (i.e., all papers); prohibiting even the slightest mention of other people’s papers in personal blogs; and I don’t even want to think about the consequences for conferences. We’d lose the ability to make informed judgments about differences between communities (“TCC people think that provable leakage resilience is a fantastic step forward while CHES people think it’s garbage”) and we probably wouldn’t even be successful at stopping the occasional abuses.

In case you haven’t seen it, you might be interested to read ‘The Five Stars of Online Journal Articles — a Framework for Article Evaluation’ by David Shottonhttp://www.dlib.org/dlib/january12/shotton/01shotton.html .
I like your suggestion of ‘justifications’ to convey, in narrative terms, how good a paper is according to a set of criteria. I think all articles, in all disciplines (I come from the life sciences), should have structured summaries (in addition to the traditional abstract) that include the sorts of things you mention for your quality scales and that can be easily understood by non-specialists.

Electronic archives can and should almost entirely supplant paper journals. They are way cheaper. They are way more available. They are way faster.

Refereeing can be way better on an electronic forum, IF forum is set up to allow it. Make reader comments appendable to papers on the archive. Over time this will be far better refereeing than any referee report is now. Also papers could be rated with a score of say 0-100 by readers & stats kept. Also self-indexing and self-cross-referencing would build up in said comments, if set up to allow it.

Anonymous refereeing by unaccountable unpaid non-experts
simply does not work well. I think anonymous referees should be abolished, that is my personal view. I have for many years refused to anonymously referee anything and I insist on INTERACTIVE not batch, refereeing, where I get to correspond with the authors 2-way if I feel that is going to be useful. That works faster and better. But the academic community is stuck in the stone age and refuses to see the obvious. Which is disgusting. Academia should be leading the world into more advanced progress, not trying to hold back the world and stop progress. But the latter is how it has been most of my scientific career. Even the makers of the electronic archives
themselves, in correspondence with me, have refused to do stuff
like add commenting capability because it would be “impolite.”
What the hell? It’s like these guys are intentionally trying to keep
the flaws of paper journals. They also implemented an “old boy network” scheme where you had to get your paper “vouched for” by somebody to put it on the archive. Not meaning the voucher was saying your paper was valid or anything. Just meaning they knew you. What the hell? The old boys club is a bug, not a feature.

Perpetual public refereeing by everybody, especially the few readers who actually care about a paper, is way more useful than a secret referee report, often written by somebody who does not care
and who refuses to say their name because they are such a useless incompetent wimp with an agenda.

Furthermore, referees themselves could get rated with electronic system, and the ones who tend to get high ratings, could then get higher weight when judging a paper. And it would be easy to,e.g,
get the average score of a paper by people with real names only (anonymous raters removed) etc.

And there is really no longer much if any reason papers should be rejected. I mean, there are bad papers and wrong papers, but if they are branded as such, they can still be there (and may even be useful).

And there is no longer much if any reason for editorial boards etc.

A few things worry me about e-archives such as permanence
(what if there is a solar storm? cyberwar?) and formats going obsolete. They could demand that large libraries worldwide install mirrors of the archive, which’d partly solve the permanence problem.

I notice several people are worried that having quantifiable metrics would encourage people to “game” the system. I submit that relying on (supposedly) unquantifiable metrics constitutes security through obscurity; such metrics don’t make it substantially harder to game the system, they merely make it almost impossible to detect such manipulation.

Unrelatedly, the discussion of ‘breadth of interest’ and of denoting the relevance of papers to various fields and subfields suggests a natural application of the idea of a “radial category”. A paper on random graphs might be close to the centre of the ‘graph theory’ category, while peripheral to the ‘random structures’ category; or it might contain insights or methods applicable to random structures in general in which case it would be closer to the centre of that category also. This suggests that a network of electronic journals should be organised not as a collection of distinct journals (“The Journal of Foo”, “Results in Bar”, “Analytic Baz Theory”) but rather as a kind of “space of papers”, a mesh of overlapping radial categories; one would browse not by reading a single journal linearly, nor by searching for a textual phrase and reading matching papers, but by iteratively narrowing the scope of one’s view; correspondingly the detail would increase, and lower-ranked papers would become visible. Perhaps at low detail levels, one would see not only (highly-ranked) papers, but also authors (“So-and-so publishes in this area”).

It might be worth noting that this model is applicable to the browsing of many kinds of data; essentially anything which can be categorised by a radial-hierarchical hybrid system, and which carries a quality metric (for deciding which items to show at lower detail levels).

Jon,
The distinction between measures and targets appears to be (if not identical to, then isomorphic to) that between instrumental and terminal values.
It is a general problem that our terminal values are rarely directly measurable, and it is for this reason that instrumental values exist. The alternative would be for us to read every single paper in our field, because that is the only way to determine, without use of an intermediate instrumental value, the paper’s worth to us. Instrumental values are, thus, a necessary evil.
Unquantifiable measures are still measures, and our current ‘target measure’, publication in a high-status journal, is still susceptible to gaming, it’s just less visible. (There’s a reason why the “bean-counters” prefer transparent measures).
I agree that impact factor is a poor choice of instrument, but its manipulability is a result of its mathematical characteristics – it is summing over all authors the number of times that author cites the paper. Clearly this (and other simple-sum metrics, including a naïve StackExchange style reputation system) is vulnerable to distortions practised by a few. What is needed is a metric which is nonlinear, such that citations amongst an already strongly-connected group carry less weight than citations from unrelated individuals. A simple (and probably overly naïve) system of this kind might be that each new citation scores some (increasing) function of the length of the shortest pre-existing path on the citation graph (considered as undirected) joining the citer to the cited.

[…] Thousands of people joined the boycott, started talking of making new, lower cost journals, or even radical new alternatives. Izabella Laba pointed that there is a big risk of inventing something even worse if we […]

Hi,
I’m not sure if this is the right place to mention it, but for the past month or so I’ve been running http://arxaliv.org/ which provides an interface for commenting and voting on papers from the arxiv or other open sources. It also supports building moderated communities, so we set up an an editorial board to review submitted papers of general mathematical interest with the intent of posting one per week.
I’d appreciate any help and advice on how to make this useful, and whether it is something that would be helpful.
Best,
Ralph