Computational Complexity and other fun stuff in math and computer science from Lance Fortnow and Bill Gasarch

Friday, June 16, 2006

The H-Number

Thomas Schwentick sends me a link to an
h-number
calculator maintained by Michael Schwartzbach.
Jorge Hirsch developed
the h-number or h-index
as a measure of the scientific
output of a researcher.

A scientist has index h if h of his/her Np papers have at
least h citations each, and the other (Np - h) papers have
no more than h citations each.

The h-index discounts researchers who have one or two highly cited
articles or books, or those researchers who just churn out mediocre
papers.

There are loads of problems with the h-index. Google scholar and other
citations counters are inaccurate because of trouble parsing and
disambiguating papers. Citation counts do not accurately measure the
quality of the paper—a paper that opens a field will get many more
citations than a paper that closes it. The h-index rewards fads
and cliques who always cite each others work. The h-index gives
greater weights to more senior scientists and doesn't separate those
who had good early careers from those still going strong.

Having said that we do love to compare ourselves with our colleagues
in any way possible. An automated calculator does not work well for
even mildly common names but it works great for "Fortnow" and
while my h-index of 23 does not put me among the h-number
elite, I'll take it.

I can second Macneil's comment. From sifting Google Scholar's results for some queries I learned that after removing duplicates, nonrefereed publications and self-citations by a selected author about 50% of the citations remain.

As for the h-index, if you want a single number extracted from the "citation histogram", then IMHO the h-index is not a bad invariant. But Lance is correct that any citation measure should be taken with a grain of salt.

From rough and hopelessly inaccurate estimates, if you work in complexity theory then your h-index will be half of an "equivalent" researcher in algorithms, which will be half of an "equivalent" researcher in a more applied community, simply because of the size of the community and the accepted citation patterns.

if you work in complexity theory then your h-index will be half of an "equivalent" researcher in algorithms,

That would be true today, but the exact opposite was the case 15 years ago. It just goes to show how, as Lance pointed out, it rewards fads. I won't venture which one is/was the fad algorithms or complexity.

In this age of computers there is no real reason to take the base of the publications tower as a proxy for the actual area. As it has been pointed out, h x [# citations of most cited paper]/1000 is a better measure, and indeed adding the total number of citations for all papers above the h number would be an even better measure.

Lastly, it is not so clear that a person who has published a thousand little theorems is truly a worse scientist that one who has tackled two large conjectures. You don't agree? Paul Erdos was accused of this for most of his life, yet for the last two decades of his life it became very clear that many of those "little theorems" were gateways to entire areas of research.

Suppose for a moment that we could all be driven solely by the desire for scientific progress. In other words, the effort we put into research has nothing to do with the number of papers we get out, the number of awards we receive, the amount of grant money we are given, or the prestige of our jobs. Does anyone argue that we wouldn't see a huge increase in our rate of progress?

What if we chose one important problem, and decided that no one would get individual credit for its solution. We would have a wiki devoted to the problem, where the entire community could offer their insights, suggestions, potential approaches, possible connections, ... Of course leaders and experts would emerge who could coordinate the effort--suggesting that specialized groups work on various branches of attack, reporting their progress regularly. Once a person or group of people had come up with some new insight, they would write a survey-like report on their findings, suitable for the rest of the community to digest, and so on.

How efficient are we as a community? Maybe you'll say that this is already what happens in some sense, but for instance, there are plenty of important areas where a good book/comprehensive survey only gets written 10-15 years into the research. People's intutions and failed approaches never get widely disseminated because it doesn't "pay" to write those things down. We don't maximize people's particular abilities: some are better at writing, some are better at very technical calculations, some people are more creative, more apt at describing an approach than at executing it, some researchers are very good at holding the entire area in their heads and making non-trivial connections, etc. How far away are we from optimal?

Suppose for a moment that we could all be driven solely by the desire for scientific progress...

I'm not suggesting that we adopt this strategy; it's completely infeasible. I'm just curious about the price of anarchy. In other words, is it conceivable that cooperatively we could condense 100 years of research into 50? 20?

Suppose for a moment that we could all be driven solely by the desire for scientific progress.... Does anyone argue that we wouldn't see a huge increase in our rate of progress?

Absolutely we wouldn't. Once we remove the incentives for people to perform top notch work, guess what? people wouldn't perform top notch work. This is not wild speculation, we have the data. Just look at countries or universities in which compensation is uncorrelated to performance. You'll find that they are much more likely to be scientific underperformers. By the same token, you can see an upturn in productivity when these places reintroduce performance based compensation.

We don't maximize people's particular abilities: some are better at writing, some are better at very technical calculations, some people are more creative, more apt at describing an approach than at executing it, some researchers are very good at holding the entire area in their heads and making non-trivial connections, etc. How far away are we from optimal?

This premise is also not obviously true. As discussed in this blog before, people tend to seek collaborators with complementary strengths as it is already. An "idea person" will tend to collaborate with people who are good at follow through. A non-experimentalist will partner with an experimentalist. A bad writer will partner with a good writer, etc.

Sure, there is friction in the system as we know it (NSF grant writing process anyone?) but most of it is not derived from h indices and other such evaluation criteria.

it's amazing that the concept of a hypothetical is so difficult for supposedly intelligent people to grasp. it's like roughgarden-tardos started their paper: "let's compare selfish behavior to an optimal route assignment. hah! everyone would just give up because they have no incentive, we shouldn't even write this paper. we're such idiots..."

"I'm just curious about the price of anarchy. In other words, is it conceivable that cooperatively we could condense 100 years of research into 50? 20?"

There will always be something inefficient about anything we do. However, given that, I would be very surprised if we were missing something that much better.

At most your arguments would suggest some subsidies for "clean up" work: e.g., organizing failed results, revising and clarifying existing works, doing surveys on more obscure topics... If we were to publicly fund people to do this, we might see some advantages.

However, at what cost? First, we'd have to pay people who were fairly knowledgeable to do this clean up work. Wouldn't it be better to assign these smart people into doing new research instead? We'd get more results that way, at the slight cost of everything not being quite as clean.

Then there is the dead-weight loss: If we have people who can copy edit and revise the writings of others, well, you just might find people are less concerned about writing well in the first place. (Why bother doing that final spell check if someone else is going to do it for you?) As a result, most of the work this "clean up" crew would be unnecessary.

Given the large leap of faith you require regarding incentives I imagine that such a leap would not be that much different than the subsidies for "clean up" people. It's just not worth the trade off.

Having people off doing their own separate research, sometimes overlapping with other researchers (and is thus a bit redundant) is a great way to do a breadth-first search for the exciting solutions. A centralized program would not be as effective: the mavericks couldn't convince others to follow them until they've pretty much solved the problem in the first place anyway.

Just imagine your struggle to get us excited about your own idea. Don't you think the research wiki would be much the same experience?

I actually like the idea of "research wiki". Of course, only experiments and time would tell which format is most appropriate. But, in a sense, this is where the research communities are going. Thanks to the web, we have preprints, data, preliminary surveys etc available to everyone. Blog posts often include half-baked ideas, intuitions etc. So a wiki is a natural step.

As far as the incentives go, I think they are out there. People write wiki articles for the pleasure of having the topic explained, and let's not forget that their name is there too. Sooner or later people will start acknowledging, or even citing, web page notes or blog posts. Why not wiki posts ?

Forget about a wiki. I always thought it would be really nice to get, say, 20-30 top people in a room over a weekend (or maybe for a week?) to work on a specific problem and see what progress could be made. Given that they have the rest of the year to work on papers with fewer co-authors; given the prestige of being invited to such a thing in the first place (maybe we could mix it up with some fresh grad students, too); and given that people already waste a couple of days sitting at (mostly) mind-numbing talks anyway, it's not completely infeasible to imagine this.

(PS: workshops don't count. They are not nearly specific enough, and if you've ever been to one, you know how little progress is usually made.)

Forget about a wiki. I always thought it would be really nice to get, say, 20-30 top people in a room over a weekend...

This would, in most instances, produce very, very little. I don't think that latency can be improved much over a small group of 2-3 dedicated people- the most 20 people are going to do in a weekend is get through introductions and dinner.

What's more, you'd never get these people to agree on what to work on, and in such a short time the experts will quickly grow tired of explaining "trivialities" to everyone else.

For a weekend workshop I agree. On the other hand a week long workshop with a preannounced topic with fifteen participants or so, starting with four or five hours of presentations, followed up by work would make substantial progress. There would be a need for a follow up workshop with self-selected subset of participants plus a few key additions. Such a series of workshops might well resolve a problem that is ripe for tackling such as bringing down the time complexity of Aggarwal's et al. primality testing algorithm to n^3.

"There would be a need for a follow up workshop with self-selected subset of participants plus a few key additions"

It would probably be a virtual requirement that everyone have tenure. Work done toward blogs or wikis or even large brain storming sessions just don't impress universities. You would need to be a researcher with nothing to lose from the deal. (And even then, tenured faculty still work towards promotions, and so would not dedicate as much time to something that would not help them.)

Anyway, I have doubts that brain storming works very well as a way to solve large intellectual problems. If so, why hasn't Microsoft assembled such a group (or some other generous donnor with a strong interest in science) and done so already? We are not insects with collective minds. We are individuals who cooperate and compete with each other. And it's not the lack of social opportunity that we haven't solved these problems quicker.

You answered that question in your own post: lack of incentives. A thirty person team solution to primality in n^3 would derive very little benefit to each of the participants.

But since you ask, placing together a large team of scientists with unicity of mind to tackle large problems was tried successfuly during WWII, viz The Manhattan Project, the RadLab at MIT, Bechtel Park in England.

Just to be clear, we cannot solve P=NP during week long workshops. This approach would only work for important problems whose solution is within a few years reach, with or without a workshop.

In a face-to-face setting having more than 3 or 4 people interrupts the flow of research; the social aspect rather than the research aspect dominates. Would the same happen with a Wiki? The evidence from pattern of comments on research postings on blogs seems to suggest that it would.

On the topic of workshops, in a completely separate domain, the Center for Language and Speech processing at JHU run a set of three summer workshops every year that last for about eight weeks. They are typically comprised of 3-4 senior researchers, 5-6 grad students and maybe 2 undergrads. There's a fairly lengthy topic selection process, and, on average, at least one of the three over any given summer is quite successful at making progress. I've never participated, but know many people who have, and they say that it's a great, but exhausting process (18 hours a day in a lab, etc.). This is much more on the experimental side, but seems to work quite well. It's also a great educational experience for the students, especially the undergrads who often use it as a jumping off point for their careers.