Friday, October 21, 2011

Academia and Innovation?

Robert A. Muenchen is maintaining a report here on the popularity of R, a programming environment for statistics.

He's got a bunch of measures, but these really caught my eye. A site called Rexter Analytics did a survey in 2010 asking respondents which pieces of software they used in 2009. These were the results:

So, R is at the top of the list. KDnuggets did a similar poll, and returned very similar results.

The take away message so far is that a lot of people who do data analysis use R. The plurality even. That is the zeitgeist.

Now we come the the results that worry me. Muenchen also did an analysis of Google Scholar citations of software packages, and produced this graph.

Clearly R has a pretty sharply rising slope, but it still comes in fourth after a bunch of software that, frankly, only academics can use because they get institutional licenses.

I'm not worried because I think academics should be using R (even though I do). It has more to do with the fact that people in academia like to think of themselves as the forward thinkers, and the innovators of new ideas. But in this regard they are clearly following behind the trend that everyone else is setting. Maybe it's fitting that the SPSS curve looks not unlike what I'd imagine an ivory tower to be.

4 comments:

Having spent a career watching (and sometimes accidentally leading) the IT infiltration of academia, I think what we're seeing is that the academics who really understand statistics and need to use it use R (see, for instance, http://cscs.umich.edu/~crshalizi/weblog/), while the ones who don't are following the Herd, which is led, often enough, by clueless administrators who believe whatever they're told by techies, and techies who know their jobs are hitched to the software they're familiar with.

Academics have always trumpeted their leadership, but the brass isn't very shiny anymore.

It may also simply be that the people using R are not publishing as prolifically as those using SPSS. I can imagine many psychologists and psycholinguists who have been using SPSS for years and years and are afraid to change (the "behind the curve" people), but are also publishing work-horses. This requires a comparison between the number of publications per capita for users of R and users of SPSS.

Honestly I think it's a certain kind of scientist that cites their statistics package, and it's probably less so the type that uses R. The main purpose of citing a software package is in lieu of explaining what you did, since you probably don't even know: you just did whatever the software package does by default. When I report an ANOVA I do in R, I just say I did a Type II ANOVA. It doesn't matter what software I used to do it: I would get the same numbers if I did it by hand. Many people that cite SPSS would probably stare blankly at me if I asked them if they knew what sort of ANOVA they did, but I know if they cite SPSS that they're doing Type IV, because that's SPSS's default.

Following up on John: it's also a certain kind of journal that wants you to cite your statistics package.

It's a pity that more people don't cite R and its packages; the lack of citations makes life harder for those of us who write software. I once reviewed a book using SAS and R where the author cited and gave index entries for SAS PROCs but not for R packages.