Friday, January 09, 2009

Data Analysts Captivated by R's power: New York Times

This article is about the open source statistical environment, R (http://r-project.org). It is an implementation of the S language created by John Chambers for the purpose of data analysis. There is another implementation, the commercial package S-Plus, generally considered to be number three among statistics packages (after SAS and SPSS).

The fact that it is open source means one thing, that people can inspect the source code, determine its correctness, and add to it to improve the implementation. In practice, R has become a playground for statistics researchers. Because they can inspect the inner workings of R, they know every little step being done, and they can make it do exactly what they want. Generally, people won't change what is already there without very good reason, that is justified to the maintainers. And numerous test suites exist that ensure the integrity of the code (at least the core parts). Many packages exist for R, in many cases written by academics who release the package to go along with papers they have published and books they have written. These are also heavily tested, as the authors stake their academic reputation on these packages (and the distribution of the packages makes it much easier to sell their books and encourages people to read and cite their journal articles because it is easier to use the methods when there is already software readily accessible.).

One thing that open source software also attracts are critics. Especially when there is a commercial competitor. The article quotes people from The SAS Institute, the top statistical package around. One quote is:

SAS says it noticed R's rising popularity at universities, despite educational discounts on its own software, but it dismisses the technology as being of interest to a limited set of people working on very hard tasks.

"I think it addresses a niche market for high-end data analysts that want free, readily available code," said Anne H. Milley, director of technology product marketing at SAS. She adds, "We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet."

In this case, both paragraphs mislead the issue. What open source software attracts are people who (1) need to know exactly what the software is doing and (2) have needs that were not apparent to the writers of the software.

In the Ms. Milley's comment, the question is, do the people who build engines for aircraft know more about what the software should be doing, or does SAS know more about the software should be doing. If the engine designers know what the software should be doing, maybe they should be the ones writing it (and testing and validating and verifying). The SAS code is a black box, to people who are very smart and do not need or want a black box. In particular, the acceptability of numeric code should not depend on how much you paid for it, but on the testing and validation done. And this is done through inspection of the code or a test suite. R's code and its packages can be inspected. SAS's code cannot. And the test suite may very well be freely available too. Because the researchers who initially developed the methodology had to prove its correctness in the open to the academic community who peer reviewed the work when it was first done. And nowadays, that work was first done in R.

The second group are people who know more about the subject matter then the commercial software builders. First a digression. Most of the people who have written code for R are academics and researchers in statistics (academic or corporate). One of the obvious contrasts is S-Plus, the commercial implementation of S. Most of the programmers who write S-Plus have backgrounds in computer programming. So the result is S-Plus is generally regarded as faster and makes better use of resources. But methodology gets developed in R first and the methods are more correct, because the subject matter experts actually wrote the code. And this is true across the board. Many niche areas, the methodology is written in R, because the subject matter is too small for the mainstream statistical programmers at SAS to put their time into. And this is in addition to the "high-end" uses that SAS has. Because the people at SAS don't have time to learn the nuances to every use that requires statistical environments. Or the subtleties of every application. They have to program to the mean, and to people who only want a black box that only spits back numbers.

What is the biggest obstacle to open source software acceptance in numeric uses? The requirement for certifications. SAS has the money to certify their product for use in regulatory purposes. What remains is the need to certify the environment around the statistical package, which companies that are involved in regulated activities must then do. And some of the work for those who use R has been done as well in the document R: Regulatory Compliance and Validation Issues - A Guidance Document for the Use of R in Regulated Clinical Trial Environmentshttp://www.r-project.org/doc/R-FDA.pdf.

My own involvement? I once wrote some code in an R project, R-GLPK. It provided documentation and examples for the use of a linear programming package from within R. Why? Because it was conceivable that people performing data analysis would, in the midst of the analysis, use linear programming to produce intermediate results. Or that an analysis may use intermediate results as inputs into a linear program. Or that R may just happen to be the platform other work was done an now someone would need to solve a linear program. Or any of a multitude of things. That you don't go to SAS for. So an open source package that connects to something that does something well just made things that much easier.