Learn more about using open source R for big data analysis, predictive modeling, data science and more from the staff of Revolution Analytics.

April 30, 2013

SAS Big Data Analytics Benchmark (Part One)

by Thomas Dinsmore

On April 26, SAS published on its website an undated Technical Paper entitled Big Data Analytics: Benchmarking SAS, R and Mahout. In the paper, the authors (Allison J. Ames, Ralph Abbey and Wayne Thompson) describe a recent project to compare model quality, product completeness and ease of use for two SAS products together with open source R and Apache Mahout.

Today and next week, I will post a two-part review of the SAS paper. In today's post I will cover simple factual errors and disclosure issues; next week's post will cover the authors' methodology and findings.

Mistakes and Errors

This section covers simple mistakes by the authors.

(1) In Table 2, the authors claim to have tested Mahout "7.0". I assume they mean Mahout 0.7, the most current release.

(2) In the "Overall Completeness" section, the authors write that "R uses the change in Akaike's Information Criteria (AIC) when it evaluates variable importance in stepwise logistic regression wheras SAS products use a change in R-squared as a default." This statement is wrong. SAS products use the R-squared to evaluate variable importance in stepwise linear models, but not for logistic regression (where the R-squared concept does not apply). Review of SAS documentation confirms that SAS products use the Wald Chi-Square to evaluate variable importance in stepwise logistic.

(3) Table 3 in the paper states that R does not support ensemble models. This is incorrect. See, for example, these packages:

(4) The "Overall Modeler Effort" section includes this statement: "Bewerunge (2011) found that R could not model a data set larger than 1.3 GB because of the object-oriented programming environment within R." The cited paper (linked here) makes no general statements about R, it simply notes the capacity of one small PC and does not demonstrate a link between R's object-oriented approach and its use of memory. The authors fail to state that in Bewerunge's tests R ran faster than SAS in every test where it was able to run, and that Bewerunge (a long-time SAS Alliance Partner) drew no conclusions about the relative merits of SAS and R.

Disclosure Issues

Benchmarking studies should provide sufficient information about how the testing was performed; this makes it possible for readers to make informed decisions about how well the results generalize to everyday experience. For tests of model quality, publishing the actual data used in the benchmark ensures that the results are replicable.

As we know from the debate over the Reinhart-Rogoff findings, even the best-trained and credentialed individuals can commit simple coding errors. We invite the authors to make the data used in the benchmark study available to the SAS and R communities.

In addition, we think that additional disclosures by the authors will help readers evaluate the methodology and interpret findings from the paper. These include:

(1) Additional detail about the testing environment. I'll remark on the obvious differences in the hardware provisioning in next week's post, but for now I will simply note that the HPA environment described in the paper does not appear to match any existing Greenplum production appliances;

(2) Actual R packages used for the benchmark;

(3) Size of the data sets (in Gigabytes);

(4) Actual sample sizes for the training and validation sets for each method, together with more detail about the sampling methods used;

(5) Details of the model parameter settings used for each method and product;

(6) The value of "priors" used for each model run (which alone may explain the observed differences in event precision);

(7) In the results tables, detailed model quality statistics for each test, including sensitivity, specificity, precision and accuracy, the actual confusion matrices and method-specific diagnostics;

(8) Detailed model quality tables for the Marketing and Telecom problems, which are not disclosed in the paper;

We invite readers to review the paper and share your thoughts in the Comments section below.

It doesn't seem that the authors did their due diligence with regard to the R eco-system. I would just add:

1. Logistic regression can be performed out of memory in open-source R using the biglm or speedglm packages.

2. Out of memory random forests are available via bigrf.

3. It seems pretty naive to leave things like the % correctly classified blank for R because they are not reported by the summary function. Sure, R doesn't spit out tons of irrelevant quantities by default, but that is by design not by deficiency. Is table(data$variable,predict(model,newdata=data)) too hard for them?

4. The whole object-oriented leads to big data problems really makes me wonder whether they have a solid understanding of computer science. R has more limited functionality for out-of-memory data, but that is not due to object orientation.

5. Regarding reading in data: "missing values must be designated "NA" in the original data" is a false statement. See the na.strings parameter of read.table. Though who knows how they are reading in their data as it is not stated.

6. What is going on with random forests??? They report 29% correct classification for SAS, and 0.8% for R and 0.1% for Mahoot. No mention of these insanely different numbers is given in the text.

It is also interesting that they mention issues with the use of RAM in R but allocate a machine with the least of it. The machine used for Enterprise Miner was the exact same except that it had 21 GB of RAM instead of 15. The laptop I'm on now has more than that (16 GB).
What is step 4 in the time to analyze for R, readable form at (.csv). If they used the RMySQL package are they saying they only used that for one part but that they manually loaded a generated csv file into SAS in order to put it into MYSQL, that seems obvious that it would take up some time (SAS -> R -> CSV -> SAS -> MySQL -> R). There also seems to be nothing involved with SAS, you just have source data and in 10 minutes you have a model. I am not sure that is a good thing, how many assumptions does it have to implement for you, are they good or bad. They show time spent partitioning the data in Mahout and R but not in SAS, yet they talk about that step happening. That step even seems to have some questionable practice behind it.
Are they saying that they actually trained the SAS model on an oversampled set because it forces this, no arguments to override this action. R and SAS HPAS do not support ensembles, well they do have Random Forests which is an ensemble. The NA and accuracy issue almost seems like if there wasn't a gui widget or drop down to do something it wasn't done. It would be nice if it was reproducible but that is not easy when you have point and click tools.

can you also comment on the following paper that claims R's mixed model procedure is inferior to SAS in terms of inflated type I error? thanks.
http://onlinelibrary.wiley.com/doi/10.1002/sim.4265/abstract

non-linear mixed models can be pretty sensitive to the number of quadrature points used. They used 20 points, which I assume would be enough for the models being estimated, but perhaps not if they are getting significantly different results than the SAS algorithm.

The largest data sample set used for this paper has 2 million observations. I'm shocked that this is their definition of "big data." We routinely deal with data sets that have more than 50 million observations.

This is one of the many problems with the phrase "big data" - no one defines how big "big" is.

I want to see benchmarks of truly big datasets - on the order of 4 billion or more "facts." Anyone know of anything like this?