Learn more about using open source R for big data analysis, predictive modeling, data science and more from the staff of Revolution Analytics.

statistics

February 26, 2015

Distcomp, a new R package available on GitHub from a group of Stanford researchers has the potential to significantly advance the practice of collaborative computing with large data sets distributed over separate sites that may be unwilling to explicitly share data. The fundamental idea is to be able to rapidly set up a web service based on Shiny and opencpu technology that manages and performs a series of master / slave computations which require sharing only intermediate results. The particular target application for distcomp is any group of medical researchers who would like to fit a statistical model using the data from several data sets, but face daunting difficulties with data aggregation or are constrained by privacy concerns. Distcomp and its methodology, however, ought to be of interest to any organization with data spread across multiple heterogeneous database environments.

Setting up the distcomp environment requires some preliminary work and out-of-band communication among the collaborators. In the first step, the lead investigator uses a distcomp function to invoke a browser-based Shiny application to describe the location of her data set, the variables to be used in the computation, the model formula and other metadata necessary to describe the computation.

Next, the investigator invokes another distcomp function to move the metadata and a copy of the local data set to computation server with a unique identifier. Once the master server is in place, collaborating investigators at remote locations perform a similar process to set up slave computation servers at their sites. When the lead investigator receives the URLs pointing to the slave servers she is ready to kick off the computation.

All of the details of this setup process are described in this paper by Narasimham et al. The paper also describes two non-trivial computations: a distributed rank-k singular value decomposition and distributed, stratified Cox model that are of interest in their own right. The algorithm and code for the stratified Cox model ought to be useful to data scientists in a number of fields working on time to event models. A really nice feature of the algorithm is that it only requires each site to independently optimize the partial likelihood function using its local data. The master process uses the partial likelihood information from all of the sites to compute a final estimate of the coefficients and their variances.

There are several nice aspects to this work:

It builds on the cumulative work of the R community to provide a big league, big data application around open source R.

It provides a flexible paradigm for implementing distributed / parallel applications that leverages existing R algorithms (e.g. the Cox model makes use of code in the survival package)

It illustrates the ease with which R projects can be deployed in web services applications with Shiny and other R centric software such as DeployR

It provides an alternative to building out infrastructure and aggregating data before realizing the benefits of a big data computation. (Prototyping calculations with distcomp might also serve to justify the expense and effort of developing centralized infrastructure.)

It recognizes that privacy and other social concerns are important in big data applications and provides a model for respecting some of the social requirements for dealing with sensitive data.

Distcomp is new work and the developers acknowledge several limitations. (So far, they have only built out two algorithms and they don’t have a way to easily deal with factor data across the distributed data sets.) Nevertheless, the project appears to show great promise.

Last year, the ASA began working with the newly formed SAS USA to re-launch STATS.org, a statistics informational and resource hub for journalists — and anyone interested in how numbers shape science and society. The project will help reporters gain access to the statistical, data-driven perspective of stories on which they are working.

Through STATS.org, journalists are connected with statisticians who are experts on specific topics to provide them understandable statistical advice and explanation. This connection is critical to helping raise the understanding of the media on the statistical issues in their stories. For the first time many reporters will have access to a statistical expert who can help them interpret a scientific study and convey that meaning to the public.

In addition, ASA member statisticians provide background information and the statistical science perspective on timely news stories and STATS.org writers, who write about quantitative concepts in a very readable, easily understandable style, produce articles. These articles are featured on the STATS.org website for reporters and others interested in statistical matters to read and learn about statistics.

“Unfortunately, to make her argument, the author confuses several different aspects of confidence, evidence, belief, and decision-making. The purpose here is to point out the confusion and clarify the statistical issues. This article is not about climate change; it’s about statistics. Oreskes’ mistaken interpretation of these statistical ideas do not imply that climate change is under question; the evidence for climate change consists of mechanistic as well as statistical arguments, and has little to do with the topic under discussion here: a misinterpretation of what is called the p-value,” wrote Lavine.

You, too, can help ASA and STATS.org. The next time you read a news article that misstates or misinterprets statistical data or concepts, take a minute to forward a copy of it to the ASA or STATS.org. Also, if you see a key statistical concept that consistently is misreported in the media, let us know about it as well.

February 13, 2015

The World Cup of Cricket starts this week. (C'mon Aussie!) Cricket isn't well-known amongst many of my American friends or colleagues, so when I'm asked about it I usually point them to this video, which gives a good sense of the game:

Actually, this Vox article and this ESPN video do a much better job of describing the game. One thing the ESPN video doesn't mention (besides not listing all 11 ways to be out) is the possibility of a draw. In test match cricket, it's entirely possible for a match lasting five days to end without a winner. The reason is that a test match lasts four innings (each team gets to bat twice), and is also limited to five days. If the time limit ends before both innings are complete, and the trailing team is still at bat, the game is declared a draw. (The idea is that the trailing team may have caught up if only the game could continue.) The strategy for the trailing team, if they don't think they can achieve outright victory, is to instead play for time and go for the draw. (The winner of a test series is the the team with the most wins over five five-day test matches.)

Playing for five days without a definitive outcome can try the patience of the modern sporting fan, so one-day cricket was born. Here, there are just two innings per game, each team is given a fixed number of overs (balls) with which to score, after which the innings is automatically over and the other team has an opportunity to bat. As the name suggests, the game is over in a day and one team or the other will be declared the winner (unless there is an exact tie in scores).

There's an interesting statistical angle here, which is related to interruptions in the game. Let's say we're halfway through the second innings and Australia is at bat with 142 runs to England's 204. Normally, Australia would need to score 63 runs (the "target") to win. Now suppose it starts to rain, and the game is suspended for an hour. To keep the game from running long, Australia will be given fewer overs to bat, and their target will be reduced as well. But the target isn't reduced in exact proportion to the overs removed, to reflect the fact that more runs are generally scored in the latter part of an innings. The exact calculation is based on statistical analysis of cricket games, and is a great example of censored data analysis. (The basic idea is to be able to forecast what the final score would have been in games that are interrupted.) The calculation is known as the Duckworth-Lewis method, named after the two British statisticians that devised it. People often talk about statistics and baseball in the same breath, but this is the only example I can think of where statistical modeling is such an important part of a sport. (If you can think of others, let me know in the comments!)

Well, that's all for this week — I'm off to watch the cricket! See you back here on Monday.

January 26, 2015

From the "statistician humour" department, today's xkcd cartoon will ring a bell for anyone who's ever published (or read!) a scientific article including a P-value for a statistical test:

If finding P-value excuses is a common activity for you (and let's hope not!) then R has you covered with the Significantly Improved Significance Test. This R code from Rasmus Bååth will automatically annotate your P-values between 0.05 and 0.12 with excuses like "suggestive of statistical significance", "weakly non-significant" or "quasi-significant". Bonus points for links in the comments to real journal articles that actually use these excuses!

January 20, 2015

From electronic medical records to genomic sequences, the data deluge is affecting all aspects of health care. The Masters of Science in Health Informatics (MSHI) program at the University of San Francisco, now in its second year, is designed to help students develop the practical computing skills and quantitative perspicacity they need to manage and exploit this wealth of data in health care applications.

This spring, I am privileged to participate in this effort by developing and teaching a new course, “Statistical Computing for Biomedical Data Analytics”, intended to motivate and prepare students for further studies in data science, such as the intensive summer courses of the MSAN bootcamp. The syllabus is on github.

As you’ve probably guessed, we will be using R. Other courses in the curriculum use Python, which seems to be favored by engineers; in contrast, R was developed by and for statisticians. We want the students to be exposed to both perspectives, and to have the technical background needed to make use of the extensive repositories of code available from CRAN and Bioconductor.

Data science is an interdisciplinary endeavor born of the synergy between computing, statistics, data management, and visualization. This can make it challenging to get started, because you have to know so many things before you get to the good stuff. We’re going to try to ease into it by starting with computational explorations of mathematical and statistical concepts. R is a fantastic environment for this; you can see a bell-shaped curve emerge from an example as simple as

plot(0:20, choose(n=20, k=0:20))

Note the expressive power of the vector of k values, and the easy convenience of having a world of statistical functions at your fingertips. Imagine how this little plot would have delighted Sir Francis Galton.

Data science is a journey. The enormous breadth of material and the rapid pace of development mean that the most important thing to learn is how to learn more. We’ll explore many fantastic resources for learning data science and R. For example, Coursera has excellent offerings, exemplified by the series of mini-courses from Johns Hopkins; our students will take at least one of these as a course project.

Of course, the R community itself is the biggest and most important resource. One class will be a field trip to a Bay Area useR Group (BARUG) meeting, and the comments in response to this post will be required reading. Ideas or suggestions regarding the syllabus or course materials from the github repository are welcome, as are observations or ruminations on the process of learning data science and R.

Finally, we are very interested in helping our students find outstanding internship opportunities in health-related organizations. Please don’t hesitate to contact me through community@revolutionanalytics.com if you are interested in working with us. Stay tuned for progress reports.

January 08, 2015

KatRisk, a Berkeley based catastrophe modeling company specializing in wind and flood risk, has put three R and Shiny powered interactive demos on their website. Together these provide a nice introduction to the practical aspects of weather based risk modeling and give a good indication of the kinds of data that are important. Two of the models, the US & Caribbean Hurricane Model and the Asia Typhoon Model, provide a tremendous amount of information but they require a little bit of background knowledge to understand the data required to drive them, and the computed loss statistics.

The Flood Data Lookup Model, however, can really hit home for anybody. Just bring up the model, type in the address of the location of interest and press the red "Geocode" button to get the associated longitude and latitude. Then click on the "Get Data" button. The resulting information will give you an idea of the level of risk for the property and let you know what a 100 year flood and 500 year flood would look like. Next, switch to the "Flood Map" tab and press the "Get Map" button to see some of the information overlayed on a Google map.

Not being able to resist the opportunity to have Google Maps google Google, I thought it would be interesting to see how bad things could get at the Googleplex.

Uh oh! The Googleplex gets a pretty high KatRisk score. A 100 year flood would put the place under 7 feet of water!

Not to worry though: Google has already completed their first round of feasibility tests for a navy. (Nobody does long range planning like Google.)

The KatRisk models are based on R code that makes heavy use of data.table for fast table look ups of the risk results. As the company says on their website:

KatRisk has developed a suite of analytic tools to make it easy to access our data and models. We use open source software tools including R Shiny for our web applications. By using R shiny we can develop on-line products that can also easily be deployed to a client site. Our software is completely open, so if you decide to host our analytical tools you will be able to see all of the details in easy to understand and modify R code.

For some details on the underlying analytics have a look at this previous post that was based on a talk Dag Lohmann gave to the Bay Area UseR Group last year.

So, go ahead and compute your KatRisk score, but please do be mindful of the company's request not to run the model for more than 3 locations in one day.

January 06, 2015

I love creating spatial data visualizations in R. With the ggmap package, I can easily download satellite imagery which serves as a base layer for the data I want to represent. In the code below, I show you how to visualize sampled soil attributes among 16 different rice fields in Uruguay.

First, I download data that is used in "Spatial Data Analysis in Ecology and Agriculture using R" by Dr. Richard Plant. (This is an excellent book to get your feet wet working with spatial data in R.) After the data has been downloaded, I create a function that builds a custom soil attribute plot for each unique field found in the rice yield data. Then, I customized the output to include larger spatial points and a custom gradient that goes from dark orange to dark purple for clarity.

Finally, once all the plots are generated, I arrange them into a single plot.

The plot shows the ph intensity of the soil in 16 fields belonging to 9 different farmers. The second to the last plot, field 15 of farmer L, appears to have higher ph concentrations than the rest.

January 02, 2015

In a recent post, where I presented some R related highlights of November's H20 World conference, I singled out and described talks by Trevor Hastie and John Chambers and remarked that it would be nice if the videos would be made available. Well, thanks to the generosity of the folks at H2O I got my wish.

Here is the video of Professor Hastie's talk.

This video represents a master class on machine learning where in 40 minutes or so Professor Hastie conducts a tour that starts with basic decision trees and goes all the way to building learning ensembles with the Lasso. Along the way, he presents the salient ideas on bagging, random forests and boosting. The treatment of boosting is succinct and elegant covering some remarkable features of the family of boosting algorithms. For example, Professor Hastie describes how training error in Adaboost can reach zero and stay there but testing error can continue to improve, how superior performance can be achieved with boosting algorithms by using only tree stumps, and how the stagewise additive modeling "slows down the rate of overfitting". The really deep insight comes in the discussion about viewing Adaboost as an algorithm that fits additive logistic regression models with an exponential loss function. This, in turn, leads to a discussion Jerome Freidman's Gradient Boosting Machine and more general boosting algorithms that can accommodate multiple kinds of loss functions. These are the models implemented in R's gbm package.

I think this video of John Chambers' reminiscing about his time at Bell Labs working with John Tukey is destined to become an important part of the historical record for Statistics. There are many remembrences of Tukey to be found online, but I don't know of any other visual record by someone of John Chambers' stature who interacted with Tukey as a colleague and professional statistician.

In Just a few minutes, Chambers paints a balanced and revealing portrait that humanizes and captures some of the complexity of this icon of modern statistics. I especially like the story in the Q & A portion of the talk where John describes Tukey's propensity for "mischief" and his delight in inventing new words (like boxplot "hinges") that rankled many of his statistician colleagues, but apparently particularly upset the British statisticians.

December 19, 2014

Johns Hopkins Biostatistics Professor (and presenter of Data Analysis at Coursera) Jeff Leek has published his list of awesome things other people did in 2014. It's well worth following the links in his 38 entries, where you'll find a wealth of useful resources in teaching, statistics, data science, and data visualization.

December 15, 2014

Visualizing complex survey data is something of an art. If the data has been collected and aggregated to geographic units (say, counties or states), a choropleth is one option. But if the data aren't so neatly arranged, making visual sense often requires some form of smoothing to represent it on a map.

R, of course, has a number of features and packages to help you, not least the survey package and the various mapping tools. Swmap (short for "survey-weighted maps") is a collection of R scripts that visualize some public data sets, for example this cartogram of transportation share of household spending based on data from the 2012-2013 Consumer Expenditure Survey.