Taking R to the Limit: Parallelism and Big Data

In a two-part series at the Los Angeles R User Group[*], Ryan Rosario took a look at the many ways you can take the R language to the limits of high-performance computing.

In Part I (see video at this link; slides and code also available), Ryan focuses on the various methods of parallel computing in R. There’s some great material here on explicit parallelism, especially if you’re looking to get into the nuts and bolts of the Rmpi package. Ryan also gives several examples of using the snow and snowfall packages for fine-grained parallel computing. If you don’t want to think too hard about the details of parallel programming, but just want to use the power of your hardware to speed up "embarrassingly parallel jobs", Ryan also covers implicit parallelism with the multicore package, and shows how to simplify things even further with foreach[**]. Part I wraps up with a brief look at high-performance computing with GPUs: computations can be very fast, but the tools available still aren’t very user-friendly. If you’re thinking about getting into parallel computing with R, Part I of Ryan’s talk gives a great overview of the possibilities available. It also includes some advice about when not to try parallel computing:

“Each iteration should execute computationally-intensive work. Scheduling tasks has overhead, and can exceed the time to complete the work itself for small jobs.”

This sage advice is worth taking to heart. My personal (but unscientific) rule of thumb is that it’s worth trying parallelism only when each iteration takes longer than the time it takes to get up and pour a cup of coffee. (Then again, the coffee pot is less than 5m from my desk.)

In Part II (video coming soon, slides and code available now), Ryan looks at the various tools available to break the constraints of R storing all data in memory, and perform analysis of very large data sets from within the R environment. Much of the presentation is focused on the bigmemory and ff packages, which use different techniques to store data on disk instead of in memory. In the former case, there’s an interesting example of combining both foreach and bigmemory to speed up processing of the airline delay data set, along with an example of doing linear regression on the data. (Revolution’s Joseph Rickert does a similar analysis using the forthcoming RevoScaleR package in this white paper, where the computation is automatically parallelized and runs somewhat faster. I’ll be talking more about RevoScaleR and showing a demonstration of this analysis in a webinar on Wednesday.) Ryan compares ff and bigmemory and finds that performance-wise they’re much the same, but does note an interesting aspect of ff: if you need to create extremely long vectors, it can help. The goal of ff is to get rid of the following message:

If you’ve been thinking about getting into MapReduce and/or Hadoop, Ryan has some great introductory materials beginning at slide 49. He gives several examples of using parallel programming tools to speed up map/reduce processing with the mapReduce package, and if you want to play with Hadoop but don’t program in Java, Ryan also shows how to use the HadoopStreaming package to drive Hadoop directly from R. If you want more power in controlling Hadoop, Ryan also touches briefly on the Rhipe package.