Tag: R parallel computation

Hadley Wickham has just announced the release of a new R package “reshape2” which is (as Hadley wrote) “a reboot of the reshape package”. Alongside, Hadley announced the release of plyr 1.2.1 (now faster and with support to parallel computation!).
Both releases are exciting due to a significant speed increase they have now gained.

Yet in case of the new plyr package, an even more interesting new feature added is the introduction of the parallel processing backend.

Reminder what is the `plyr` package all about

plyr is a set of tools for a common set of problems: you need to __split__ up a big data structure into homogeneous pieces, __apply__ a function to each piece and then __combine__ all the results back together. For example, you might want to:

fit the same model each patient subsets of a data frame

quickly calculate summary statistics for each group

perform group-wise transformations like scaling or standardising

It’s already possible to do this with base R functions (like split and the apply family of functions), but plyr makes it all a bit easier with:

totally consistent names, arguments and outputs

convenient parallelisation through the foreach package

input from and output to data.frames, matrices and lists

progress bars to keep track of long running operations

built-in error recovery, and informative error messages

labels that are maintained across all transformations

Considerable effort has been put into making plyr fast and memory efficient, and in many cases plyr is as fast as, or faster than, the built-in functions.

What’s new in `plyr` (1.2.1)

The exiting news about the release of the new plyr version is the added support for parallel processing.

l*ply, d*ply, a*ply and m*ply all gain a .parallel argument that when TRUE, applies functions in parallel using a parallel backend registered with the
foreach package.

The new package also has some minor changes and bug fixes, all can be read here.

In the original announcement by Hadley, he gave an example of using the new parallel backend with the doMC package for unix/linux. For windows (the OS I’m using) you should use the doSMP package (as David mentioned in his post earlier today). However, this package is currently only released for “REvolution R” and not released yet for R 2.11 (see more about it here). But due to the kind help of Tao Shi there is a solution for windows users wanting to have parallel processing backend to plyr in windows OS.

Recently, REvolution blog announced the release of “doSMP”, an R package which offers support for symmetric multicore processing (SMP) on Windows.
This means you can now speed up loops in R code running iterations in parallel on a multi-core or multi-processor machine, thus offering windows users what was until recently available for only Linux/Mac users through the doMC package.

Installation

For now, doSMP is not available on CRAN, so in order to get it you will need to download the REvolution R distribution “R Community 3.2” (they will ask you to supply your e-mail, but I trust REvolution won’t do anything too bad with it…)
If you already have R installed, and want to keep using it (and not the REvolution distribution, as was the case with me), you can navigate to the library folder inside the REvolution distribution it, and copy all the folders (package folders) from there to the library folder in your own R installation.

If you are using R 2.11.0, you will also need to download (and install) the revoIPC package from here:revoIPC package – download link (required for running doSMP on windows)
(Thanks to Tao Shi for making this available!)

Usage

Once you got the folders in place, you can then load the packages and do something like this:

require(doSMP)
workers <- startWorkers(2)# My computer has 2 cores
registerDoSMP(workers)# create a function to run in each itteration of the loop
check <-function(n){for(i in1:1000){
sme <-matrix(rnorm(100), 10,10)solve(sme)}}
times <-10# times to run the loop# comparing the running time for each loopsystem.time(x <- foreach(j=1:times )%dopar% check(j))# 2.56 seconds (notice that the first run would be slower, because of R's lazy loading)system.time(for(j in1:times ) x <- check(j))# 4.82 seconds# stop workers
stopWorkers(workers)

Points to notice:

You will only benefit from the parallelism if the body of the loop is performing time-consuming operations. Otherwise, R serial loops will be faster

Notice that on the first run, the foreach loop could be slow because of R’s lazy loading of functions.

I am using startWorkers(2) because my computer has two cores, if your computer has more (for example 4) use more.

Lastly – if you want more examples on usage, look at the “ParallelR Lite User’s Guide”, included with REvolution R Community 3.2 installation in the “doc” folder

Updates

(15.5.10) :The new R version (2.11.0) doesn’t work with doSMP, and will return you with the following error:

So far, a solution is not found, except using REvolution R distribution, or using R 2.10
A thread on the subject was started recently to report the problem. Updates will be given in case someone would come up with better solutions.

Thanks to Tao Shi, there is now a solution to the problem. You’ll need to download the revoIPC package from here:revoIPC package – download link (required for running doSMP on windows)
Install the package on your R distribution, and follow all of the other steps detailed earlier in this post. It will now work fine on R 2.11.0

Update 2: Notice that I added, in the beginning of the post, a download link to all the packages required for running parallel foreach with R 2.11.0 on windows. (That is until they will be uploaded to CRAN)