Learn more about using open source R for big data analysis, predictive modeling, data science and more from the staff of Revolution Analytics.

packages

February 26, 2015

Distcomp, a new R package available on GitHub from a group of Stanford researchers has the potential to significantly advance the practice of collaborative computing with large data sets distributed over separate sites that may be unwilling to explicitly share data. The fundamental idea is to be able to rapidly set up a web service based on Shiny and opencpu technology that manages and performs a series of master / slave computations which require sharing only intermediate results. The particular target application for distcomp is any group of medical researchers who would like to fit a statistical model using the data from several data sets, but face daunting difficulties with data aggregation or are constrained by privacy concerns. Distcomp and its methodology, however, ought to be of interest to any organization with data spread across multiple heterogeneous database environments.

Setting up the distcomp environment requires some preliminary work and out-of-band communication among the collaborators. In the first step, the lead investigator uses a distcomp function to invoke a browser-based Shiny application to describe the location of her data set, the variables to be used in the computation, the model formula and other metadata necessary to describe the computation.

Next, the investigator invokes another distcomp function to move the metadata and a copy of the local data set to computation server with a unique identifier. Once the master server is in place, collaborating investigators at remote locations perform a similar process to set up slave computation servers at their sites. When the lead investigator receives the URLs pointing to the slave servers she is ready to kick off the computation.

All of the details of this setup process are described in this paper by Narasimham et al. The paper also describes two non-trivial computations: a distributed rank-k singular value decomposition and distributed, stratified Cox model that are of interest in their own right. The algorithm and code for the stratified Cox model ought to be useful to data scientists in a number of fields working on time to event models. A really nice feature of the algorithm is that it only requires each site to independently optimize the partial likelihood function using its local data. The master process uses the partial likelihood information from all of the sites to compute a final estimate of the coefficients and their variances.

There are several nice aspects to this work:

It builds on the cumulative work of the R community to provide a big league, big data application around open source R.

It provides a flexible paradigm for implementing distributed / parallel applications that leverages existing R algorithms (e.g. the Cox model makes use of code in the survival package)

It illustrates the ease with which R projects can be deployed in web services applications with Shiny and other R centric software such as DeployR

It provides an alternative to building out infrastructure and aggregating data before realizing the benefits of a big data computation. (Prototyping calculations with distcomp might also serve to justify the expense and effort of developing centralized infrastructure.)

It recognizes that privacy and other social concerns are important in big data applications and provides a model for respecting some of the social requirements for dealing with sensitive data.

Distcomp is new work and the developers acknowledge several limitations. (So far, they have only built out two algorithms and they don’t have a way to easily deal with factor data across the distributed data sets.) Nevertheless, the project appears to show great promise.

Tracking progress of parallel computing tasks

Parallel programming can help speed up the total completion time of your project. However, for tasks that take a long time to run, you may wish to track progress of the task, while the task is running.

This seems like a simple request, but seems remarkably hard to achieve. The reason boils down to this:

Each parallel worker is running in a different session of R

In some parallel computing setups, the workers don’t communicate with the initiating process, until the final combining step

So, if it is difficult to track progress directly, what can be done?

It seems to me the typical answer to this question fall into 3 different classes:

Use operating system monitoring tools, i.e. tools external to R.

Print messages to a file (or connection) in each worker, then read from this file, again outside of R

Use specialist back-ends that support this capability, e.g. the Redis database and the doRedis package

This is an area with many avenues of exploration, so I plan to briefly summarize each method and point to at least one question on StackOverflow that may help.

Method 1: Use external monitoring tools.

In his answer to this question, Dirk Eddelbuettel mentions that parallel back ends like MPI and PVM have job monitors, such as slurm and TORQUE. However, tools that are simpler to use, like snow do not have monitoring tools. In this case, you be forced to use methods like printing diagnostic messages to a file.

For parallel jobs using the doParallel backend, you can use standard operating system monitoring tools to see if the job is running on multiple cores. For example, in Windows, you can use the "Task Manager" to do this. Notice in the CPU utilization how each core went to maximum once the script started:

Method 2: Print messages to a file (or connection) in each worker, then read from this file, again outside of R

Sometimes it may be sufficient, or desirable, to print status messages from each of the workers. Simply adding a print() statement will not work, since the parallel workers do not share the standard output of the master job.

Steve Weston, the author of foreach (and one of the original founders of Revolution Analytics) wrote an excellent answer to this question.

Steve says that output produced by the snow workers gets thrown away by default, but you can use the makeCluster() argument "outfile" option to change that. Setting outfile to the empty string ("") prevents snow from redirecting the output, often resulting in the output from your print messages showing up on the terminal of the master process.

Steve says: to create and register your cluster with something like:

library(doSNOW)cl <- makeCluster(4, outfile="")registerDoSNOW(cl)

He continues: Your foreach loop doesn't need to change at all. This works with both SOCK clusters and MPI clusters using Rmpi built with Open MPI. On Windows, you won't see any output if you're using Rgui. If you use Rterm.exe instead, you will. In addition to your own output, you'll see messages produced by snow which can also be useful.

Also note that this solution seems to work with doSnow, but is not supported by the doParallel backend.

Method 3: Use specialist back-ends that support this capability, e.g. the Redis database and the doRedis package

Specifically, the R package rredis allows message passing between R and Redis. The package doRedis allows you to use foreach with redis as the parallel backend. What’s interesting about Redis is that this database allows the user to create queues and each parallel worker fetches jobs from this queue. This allows for a dynamic network of workers, even across different machines.

February 20, 2015

I had a most interesting exchange with an industry analysis firm recently who suggested that application marketplaces were critical to the success of analytical tools, suggesting that Revolution Analytics was remiss in not creating one.

I must say, I was taken aback somewhat. In considering the suggestion, I was left suspecting that the expressed enthusiasm for marketplaces being built to do for various commercial products what the R community already enjoys.

We responded that the R community already enjoys a richly-furnished “marketplace” for R extensions, algorithms, applications, adapters, techniques and educational assets. I’m speaking, of course, of CRAN and the 6000+ CRAN packages.

Is CRAN a marketplace? Perhaps the typical goals of a marketplace hold an answer:

Foster success among groups and individuals working to enhance the capability of a particular tool or solution;

CRAN enables collaboration on vertically-specific problems — one need only search on a single field such as genomics to see examples of communities of practice sharing their work via CRAN.

CRAN is big and growing vigorously. A couple of week ago, I surveyed new contributions to CRAN in first 2 ½ weeks of 2015. The list included hundreds of new packages for genomics and life sciences, chemistry, natural resources, biology, ecology, forestry, agronomy astronomy, drug research, healthcare delivery, finance, forestry and government.

If R users struggle with CRAN anywhere, they do so perhaps as a victim of CRAN’s huge success. With 6285 contributed packages as of this writing, CRAN can be unwieldy for the new users. And a solution exists. New users often rely on the wisdom of the R community, as expressed through the “base” and “recommended” packages that are typically installed with R. These provide a rich, but tractable “starter set” of frequently used techniques, algorithms, connectors and other methods.

Among the typical goals of a marketplace is one not targeted by CRAN. CRAN is not used to monetize directly, reflecting its open source heritage. CRAN provides free distribution to end users, whom I suspect, generally find “free” to be a great advantage.

CRAN can no doubt be improved and extended. For example, CRAN mirrors the latest versions of most packages, introducing potential repeatability problems when sharing scripts. We’ve built what we call the Managed R Archive Network (MRAN). MRAN extends and takes daily snapshots of CRAN, addressing issues of reproducibility, while continuing to expose the entirety of CRAN.

And we eat our own dog food. We use MRAN to distribute Revolution R Open, our new and accelerated version of open source R, with source code. We also distribute introductory material on R targeted to new users and provide significantly improved searching tools.

At Revolution Analytics, we do not believe the R community needs another marketplace for packages. We believe that CRAN amply fills the role, perhaps better than any vendor-specific application marketplace available today. With extensions such as MRAN and the repeatability toolkit, it excels at its job.

One need only compare CRAN to any other marketplace. Look for the number of packages contributed independently. CRAN is a shining example of the open source business model operating as intended. CRAN continues to foster a thriving community with shared interests that is building innovative additions to the R ecosystem including components and solutions, but also educational assets, data assets, and many others.

February 19, 2015

For the past few years, the Strata + Hadoop World Conference in San Jose has kicked off my personal conference season. With its focus on Data Science, Strata always seems to present some interesting R related talks, and I am looking forward to the various events over the next couple of days. But, Strata and other large conferences like the JSM, are just too big to easily find presentations that explicitly highlight R. So, if you would like to plan your conference season around R, the following list presents some R themed conferences that will be taking place mostly during the first half of this year.

The GDR Ecology Statistics group will be holding the GDR EcoStat meeting on the 12th and 13th of March in Lyon. The conference will be organized around ten themes including evolutionary ecology and population dynamics and demography “qui transcendent les modèles biologiques”. The group lists a number of R packages on their website and I expect that the conference will produce many R related talks.

Several prominent R developers will attend the rOpenSci Unconf (an “unhackathon” this year) which will take place on the 26th and 27th of March in San Francisco. This is an invitation only event, however if you would like to attend, there is still time to nominate yourself.

The Fourth Annual Joint Conference of the Upstate Chapters of the American Statistical Association will take place in Geneseo, New York on April 10th and 11th. Professor Kosuke Imai from Princeton University, author of several R packages, will be the keynote speaker. The deadline for abstract submission is February 27th.

The Applied Statistics in Public Policy Evaluation workshop will be held on April 22nd through April 25th at the Univesidad Santo Tomas in Bogata. Hadley Wickham will be one of the keynote speakers. The call for papers ends on March 8th.

The UCLA datafest, which will provide plenty of opportunity for students to do some serious R coding will take place over the May 2nd weekend. The organizers are still looking for sponsors.

R/Finance 2015: Applied Finance with R will be held on May 29th and 30th at the University of Illinois at Chicago. Emanuel Derman (Models Behaving Badly and My Life as a Quant) will be one of the keynote speakers. R/Finance is my personal favorite. It is a relatively small, single track conference featuring high quality talks of varying length, very little fluff, and plenty of opportunities to network with some very knowledgeable R adepts.

The ASA Wisconsin Chapter Big Data Conference, which will be held in Milwaukee on June 5th, will feature Revolution Analytics’ David Smith as a keynote speaker. The website is not in place yet, but if you would like to know more please contact the Secretary/Treasurer of the Wisconsin Chapter of the ASA for more info: Elizabeth Smith <elsmith@mcw.edu>.

The 9th International Conference on Extreme Value Analysis will be held at the University of Michigan in Ann Arbor on June 15th to 19th. Stilian Stoev, one of the conference organizers, writes: “. . .much of the state-of-the-art software for statistics of extremes is written in R. We plan to organize a small hands-on workshop prior to the main conference, where students and practitioners will be introduced to the most popular R-packages for extreme value analysis. In fact, it is likely the the people who are currently maintaining these packages will be the instructors! “. Abstracts are due on February 27th.

missData2015, the missing value conference, will be held at the Agrocampus Ouest in Rennes on June 18th and 19th. Stefan van Buuren, maintainer and author of mice and Christophe Biernacki, an author of the Mixmod package, are among the invited speakers. Poster submission is open until April 1st. Follow the event on twitter.

Quatrième Rencontres R will be held in Grenoble on June 24th through 26th. The purpose of these yearly meetings is to provide: “a place of exchange and sharing ideas on the use of the software R in different disciplines (visualization, applied statistics, biostatistics and bioinformatics, Bayesian statistics, data analysis, Big Data, modeling, machine learning, high performance computing ...)”. The deadline for submissions in April 7th.

The 9th RMetrics Summer Workshop will be held from June 25th through June 27th at the Villa Hatt in Zurich. Topics this year will include vulnerabilities, structural stability, and the stress resistivity of financial markets.

R in Insurance 2015 will be held in Amsterdam on June 29th. "The central theme is how one can use R as a primary tool for insurance risk management, analysis and modelling." The submission deadline for abstracts is March 28th.

BioC2015, the the annual conference of the BioConductor project, will be held in Seattle on July 20th through 22nd. The conference website is just being built, so please check back later for details.

The biannual Alacip Escuela (ALACIP School for Policy Analysis) conference which will be held at the Pontifical Catholic University of Peru (PUCP) in Lima on July 21st through 24th will feature several days of R workshops.

EARL 2015 (Effective use of the R Language) London, will be held on September 14th through 16th. Abstract submission closes on March 31st. Revolution Analytics' Andrie de Vries will be speaking. Follow the event on twitter.

EARL 2015 Boston, will be held from November 2nd through 4th. David Smith will be among the speakers.

I am sure that I have not managed to generate a complete list, and I do apologize if I have missed something important. But, please do send me a note and I will at least get the event listed in Revolution Analytic’s Community Calendar.

February 18, 2015

During October 2014 we announced RRT (the Reproducible R Toolkit) that consists of the checkpoint package and the MRAN. In January, David Smith followed up with another post about reproducibility using Revolution R Open.

Allow users to specify any folder location as the checkpoint library location. Previously, checkpoint always installed packages in the location ~/.checkpoint. This is still the default, but now you can change this, for example to store your checkpoint packages on a USB drive.

Add option to run checkpoint() without scanning for packages. This option answers the use case where you run code in a production environment, and you are already certain that all package dependencies are installed in the .checkpoint folder. In this special case, not scanning for packages leads to lower latency and better performance.

You can now specify that your checkpoint project depends on a specific version of R.

Removed dependency on the knitr package. If the knitr package is available on your machine, then checkpoint will scan all rmarkdown script files in the project for package dependencies.

Also, several enhancements:

Progress reporting while installing packages. If you happen to scan a project with many R scripts, the scanning process can take some time. The checkpoint function now provides a progress bar indicator.

Inform user when packages are found that don't exist in the MRAN snapshot.

Include direct namespace calls with :: or ::: into scan for packages, for example package::foo() or package:::bar(). This is in addition to any occurrences of library() or require() statements.

Return diagnostic information from checkpoint(). Previously, checkpoint() always returned NULL. Now checkpoint() invisibly returns a list with diagnostic information, e.g. which packages were found during the scan process.

Improve messages when scanning project for packages. Rather than providing a cryptic message for each package, now checkpoint() prints a helpful message and lists all files that could not be scanned.

Improve handling of checking for knitr availability. For example, if checkpoint() finds an rmarkdown file, but knitr is not available, you get a helpful warning message with the name of the file that could not be scanned.

Added vignette with sample code.

Performance improvements:

Checks if required packages are installed and doesn't re-install them if so

Finally, some bug fixes that were potentially annoying:

No longer displays warnings when installing base packages.

When encountering checkpoint() no longer throws an error.

You can easily download and install the latest version of GitHub as follows (with thanks to the devtools package):

February 10, 2015

Information about the technology business ecosystems is valuable to both established companies as well as startups. Fortunately CrunchBase - the world’s most comprehensive dataset of startup activity, captures quite a bit of such information. Founded in 2007 by Mike Arrington, CrunchBase began as a simple crowd-sourced database to track startups covered on TechCrunch. Today, you’ll find about 650K profiles of people and companies that are maintained by tens of thousands of contributors. Venture Capital firms have willingly shared this information so that others could benefit. It's also accessible to everyone as an API and to researches as downloadable workbook.

rcrunchbase is an R client to the CrunchBase API developed by Tarak Shah of UC Berkeley. It has several helpful functions that aim to create a compositional query flow. As much as possible, complex queries can be built up from simple requests. The intent is to have rcrunchbase handle the messy stuff while you focus on getting the data you want.

As an example, let us explore relationship between the companies through the founding teams. For example 'Paypal Mafia' is a group of former PayPal employees and founders who have since founded and developed additional technology companies such as Tesla Motors, LinkedIn, Palantir Technologies, SpaceX, YouTube, Yelp, and Yammer. You can read about Paypal Mafia in wikipedia and the San Jose Mercury News.

Let's find out more about the Paypal Mafia from CrunchBase. To get started you will first need to sign up to get an API key for CrunchBase access, and then install the package with the command:

devtools::install_github("tarakc02/rcrunchbase")

The following code lists the current and past team of Paypal people who are in CrunchBase.

library(rcrunchbase)library(magrittr)# Start by looking up the node details of a company
pp <- crunchbase_get_details("organization/paypal")ls.str(pp[[1]])names(pp[[1]])# get the path to pull the collections corresponding to the companies “current team” and "past team"str(pp[[1]]$relationships)
crunchbase_expand_section(pp,c("current_team","past_team"))

These three functions can be combined in diverse ways, resulting in a much richer and more expressive approach to the API. To take full advantage of the compositional nature of these functions, it’s useful to have a “piping” operator to pass results of one function to inputs for the next function. For example, one could find the list of companies that Paypal's current and past teams have invested in:

The result (Download Pp) is a list of over 300 companies! That's a huge impact by the Paypal Mafia.

The crunchbase database is a graph database and using the API may be time consuming. Crunchbase also publishes a xls workbook with information on companies, funding rounds, and acquisitions. (Available to academics and Crunchbase venture partners). More about that in another blog post.

What interesting questions do you have with such rich data on the startup ecosystems? Please comment below...

February 04, 2015

Hadley Wickham's testthat package has been a boon for R package authors, making it easy to write tests to verify that your code is working directly, and alerting you when you make changes to your code that inadvertently breaks things.

For the RHadoop project, though, developer Antonio Piccolboni needed a different testing framework, that included the possibility of writing tests that included random input values for functions. The Haskell language has a "quickcheck" package that does this, so Antonio wrote a similar package for R, also called quickcheck. The key function is called (naturally) test, and here's an example of it in action, testing a user-defined function called "identity":

The rany function generates random integer inputs to pass to the identity function, and in this case the test is run 100 times with random values. Quickcheck supports generating inputs of various R data types (double, character, etc) and can even generate a mixture of R object types to test functions that support inputs of multiple types. And when errors are detected, the function repro will tell you exactly what inputs generated the error so you can track it down.

Mathieu used the R language and OpenStreetMap data to construct the image, which colorizes each street according to the compass direction it points. Orthogonal streets are colored the same, so regular grids appear as swathes of uniform color. A planned city like Chicago, would appear as a largely monochrome grid, but Paris exhibits much more variation. (You can see many other cities in this DataPointed.net article.) As this article in the French edition of Slate explains, the very history of Paris itself is encapsulated in the colored segments. You can easily spot Napoleon's planned boulevards as they cut through the older medieval neighborhoods, and agglomerated villages like Montmartre appear as rainbow-hued nuggets.

January 12, 2015

The team at RStudio have just released an update to the immensely useful dplyr package, making it even more powerful for manipulating data frame data in R. The new 0.4.0 version adds new "verbs" to the syntax for mutating joins (left join, right join, etc.), filtering joins, and set operations (intersection and union). There's also some new documentation to help you get started with dplyr, including a vignette on using data frames with dplyr and a printable cheatsheet on data wrangling with dplyr and tidyr. Check out all the updates at the RStudio blog post linked below.