Daily news about using open source R for big data analysis, predictive modeling, data science, and visualization since 2008

February 2018

February 21, 2018

Modern machine learning platforms like Tensorflow have to date been used mainly by the computer science crowd, for applications like computer vision and language understanding. But as JJ Allaire pointed out in his keynote at the RStudio conference earlier this month (embedded below), there's a wealth of applications in the data science domain that have yet to be widely explored using these techniques. This includes things like time series forecasting, logistic regression, latent variable models, and censored data analysis (including survival analysis and failure data analysis).

The keras package for R provides a flexible, high-level interface for specifying machine learning models.(RStudio also provides some nice features when using the package, including a dynamically-updated convergence chart to show progress.) Networks defined with keras are flexible enough to specify models for data science applications, that can then be optimized using frameworks like Tensorflow (as opposed to traditional maximum-likelihood techniques), without limitations on data set size and with the ability to apply modern computational hardware.

For learning materials, RStudio's Tensorflow Gallery provides a good place to get started with several worked examples using real-world data. The book Deep Learning with R (Chollet and Allaire) provides even more worked examples translated from the original Python. If you want to dive into the mathematical underpinnings, the book Deep Learning (Goodfellow et al) provides the details there.

February 16, 2018

You probably saw Boston Dynamics' robots achieve another milestone this week: not only can one of their robots open and pass through a door, it will cooperate and politely hold the door as a fellow robot passes through:

February 15, 2018

While ggplot2 (and its various extensions) is often the go-to package for graphics in R these days, if you need to step outside the boundaries of what ggplot2 can do, you can always step back to base R graphics (and the built-in lattice package) and customize to your hearts content.

The problem is that (unlike for ggplot2) the default look for base graphics is kinda ... meh. That being said, the base graphics system offers almost unlimited flexibility, both via function options and via the par system for modifying layouts and graphic defaults. As Colin Gallespie explains in a recent blog post, you can take a scatterplot that looks, by default, like this:

February 13, 2018

I was genuinely chuffed to get a shout-out in the most recent episode of Not So Standard Deviations, the awesome statistics-and-R themed podcast hosted by Hilary Parker and Roger Peng. In that episode, Roger recounts his recent discovery of the Microsoft ecosystem of tools for R, which he (jokingly) dubbed the "Microsoft-verse".

While we're flattered by the allusion to the tidyverse, in general Microsoft's developments with R are designed to work with the entire R ecosystem rather than be distinct from it. Here's a quick overview of what Microsoft has developed around R. It's in three sections: the first two don't require any special version of R, and only the third section requires a Microsoft-specific R distribution.

Thanks again to Hilary and Roger for another entertaining episode of NSS Deviations and for giving me the impetus to write all of this down. (This started as an email, but I quickly realized it was getting too long and became this blog post instead.) If you have any questions or feedback, let me know in the comments section of this post.

R available from within Microsoft products

You can call R from within some data oriented Microsoft products, and apply R functions (from base R, from packages, or R functions you've written) to the data they contain.

Open source R tools and packages from Microsoft

Microsoft provides various open source tools to help people use R. This includes R packages published on CRAN and in Github.

Microsoft R Open, Microsoft's distribution of open-source R. The only difference with CRAN R is that comes bundled with Intel Math Kernel Libraries (which makes vector and matrix operations faster on multi-core machines), and that it uses a static mirror of CRAN so packages don’t change from day to day (but you can always use the "checkpoint" package described below to get the latest-and-greatest, if you want).

MRAN, which is the download repository for Microsoft R Open, and also hosts daily archive snapshots of the entire CRAN system (from 2015 to the present). These snapshots are used for reproducibility by Microsoft R Open, the checkpoint package (see below), and anyone who wants a non-changing CRAN image. (The Rocker docker images are configured to use these static snapshots, for example.)

The checkpoint package, which provides a simple interface to those static CRAN snapshots, for reproducibility. (In short: add checkpoint("2018-02-13") to make R install and use packages from that date for your project, now and in the future, including when you share scripts with someone else.)

The foreach and iterators packages, for parallel programming. Microsoft also provide “backends” to use different parallel programming systems for a foreach loop, like doParallel (use a local machine or cluster), and doAzureParallel (spin up a cluster in Azure and to the parallel iterations there).

Microsoft has implemented a suite of algorithms for statistics and machine learning. These either serve as replacements for existing R functions or packages, or add new capabilities. They are designed for performance and to work without data size limitations. (In general, these algorithms are closed-source and only available within Microsoft R products, and not on CRAN or Github.)

The RevoScaleR package provides new implementations of some of R's statistical functions (for example: rxGlm is the equivalent of R's glm), but are designed to work with data sizes much larger than available memory. It also uses parallel computing to speed things up when running on a multi-core server, in a Hadoop or Spark cluster, or in a SQL Server database.

Those two packages can optionally make use of the XDF file format, a binary on-disk data format designed for performance and parallel processing. Within R, XDF objects behave much like data frames, and you can also apply tidyverse data functions with the dplyrXdf package.

The mrsdeploy package allows you to publish custom R scripts and functions (including ones using the packages above) to a server as an API that can be called from other applications.

February 12, 2018

One of the greatest things about the R community is its diversity. This is largely thanks to organizations like Forwards and R-Ladies, who have been instrumental in welcoming women and other under-represented groups to the world of R. Likewise, conferences in the R community encourage diversity, with open codes of conduct, facilitations like on-site child-care, and by offering scholarships for travel and lodging to encourage attendees from diverse backgrounds.

Here are three upcoming R community events that are offering diversity scholarships:

The rOpenSci Unconference (May 21-22, Seattle) is now accepting self-nominations to participate. You can request travel support as part of the application process, and the organizers "strongly encourage applications from women and other underrepresented genders, people of color, people who are LGBTQ, people with disabilities or any other underrepresented minorities in research". Applications are open now.

The core booster did not survive its sea landing, but the core mission was a success: Starman and the Roadster are now in Solar orbit, and will eventually pass between Mars and the asteroid belt and will continue to do so for billions of years.

According to a Forbes article, cleaning and organizing data is the most time-consuming and least enjoyable data science task. Of all the resources out there, DataExplorer is one of them, with its sole mission to minimize the 80%, and make it enjoyable. As a result, one fundamental design principle is to be extremely user-friendly. Most of the time, one function call is all you need.

Data manipulation is powered by data.table, so tasks involving big datasets usually complete in a few seconds. In addition, the package is flexible enough with input data classes, so you should be able to throw in any data.frame-like objects. However, certain functions require a data.table class object as input due to the update-by-reference feature, which I will cover in later part of the post.

Now enough said and let's look at some code, shall we?

Take the BostonHousing dataset from the mlbench library:

library(mlbench)
data("BostonHousing", package = "mlbench")

Initial Visualization

Without knowing anything about the data, my first 3 tasks are almost always:

library(DataExplorer)
plot_missing(BostonHousing) ## Are there missing values, and what is the missing data profile?
plot_bar(BostonHousing) ## How does the categorical frequency for each discrete variable look like?
plot_histogram(BostonHousing) ## What is the distribution of each continuous variable?

While there are not many interesting insights from plot_missing and plot_bar, below is the output from plot_histogram.

February 06, 2018

The Data Science Virtual Machine was featured on a recent episode of the AI Show with Seth Juarez and Gopi Kumar. If you want a quick and easy way to spin up a virtual machine with all of the data science tools you'll ever need — including R and RStudio — already installed and ready to go, this video explains what the Data Science Virtual Machine is used for and (at 21:00) how to launch one in the Azure portal.