Daily news about using open source R for big data analysis, predictive modeling, data science, and visualization since 2008

packages

September 21, 2016

Hadley Wickham, co-author (with Garrett Grolemund) of R for Data Science and RStudio's Chief Scientist, has focused much of his R package development on the un-sexy but critically important part of the data science process: data management. In the Tidy Tools Manifesto, he proposes four basic principles for any computer interface for handling data:

The tidyverse also loads purrr, for functional programming with data, and ggplot2, for data visualization using the grammar of graphics.

Installing the tidyverse package also installs for you (but doesn't automatically load) a raft of other packages to help you work with dates/time, strings, factors (with the new forcats package), and statistical models. It also provides various packages for connecting to remote data sources and data file formats.

Simply put, tidyverse puts a complete suite of modern data-handling tools into your R session, and provides an essential toolbox for any data scientist using R. (Also, it's a lot easier to simply add library(tidyverse) to the top of your script rather than the dozen or so library(...) calls previously required!) Hadley regularly updates these packages, and you can easily update them in your R installation using the provided tidyverse_update() function.

For more on tidyverse, check out Hadley's post on the RStudio blog, linked below.

September 07, 2016

Take a satellite image, and extract the pixels into a uniform 3-D color space. Then run a clustering algorithm on those pixels, to extract a number of clusters. The centroids of those clusters them make a representative palette of the image. Here's the palette of Chicago:

The palette of Chicago

The R package earthtones by Will Cornwell, Mitch Lyons, and Nick Murray — now available on CRAN — does all this for you. Pass the get_earthtones function a latitude and longitude, and it will grab the Google Earth tile at the requested zoom level (8 works well for cities) and generate a palette with the desired number of colors. This Shiny app by Homer Strong uses the earthtones package to make the process even easier: it grabs your current location for the first palette, or you can pass in an address and it geolocates it for another. That's what I used to create the image above. (Another Shiny app by Andrew Clark shows the size of the clusters as a bar chart, but I prefer the simple palettes.) There are a few more examples below, and you can see more in the earthtones vignette. If you find more interesting palettes, let us know where in the world you found them in the comments.

(Here's how to create that chart in R.) But, note that scale at the bottom of the chart, mapping measles cases to a color on the rainbow. Here, we'll zoom in on it:

The scale you choose for a heat map is very important, and has a major impact on how the viewer will interpret the data presented. This scale has been chosen with care: while most of the scale is red, very few of the data cells are red (because the distribution of measles cases is skewed, thanks in particular to the introduction of a vaccine in 1964). A naively chosen scale would wash out the data.

The actual colors you choose are important too. The physics, technology, and neuroscience behind the interpretation of colors is surprisingly complex, but this talk on the default color schemes used in Python's matplotlib does a great job of explaining:

You can easily use the viridis color scales in R as well, thanks to the viridis package by Simon Garnier, which is available on CRAN. The package provides for heatmap color schemes, all carefully chosen for optimized perception and usefulness for color-impaired viewers.

You can find several examples of using the viridis color pallettes in the package vignette, both for base R graphics (including raster) and ggplot2. To get started, just install.packages("viridis") to install the package from CRAN.

August 31, 2016

You download the data and complete your analysis with ample time to spare. Then, just before deadline, your collaborator lets you know that they've "fixed a data error". Now, you have to do your analysis all over again. This is the reproducibility horror story:

But while knitr solves a good chunk[*] of the reproducibility problem, there's one complicating factor it doesn't deal with: updated R pacakges. In the same way that a collaborator updating the data triggers a restart, someone updating an R package your script uses can also affect your results. (That someone was likely you, working on a different R project.) The checkpoint package for R solves that problem by letting you "lock in" the package versions you use with a project. It's easy to use: all you need to do is add a line like checkpoint("2016-08-31") to the beginning of your script, which:

Downloads all the packages used by your project (those mentioned files in the current folder), as they were on August 31, 2016

Installs them in a folder specific to this project (so they're independent from other R projects), and

Makes sure R uses those package versions when you run your script

It does some clever things to avoid re-downloading packages if it doesn't need to, and avoiding duplicates of multiple copies of the same package version, but that's the basic gist. Checkpoint also makes it really easy to share code with others, because you can be confident they'll also get the packages they need to make your script work. You can learn more about the checkpoint package here and in this vignette, and just install it from CRAN to get started. (If you use Microsoft R Open you don't even need to download it, it's already included.)

August 17, 2016

R has some good tools for importing data from spreadsheets, among them the readxl package for Excel and the googlesheets package for Google Sheets. But these only work well when the data in the spreadsheet are arranged as a rectangular table, and not overly encumbered with formatting or generated with formulas. As Jenny Bryan pointed out in her recent talk at the useR!2016 conference (and embedded below, or download PDF slides here), in practice few spreadsheets have "a clean little rectangle of data in the upper-left corner", because most people use spreadsheets not just a file format for data retrieval, but also as a reporting/visualization/analysis tool.

Nonetheless, for a practicing data scientist, there's a lot of useful data locked up in these messy spreadsheets that needs to be imported into R before we can begin analysis. As just one example given by Jenny in her talk, this spreadsheet was included as one of 15,000 spreadsheet attachments (one with 175 tabs!) in the Enron Corpus.

To make it easier to import data into R from messy spreadsheets like this, Jenny and co-author Richard G. FitzJohn created the jailbreakr package. The package is in its early stages, but it can already import Excel (xlsx format) and Google Sheets intro R as a new "linen" objects from which small sub-tables can easily be extracted as data frames. It can also print spreadsheets in a condensed text-based format with one character per cell — useful if you're trying to figure out why an apparently simple spreadsheet isn't importing as you expect. (Check out the "weekend getaway winner" story near the end of Jenny's talk for a great example.)

The jailbreakr package isn't yet on CRAN, but if you want to try it out you can download it from the Github repository (or even contribute!) at the link below.

August 16, 2016

In 2014, Illinois passed into law the creation of a medical cannabis pilot program. As my son has cancer and marijuana could greatly help with some of his symptoms, we eagerly applied for a card when registration was available early in 2015. The first dispensaries were not available until November 2015. At that time there were 9 dispensaries; the PDF file with a table of dispensary names and locations provided by the Illinois Department of Health was an adequate way to find a dispensary.

In the time that dispensaries have been available, my son has been in various hospitals and facilities in and around the city of Chicago. First we were in Park Ridge, then Hyde Park, then Hinsdale, then downtown Chicago and now finally back home in Oak Park. As we moved around the city, I would use that same PDF file to locate the dispensary closest to me. The list has grown from 9 names and addresses to 40 today. With 40 entries, the PDF table format is not at all useful for showing where the dispensaries are located. The entries are listed in the order of the license issue date, making it all the more difficult to see which dispensaries might be easiest for me to visit.

So one weekend I decided to create a map of all the current locations. Keeping in mind that more dispensaries will be available in the future, I wanted to create code that would read the official list of registered dispensaries, so that updates would be easy as more entries were added.

I knew I could read the text of the file in R using pdftools, and could put the locations onto a google map using googleVis. The hardest part of the code was trying to filter out the noise included in the text and reliably get the name, address, and phone number of each dispensary into a data frame. A few handy gsub statements worked their magic and I was left with data ready for mapping.

I added in some geocoding to get the longitude and latitude, thanks to this tip.

Finally, after the data manipulation, the code to produce the map itself is rather straightforward:

August 11, 2016

Data Science is all about getting access to interesting data, and it is really nice when some kind soul not only points out an interesting data set but also makes it easy for you to access it. Below is a list of 17 R packages that appeared on CRAN between May 1st and August 8th that, in one way or another, provide access to publicly available data.

dataone: The dataone R package enables R scripts to search, download and upload science data and metadata from/to the DataONE Federation. The website describes DataOne as "a community driven project providing access to data across multiple member repositories, supporting enhanced search and discovery of Earth and environmental data". The package comes with several vignettes including this overview.

dataRetrieval: Package to retrieve USGS and EPA hydrologic and water quality data, officially supported by USGS. The vignette gives several examples of downloading interesting data sets.

eechidna: Provides the data from the 2013 Australian Federal Election and tools to analyze it. There are several nicely done vignettes. The following plot which shows election results by polling place comes from the vignette on plotting polling stations.

getHFdata: Provides functions to downloads and aggregate high frequency trading data for Brazilian instruments directly from the Bovespa ftp site. There is a vignette to get you started. The following plot showing unemployment data by state comes from the vignette on Census data.

osi: Provides a connector to the Open Source Initiative API that provides machine --readable data about open source software licenses.

pewdata: Provides for reproducible, programmatic retrieval of survey data sets from the Pew Research Center. The vignette shows how to setup and use the package. Look here for an interesting poll about what Americans know about science.

[Update: added the dataRetrieval package, at the suggestion of Laura DeCicco.]

Editor's note: This is Joe's last post to Revolutions as a member of the Microsoft team: he is heading on for further adventures in the world of R. We want to thank Joe for his many contributions to the blog over the past 6 years, and please join us in wishing him well!

August 04, 2016

My guess is that a good many statistics students first encounter the bivariate Normal distribution as one or two hastily covered pages in an introductory text book, and then don't think much about it again until someone asks them to generate two random variables with a given correlation structure. Fortunately for R users, a little searching on the internet will turn up several nice tutorials with R code explaining various aspects of the bivariate Normal. For this post, I have gathered together a few examples and tweaked the code a little to make comparisons easier.

Here are five different ways to simulate random samples bivariate Normal distribution with a given mean and covariance matrix.

To set up for the simulations this first block of code defines N, the number of random samples to simulate, the means of the random variables, and and the covariance matrix. It also provides a small function for drawing confidence ellipses on the simulated data.

July 28, 2016

My impression is that the JSM has become ever more R friendly over recent years, but with two sessions organized around R tools and several talks featuring R packages, this year may turn out to be the beginning of a new era where conference organizers see value in putting R on the agenda and prospective speakers perceive it to be advantageous to mention R, an R package or a Shiny App in their abstract.

As should be expected, the vast majority of the presentations will focus on statistics or the application of statistical methods, and not on the underlying computational platform. Nevertheless, based on past experience I would be very surprised if there is not quite a bit more R talk buzzing around the conference.

If you are going to Chicago please stop by the Microsoft booth 232. We would be happy to tell you how we are using R at Microsoft and even more interested in hearing your opinion about what Microsoft should be doing with R. Also look for us at the opening night mixer (Sunday 6 - 8PM in the Expo Hall) and the Student Mixer (Monday 6 - 7:30PM in the Chicago Hilton Hotel)

Here follows my R Users Guide to JSM 2016. I have organized the talks by session number and included information on times and room numbers.