Daily news about using open source R for big data analysis, predictive modeling, data science, and visualization since 2008

April 07, 2016

Book Review: Graphical Data Analysis with R

by Joseph Rickert

Basically, there are two kinds of graphics or plots you can make from a data set: (1) those that allow you to see what is going on with the data, and (2) those you make to communicate what you have found to someone else. When making the first kind, you want to select plots that will enable you to see as much as possible while taking great care not to fool yourself. With the second kind, you ought to select plots that make the features of the data you want to communicate seem obvious. They should be focused on the story you are trying to tell, be free from clutter and have an impact on your target audience.

Anthony Unwin’s Graphical Data Analysis with R (CRC Press 2015) is a very good read that thoroughly discusses the process and principles behind plots of the first kind while offering considerable guidance about producing those of the second kind. In 14 chapters that extend to nearly 300 pages, Unwin makes superb use of the R language to develop the principles of Graphical Data Analysis (GDA) while demonstrating the interplay of plot making and basic statistical inference that together make for a comprehensive, exploratory analysis of a data set.

The Preface and Chapter 1 set the scene by defining the scope of GDA, illustrating some of its basic principles and offering a new metaphor that ought to replace the tired and misleading idea that good graphics let you “drill down” into the data. This mechanical notion assumes you know where to drill and possess a clear idea of what you are looking for. In stark contrast, Unwin offers the metaphor of a photographer who takes many photographs of an object from multiple angles and in different lighting conditions in order to “grasp a whole object”. Throughout the book Unwin hammers home his central idea that many graphics of a dataset should be drawn “maybe even a large number of them, where each contributes something to the overall picture”.

Chapter 2 presents a fairly comprehensive review of the relevant literature for GDA acknowledging the contributions of Cleveland, Cook, Murrell, Wickham and many others and pointing out software, websites, texts and other resources that a student should find helpful. Because, as the author explains, “There is no complex theory about graphics” it is the practice of experts and successful exemplars that comprise the foundations of the subject. I found this sketch of GDA’s background helpful in coming to see GDA as an emergent discipline in its own right.

Chapters 3 through 7 discuss graphics for continuous variables, categorical data, looking for structure and associations, multivariate continuous data and multivariate categorical data. These chapters comprise the core of the text, describing most of the basic plots available in the R arsenal through numerous examples with multiple data sets. Chapter 6 on Investigating Multivariate Continuous Data provides as thorough a discussion parallel coordinate plots as you are likely to find anywhere.

Not only does Unwin describe the logic and technique of the plots in these core chapters but he is careful to provide thoughtful and measured interpretations. Consider plots 5.5 and 5.6 and their captions, the second and third plots in a series of three describing the geyser data set.

FIGURE 5.5: The same scatter plot with bivariate density contours. There is evidence of three concentrations of data, two univariate outliers (one eruption with low duration and one with a high waiting time until the next eruption), and one bivariate outlier.

FIGURE5.5: Another version of the scatterplot, but now with highest density regions based on a bivariate density estimate. There is less evidence of three data concentrations than in the previous plot and there is a slightly different set of possible outliers.

Of the many possible examples from the book, I have selected these plots to illustrate Unwin’s style because they show how a simple scatter plot enhanced with sophisticated, “built in” mathematical tools can illuminate the complexities of the data. They also serve to illustrate the value of Unwin’s injunction that a good graphic ought to be accompanied by a thought-through caption and highlight the ephemeral boundary between seeing and interpreting.

Chapter 11 – Graphics for Time Series provides some good advice and proposes a few alternative suggestions for plotting time series data including the use of parallel coordinated plots and calendar plots (see the openair package). It unusual to find any extended discussion on plotting time series data.At the very least this chapter summarizes a good amount of common sense wisdom.

Chapter 13 – Some Notes on Graphics with R provides a short discussion of R’s graphic systems along with tips on R coding for graphics and offers some suggestions for dealing with large data sets.

Chapter 14 provides a very brief summary of the text and a short assessment of the strengths and weakness of GDA itself.

The text has some nice pedagogical features that make it appealing for self study. Every chapter is preceded by a short summary of what the chapter is about, ends with a list of the main points covered and offers some exercises that extend the material covered in the text, often pointing to additional R packages. I particularly enjoyed Unwin's brief historical references that accompany examples drawn from the HisData package like the context surrounding the Charge of the Light Brigade.

Graphical Data Analysis with R will certainly be valuable to anyone wanting to create better graphics in R. It is sufficiently rich in well coded, ggplot2 examples that it will serve as a good reference even after the basic principles have been assimilated. But, in my view, the book has more to offer than examples and ought to be read more closely, and more widely. Although the examples primarily make use of the ggplot2 system, the principles that they illustrate are much more general making them useful to anyone with some experience using any robust statistical plotting system including base R graphics, the lattice package, Python, Matlab and others. Also, as I mentioned above, the text provides a fairly complete discussion of exploratory data analysis and could easily be used as the basis for short course in this subject.

Moreover, Unwin’s careful treatment of the plots he draws, his use of multiple data sets, and his ongoing discussion of the interplay between graphical analysis and statistical inference make Graphical Data Analysis with R a suitable text for alternative first course in statistics. To my mind, a future scientist or intelligent consumer of plots and statistical information would be better served by a modern computational approach to working with data than most standard introductory courses in statistics which, in the end, amount to little more than a futile attempt to explain what a p-value isn’t.

Who can say how any individual text will be read and valued in these days when e-books dilute the integrity of any printed texts by merging them into the rushing stream of online content? Even so, I think that many readers will find that Graphical Data Analysis with R illuminates a shadowy corner of statistical analysis and stands a good chance of becoming recognized as a foundational text for GDA.