Book review: Interactive Graphics for Data Analysis

I am happy to provide the following review of this interesting book by Martin and Simon, who are readers of Junk Charts. Martin also publishes a blog, and he's the one who has created bumps charts for the Tour de France races (which also appear in the book).

Interactive Graphics for Data Analysis is an advanced book written by two researchers who have deep experience developing graphics software. People who like to go beyond the basics will find it a useful addition to the literature.

To give you an idea of the level of sophistication, just in Chapter 1 (titled Interactivity), the two authors utilize set operations, SQL statements, and parallel coordinate plots. They assume you have some sense of what those are. That said, those sections can be skipped without interrupting the flow of the book.

The following key messages from these authors are worth repeating:

There is a distinction between statistical graphics and data graphics. Underlying trends and patterns in the data is often made clear by performing statistical analyses on the data, with the results added to charts (e.g. loess lines). When dealing with very large data sets, statistical charts (such as box plots) are found to be much more scalable, precisely because they do not attempt to put every data point onto the page.

The authors stress the need to look at a variety of charts when doing exploratory data analysis. This is because most chart types do certain things well but not others.

Throughout the book, they make much hay of the problem of "over-plotting", that is, overlapping data. This happens when data is abundant, or when values are concentrated in a narrow range. A great illustration of this problem is the parallel coordinates plot, which can look entirely different depending on which lines are plotted on top of which other lines. (The charts on the right are identical except for the order in which the lines are plotted.) Common strategies include "jittering", and varying transparency. Many of these strategies have issues of their own.

They also point out that the look of many multivariate charts (such as mosaic charts) depends on the sorting of the data. This is a key weakness of many such plots. Just think about this the next time you create a stacked column chart.

The book is divided into two sections: Principles and Examples. The second half, the Examples section, consists of case studies in which the authors show examples of how to investigate the structure of a given data set.

The example of using the fatty-acid contents of Italian olive oils to deduce their regional origin is a good visualization of how the statistical technique of classification trees work. Here is the telling diagram:

Notice that data with the same color are oils from the same region, the rectangular sections are results of the statistical classification procedure, and we would like to see most (if not all) of the data within each section having the same color.

***

Without a doubt, graphics designers should be aware of the issues raised by these authors. The book appears to be written for students who are creating statistical software (complete with end-of-chapter exercises.) I'm left wondering what users of graphics software can do with this information because much of this material relates to the design of graphics software. Knowing these issues makes you want to do things the software may not be designed to do efficiently. For example, most software packages I have used do not have a simple toggle to sort categorical variables by various means (alphabetical, increasing or decreasing frequency, increasing or decreasing value of another variable, etc.).

Comments

You close your review with the question: "I'm left wondering what users of graphic software can do with this information...".

First, the book gives a background on Mondrian, which you failed to mention. Get it at http://rosuda.org/Mondrian/.
Second, the book is more for those wanting to have a background on visualization software like Mondrian rather than designing software.
Third -what you can do with it? I use Mondrian all the time for large datasets [environmental data analysis]. It blows your socks off. Never seen anything more practical.

Henk: If the book is intended as a manual or introduction to Mondrian, then I think the authors are being too shy about it, e.g. they could have titled the book Interactive Graphics for Data Analysis using Mondrian. Apart from the appendix which is the Mondrian manual, I did not get the sense that it is only about Mondrian; in the main text, the authors use other software like DataDesk and R to illustrate concepts, in addition to Mondrian.
So thanks for pointing that out, and am glad to hear of the positive experience.

It might be worthwhile to comment on the both views - which are both ok, depending on what your perspective is.

Our idea was to illustrate concepts and principles of interactive graphical data analysis as generic as possible. We explicitly wanted to avoid writing a manual/cookbook on Mondrian. That said, you can either think "these are the things I should implement in a graphical tool for data analysis", or "when I use (something like) Mondrian, these are the things I should expect to do." while working on the first part of the book - I guess the latter group of readers is far bigger than the first one.

The second part (the examples) though, should finally get you to the "real work" on data sets. Using Mondrian for this part is certainly far easier than most other tools - but if you are an R addict, you can also work on it with R.

I attended a short course by Martin and Simon at JSM 2009 that was based on their book. It was very well presented. I also recommend the book by Antony Unwin, Theus, and Heike Hofmann call Graphics of Large Datasets, which has a similar perspective on creating effective graphics (and software).

To be clear, there are several commercial products that share features that are present in Mondrian. For example, I work at SAS, and we produce both JMP and SAS/IML Studio. As you mention, DataDesk is another similar product.