What I use for data visualization

Depending on the nature of the problem, data size, and deliverable, I still draw upon an array of tools for data visualization. As I survey the Design track at next month’s Strata conference, I see creators and power users of visualization tools that many data scientists have come to rely on. Several pioneers will lead sessions on (new) tools for creating static and interactive charts, against small and massive data sets.

The Grammar of Graphics
To this day, I find R (specifically ggplot2) to be a tool I turn to for producing static visualizations. Even the simplest charts allow me to quickly spot data problems and anomalies, and a tool like ggplot2 can accomplish a lot in very few lines of code. Charts produced by ggplot2 look much nicer than simple R plots and once you get past the initial learning curve, they are easy to fine-tune and customize.

Hadley Wickham1, the creator of ggplot2, is speaking on two new domain specific languages (ggvis and dplyr) that make it easy for R users to declaratively create interactive web graphics. As Hadley describes it, ggvis is interactiveGrammar of Graphics for R. As more data scientists turn to interactive visualizations that can be shared through web browsers, ggvis is the natural next tool for ggplot2 users.

d3 and Javascript
For interactive web visualizations, I previously turned to Google Charts and protovis. But I (and many other protovis fans) began migrating over to d3.js when it was announced in 2011. Since then I’ve used its versatility and power to create standard static charts and highly customized, interactive visualizations. If you’re new to d3, Scott Murray is leading an introductory tutorial at Strata that I highly recommend (Scott is a popular instructor and author).

As I noted in a recent post, I’m currently using Python and Scala as my general purpose programming languages. I’m pleased to see the PyData community embrace visualization tools from other languages. Brian Granger, one of the leaders of the IPython community, is giving a talk on how IPython’s architecture allows users to leverage tools like d3 from within IPython notebooks.

If you need to create interactive web visualizations on large data sets, check out Superconductor – a new open source project that originated from UC Berkeley’s Par Lab. It leverages high-level, simple, domain specific languages that automatically find and exploit parallelism4. Here’s an example of 100,000 time-series data points, spread across hundreds of line graphs, with Javascript controls for real-time zooming and panning:

Ben Lorica is the Chief Data Scientist and Director of Content Strategy for Data at O'Reilly Media, Inc. He has applied Business Intelligence, Data Mining, Machine Learning and Statistical Analysis in a variety of settings including Direct Marketing, Consumer and Market Research, Targeted Advertising, Text Mining, and Financial Engineering. His background includes stints with an investment management company, internet startups, and financial services.