This notebook illustrates some basic techniques for graphing data using Python. Here we focus on scatterplots, histograms, and boxplots. Other plotting techniques such as countour plots and heatmaps are also possible.

There are many Python libraries that support visualization. Here we focus on five of them:

Matplotlib is a mature and powerful graphing library for Python. It comes with two main interfaces. The Matplotlib API is low-level and object oriented. The pyplot interface is somewhat more abstract, and resembles to some degree Matlab "handle graphics". We will mainly use the pyplot interface here, which we load next:

In [4]:

importmatplotlib.pyplotasplt

First we make a basic scatterplot of two of the systolic blood pressure measurements (paired by person).

Next we will make a more elaborate histogram plot in which the blood pressure distribution is split by gender. To do this, first we will recode the gender variable (refer to the NHANES codebook for coding information).

Next we will make a side-by-side boxplot showing the blood pressure distribution by age group. To do this we need a list of arrays containing the blood pressure data for each age group. We can obtain this by using pandas cut and groupby.

With some effort, any aspect of these plots can be customized. By passing return_type='dict', we obtain a dictionary that contains objects corresponding to each box in the boxplot. We can then set the properties of each part as desired. Below we customize the boxes and remove the "fliers".

We can further improve the plot by including the sample sizes in the labels. There are various ways to do this. The approach we use here is to create a new variable in the dataframe whose values are the labels as we would like them to appear in the plot (with the sample size included).

The FacetGrid is a convenient way to make an array of plots. This can be used, for example, to stratify on certain variables while exploring the relationships among other variables. Next we use FacetGrid to look at the relationship between age and blood pressure within gender/ethnicity subgroups.

mpld3 is a library that allows you to make interactive plots. There are many forms of interactivity, such as allowing the user to click on a point and view some identifying information about that point.

In [27]:

importmpld3frommpld3importplugins

The next cell produces a scatterplot of systolic blood pressure versus age. You can hover over each point to see the subject identifier (SEQN) corresponding to the point. For performance reasons, and to reduce overplotting, we show only 1000 points.