Thursday, 26 June 2014

ggpie: pie graphs in ggplot2

How does one make a pie chart in R using ggplot2? (If you're impatient: see the final code here).

I know, I know, pie charts are often not very good ways of displaying data. It is hard to visually compare the relative sizes of slices (particularly the smaller ones and if they are scattered with larger ones in between), and you get no sense of scale.

However, sometimes a good ol' pie chart can convey your point in a way that many of the general public will find easy to understand. For example, this is how I've spent my "work" hours this week so far:

(I manually inserted the line break into "marking exams" so it wouldn't get cut off - not sure how to automatically do this).

Terrible, I know. But the plot conveys quite clearly the point I want to make: I wasted away almost half (!) of my week so far playing nethack when I should have been doing my PhD, which is much more time than I have spent on my PhD itself!

First we construct a stacked bar chart, coloured by the activity type (fill=activity). Note that x=1 is a dummy variable purely so that ggplot has an x variable to plot by (we will remove the label later).

(Aside: if x is a factor (e.g. factor("dummy")) for some reason the bar does not get full width, and the resulting pie chart has a funny little hole in the middle).

(Aside: does anyone find it weird that although my x axis was 1 and my y axis was time, on the resultant graph the x axis is now 'time' and the y axis is now 1?)

Let's tweak this so it doesn't look so bad.

First, some pure aesthetics: let's outline each slice of the pie in black using color='black' in geom_bar(). This causes there to be an ugly black line and outline on each square of the legend, so we remove that.

Now, I should also remove the '0', '5', '10', '15' tick marks/labels, being the cumulative number of hours spent so far, since they're not meaningful.

However, I want to label each slice of the pie, and it is convenient to put my labels in place of the '0', '5', etc.

In terms of their position, they should be located at the midpoint of each pie slice. Think back to the stacked bar chart I produced at the start, and recall that the y axis (time) shows cumulative hours spent.

This means the y coordinate of the end of each slice is given by cumsum(df$time) (cumulative sum of time spent so far), and so the coordinate of the midpoint of each slice is given by:

y.breaks <- cumsum(df$time) - df$time/2
y.breaks

## [1] 7.60 18.90 24.60 27.75 29.90 31.65

In order to implement this in the pie chart, we use scale_y_continuous with the breaks argument being the coordinates we've just calculated, and the labels argument being the activity name.

(Note: for some reason the pie plots have an unreasonably large amount of white space, and the theme(*.margin) settings are an attempt to control that. However, I still get a lot of vertical space that I'm not sure how to compress).

Clearly the labels could do with more work (if they are too long they go out of the plot boundary, or they bump into the pie chart), but not bad for an hour and a half's work :)

To remove the grey, circular outline (formerly the grid): ``panel.grid = element_blank()``. By setting ``xlab`` and ``ylab`` to NULL, you remove blank space. I'm still struggling to understand how/when to use ``aes_string()`` versus plain ``aes()`` and when to set ``1`` to a factor or not, e.g. ``aes(x = factor(1), ...`` versus ``aes_string(x = 1, ...``