Pivot Tables in R with dplyr

The prolific Hadley Wickham did it again! This time he came up, together with Romain Francois, with an amazing library for data manipulation that turns the task of making Pivot Tables in R a real breeze. Enter dplyr. Along the lines of ggplot2, also from the same main author, dplyr implements a grammar of data manipulation and also introduces a new syntax using “pipe” operators.

What is a “pipe” operator? Well, the best way to learn about it is to use it. Ready to experiment with dplyr? Let’s get started.

3. Install and load the dplyr library

dplyr is available directly from CRAN (the latest version is 0.2 at the time of this writing). It can be installed and made available to R with the following commands:

Installing dplyr

R

1

2

>install.packages("dplyr")

>library(dplyr)

After having loaded the dplyr library, you may get some warnings that some objects are masked by other packages. This is ok and it is not to worry about at this time.

4. Getting familiar with dplyr

dplyr comes with a set of functions that can be used to perform the most common manipulation on data. They all work on data frames and table data frames, a new “smarter” version of data frames supported by dplyr.

You can convert a data frame to a table data frame through the tbl_df() function:

Converting data to a table data frame

R

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

data_tdf<-tbl_df(data)

>data_tdf

Source:local data frame[799x5]

Country Salesperson Order.Date OrderID Order.Amount

1UK Suyama37812102491863.40

2USA Peacock37813102523597.90

3USA Peacock37814102501552.60

4UK Dodsworth37817102552490.50

5USA Leverling3781710251654.06

6UK Buchanan3781810248440.00

7USA Leverling37818102531444.80

8USA Leverling3781910256517.80

9USA Peacock37824102571119.90

10UK Buchanan3782510254556.62

.................

>

While converting to a table data frame is strictly not necessary to use dplyr, the new format provides a smarter way to display large data sets with many variables (columns). Since our data set is pretty limited, we will keep working with our data in a data frame format. Note however that the output of most dplyr functions will be a table data frame, then converted back transparently to a data frame.

5. Data manipulation verbs

dplyr introduces five data manipulation verbs, namely filter(), arrange(), select(), mutate() and summarise(). When it comes to the task of producing Pivot Tables, summarise() is our working horse.

Note to my american friends: you can use summarize() in place of summarise() and it will work as well! 🙂

summarise() takes multiple rows of a data frame and summarises (collapses) them into a single one by applying a function, for example sum() to sum their values or mean() to calculate the average.

summarise() alone is not very useful though because it does not allow any grouping of the results before applying the summarizing function to the data. Since Pivot Tables are obtained by first grouping the rows according to the value of a variable (column) and then applying a summarizing function to each group, we need a way to group rows in dplyr first. Enter group_by().

6. Group_by()

group_by() enables data manipulation verbs to be applied to each subgroup of data, bringing then back the result of each group in a single data frame.

Back to our sample data, we want to obtain the total amount each Sales Person has sold. Using dplyr, first we group the data by Salesperson with group_by(), then apply summarise() to each group to find the total sum.

Since group_by() should be called first and the results passed to summarise(), we end up with the following fully working but quite convoluted syntax:

group_by() and summarise() plain syntax

R

1

2

3

4

5

6

7

8

9

10

11

12

13

>summarise(group_by(data,Salesperson),Order.Amount=sum(Order.Amount))

Source:local data frame[9x2]

Salesperson Order.Amount

1Buchanan68792.25

2Callahan123032.67

3Davolio182500.09

4Dodsworth75048.04

5Fuller162503.78

6King116962.99

7Leverling201196.27

8Peacock225763.68

9Suyama72527.63

While the results are exactly as expected (please compare with the previous articles), the nested functions we had to use are pretty hard to read. Not surprisingly, the authors of dplyr have come up with a brilliant idea to simplify this complexity and make chained data operation extremely easy.

7. The pipe operator

What we are doing with the previous function calls is basically a chain of operations like this:

take data

group data by Salesperson using group_by()

take the result from the previous step and calculate the total for each group (aka Salesperson) it using summarise()

In other words, data goes into group_by() and the result of group_by() goes into summarise() producing the final pivot table.

The “pipe” operator introduced by dplyr does exactly this. It sends a piece of data as input to a function and then allows the output from the function to go into another function and so on. Using the pipe operator, we can produce the exact same pivot tables with the following much more readable syntax:

Where %>% is the pipe operator. Basically we have taken our data frame data, we have sent it to group_by() telling it to group by Salesperson, then we have taken the output from group_by() and sent it to summarise() telling it to summarise each group by calculating the total of Order. Amount within each group. Isn’t it great? 🙂

No unnecessary repetition of data, no need to use temporary variable to store the results, no need to nest functions. Everything happens automatically behind the scenes and is amazingly fast on large data sets thanks to the use of optimized C/C++ code. Great job Handy and Romain!!!

8. Filtering

How about doing some filtering too? Say that, like we did in a previous article, we want to product a Pivot Tables only for the Sales person in USA. We can achieve this easily with dplyr using the filter() verb. We filter the data before grouping and summarising them. Here is the syntax using again the pipe operator:

9. Unleashing the power of dplyr

dplyr is particularly powerful also because it also allows multiple operations to be performed at once on the same data.

For the sake of this example, let’s add to our data set a fictitious Quantity column, representing the quantity sold for each order. We do this by generating a normally distributed Unit Sales Price with mean = 10 and standard deviation = 1, then calculating the Quantity as Order.Amount / Unit.Sales.Price.

Even if calculating the Quantity requires the Unit Sales Price to be defined first, with dplyr we can do everything at once, that is in one single pass, using mutate(). Here is the syntax and the outcome:

Incidentally, note that the visualization of our modified data set wraps-around. Not very user friendly. Converting it to a table data frame makes its visualization nicer:

data as table data frame

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

>tbl_df(data)

Source:local data frame[799x7]

Country Salesperson Order.Date OrderID Order.Amount Quantity

1UK Suyama37812102491863.40201.77949

2USA Peacock37813102523597.90307.72409

3USA Peacock37814102501552.60173.32053

4UK Dodsworth37817102552490.50266.73936

5USA Leverling3781710251654.0667.22134

6UK Buchanan3781810248440.0040.88416

7USA Leverling37818102531444.80142.07045

8USA Leverling3781910256517.8056.71080

9USA Peacock37824102571119.90115.06074

10UK Buchanan3782510254556.6260.60626

....................

Variables notshown:Unit.Sales.Price(dbl)

Not only the new class table data frame shows the dimensions of the data frame as [799 x 7], but it prints nicely without wrapping around and specifying that one variable, Unite.Sales.Price, was not displayed.

Ok, now that we have our Quantities in place, we want to produce a Pivot Tables that has, for each sales person, the total amount sold and the average sold quantity. This is very easy too to achieve with dplyr. Here is the syntax:

In fact summarise() can process multiple variables at once, each one with a different summarisation function, or apply different summarisation functions to the same variable.

Say, for example, we want to calculate for each sales person the total amount sold AND the maximum order size. This is very simple with dplyr compared to Excel, where you can only calculate one summary per column (unless you duplicate a column):

9 Comments

How do we get multiple column labels in the pivot table? I would like to have the year or the month as a column. If there are orders in three different year, you would have 3 columns; 2012,2013,2014 with order totals for each row.

Hi Marcel,
the most elegant way to do what you ask for is to use the new tidyr package, also from Hadley Wickham (the same author of dplyr). tidyr is an evolution of reshape and reshape2 and uses the same pipe syntax as dplyr. Once you have your pivot table with the Total.Order amount for each Year, Country and Salesperson, you can spread it over the years using spread() from tidyr with the following syntax:

Very usefull, thank you.
I was trying to add a column with SUM (as Excel does by default).
I met two problems:
1> when adding something like mutate(Total=sum(2003:2005)) to the pipe, I got 6012 value for every row (so it’s treated as values, not column names),
2> looking for solution, I tried to change column names to y2003, y2004, y2005, but could not achieve it in pipe

Hi Milos,
you can use colSums() for that. If your data frame is a:
mutate(a, Total=colSum(a))
will add a Total column to it with totals by row. sum(2003:2005) would simply return 2003+2004+2005 as 2003:2005 is equivalent to the vector c(2003,2004,2005)
Hope this helps.
Ciao,
Marco.

Is it possible to pass a dynamic variable (let’s call it grp_var) to group_by(), which may take the value ‘Salesperson’ or ‘Country’ depending on certain condition(s)? In effect, I would like to set the dimension for ‘Rows’ of the pivot dynamically. Any pointers to this? Thanks in advance!