Pivot Tables in R with melt and cast

If you are making the transition from Excel to R and still can’t figure out how to quickly obtain pivot tables like Excel has, this article is for you!

Actually it is pretty easy to produce Pivot Tables in R. All you need is a package called reshape by Hadley Wickham (yes, the same prolific author of plyr and ggplot2) and some understanding of how reshape “thinks” and works.

If you are interested in an alternative, easier but less powerful, method to create Pivot Tables in R using aggregate, you can read my other post here.

For this tutorial, we will be using a sample data set called Salespeople PivotTable report, which is used in many articles by Microsoft teaching the basics of Pivot Tables in Excel. The link to the sample data file can be found at the bottom of this article.

We will first obtain each result in Excel and then reproduce it with R. I recommend that you download the sample data file and follow along.

1. Download the sample data file

Use the link at the bottom of the article and save the file somewhere on your hard drive. So far so good!

2. Open the file in Excel and save a CSV copy of the first sheet

Double-click the file from the location where you saved it to open it inside Excel. The Excel worksheet has 3 tabs. We will be using the first one, containing the source data, and the last one, which contains a ready-made Pivot Table to play with. If you do not know how to make Pivot Tables in Excel, I recommend you read the Microsoft tutorial. In this article we will focus on how to obtain the same results with R and we will use the ready-made pivot table on sheet 3.

Once you are on the first sheet of the sample data set (the one with the source data), save a copy of it as CSV (Comma Separated Values) file. This will generate a file which is easier to import into R. R has packages which allow you to read directly Excel files without converting them to CSV, but for the sake of simplicity we will stick to CSV.

Before saving as CSV, change the format of column E (Order Amount) to General so that the orders amounts appear as numbers with 2 decimals and not with a $ sign in front and comma separated thousands. If you export to CSV without doing this first, there will be extra steps required in R to convert the sales amounts to numeric (see the note at the end of the article).

3. Load the CSV file in R

This is an easy step accomplished with the read.CSV function as follows.

Reading the CSV file into R

R

1

data<-read.CSV("file_name.txt",header=TRUE)

file_name.txt is the name of the CSV file you created from Excel. If it is not in the current working directory in R, you can either change R’s working directory with setwd or move the file to the current working directory or add a path to the file before the file name.

At this point you should have the same base data open in Excel and loaded into R. In R the first lines will look like this:

First 6 rows of data in R

1

2

3

4

5

6

7

Country Salesperson Order.Date OrderID Order.Amount

1UK Suyama10/07/2003102491863.40

2USA Peacock11/07/2003102523597.90

3USA Peacock12/07/2003102501552.60

4UK Dodsworth15/07/2003102552490.50

5USA Leverling15/07/200310251654.06

6UK Buchanan16/07/200310248440.00

4. Your first Pivot Table in R with melt and cast

Here we go. This is what we have been waiting for, the rest was just preparation to get to this point. Ready?

We will re-produce in R a pivot table like the default one that appears in the 3rd tab of the sample Excel file and which looks like this.

Basically it shows how much each sales person has sold in total across all orders and and countries (note that the Country filter is set to All). It also shows a Grand Total which is the sum of all sales.

To get the job done in R we will use the reshape package. There are other packages available which can achieve the same results, but reshape is particularly versatile and easy to use once you have grasped some basic concepts. (And, hey, it has been written by Hadley Wickham, so we _have_ to use it!)

We can install (if not already installed) and load the reshape package with:

Installing and loading reshape

R

1

2

install.packages("reshape")

library(reshape)

If you have ggplot2 installed and loaded, reshape will already be available.

The reshape package contains two functions that are key to easily generate pivot tables in R. They are melt and cast.

melt transforms a data frame from the original format to a so called long format, where all the observed variables (called measures) appear, together with their respective value, in two adjacent columns named variable and value. Each row of this new data format is identified by a unique combination of the id variables, also part of the original data frame.

An example will clarify how melt works. Let’s take the first 6 rows of our data set.

First 6 rows of data in R

1

2

3

4

5

6

7

Country Salesperson Order.Date OrderID Order.Amount

1UK Suyama10/07/2003102491863.40

2USA Peacock11/07/2003102523597.90

3USA Peacock12/07/2003102501552.60

4UK Dodsworth15/07/2003102552490.50

5USA Leverling15/07/200310251654.06

6UK Buchanan16/07/200310248440.00

It contains 5 columns. The first 4 identify each order by a combination of Country, Salesperson, Order.Date and OrderID. These are all non-numeric and there are no calculations we can do on them except, maybe, counting their frequency. Using melt‘s terminology, Country, Salesperson, Order.Date, OrderID are id variables, while Order.Amount, which is a numeric and which is the one we would like to sum up in our pivot table, is a measure.

When melting your data, you can indicate multiple id variables and also multiple measure variables. The id variable will appear as-is in the resulting melted format, while the measure variable will be stacked in the variable column with their respective value in the value column.

Note that melt preserves all data. Nothing is lost or modified, just the way the data are collected within the data frame. It is therefore possible to “un-melt” the melted data and go back to the original format at any time.

Let’s melt our sales data frame and see what the result looks like.

Melting the sales data

R

1

data.m<-melt(data,id=c(1:4),measure=c(5))

Pretty simple. Instead of listing the column names we have used their numerical identifiers. We have indicated to melt that columns 1 to 4 ( Country, Salesperson, Order.Date, OrderID) are id variables while column 5 (Order.Amount) is a measure variable.

Here is how the first 6 rows of data.m look like at this point:

First 6 rows of data after melt

1

2

3

4

5

6

7

Country Salesperson Order.Date OrderID variable value

1UK Suyama10/07/200310249Order.Amount1863.40

2USA Peacock11/07/200310252Order.Amount3597.90

3USA Peacock12/07/200310250Order.Amount1552.60

4UK Dodsworth15/07/200310255Order.Amount2490.50

5USA Leverling15/07/200310251Order.Amount654.06

6UK Buchanan16/07/200310248Order.Amount440.00

As said the column Order.Amount has been “melted” into a variable column with the respective values in the value column.

Now that our data have been melted (and you have to do this only once, unless you change idea on what you want as id and as measure), we are ready to cast them to build the pivot table.

cast requires us to indicate, beside a reference to the melted data, how we want to re-aggregate the values.

The basic syntax is (and, yes, we are omitting some parameters that we won’t need in this tutorial):

formula and fun.aggregate are the most important two because they indicate how we want to reshape the data and which functions to use for aggregating the values.

Think of formula as saying:

1

what doyou want asrows~what doyou want ascolumns

While fun.aggregate says how you want to aggregate the values of the variable(s) in order to reshape them as described by formula. This is the equivalent of selecting “Count of”, or “Sum of” etc. in Excel.

Going back to our example will make this point clearer.

We want to produce a pivot table which contains the total sales for each sales person. Therefore we chose Salesperson as row and variable (which is Sales.Amount) as column. The aggregate function will be sum to obtain the sum of the variable (Sales.Amount) for each sales person. Here the command:

Casting the data to make the pivot table

R

1

data.c<-cast(data.m,Salesperson~variable,sum)

And here is the resulting pivot table:

Pivot table produced with cast

1

2

3

4

5

6

7

8

9

10

Salesperson Order.Amount

1Buchanan68792.25

2Callahan123032.67

3Davolio182500.09

4Dodsworth75048.04

5Fuller162503.78

6King116962.99

7Leverling201196.27

8Peacock225763.68

9Suyama72527.63

As expected (or not?!?) is identical to the Excel one. Congratulations on your first pivot table in R!

But wait before you pat yourself on the shoulder. In Excel there is an extra row showing the Grand Total and we don’t have it in R. How can we fix it?

5. Adding Grand Totals

That’s easy actually. Look back at the syntax for cast. There is an argument called margins which comes to rescue. margins accepts a vector that can contain the strings “grand_col” and “grand_row”. This adds an extra column or extra row (or both) with a grand total to the resulting data frame. The grand total is obtained through the same fun.aggregate.

Let’s add a row with the grand total.

Casting the data to make the pivot table

R

1

data.c<-cast(data.m,Salesperson~variable,sum,margins=c("grand_row"))

And here it is, identified by (all):

Pivot table with grand total row added

1

2

3

4

5

6

7

8

9

10

11

Salesperson Order.Amount

1Buchanan68792.25

2Callahan123032.67

3Davolio182500.09

4Dodsworth75048.04

5Fuller162503.78

6King116962.99

7Leverling201196.27

8Peacock225763.68

9Suyama72527.63

10(all)1228327.40

Well done!

6. And how to Filter

Ok, ok. You are a picky one! I know you noted that our pivot table lacks one more feature respect to the Excel one. It has in fact no ability to Filter by Country. Did you get it yet? No!?!

Yes, the solution is in the other parameter for cast which we did not consider so far (beside those we omitted): subset.

subset allows us to limit the casting to only a subset of the melted data selected according to certain criteria. It requires a vector of TRUE or FALSE corresponding to the rows of the melted data we want to select (TRUE) or exclude (FALSE) before the casting is performed.

Say that we want to calculate the performance only for the American sales people. We need to select only those rows for which Country==”USA”. This is exactly the additional condition we need to specify within cast and it corresponds to selecting the Filter Country = USA a the top of the Excel pivot table.

And here is the filtered pivot. The grand total has also been adjusted accordingly.

Pivot filtered on Country==

1

2

3

4

5

6

7

Salesperson Order.Amount

1Callahan123032.7

2Davolio182500.1

3Fuller162503.8

4Leverling201196.3

5Peacock225763.7

6(all)894996.5

Congratulations, now we are really done! 🙂

I hope you have enjoyed this tutorial on how to make simple pivot tables in R. Please leave a comment below if you would like to see more tutorials on the same topic.

Till next time!

Note: Stripping $ and commas and converting to numeric

Ok, so you decided not to change the format of Column E (Amount) to General before saving the table as CSV. I had warned you. Now your punishment will be terrible and will include… regular expressions! Read on.

Once you open the CSV file in R, you can verify that the column Order.Amount has indeed been imported as a Factor. In order to be able to do calculations on it, we need to convert it to numeric first. This can be easily achieved using the function as.numeric, however the $ sign at the beginning and the commas separating the thousands will cause it to generate unexpected results. Before running through as.numeric we need therefore to strip both the $ sign and the commas. This can be quickly done using gsub and regular expressions. Teaching you regular expressions goes beyond the scope of this article, but the following code does the job.

Strip $ and , from Amount

R

1

2

data$Order.Amount<-gsub("([/$,])","",data$Order.Amount)

data$Order.Amount<-as.numeric(data$Order.Amount)

Obviously the two steps above can be combined into one by nesting gsub into as.numeric.

Source data:

Update: it seems that the Salespeople PivotTable report sample data file is no longer available on Microsoft’s web site. You can still get a copy from the link below.

Originally the file was located here:
http://office.microsoft.com/en-001/templates/salespeople-pivottable-report-TC001091651.aspx?CTT=5&origin=HA001087565

Hi Arianna,
the easiest way to do this is to use the new dplyr package by Hadley Wickham. Please check out the article I just published on Pivot Tables in R with dplyr. It contains examples about aggregating multiple variables with different summary statistics. including sum, average and max. I believe it should help you to solve your problem, otherwise please feel free to ask again.
Cheers,
Marco.

Hi Edward,
you can always add extra calculated columns (aka variables) in the usual way (data$new.column = data$column.1 / data$column.2) or you can take advantage of libraries like dplyr that make the task easy. I just wrote a new article about using dplyr to make Pivot Tables, maybe you can take a look and see if it helps to solve your problem. I added an example on calculating new columns from existing ones and then pivoting them.
Cheers,
Marco.

It is a great article and I loved it. But instead of sum, i want to calculate: Average, count, percentage count etc. I tried to replace sum with average etc but it is giving me an error stating: object ‘count’ not found. Could you please help me out.

Hi Rupinder,
It is difficult to help without seeing your code. I just wrote an article on using dplyr to make Pivot Tables. dplyr makes it easier to calculate multiple statistics at once for the same variable. Please have a look and let me know if it works for you.
Ciao,
Marco.

thanks Marco…a minor tip: after importing the file into R and in the step when I did the cast operation, I got the error message
“Error in Summary.factor(integer(0), na.rm = FALSE) :
sum not meaningful for factors”
…this drove me around the bend till I realized that the Amount column, though it looked like numbers was in text form. The solution is to go back to the data.csv file in Excel, select the Amount column, and change its format from General t Number. Hope this helps
ajit

Hi Ajit,
thanks for your comment. Indeed, one should always check with the str() command with which types the variables are being imported into R. This is known to be a source of headaches, especially when variables get or do not get converted to factors. Good that you were able to spot the problem and solve it.
Ciao,
Marco.

After seeing the results of your little example (and realizing the flexibility of the coding vs. the traditional Excel “point&click” orgies plus the data outpout format which is better suitable for further analysis than the default Excel pivot result), I am determinded to dare the transition to R (…daunting as the task my seem; the learning curve is really steep for a start – but the reward is bound to come….).

One very positive & motivating factor is the amount of information and help that is available for R from people like you – so you may take this compliment literally personal; but you can also take this as a “pars-pro-toto” acknowledgement for the entire R-community 🙂

Hi Ulf,
sehr gerne!
It is nice to receive comments like yours as they provide the energy to keep sharing. I am glad you are making the transition from Excel to R. When I introduced it into my company for a special financial analysis project which used to be done with Excel, we cut the total time from many days to few hours. The quality and our confidence in the results also greatly improved to everyone’s satisfaction.
By the way, I also started to make a series of YouTube videos for anyone wanting to make the move from Excel to R. They are called R for Excel Users and show side by side how the same tasks are done in Excel and in R. You may want to check them out, maybe they are useful to you too. I just need to find the time to continue the series. 🙂
For any questions, please feel free to contact me, I will be glad to help if I can.
MfG,
Marco.

Hi Zoe,
not easy to guess what could be wrong without seeing an example of your data and the exact syntax of the cast command you are issuing (if it was in your original comment it didn’t come through unfortunately).
I would recommend that you first use str() on your data frame to check you have the right data types in it, then check your cast syntax. Are you referring a variable which is not defined anywhere?
Cheers,
Marco.

Hi Marc,
I am trying to migrate from Excel to R and find your articles very easy to comprehend.
In this context, I am stuck with an issue. I have a “Price” variable with $ appended and converted it into numeric using the gsub and as.numeric functions mentioned by you. However, when I am melting it, it is reconverting it into factor. Consequently, when I am trying to aggregate it in the Cast function, I am getting a factor error. Please suggest a way out.
Thanks
Prashanth

Hi Prashanth,
melt and cast should not be changing the type of the variables they work with, so I am wondering whether your Price variable has the right type in the first place. Did you check with str() that your conversion to numeric has been successful? Please post some sample data if you like, so I can simulate what you are doing.
Cheers,
Marco.