Introduction

We use summarise() with aggregate functions, which take a vector of values and return a single number. Function summarise_each() offers an alternative approach to summarise() with identical results.

This post aims to compare the behavior of summarise() and summarise_each() considering two factors we can take under control:

How many variables to manipulate

1A. single variable

1B. more than a variable

How many functions to apply to each variable

2A. single function

2B. more than one function

resulting in the following four cases:

Case 1: apply one function to one variable

Case 2: apply many functions to one variable

Case 3: apply one function to many variables

Case 4: apply many functions to many variables

These four cases will be also tested with and without a group_by() option.

The mtcars data frame

For this article we will use the well known mtcars data frame.

We will first transform it into a tbl_df object; no change will occur to the standard data.frame object but a much better print method will be available.

Finally, to keep this article tidy and clean we will select only four variables of interest

1

2

3

4

mtcars<-mtcars%>%

tbl_df()%>%

select(cyl,mpg,disp)

Case 1: apply one function to one variable

In this case, summarise() results the simplest candidate.

1

2

3

4

# without group

mtcars%>%

summarise(mean_mpg=mean(mpg))

1

2

3

4

5

6

## Source: local data frame [1 x 1]

##

## mean_mpg

## (dbl)

## 1 20.09062

1

2

3

4

5

# with group

mtcars%>%

group_by(cyl)%>%

summarise(mean_mpg=mean(mpg))

1

2

3

4

5

6

7

8

## Source: local data frame [3 x 2]

##

## cyl mean_mpg

## (dbl) (dbl)

## 1 4 26.66364

## 2 6 19.74286

## 3 8 15.10000

We could use function summarise_each() as well but, its usage results in a loss of clarity.

1

2

3

4

# without group

mtcars%>%

summarise_each(funs(mean),mean_mpg=mpg)

1

2

3

4

5

6

## Source: local data frame [1 x 1]

##

## mean_mpg

## (dbl)

## 1 20.09062

1

2

3

4

5

# with group

mtcars%>%

group_by(cyl)%>%

summarise_each(funs(mean),mean_mpg=mpg)

1

2

3

4

5

6

7

8

## Source: local data frame [3 x 2]

##

## cyl mean_mpg

## (dbl) (dbl)

## 1 4 26.66364

## 2 6 19.74286

## 3 8 15.10000

Case 2: apply many functions to one variable

In this case we can use both functions summarise() and summarise_each().

Function summarise() has a more intuitive syntax:

1

2

3

4

# without group

mtcars%>%

summarise(min_mpg=min(mpg),max_mpg=max(mpg))

1

2

3

4

5

6

## Source: local data frame [1 x 2]

##

## min_mpg max_mpg

## (dbl) (dbl)

## 1 10.4 33.9

1

2

3

4

5

# with group

mtcars%>%

group_by(cyl)%>%

summarise(min_mpg=min(mpg),max_mpg=max(mpg))

1

2

3

4

5

6

7

8

## Source: local data frame [3 x 3]

##

## cyl min_mpg max_mpg

## (dbl) (dbl) (dbl)

## 1 4 21.4 33.9

## 2 6 17.8 21.4

## 3 8 10.4 19.2

The names of the output variables can be specified in simple forms like: max_mpg = max(mpg)

When we apply many functions to one variable, the use of summarise_each() provides a more compact and tidy notation:

1

2

3

4

# without group

mtcars%>%

summarise_each(funs(min,max),mpg)

1

2

3

4

5

6

## Source: local data frame [1 x 2]

##

## min max

## (dbl) (dbl)

## 1 10.4 33.9

1

2

3

4

5

# with group

mtcars%>%

group_by(cyl)%>%

summarise_each(funs(min,max),mpg)

1

2

3

4

5

6

7

8

## Source: local data frame [3 x 3]

##

## cyl min max

## (dbl) (dbl) (dbl)

## 1 4 21.4 33.9

## 2 6 17.8 21.4

## 3 8 10.4 19.2

The names of the output variables is given by the name of the functions: min and max. In this case we loose the name of the variable the function is applied to. If we prefer something like: min_mpg and max_mpg we shall rename the functions we call within funs():

1

2

3

4

# without group

mtcars%>%

summarise_each(funs(min_mpg=min,max_mpg=max),mpg)

1

2

3

4

5

6

## Source: local data frame [1 x 2]

##

## min_mpg max_mpg

## (dbl) (dbl)

## 1 10.4 33.9

1

2

3

4

5

# with group

mtcars%>%

group_by(cyl)%>%

summarise_each(funs(min_mpg=min,max_mpg=max),mpg)

1

2

3

4

5

6

7

8

## Source: local data frame [3 x 3]

##

## cyl min_mpg max_mpg

## (dbl) (dbl) (dbl)

## 1 4 21.4 33.9

## 2 6 17.8 21.4

## 3 8 10.4 19.2

Case 3: apply one function to many variables

This case is very similar to case 2. Both functions summarise() and summarise_each() can be used

Function summarise() has again a more intuitive syntax and the names of output variables can be specified in the usual simple form: max_mpg = max(mpg)

1

2

3

4

# without group

mtcars%>%

summarise(mean_mpg=mean(mpg),mean_disp=mean(disp))

1

2

3

4

5

6

## Source: local data frame [1 x 2]

##

## mean_mpg mean_disp

## (dbl) (dbl)

## 1 20.09062 230.7219

1

2

3

4

5

# with group

mtcars%>%

group_by(cyl)%>%

summarise(mean_mpg=mean(mpg),mean_disp=mean(disp))

1

2

3

4

5

6

7

8

## Source: local data frame [3 x 3]

##

## cyl mean_mpg mean_disp

## (dbl) (dbl) (dbl)

## 1 4 26.66364 105.1364

## 2 6 19.74286 183.3143

## 3 8 15.10000 353.1000

When we apply many functions to one variable, the use of summarise_each() provides a more compact and tidy notation:

1

2

3

4

# without group

mtcars%>%

summarise_each(funs(mean),mpg,disp)

1

2

3

4

5

6

## Source: local data frame [1 x 2]

##

## mpg disp

## (dbl) (dbl)

## 1 20.09062 230.7219

1

2

3

4

5

# with group

mtcars%>%

group_by(cyl)%>%

summarise_each(funs(mean),mpg,disp)

1

2

3

4

5

6

7

8

## Source: local data frame [3 x 3]

##

## cyl mpg disp

## (dbl) (dbl) (dbl)

## 1 4 26.66364 105.1364

## 2 6 19.74286 183.3143

## 3 8 15.10000 353.1000

The names of the output variables is given by the name of the variables: mpg and disp. In this case we loose track of the name of the function applied to the variables: mean(). Possibly we would prefer something like: mean_mpg and mean_disp. In order to achieve this result we shall appropriately rename the variables we pass to ... within summarise_each():

1

2

3

4

# without group

mtcars%>%

summarise_each(funs(mean),mean_mpg=mpg,mean_disp=disp)

1

2

3

4

5

6

## Source: local data frame [1 x 2]

##

## mean_mpg mean_disp

## (dbl) (dbl)

## 1 20.09062 230.7219

1

2

3

4

5

# with group

mtcars%>%

group_by(cyl)%>%

summarise_each(funs(mean),mean_mpg=mpg,mean_disp=disp)

1

2

3

4

5

6

7

8

## Source: local data frame [3 x 3]

##

## cyl mean_mpg mean_disp

## (dbl) (dbl) (dbl)

## 1 4 26.66364 105.1364

## 2 6 19.74286 183.3143

## 3 8 15.10000 353.1000

Case 4: apply many functions to many variables

As in the previous cases both functions: summarise() and summarise_each() provide a valid alternative.

Function summarise() has again a more intuitive syntax and the names of output variables can be specified in the usual simple form: max_mpg = max(mpg)

When we apply many functions to one variable, the use of summarise_each() provides a more compact and tidy notation:

1

2

3

4

# without group

mtcars%>%

summarise_each(funs(min,max),mpg,disp)

1

2

3

4

5

6

## Source: local data frame [1 x 4]

##

## mpg_min disp_min mpg_max disp_max

## (dbl) (dbl) (dbl) (dbl)

## 1 10.4 71.1 33.9 472

1

2

3

4

5

6

# with a single group

mtcars%>%

group_by(cyl)%>%

summarise_each(funs(min,max),mpg,disp)

1

2

3

4

5

6

7

8

## Source: local data frame [3 x 5]

##

## cyl mpg_min disp_min mpg_max disp_max

## (dbl) (dbl) (dbl) (dbl) (dbl)

## 1 4 21.4 71.1 33.9 146.7

## 2 6 17.8 145.0 21.4 258.0

## 3 8 10.4 275.8 19.2 472.0

The names of the output variables is given by the notation: variable_function: i.e. mpg_mim, disp_min etc ....

Naming output variables with a different notation: i.e. function_variable does not appear to be possible within the call tosummarise_each()

This goal has to be achieved with a separate instruction

1

2

3

4

5

# without group

mtcars%>%

summarise_each(funs(min,max),mpg,disp)%>%

setNames(c("min_mpg","min_disp","max_mpg","max_disp"))

1

2

3

4

5

6

## Source: local data frame [1 x 4]

##

## min_mpg min_disp max_mpg max_disp

## (dbl) (dbl) (dbl) (dbl)

## 1 10.4 71.1 33.9 472

1

2

3

4

5

6

# with group

mtcars%>%

group_by(cyl)%>%

summarise_each(funs(min,max),mpg,disp)%>%

setNames(c("gear","min_mpg","min_disp","max_mpg","max_disp"))

1

2

3

4

5

6

7

8

## Source: local data frame [3 x 5]

##

## gear min_mpg min_disp max_mpg max_disp

## (dbl) (dbl) (dbl) (dbl) (dbl)

## 1 4 21.4 71.1 33.9 146.7

## 2 6 17.8 145.0 21.4 258.0

## 3 8 10.4 275.8 19.2 472.0

Conclusions

When using functions returning results of length one we have two possible candidate verbs:

summarise()

summarise_each()

Function summarise() has a simpler syntax while function summarise_each() has a more compact notation.

As a consequence, summarise() seems more appropriate dealing with a single variable or a single function. The more the number of variables or functions increases, the more summarise_each() becomes a better choice.

Function summarise_each() has its own way to assign names to the output variables:

Case 2: apply many functions to one variable

The names of the output variables is given by the name of the functions. In this case we loose the name of the variable the function is applied to.

Case 3: apply one function to many variables

The names of the output variables is given by the name of the variables. In this case we loose track of the name of the function applied to the variables

Case 4: apply many functions to many variables

The names of the output variables is given by the notation: variable_function. Naming output variables with a different notation does not appear to be possible within the call to summarise_each()

Andrea Spanò is an Rstudio certificated instructor who has worked as an R trainer and consultant for over 20 years. Andrea graduated in Statistics from the University of Siena and obtained a Master’s degree in Applied Statistics at the University College of London. He runs Quantide consulting firm and teaches at Luiss University post grad course on Big Data Management