Tag Archives: subset

Data Manipulation: Subsetting

Making a subset of a data frame in R is one of the most basic and necessary data manipulation techniques you can use in R. If you are brand new to data analysis, a data frame is the most common data storage object in R and subsets are a collection of rows from that data frame based on certain criteria.

Data Frame

V1

V2

V3

V4

V5

V6

V7

Row1

Row2

Row3

Row4

Row5

Row6

Subset

V1

V2

V3

V4

V5

V6

V7

Row2

Row5

Row6

The Data

For this example, I’m using data from FanGraphs. You can get the exact data set here, and it’s provided in my GitHub. This data set has players names, teams, seasons and stats. We are able to create a subset based on any one or more of these variables.

The Code

I’m going to show four different ways to subset data frames: using a boolean vector, using the which() function, using the subset() function and using filter() function from the dplyr package. All of these functions are different ways to do the same thing. The dplyr package is fast and easy to code, and it is my recommended subsetting method, so let’s start with that. This is especially true when you have to loop an operation or run something that will be run repeatedly.

dplyr

The filter() requires the dplyr package to be loaded in your R environment, and it removes the filter() function from the default stats package. You don’t need to worry about but it does tell you that when you first install and load the package.

#Finds players not in the NL East and who have more than 30 home runs.

data.sub.5<-filter(data,!(Team%in%NL.East),HR>30)

The filter() function is rather simple to use. The examples above illustrate a few simple examples where you specify the data frame you want to use and create true/false expressions, which filter() uses to find which rows it should keep. The output of the function is saved into a separate variable, so we can reuse the original data frame for other subsets. I put a few other examples in the code to demonstrate how it works.

Built-in Functions

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

#method 1 -- using a T/F vector

data.sub.1<-data[data$Team=='Marlins',]

#method 2 -- which()

data.sub.2<-data[which(data$Team=='Marlins'),]

#method 3 -- subset()

data.sub.3<-subset(data,subset=(Team=='Marlins'))

#other comparison functions

data.sub.4<-data[data$HR>30,]#greater than

data.sub.5<-data[data$HR<30,]#less than

data.sub.6<-data[data$AVG>.320&data$PA>600,]#duel requirements using AND (&)

data.sub.8<-data[data$HR>40|data$SB>30,]#duel requirements using OR (|)

data.sub.9<-data[data$Team%in%c('Marlins','Nationals','Mets','Braves','Phillies'),]#finds values in a vector

data.sub.10<-data[data$Team!='- - -',]#removes players who played for two teams

If you don’t want to use the dplyr package, you are able to accomplish the same thing uses the basic functionality of R. #method 1 uses a boolean vector to select rows for the subset. #method 2 uses the which() function. This function finds the index of a boolean vector of True values. Both of these techniques use the original data frame and uses the row index to create a subset.

The subset() function works much like the filter() function, except the syntax is slightly different and you don’t have to download a separate package.

Efficiency

While subset works in a similar fashion, it doesn’t perform the same way. While some data manipulation might only happen once or a few times throughout a project, many projects require constant subsetting and possibly from a loop. So while the gains might seem insignificant for one run, multiply that difference and it adds up quickly.

I timed how long it would take to run the same [complex] subset of a 500,000 row data frame using the four different techniques.

Time to Subset 500,000 Rows

Subset Method

Elapsed Time (sec)

boolean vector

0.87

which()

0.33

subset()

0.81

dplyr filter()

0.21

The dpylr filter() function was by far the quickest, which is why I prefer to use it.

The full code I used to write up this tutorial is available on my GitHub .