8. How do I filter rows of a pandas DataFrame by column value? (video)¶

In [33]:

# read a dataset of top-rated IMDb movies into a DataFramemovies=pd.read_csv('http://bit.ly/imdbratings')movies.head()

Out[33]:

star_rating

title

content_rating

genre

duration

actors_list

0

9.3

The Shawshank Redemption

R

Crime

142

[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...

1

9.2

The Godfather

R

Crime

175

[u'Marlon Brando', u'Al Pacino', u'James Caan']

2

9.1

The Godfather: Part II

R

Crime

200

[u'Al Pacino', u'Robert De Niro', u'Robert Duv...

3

9.0

The Dark Knight

PG-13

Action

152

[u'Christian Bale', u'Heath Ledger', u'Aaron E...

4

8.9

Pulp Fiction

R

Crime

154

[u'John Travolta', u'Uma Thurman', u'Samuel L....

In [34]:

# examine the number of rows and columnsmovies.shape

Out[34]:

(979, 6)

Goal: Filter the DataFrame rows to only show movies with a 'duration' of at least 200 minutes.

In [35]:

# create a list in which each element refers to a DataFrame row: True if the row satisfies the condition, False otherwisebooleans=[]forlengthinmovies.duration:iflength>=200:booleans.append(True)else:booleans.append(False)

In [36]:

# confirm that the list has the same length as the DataFramelen(booleans)

Out[36]:

979

In [37]:

# examine the first five list elementsbooleans[0:5]

Out[37]:

[False, False, True, False, False]

In [38]:

# convert the list to a Seriesis_long=pd.Series(booleans)is_long.head()

Out[38]:

0 False
1 False
2 True
3 False
4 False
dtype: bool

In [39]:

# use bracket notation with the boolean Series to tell the DataFrame which rows to displaymovies[is_long]

Out[39]:

star_rating

title

content_rating

genre

duration

actors_list

2

9.1

The Godfather: Part II

R

Crime

200

[u'Al Pacino', u'Robert De Niro', u'Robert Duv...

7

8.9

The Lord of the Rings: The Return of the King

PG-13

Adventure

201

[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...

17

8.7

Seven Samurai

UNRATED

Drama

207

[u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...

78

8.4

Once Upon a Time in America

R

Crime

229

[u'Robert De Niro', u'James Woods', u'Elizabet...

85

8.4

Lawrence of Arabia

PG

Adventure

216

[u"Peter O'Toole", u'Alec Guinness', u'Anthony...

142

8.3

Lagaan: Once Upon a Time in India

PG

Adventure

224

[u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...

157

8.2

Gone with the Wind

G

Drama

238

[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...

204

8.1

Ben-Hur

G

Adventure

212

[u'Charlton Heston', u'Jack Hawkins', u'Stephe...

445

7.9

The Ten Commandments

APPROVED

Adventure

220

[u'Charlton Heston', u'Yul Brynner', u'Anne Ba...

476

7.8

Hamlet

PG-13

Drama

242

[u'Kenneth Branagh', u'Julie Christie', u'Dere...

630

7.7

Malcolm X

PG-13

Biography

202

[u'Denzel Washington', u'Angela Bassett', u'De...

767

7.6

It's a Mad, Mad, Mad, Mad World

APPROVED

Action

205

[u'Spencer Tracy', u'Milton Berle', u'Ethel Me...

In [40]:

# simplify the steps above: no need to write a for loop to create 'is_long' since pandas will broadcast the comparisonis_long=movies.duration>=200movies[is_long]# or equivalently, write it in one line (no need to create the 'is_long' object)movies[movies.duration>=200]

Out[40]:

star_rating

title

content_rating

genre

duration

actors_list

2

9.1

The Godfather: Part II

R

Crime

200

[u'Al Pacino', u'Robert De Niro', u'Robert Duv...

7

8.9

The Lord of the Rings: The Return of the King

PG-13

Adventure

201

[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...

17

8.7

Seven Samurai

UNRATED

Drama

207

[u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...

78

8.4

Once Upon a Time in America

R

Crime

229

[u'Robert De Niro', u'James Woods', u'Elizabet...

85

8.4

Lawrence of Arabia

PG

Adventure

216

[u"Peter O'Toole", u'Alec Guinness', u'Anthony...

142

8.3

Lagaan: Once Upon a Time in India

PG

Adventure

224

[u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...

157

8.2

Gone with the Wind

G

Drama

238

[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...

204

8.1

Ben-Hur

G

Adventure

212

[u'Charlton Heston', u'Jack Hawkins', u'Stephe...

445

7.9

The Ten Commandments

APPROVED

Adventure

220

[u'Charlton Heston', u'Yul Brynner', u'Anne Ba...

476

7.8

Hamlet

PG-13

Drama

242

[u'Kenneth Branagh', u'Julie Christie', u'Dere...

630

7.7

Malcolm X

PG-13

Biography

202

[u'Denzel Washington', u'Angela Bassett', u'De...

767

7.6

It's a Mad, Mad, Mad, Mad World

APPROVED

Action

205

[u'Spencer Tracy', u'Milton Berle', u'Ethel Me...

In [41]:

# select the 'genre' Series from the filtered DataFramemovies[movies.duration>=200].genre# or equivalently, use the 'loc' methodmovies.loc[movies.duration>=200,'genre']

9. How do I apply multiple filter criteria to a pandas DataFrame? (video)¶

In [42]:

# read a dataset of top-rated IMDb movies into a DataFramemovies=pd.read_csv('http://bit.ly/imdbratings')movies.head()

Out[42]:

star_rating

title

content_rating

genre

duration

actors_list

0

9.3

The Shawshank Redemption

R

Crime

142

[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...

1

9.2

The Godfather

R

Crime

175

[u'Marlon Brando', u'Al Pacino', u'James Caan']

2

9.1

The Godfather: Part II

R

Crime

200

[u'Al Pacino', u'Robert De Niro', u'Robert Duv...

3

9.0

The Dark Knight

PG-13

Action

152

[u'Christian Bale', u'Heath Ledger', u'Aaron E...

4

8.9

Pulp Fiction

R

Crime

154

[u'John Travolta', u'Uma Thurman', u'Samuel L....

In [43]:

# filter the DataFrame to only show movies with a 'duration' of at least 200 minutesmovies[movies.duration>=200]

Out[43]:

star_rating

title

content_rating

genre

duration

actors_list

2

9.1

The Godfather: Part II

R

Crime

200

[u'Al Pacino', u'Robert De Niro', u'Robert Duv...

7

8.9

The Lord of the Rings: The Return of the King

PG-13

Adventure

201

[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...

17

8.7

Seven Samurai

UNRATED

Drama

207

[u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...

78

8.4

Once Upon a Time in America

R

Crime

229

[u'Robert De Niro', u'James Woods', u'Elizabet...

85

8.4

Lawrence of Arabia

PG

Adventure

216

[u"Peter O'Toole", u'Alec Guinness', u'Anthony...

142

8.3

Lagaan: Once Upon a Time in India

PG

Adventure

224

[u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...

157

8.2

Gone with the Wind

G

Drama

238

[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...

204

8.1

Ben-Hur

G

Adventure

212

[u'Charlton Heston', u'Jack Hawkins', u'Stephe...

445

7.9

The Ten Commandments

APPROVED

Adventure

220

[u'Charlton Heston', u'Yul Brynner', u'Anne Ba...

476

7.8

Hamlet

PG-13

Drama

242

[u'Kenneth Branagh', u'Julie Christie', u'Dere...

630

7.7

Malcolm X

PG-13

Biography

202

[u'Denzel Washington', u'Angela Bassett', u'De...

767

7.6

It's a Mad, Mad, Mad, Mad World

APPROVED

Action

205

[u'Spencer Tracy', u'Milton Berle', u'Ethel Me...

Understanding logical operators:

and: True only if both sides of the operator are True

or: True if either side of the operator is True

In [44]:

# demonstration of the 'and' operatorprint(TrueandTrue)print(TrueandFalse)print(FalseandFalse)

True
False
False

In [45]:

# demonstration of the 'or' operatorprint(TrueorTrue)print(TrueorFalse)print(FalseorFalse)

True
True
False

Rules for specifying multiple filter criteria in pandas:

use & instead of and

use | instead of or

add parentheses around each condition to specify evaluation order

Goal: Further filter the DataFrame of long movies (duration >= 200) to only show movies which also have a 'genre' of 'Drama'

In [46]:

# CORRECT: use the '&' operator to specify that both conditions are requiredmovies[(movies.duration>=200)&(movies.genre=='Drama')]

Out[46]:

star_rating

title

content_rating

genre

duration

actors_list

17

8.7

Seven Samurai

UNRATED

Drama

207

[u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...

157

8.2

Gone with the Wind

G

Drama

238

[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...

476

7.8

Hamlet

PG-13

Drama

242

[u'Kenneth Branagh', u'Julie Christie', u'Dere...

In [47]:

# INCORRECT: using the '|' operator would have shown movies that are either long or dramas (or both)movies[(movies.duration>=200)|(movies.genre=='Drama')].head()

Out[47]:

star_rating

title

content_rating

genre

duration

actors_list

2

9.1

The Godfather: Part II

R

Crime

200

[u'Al Pacino', u'Robert De Niro', u'Robert Duv...

5

8.9

12 Angry Men

NOT RATED

Drama

96

[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...

7

8.9

The Lord of the Rings: The Return of the King

PG-13

Adventure

201

[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...

9

8.9

Fight Club

R

Drama

139

[u'Brad Pitt', u'Edward Norton', u'Helena Bonh...

13

8.8

Forrest Gump

PG-13

Drama

142

[u'Tom Hanks', u'Robin Wright', u'Gary Sinise']

Goal: Filter the original DataFrame to show movies with a 'genre' of 'Crime' or 'Drama' or 'Action'

In [48]:

# use the '|' operator to specify that a row can match any of the three criteriamovies[(movies.genre=='Crime')|(movies.genre=='Drama')|(movies.genre=='Action')].head(10)# or equivalently, use the 'isin' methodmovies[movies.genre.isin(['Crime','Drama','Action'])].head(10)

# specify which columns to include by nameufo=pd.read_csv('http://bit.ly/uforeports',usecols=['City','State'])# or equivalently, specify columns by positionufo=pd.read_csv('http://bit.ly/uforeports',usecols=[0,4])ufo.columns

Out[50]:

Index([u'City', u'Time'], dtype='object')

Question: When reading from a file, how do I read in only a subset of the rows?

In [51]:

# specify how many rows to readufo=pd.read_csv('http://bit.ly/uforeports',nrows=3)ufo

# save the DataFrame of dummy variables and concatenate them to the original DataFrameembarked_dummies=pd.get_dummies(train.Embarked,prefix='Embarked').iloc[:,1:]train=pd.concat([train,embarked_dummies],axis=1)train.head()

# read a dataset of movie reviewers into a DataFrameuser_cols=['user_id','age','gender','occupation','zip_code']users=pd.read_table('http://bit.ly/movieusers',sep='|',header=None,names=user_cols,index_col='user_id')users.head()

Out[251]:

age

gender

occupation

zip_code

user_id

1

24

M

technician

85711

2

53

F

other

94043

3

23

M

writer

32067

4

24

M

technician

43537

5

33

F

other

15213

In [252]:

users.shape

Out[252]:

(943, 4)

In [253]:

# detect duplicate zip codes: True if an item is identical to a previous itemusers.zip_code.duplicated().tail()

c:\Users\Kevin\Anaconda\lib\site-packages\pandas\core\generic.py:2701: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self[name] = value

Problem: That statement involves two operations, a __getitem__ and a __setitem__. pandas can't guarantee whether the __getitem__ operation returns a view or a copy of the data.

If __getitem__ returns a view of the data, __setitem__ will affect the 'movies' DataFrame.

But if __getitem__ returns a copy of the data, __setitem__ will not affect the 'movies' DataFrame.

In [272]:

# the 'content_rating' Series has not changedmovies.content_rating.isnull().sum()

Out[272]:

3

Solution: Use the loc method, which replaces the 'NOT RATED' values in a single __setitem__ operation.

# create a DataFrame only containing movies with a high 'star_rating'top_movies=movies.loc[movies.star_rating>=9,:]top_movies

Out[275]:

star_rating

title

content_rating

genre

duration

actors_list

0

9.3

The Shawshank Redemption

R

Crime

142

[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...

1

9.2

The Godfather

R

Crime

175

[u'Marlon Brando', u'Al Pacino', u'James Caan']

2

9.1

The Godfather: Part II

R

Crime

200

[u'Al Pacino', u'Robert De Niro', u'Robert Duv...

3

9.0

The Dark Knight

PG-13

Action

152

[u'Christian Bale', u'Heath Ledger', u'Aaron E...

Goal: Fix the 'duration' for 'The Shawshank Redemption'.

In [276]:

# overwrite the relevant cell with the correct durationtop_movies.loc[0,'duration']=150

c:\Users\Kevin\Anaconda\lib\site-packages\pandas\core\indexing.py:465: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item] = s

Problem: pandas isn't sure whether 'top_movies' is a view or a copy of 'movies'.

In [277]:

# 'top_movies' DataFrame has been updatedtop_movies

Out[277]:

star_rating

title

content_rating

genre

duration

actors_list

0

9.3

The Shawshank Redemption

R

Crime

150

[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...

1

9.2

The Godfather

R

Crime

175

[u'Marlon Brando', u'Al Pacino', u'James Caan']

2

9.1

The Godfather: Part II

R

Crime

200

[u'Al Pacino', u'Robert De Niro', u'Robert Duv...

3

9.0

The Dark Knight

PG-13

Action

152

[u'Christian Bale', u'Heath Ledger', u'Aaron E...

In [278]:

# 'movies' DataFrame has not been updatedmovies.head(1)

Out[278]:

star_rating

title

content_rating

genre

duration

actors_list

0

9.3

The Shawshank Redemption

R

Crime

142

[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...

Solution: Any time you are attempting to create a DataFrame copy, use the copy method.

# view the option descriptions (including the default and current values)pd.describe_option()

display.chop_threshold : float or None
if set to a float value, all float values smaller then the given threshold
will be displayed as exactly 0 by repr and friends.
[default: None] [currently: None]
display.colheader_justify : 'left'/'right'
Controls the justification of column headers. used by DataFrameFormatter.
[default: right] [currently: right]
display.column_space No description available.
[default: 12] [currently: 12]
display.date_dayfirst : boolean
When True, prints and parses dates with the day first, eg 20/01/2005
[default: False] [currently: False]
display.date_yearfirst : boolean
When True, prints and parses dates with the year first, eg 2005/01/20
[default: False] [currently: False]
display.encoding : str/unicode
Defaults to the detected encoding of the console.
Specifies the encoding to be used for strings returned by to_string,
these are generally strings meant to be displayed on the console.
[default: UTF-8] [currently: UTF-8]
display.expand_frame_repr : boolean
Whether to print out the full DataFrame repr for wide DataFrames across
multiple lines, `max_columns` is still respected, but the output will
wrap-around across multiple "pages" if its width exceeds `display.width`.
[default: True] [currently: True]
display.float_format : callable
The callable should accept a floating point number and return
a string with the desired format of the number. This is used
in some places like SeriesFormatter.
See formats.format.EngFormatter for an example.
[default: None] [currently: <built-in method format of str object at 0x000000000CAB1F58>]
display.height : int
Deprecated.
[default: 60] [currently: 60]
(Deprecated, use `display.max_rows` instead.)
display.large_repr : 'truncate'/'info'
For DataFrames exceeding max_rows/max_cols, the repr (and HTML repr) can
show a truncated table (the default from 0.13), or switch to the view from
df.info() (the behaviour in earlier versions of pandas).
[default: truncate] [currently: truncate]
display.latex.escape : bool
This specifies if the to_latex method of a Dataframe uses escapes special
characters.
method. Valid values: False,True
[default: True] [currently: True]
display.latex.longtable :bool
This specifies if the to_latex method of a Dataframe uses the longtable
format.
method. Valid values: False,True
[default: False] [currently: False]
display.latex.repr : boolean
Whether to produce a latex DataFrame representation for jupyter
environments that support it.
(default: False)
[default: False] [currently: False]
display.line_width : int
Deprecated.
[default: 80] [currently: 80]
(Deprecated, use `display.width` instead.)
display.max_categories : int
This sets the maximum number of categories pandas should output when
printing out a `Categorical` or a Series of dtype "category".
[default: 8] [currently: 8]
display.max_columns : int
If max_cols is exceeded, switch to truncate view. Depending on
`large_repr`, objects are either centrally truncated or printed as
a summary view. 'None' value means unlimited.
In case python/IPython is running in a terminal and `large_repr`
equals 'truncate' this can be set to 0 and pandas will auto-detect
the width of the terminal and print a truncated object which fits
the screen width. The IPython notebook, IPython qtconsole, or IDLE
do not run in a terminal and hence it is not possible to do
correct auto-detection.
[default: 20] [currently: 20]
display.max_colwidth : int
The maximum width in characters of a column in the repr of
a pandas data structure. When the column overflows, a "..."
placeholder is embedded in the output.
[default: 50] [currently: 1000]
display.max_info_columns : int
max_info_columns is used in DataFrame.info method to decide if
per column information will be printed.
[default: 100] [currently: 100]
display.max_info_rows : int or None
df.info() will usually show null-counts for each column.
For large frames this can be quite slow. max_info_rows and max_info_cols
limit this null check only to frames with smaller dimensions than
specified.
[default: 1690785] [currently: 1690785]
display.max_rows : int
If max_rows is exceeded, switch to truncate view. Depending on
`large_repr`, objects are either centrally truncated or printed as
a summary view. 'None' value means unlimited.
In case python/IPython is running in a terminal and `large_repr`
equals 'truncate' this can be set to 0 and pandas will auto-detect
the height of the terminal and print a truncated object which fits
the screen height. The IPython notebook, IPython qtconsole, or
IDLE do not run in a terminal and hence it is not possible to do
correct auto-detection.
[default: 60] [currently: 60]
display.max_seq_items : int or None
when pretty-printing a long sequence, no more then `max_seq_items`
will be printed. If items are omitted, they will be denoted by the
addition of "..." to the resulting string.
If set to None, the number of items to be printed is unlimited.
[default: 100] [currently: 100]
display.memory_usage : bool, string or None
This specifies if the memory usage of a DataFrame should be displayed when
df.info() is called. Valid values True,False,'deep'
[default: True] [currently: True]
display.mpl_style : bool
Setting this to 'default' will modify the rcParams used by matplotlib
to give plots a more pleasing visual style by default.
Setting this to None/False restores the values to their initial value.
[default: None] [currently: None]
display.multi_sparse : boolean
"sparsify" MultiIndex display (don't display repeated
elements in outer levels within groups)
[default: True] [currently: True]
display.notebook_repr_html : boolean
When True, IPython notebook will use html representation for
pandas objects (if it is available).
[default: True] [currently: True]
display.pprint_nest_depth : int
Controls the number of nested levels to process when pretty-printing
[default: 3] [currently: 3]
display.precision : int
Floating point output precision (number of significant digits). This is
only a suggestion
[default: 6] [currently: 2]
display.show_dimensions : boolean or 'truncate'
Whether to print out dimensions at the end of DataFrame repr.
If 'truncate' is specified, only print out the dimensions if the
frame is truncated (e.g. not display all rows and/or columns)
[default: truncate] [currently: truncate]
display.unicode.ambiguous_as_wide : boolean
Whether to use the Unicode East Asian Width to calculate the display text
width.
Enabling this may affect to the performance (default: False)
[default: False] [currently: False]
display.unicode.east_asian_width : boolean
Whether to use the Unicode East Asian Width to calculate the display text
width.
Enabling this may affect to the performance (default: False)
[default: False] [currently: False]
display.width : int
Width of the display in characters. In case python/IPython is running in
a terminal this can be set to None and pandas will correctly auto-detect
the width.
Note that the IPython notebook, IPython qtconsole, or IDLE do not run in a
terminal and hence it is not possible to correctly detect the width.
[default: 80] [currently: 80]
io.excel.xls.writer : string
The default Excel writer engine for 'xls' files. Available options:
'xlwt' (the default).
[default: xlwt] [currently: xlwt]
io.excel.xlsm.writer : string
The default Excel writer engine for 'xlsm' files. Available options:
'openpyxl' (the default).
[default: openpyxl] [currently: openpyxl]
io.excel.xlsx.writer : string
The default Excel writer engine for 'xlsx' files. Available options:
'xlsxwriter' (the default), 'openpyxl'.
[default: xlsxwriter] [currently: xlsxwriter]
io.hdf.default_format : format
default format writing format, if None, then
put will default to 'fixed' and append will default to 'table'
[default: None] [currently: None]
io.hdf.dropna_table : boolean
drop ALL nan rows when appending to a table
[default: False] [currently: False]
mode.chained_assignment : string
Raise an exception, warn, or no action if trying to use chained assignment,
The default is warn
[default: warn] [currently: warn]
mode.sim_interactive : boolean
Whether to simulate interactive mode for purposes of testing
[default: False] [currently: False]
mode.use_inf_as_null : boolean
True means treat None, NaN, INF, -INF as null (old way),
False means None and NaN are null, but INF, -INF are not null
(new way).
[default: False] [currently: False]

In [296]:

# search for specific options by namepd.describe_option('rows')

display.max_info_rows : int or None
df.info() will usually show null-counts for each column.
For large frames this can be quite slow. max_info_rows and max_info_cols
limit this null check only to frames with smaller dimensions than
specified.
[default: 1690785] [currently: 1690785]
display.max_rows : int
If max_rows is exceeded, switch to truncate view. Depending on
`large_repr`, objects are either centrally truncated or printed as
a summary view. 'None' value means unlimited.
In case python/IPython is running in a terminal and `large_repr`
equals 'truncate' this can be set to 0 and pandas will auto-detect
the height of the terminal and print a truncated object which fits
the screen height. The IPython notebook, IPython qtconsole, or
IDLE do not run in a terminal and hence it is not possible to do
correct auto-detection.
[default: 60] [currently: 60]

# reset all of the options to their default valuespd.reset_option('all')

height has been deprecated.
line_width has been deprecated, use display.width instead (currently both are
identical)

c:\Users\Kevin\Anaconda\lib\site-packages\ipykernel\__main__.py:2: FutureWarning:
mpl_style had been deprecated and will be removed in a future version.
Use `matplotlib.pyplot.style.use` instead.
from ipykernel import kernelapp as app