Group By: split-apply-combine

By “group by” we are referring to a process involving one or more of the following steps

Splitting the data into groups based on some criteria

Applying a function to each group independently

Combining the results into a data structure

Of these, the split step is the most straightforward. In fact, in many situations you may wish to split the data set into groups and do something with those groups yourself. In the apply step, we might wish to one of the following:

Aggregation: computing a summary statistic (or statistics) about each group. Some examples:

Compute group sums or means

Compute group sizes / counts

Transformation: perform some group-specific computations and return a like-indexed. Some examples:

Standardizing data (zscore) within group

Filling NAs within groups with a value derived from each group

Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:

Discarding data that belongs to groups with only a few members

Filtering out data based on the group sum or mean

Some combination of the above: GroupBy will examine the results of the apply step and try to return a sensibly combined result if it doesn’t fit into either of the above two categories

Since the set of object instance methods on pandas data structures are generally rich and expressive, we often simply want to invoke, say, a DataFrame function on each group. The name GroupBy should be quite familiar to those who have used a SQL-based tool (or itertools), in which you can write code like:

Splitting an object into groups

pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. To create a GroupBy object (more on what the GroupBy object is later), you do the following:

For DataFrame objects, a string indicating a column to be used to group. Of course df.groupby('A') is just syntactic sugar for df.groupby(df['A']), but it makes life simpler

For DataFrame objects, a string indicating an index level to be used to group.

A list of any of the above things

Collectively we refer to the grouping objects as the keys. For example, consider the following DataFrame:

Note

New in version 0.20.

A string passed to groupby may refer to either a column or an index level. If a string matches both a column name and an index level name then a warning is issued and the column takes precedence. This will result in an ambiguity error in a future version.

pandas Index objects support duplicate values. If a non-unique index is used as the group key in a groupby operation, all values for the same index value will be considered to be in one group and thus the output of aggregation functions will only contain unique index values: