Pandas for data.table Users

December 2018 · 15 minute read

R and Python are both great languages for data analysis. While they are remarkably
similar in some aspects, they are drastically different in others. In this post,
I will focus on the similarities and differences between Pandas and data.table,
two of the most prominent data manipulation packages in Python/R.

There is alreay an excellent post that nicely compares basic data manipulations by Fisseha Berhane,
which you can find on this website

I want to focus on how more advanced data.table operations map to Pandas,
such as inplace modifications etc.

Let’s get started by setting up our environment. I will use the reticulate
package to work with both R and Python in tandem.

Assignment: copy vs in-place

R is a functional language and therefore data is generally immutable, e.g. ‘changing’ a
dataframe is not possible. R will copy the original dataframe, apply your updates
and store it as a new dataframe. While immutable data makes it easier to reason
about your code, it takes more time and memory to perform computations compared to
modifying data in-place. This is one of the primary reasons why data.table is
so blazzing fast and memory efficient. It avoids unnecessary copies and directly
modifies your original data.frame (see my post “Using data.table deep copy” for more
infos).

So how can add a column by reference?

New column: in-place & copy

We will start by adding a new column that is simply a multiple of an existing one:

In Pandas transform is used to broadcast a groupby result back to the original
dataframe, i.e. transform is similar to SQL window functions. I used a lambda
function in my code above, but you could simply write transform('sum') in this
simple case as well.

Unfortunately, transform works on columns in sequence, i.e. first column a is
passed to our lambda function and then column b. So this code will fail:

# If you want to use character vectors:
# setkeyv(df_r, 'a')
# Query data.table based on key
df_index[.(1:3), .(a, b, c)]

## a b c
## 1: 1 7 q
## 2: 2 8 q
## 3: 3 9 q

Setting a key in data.table physically reorders the rows by reference in increasing order
and sets a sorted attribute for the key columns. So setting a key is equivalent to
creating a clustered columnstore index in SQL Server. You can also generate non-clustered
indices like so:

df_index = r.df_r
# Set index on existing DataFrame:
# Alternatively, you can include an index when you create the DataFrame
df_index.set_index(keys=['a'], drop=True, inplace=True)
# Check if index is set correctly:
print("Check index: \n", df_index, '\n')
# Filter with index

Chaining

Chaining operations can make code more readable. If you are coming from the tidyverse
fear not, data.table also works with magrittr:)

library(magrittr)
df_r[,.(b, c)] %>%
.[b > 9]

## b c
## 1: 10 q
## 2: 11 w
## 3: 12 w

You can also chain method calls in Pandas using \\. Alternatively, you can skip the
backslash and put the entire block in ().

df_res = df_py[['b', 'c']] \
.loc[df_py['b'] > 9]
print(df_res)

## b c
## 3 10 q
## 4 11 w
## 5 12 w

Do by group

In this section I want to show you how you can conveniently run calculations per group. Imagine
you want to fit a model per group. One way to do it is to loop over the entire data.table and
filter each run by the respective group. More convenient in my opinion is the following structure:

df_list = df_r[,list(data = list(.SD)), by = 'c']
df_list

## c data
## 1: q <data.table>
## 2: w <data.table>

We created a new data.table where we collapsed the data per group into a list. This has a couple
of advantages:

You can calculate rolling statistics for multiple columns in one function call by
using .SD. Note that shift() and rollapply() take the ordering of your columns
as given!

In Pandas, it is not as straight forward I am afraid, because .shift() and .rolling()
behave differently. While .shift() does not create a new index, .rolling() does. This
means that you need to be careful when you assign the results from .rolling() back to
your dataframe (check if the indices match!). To complicate things further, you cannot simply chain .shift(n).rolling(m), because .shift() will remove the grouping and therefore .rolling() results can be incorrect.

In Pandas the rolling method also supports a time period offset (only valid for datetimelike indices). By default, Pandas rolling method uses aligns the series to the right, but centering is also possible. Partial application (i.e. using a smaller window than specified at the beginning) is not possible at the moment.

Some final notes

However, while data.table is very explicit about when assignment is made by reference
(:= and all set* functions) Pandas is less so. There is actually no guarantee that Pandas
performs operations in-place, even if you specify inplace=True, see for example this stackoverflow thread.

Also note that modifying data in-place can make your operations more difficult to
understand/debug.

If you are coming from R, be careful when you assign one Pandas DataFrame column to
another DataFrame. Pandas will match the columns based on indices!

I hope you found this post useful! If you find any errors, please open an issue on Github.