Fast-PandasBenchmark for different operations in pandas against various dataframe sizes.

Fast Pandas

A Benchmarked Pandas Cheat Sheet

Pandas is one of the most flexible and powerful tools available for data scientists and developers. Being very flexible, one can perform a given task in several ways. This project aims to benchmark the different available methods in such situations; moreover, there is a special section for functions found in both numpy and pandas.

Rev 2 changes:

Added NaN handling functions to numpy benchmarks.

Performed Numpy benchmarks on ndarrays (previously they were only tested on panda series).

Tested df.values for looping through dataframe rows.

Introduction:

This project is not intended to only show the obtained results but also to provide others with a simple method for benchmarking different operations and sharing their results.

The first parameter passed to the class constructor is df_generator which is simply a function that generates a random dataframe. This function has to be define in terms of df_size so that different dataframes with increasing sizes are generated. The second parameter is the list of functions to be evaluated, while the last one is the title of the resulting plot.

Calling plot_results( ) will show and save a plot like the one shown below containing two subplots:

The first subplot shows the average time it has taken each function to run against different dataframe sizes. Note that this is a semilog plot, i.e. the y-axis is shown in log scale.

The second subplot shows how other functions performed with respect to the first function.

You can clearly see that pandas sum is slightly faster than numpy sum, for dataframes below one million rows, which is quite surprising, shouldn't pandas function have more python overhead and be much slower? Well, not exactly checkout out the second section to know more.

Results Summary:

[1] The method df.values is very fast; however, it consumes a lot of memory. Itertuples comes second in performance and is recommended in most cases.

eval_method shows an interesting erratic behavior that I could not explain; however, I repeated the test several times with different mathematical operations and still reproduced the same results every time.

2 - Pandas vs Numpy.

Few general notes regarding this section:

There four different ways for calling most function here, namely: df["A"].func(), np.func(df["A"]), np.func(df["A"].values), and np.nanfunc(df["A"].values).

np.func(df["A"]) would call df["A"].func() if the later is defined; thus, it is always slower. This was pointed out by u/aajjccrr here.

np.func(df["A"].values) is the fastest when your dataset has no NaNs.

df["A"].func()is faster than np.nanfunc(df["A"].values), and hence it is generally recommended to use it.

This section tests the performance of functions that are found in both numpy and pandas.

The same behavior observed in sum is appearing here; notwithstanding, pandas is not out performing nor even approaching numpy for large dataframes.

Extra notes:

Extra parameters:

The class constructor has three other optional parameters:

"user_df_size_powers": List[int] containing the log10(sizes) of the test_dfs
"user_loop_size_powers": List[int] containing the log10(sizes) of the loops_sizes
"largest_df_single_test" (defualt = True)

You can pass custom sizes for the dataframes and loops used in benchmarking, this is suggested when there seems to be noise in th results; i.e. you are unable to maintain consistency over different runs.
The third parameter, largest_df_single_test, is set to true by default; since the last dataframe has 100 million rows and for some operations it will take a large amount of time to complete a single task.

Warnings:

The benchmarker will warn you if the results returned by the evaluated functions are not identical. You might not need to worry about that, as it has been shown in the benchmarking of the np.unique function above.

Future work:

-Using median, minimum, or the average of the best three runs instead of mean as those markers are less prone to noise.

-Benchmarking memory consumption.

Got something on your mind you would like to benchmark ? We are waiting for your results.