Tag: Programming

Pandas dataframes are a great way of preserving relational database structure while gaining a huge API of vectorized functions for cleaning and analysis. However, I’ve always been frustrated by a lack of merge between functionality in pandas. So, I’ve come up with a fairly simple method for returning a merge-between-like result set with good runtime and manageable memory usage, that’s also easy to parallelize.

Using SQL, I can easily perform a left join between two datasets on a between condition, that associates my left table key with all matching data from the right table. For example:

SELECT a.*,
b.*
FROM a
LEFT JOIN b
ON b.timestamp >= a.start_time
AND b.timestamp <= a.end_time

will return all the rows in b that have timestamps between the start and end time for each row in a. Admittedly, it can be very slow, but it’s a useful process. As of pandas .2, there’s no simple way to perform this operation.

Pandas .2 does have merge_asof implemented, which matches a to the nearest entry on b in a specified direction. But for many data analysis applications, I don’t want the closest entry, I want all the entries in between. For example, I might want to take the mean over a time period, or do a more sophisticated aggregation. Or, maybe my timestamps aren’t perfectly aligned, so I’d rather pull in a range of data for each entry in a and perform an estimation from the values in b. Either way, merge_asof doesn’t cut it.

If all the start and end times in a are distinct (no overlapping events), then I can easily interpolate the data in a and b by concatenating, sorting on the timestamp, and boolean masking for data that falls between a the same start and end, though the runtime isn’t very good. I don’t want to go through the extra data cleaning steps to make my data distinct — which may not be the right approach anyways. Instead, I’m going to use a little iterating and some fancy indexing to return an efficient merge-between-like result set. It’s also naturally suited to parallelization, since I can do the operations in distinct chunks (like splitting time series data by date), group by a unique identifier from my left table, and return a condensed result set. Since the condensed result set is at most len(left table), and python clears memory between function calls, I can also manage memory usage by intelligent chunking, and process more data than I can load into working memory at any one time.

Let’s create some random data using numpy’s methods for sampling with replacement. In this example, I have a website selling merchandise, and three different ad campaigns promoting my website. The ad campaigns are running on multiple channels, and have different spot lengths. I want to know if during the ad spots, there’s any notable activity on my website.

Let’s say my website sells socks, shirts, pants, backpacks and pencils, and records the time when a user either clicks, adds an item to their cart, or purchases it. This is clearly a bit cleaner than any data you’ll ever find in the wild, but gets the job done.
For my advertisement data, I’ll assume my ads have three different sets of content, and can be 30, 45, or 60 seconds long. Each ad will also have a unique integer id.

You can use pandas functions to create a series with times, but I want to demonstrate the functionality using timeseries data without any assumptions of regularity.
First, this means that I can use data of arbitrary length and sparsity/density. Many real-world time series data sets are naturally extremely long, with some dense and some sparse periods. Resampling (as much of the pandas documentation recommends) is extremely slow on long data sets. If some of my data is nanoseconds apart, and some of my data is hours apart, resampling to nanoseconds will create an absurdly huge dataframe. Resampling to minutes combines my denser data, and I want to preserve its granularity.
Second, this approach works on time series data after minimal data cleaning. A website might have multiple users performing actions at the same time, so duplicates are valid. I may be running two ads on different channels that overlap with each other — also valid.
The following generates two dataframes, one with my website information, and one with my advertisement information. The website dataframe has 10,000 rows, each representing an action on a product, and the ad dataframe has 200 rows, representing a unique advertisement than ran on 1/1/2017. The website dataframe is sorted by timestamp, and has a timestamp index.

I try to never iterate over dataframes (or use apply), but in this case, it’s necessary, since a given website row can correspond to arbitrarily many ads and vice versa. For each ad, we can index into the website dataframe between the start and end times to find the actions during the ad. If we label each slice with the unique ad id, we can simply concatenate all the slices together and return all the actions during each ad, labeled by the ad id.

ad_actions = []
# pass over the dataframe. Itertuples is the most efficient method, and we don't need an index
for tup in advertisements.itertuples(index=False):
# use the advertisement start and end times to index into the website df
site_slice = website_actions[tup.start_time:tup.end_time]
# label the slice
site_slice['id'] = tup.id
# add it to the list of slices
ad_actions.append(site_slice)
# when we're done iterating, concatenate all the slices into our final dataframe
ad_actions = pd.concat(ad_actions)

As mentioned above, an advantage of this method is that you can define your own split-apply-combine functions to use on the chunks of data. Since we preserved the unique ad id for each slice of website data, we can group on the id and easily merge our final results back into the advertisement dataframe to do further aggregations.

One simple analysis task might be to determine which ad content generated the most revenue. Let’s say we make $1 for every pair of socks we sell, $2.50 for every shirt, $3.25 for every pair of pants, and $2.75 for every backpack, and we lose $.05 for every pencil we sell. One way of looking at the effectiveness of an ad is comparing to what we could have sold to what we did sell. We could have sold anything that appears in our website dataframe, since that represents items people clicked on, items people added to their carts (but didn’t checkout), and final sales. We sold anything that has action = ‘purchase’.

Since the values are randomly generated, every run will produce a different result.

I see:

content missed_revenue
'bar' 2311.05
'baz' 2069.90
'foo' 2005.45

Not the most sophisticated analysis! For example, I haven’t taken into account the way content may have overlapped during ad time. I also don’t know if ‘baz’ content missed the least revenue because people declined to buy the most pencils. I can add a little more granularity to my results by counting the pencils per ad:

It doesn’t look like any set of content is correlated with whether or not site visitors look at or purchase pencils (which makes sense, since the numpy sampling functions sample from a constant distribution by default, though you can seed them however you like).