Tips and Tricks

Aggregating into Strings without String_agg or Array_agg in Amazon Redshift

Periscope Data

|

December 30, 2014

Getting the Top Purchases for Each Product

From time to time, any analyst will want to know the “top n instances” of something. For example, as the holidays approach, a toy store may want to know who the top customers of certain products are, so they can prepare special marketing for those customers.

However, those of us on other databases have to do without. In particular, Amazon Redshift doesn’t yet support these functions. In this post, we’ll show you how to use window functions and self joins to find the top 5 purchasers of each product.

Counting the Purchasers

Let’s start with a simple group-and-count that gives us a table of how many times any customer purchased any product:

select

products.name as product_name,

customers.name as customer_name,

count(1) purchase_count

from products

join purchases on purchases.product_id = products.id

groupby1, 2

This gives us the simple table we were expecting:

Ranking Purchasers

Now, for each product and purchaser, we want to know that purchaser’s rank for that product. Are they the top purchaser? The second to the top?

The row_number window function makes this easy. Here’s how:

select

product_name,

customer_name,

row_number() over (

partition by product_name orderby purchase_count desc

) as rank

from (

select

products.name product_name,

customers.name customer_name,

count(1) purchase_count

from products

join purchases on purchases.product_id = products.id

groupby1, 2

) purchase_counts

We’ve put our previous query into a subquery named purchase_counts.

Then, in our row_number window function, we partitioned by product_name because we want a rank per product, and we ordered by purchase_count desc so that the customer with the most purchases has rank 1.

This gives us a handy customer rank for each product, seen here:

Putting the Top 5 Purchasers into One Row

Now that we have a unique rank for each purchaser on each product, if we want the top 5 purchasers, all we have to do is join this table to itself 5 times!

Assuming we have the above results in a purchase_ranks table, here’s the query:

select

pr0.product_name,

pr1.customer_name

||', '|| pr2.customer_name

||', '|| pr3.customer_name

||', '|| pr4.customer_name

||', '|| pr5.customer_name as top_purchasing_customers

from

purchase_ranks pr0

join purchase_ranks pr1

on pr0.product_name = pr1.product_name and pr1.rank =1

join purchase_ranks pr2

on pr0.product_name = pr2.product_name and pr2.rank =2

join purchase_ranks pr3

on pr0.product_name = pr3.product_name and pr3.rank =3

join purchase_ranks pr4

on pr0.product_name = pr4.product_name and pr4.rank =4

join purchase_ranks pr5

on pr0.product_name = pr5.product_name and pr5.rank =5

The first key for each self join is product_name. That keeps each row about one product.

The second key selects which customer we are bringing in. The first self join brings the top purchaser by requiring pr1.rank = 1. The second self join brings in the second-to-the-top purchaser by requiring pr2.rank = 2, and so on.

In the select clause, we concatenate these customers’ names together, separated by commas. Here’s the result:

Paring down the Results

As you can see, since there are 5 self joins, there are 5 resulting identical rows for each product. To bring it down to one row per product, all we need to do is wrap our query in a select distinct:

selectdistinct product_name, top_purchasing_customers from (

select

pr0.product_name,

pr1.customer_name

||', '|| pr2.customer_name

||', '|| pr3.customer_name

||', '|| pr4.customer_name

||', '|| pr5.customer_name as top_purchasing_customers

from

purchase_ranks pr0

join purchase_ranks pr1

on pr0.product_name = pr1.product_name and pr1.rank =1

join purchase_ranks pr2

on pr0.product_name = pr2.product_name and pr2.rank =2

join purchase_ranks pr3

on pr0.product_name = pr3.product_name and pr3.rank =3

join purchase_ranks pr4

on pr0.product_name = pr4.product_name and pr4.rank =4

join purchase_ranks pr5

on pr0.product_name = pr5.product_name and pr5.rank =5

) products_with_top_purchasers

This gives us the final results we’re looking for!

Putting It all Together

Pulling out original purchase counts into a with clause, we get a final query that looks like this:

with purchase_ranks as (

select

product_name,

customer_name,

row_number() over (

partition by product_name orderby purchase_count desc

) as rank

from (

select

products.name as product_name,

customers.name as customer_name,

count(1) as purchase_count

from products

join purchases on purchases.product_id = products.id

groupby1, 2

) purchase_counts

)

selectdistinct product_name, top_purchasing_customers from (

select

pr0.product_name,

pr1.customer_name

||', '|| pr2.customer_name

||', '|| pr3.customer_name

||', '|| pr4.customer_name

||', '|| pr5.customer_name as top_purchasing_customers

from

purchase_ranks pr0

join purchase_ranks pr1

on pr0.product_name = pr1.product_name and pr1.rank =1

join purchase_ranks pr2

on pr0.product_name = pr2.product_name and pr2.rank =2

join purchase_ranks pr3

on pr0.product_name = pr3.product_name and pr3.rank =3

join purchase_ranks pr4

on pr0.product_name = pr4.product_name and pr4.rank =4

join purchase_ranks pr5

on pr0.product_name = pr5.product_name and pr5.rank =5

) products_with_top_purchasers

Now that we have the top 5 purchasers for each product, we can tailor our marketing specifically to them for the holiday season!

Periscope Data and Amazon Web Services combine to provide the fastest and easiest way to deliver scalable, high-performance and secure cloud analytics.