Guest Post: Bucketing in SQL

Use SQL to group data so that you can spot trends, uncover opportunities, and see where and how your business generates value.

Eric Feng

This is a guest post by Eric Feng of Numeracy, a company specializing in lightweight business intelligence tools. Before he joined Numeracy, Eric worked at Dropbox and Sentry in developer relations, engineering and product.

Bucketing, also known as binning, is useful to find groupings in continuous data (particularly numbers and time stamps). While it’s often used to generate histograms, bucketing can also be used to group rows by business-defined rules. I’ll walk through the simple bucketing various data types as well as custom buckets. Finally, to illustrate how these types of bucketing might be used, I’m going to walk through an example with Numeracy’s own schemas and mock data on when our users connect a database.

Bucketing Numbers

The most common and naive method of bucketing is through truncating. trunc rounds floats down to the nearest integer. As a bonus, if you’re trying to bucket values to the nearest 10, trunc takes a second argument for the decimal point you want to round to.

Bucketing Time

Like trunc for numbers, date_trunc is used to bucket dates together. This is particularly useful for tracking user activity, where you’d like to smooth out weekend dips by bucketing weeks together. It accepts a wide variety of fields to truncate against, from microseconds to millennia.

If you want to bucket by intervals other than second, minute, or hour, your second option is to change the interval into a number (namely epoch or seconds since 1970), divide by the desired interval, and then trunc the result.

Once again, it’s important to note that other fields (like hour or day) do not work this way. It instead extracts the field as a value, and does not convert the value into that unit of time. In the event you try and extract irregular intervals, results can vary in unexpected ways. I’d recommend you stick with epoch.

Custom Buckets

You can create custom bucket with the Postgres width_bucket function. The function scans through the array, using the listed values as upper bounds, and returns the last index. That’s why it’s important to always have your buckets sorted by ascending values (easily done with Postgres’ intarray extension). For the array {10, 20, 30}, you effectively get the four ranges <10, 10-20, 20-30, and >30.

While it’s possible to manually duplicate this logic with CASE expression, it’s much more verbose. width_bucket also has the advantage of being able to use values to bucket, allowing you to generate or even store buckets as SQL values, instead of needing to hard code them in your SQL statements.

Numeracy Example

Let’s look at our example: tracking when Numeracy users first connect a database.

First, here’s how we might pull the “time to connect” from raw event tables with ‘organization.created’ and ‘database.connected’ event types:

with conn_event as (
select org_id, min(created) as created
from events
where type = 'database.connected'
group by 1
),
org_event as (
select org_id, min(created) as created
from events
where type = 'organization.created'
group by 1
),
time_to_connect as (
select extract(epoch from conn_event.created - org_event.created) as seconds
from org_event
left join conn_event
on org_event.org_id = conn_event.org_id
)

We’ll assume all of the following statements to be prefixed with these CTEs. For a first pass, let’s group by day.

select trunc(seconds / 3600 / 24) as day, count(1) as org_count
from time_to_connect
group by 1
order by 1

If you run this, you can see almost everyone connects within a week. Most, in fact, connect within a day of signing up. Our intuition tells us that the users connecting on the first day actually connect in minutes. However, grouping by minute would produce hundreds of thousands of “buckets” from the long tail of users who connect months later.

select trunc(seconds / 60) as minutes,
count(1) as org_count
from time_to_connect
where trunc(seconds / 60) <= 1440
group by 1
order by 1

Segmenting Users Into Custom Buckets

In most businesses, you want to define “personas”: types of users who find your software valuable. In Numeracy’s case, it’s easy to see three distinct buckets: users who connect in the first 15 minutes, in the first week, or later. Having defined and reached consensus on these, I might want to categorize these users in my queries and examine other attributes. For this, we’ll use width_bucket.

select width_bucket(seconds / 60, '{15, 10080}') as bucket,
count(1) as org_count
from time_to_connect
group by 1

In this case, I’ve passed the array {15, 10080} to the width_bucket function. If a user connects in less than 15 minutes, width_bucket returns 0. If between 15 minutes to 10,080 minutes (or 1 week), it returns 1. If more than a week, 2.

In Numeracy’s case, if you run the queries above, you’ll see 60% of organizations that will connect a database will do so in the first 15 minutes. 30% connect in the next 7 days. The final 10% trickle in over near-infinite time afterwards.

Now that we’ve bucketed our users, we can answer questions like, “Which bucket converts at the highest rate?” In Numeracy’s case, our initial intuition told us that users who instantly connected were more likely to convert. However, our highest conversion rates were actually amongst the second group: users who returned within a week. In retrospect, this makes sense. people with database credentials on hand are more likely ops teams, whereas Numeracy is built for business or product teams.

The goal of bucketing is to group user behaviors into segments that can inform your marketing efforts and product design. By using buckets to organize data in different ways, you can detect trends that might otherwise have gone unnoticed, determine where your business currently generates the most value, and find unexploited opportunities.

Fivetran uses cookies to enhance your user experience and improve the quality of our website. Unless you disable cookies, you consent to the placement and use of cookies as described in our Privacy Policy by continuing to use this website.