Share this post

How many times have you started off building a complicated analytical SQL query like this?

SELECT
.
.
.
.
uh???
.
.
SELECT
*
FROM
.
.
.
SELECT
.
.
.

And you get stuck trying to figure out exactly what you want to select. You’re thinking about averages, group by’s, the order of your results or some change you want to see over time and the query editor is just sitting there, taunting you, because in SQL, you have to know up front what you want to select into your final results.

It’s paralyzing…but if you keep reading, you’re going to know why it’s happening and you’ll never freeze up at a SELECT statement again.

You Are Writing Your Queries Backwards

Any time you start with a SELECT statement and you’re attempting to summarize complicated data, you’re starting off backwards.

SQL is a descriptive language. Every time you write an analytical SQL query, you are attempting to describe the underlying data in your tables. If you’re coming from a typical programming background, this is really different than how declarative or imperative programming languages work. Yes, there are SQL tools for transforming data that start looking and feeling like other types of programming with loops and control flow and all that, but here I’m just talking about summarizing, analyzing, and transforming data to get insight into the underlying meaning of it.

Every time you reach for your SELECT statement first, you’re going to feel shaky because you haven’t explored the foundation of the data you’re trying to explain. Not only is uncertainty an stumbling block for composing the query, but later on, if you were wrong about any of your initial assumptions, you end up with invalid conclusions in your summary data.

Instead, treat the SELECT statement like a summary of your findings.

Could you summarize a book for someone if you never even opened a page? Or read a chapter? Trying to fill in your SELECT statement first is like writing your book report sentence by sentence, as you finish reading each chapter. I guess you could do that…but you’re sure to miss the big picture and probably a lot of interconnected insights if you don’t wait until you’ve read the whole book.

Build Your Query Chapter by Chapter

Let’s work through an example to put this idea into practice. Suppose we have a group of students who take a quiz in class every day. You want to track their progress by seeing how their quiz scores are changing over time and you’ve decided that looking at the change in their rolling two-week average is the best way to measure that change.

Before you read this post, you might be tempted to sit down and write:

SELECT
student.first_name,
student.last_name,
avg(quiz.score),
.
.
.
wait...how do I limit the average for the last two weeks....and get one score per day...

And now we’re frozen in the SELECT statement again.

No! We said that approach was backwards. So how do we reverse the query-writing process? At first this might feel really basic and slow and painful, but building chapter by chapter is how you read any book, right?

An Important Point About Actually Writing SQL

SQL is fragile!

When you crack open a blank editor and start composing a new query, one misplaced function or statement, can make the whole query collapse in on itself. And you might not even realize it until it’s too late. You can get three or four steps through tweaking and editing your query and suddenly find out that you’ve grouped your data wrong or changed the way a subquery does some calculation and all your results are skewed.

If you’ve just been changing your main query, then it can be really hard to unwind the steps. That’s why I recommend a very simple interative approach to building your queries.

Start with a simple query

Run it against your database

Check the results

Looking good?

Copy the working query and paste it below itself

Make one small change to the copy of your working query

Repeat steps 2 - 6 until you have the data you want

If you build queries up step-by-step like this, you can easily move forward and back in time, both to check your query logic and sanity check the data. That’s what we’re about to do to solve this analytics problem.

Get Oriented

Now back to our students and quizzes example…

Let’s start by writing some simple orientation queries so we know more about our data set. There are basically two ways to get oriented: by schema and by the shape of the data.

Exploring the schema gives you a sense for what tables we will use, what’s in each table, and how they are related.

If you’re ever worked on a large, legacy code base or with a complicated data set, I’m sure you’ve seen the limits of this type of exploration. Old, unused columns still show up in legacy schemas. Sparsely populated tables show up just as clearly in a schema as the most heavily used tables. It’s a bit like looking at a map of a country with just the city and state boundaries (schema view) vs looking at a map that shows population density or traffic patterns at rush hour(data-shape view).

To understand the shape of the data, ask questions like, how many students are there?

Hmmm, we would have 9100 quiz results if 100 students had taken each of 91 quizzes. Are we dealing with student absences? Or is there some other pattern in the shape of the data? At this point, missing quiz results won’t necessarily impact our ability to answer the question (moving two-week average score per student), but we should understand the shape of the data we’re describing.

Sometimes you get lucky and a single GROUP BY will reveal significant results that explain the missing quiz results, but it looks like that’s not the case here.

If the pattern doesn’t reveal itself at the top-level, you can group the counts themselves, to see if there’s a pattern to the distribution of quizzes taken per student.

To write the query, we make our “quizzes per student” query a subquery that we can select from and basically repeat the group by and count approach.

That query looks like this:

SELECT
quizzes,
count(1) as count
FROM (
SELECT student_id, count(1) as quizzes FROM quiz_results GROUP BY student_id
) r
GROUP BY quizzes
ORDER BY count DESC;

And now we get a much more intelligble result:

quizzes | count
---------+-------
91 | 49
61 | 30
31 | 21
(3 rows)

Which tells us that about half the students did take all 91 quizzes, 30 took 61 and the last 21 students only took 31 quizzes. With clean groupings like this, there’s probably some kind of cohort-level reason for the number. In other words, if missing quizzes were due to student absences, you’d expect to see a wider range in “number of quizzes per student” because the count of absences per student would likely have more variation.

If you really wanted to keep poking at the data, you’d have one other nice entry point: date.

Here’s our distribution of quiz results by date:

SELECT
quiz_date as quiz_date,
count(1) as results
FROM
quiz_results qr
JOIN
quizzes q
ON
qr.quiz_id = q.id
GROUP BY quiz_date
ORDER BY quiz_date DESC;

I’m not going to put all 91 results here, but if you’re following along and run the query, you’ll see a pattern that explains the “quizzes per student” data, namely, a new group of students joined every month and everyone who was participating at the time took every quiz.

Even building up this example, I played with half a dozen different ways to poke at the data and get a feel for its shape. I hope that’s what you take away from this section more than anything else: simple, exploratory queries help you get the shape of a data set and build your intuition about the correctness of your query results, which really comes in handy when you try to draw complicated conclusions about the data later on.

Get Specific

Once you have a feel for the quiz results we’re working with, the next thing to do is take a very specific example all the way from start to finish.

We want to build up a specific case where we look at:

one student’s scores

the average of all those scores

the average of all those scores for multiple two-week periods

With these specific results, we can check the summary data in the more complicated case of calculating the moving two-week average for all students and be confident that we’ve done our summary calculation correctly.

First, pick a student:

SELECT * FROM quiz_results WHERE student_id = 2

Wait, that’s not going to give us rich enough data to build intuition about what’s happening in the data, is it? Let’s spend a little more time joining onto the student and quiz tables to inflate our normalized data:

This next step is going to feel like a little bit of SQL magic, but there is a key concept that lets us combine the last two queries (and more) into a single function call: window functions. Window functions in SQL allow you to perform an aggregation (e.g. COUNT, SUM, AVG, etc) over a paritcular window of data that is a subset of the total data you are scanning in a single query.

With a window function, instead of one query for each two week period, where the “window” is specified in the where clause, we can calculate all the moving averages in a single query. We can specify the range of time within the magical window function and get a two-week average for each row of data, i.e. each quiz date in our result set.

Here’s an example of a window function, ROW_NUMBER, that illustrates the windowing clearly. ROW_NUMBER assigns a row number based on the framing clause in the OVER statment, which here is just ORDER BY quiz_date.

There’s one more concept we need to add here, which might not be obivous from our last query. We need to define a boundary around which rows get included in the average calculation. Since we’re aiming to get to the two-week average, we want to only include the preceding two weeks in the average score each day (or for the first two weeks, as many quizzes are available up to and including the current day).

Our results for 2018-10-05 are different now. Last time we got 86.4 and now we have 91.2. If you calculate out the averages, you can see that we are now adding 3 numbers (current row + 2 preceding) instead of 4 (all preceding), when we didn’t specify which rows to use:

This is a perfect example of a small, easy-to-miss, but important detail. If you just apply the formula to the aggregate results instead of taking the time to look at the expanded row data, you might never see the calculation change.

Now that we know how to apply the window function correctly, we can build our final query for the specific case.

First, make three small changes:

add ROWS 13 PRECEDING to get two-week average

round the averages for easy of reading

change the order of the display results (which is independent of over in the frame clause, so we can calculate and display rows with different sort orders)

That looks like kind of a mess! You can see that Bernadette’s 2018-12-31 is wrong.

To fix this summary calculation, we need one more key window function concept: PARTITION BY. We can partition the data so that we can do calculations for each student individually, just like we did for Bernadette, but all at once.

add student_id, which we know is unique, so that we can cleanly group scores by student

add PARTITION BY student_id in the window function

and ORDER BY student_id so we can see Bernadette’s results (which are now correct again)

For completeness, I would probaby run through specific averages for one more student and then check their results in the final aggregate, but we seem to be heading in the right direction.

One More Time: Step-by-Step

Thanks for following along through this example!

I know building queries up section-by-section like this probably seems tedious. It can definitely feel that way at times, but this approach has saved me from making small mistakes so many times that I’m convinced it’s worth it.

Just to say it again, the step-by-step approach to building complicated analytical SQL queries is:

Start with a simple query

Run it against your database

Check the results

Looking good?

Copy the working query and paste it below itself

Make one small change to the copy of your working query

Repeat steps 2 - 6 until you have the data you want

If you follow those steps, you’ll have proof that your data adds up and you’ll have a much stronger feel for what’s inside your tables.