Understanding Posrgres Window Functions

Posted on Thursday, Jun 02, 2016

Recently I stumbled into a very handy feature of Postgres — window functions. A window function gives you a “window” into other rows which are somehow related to the current row. You can then preform a calculation across that set of rows. Window functions behave much like regular aggregate functions, with the caveat that the rows do not become grouped into a single row! Let’s explore this feature, and why it might be useful.

Let’s imagine that we have an application that is used to record sales data for different stores. To keep things simple, we will say that the application’s database has only three tables: sales, users, and stores.

Let’s break this down a bit. The store_id column is self explanatory. The sum aggregates the amount from every row that has the same store_id as the current row. The OVER keyword is what causes the aggregate function to be treated as a window function, and compute the aggregate across the rows in the window without grouping.

But how do we define the window? How does Postgres know which subset of rows we want to aggregate? The answer lies in the PARTITION keyword. The partition is what divides the table rows into distinct window frames. In this case we partition the table by the store_id.

One thing to note is that we can use an aggregate function as a window function. We could have just as easily written

SELECTstore_id,COUNT(*)OVER(PARTITIONBYstore_id)FROMsales

Using the Rank Function

Postgres provides a number of window specific functions which you can find here. One of the most common use cases for window functions is to rank a window of rows using the rank function.

Lets say we wanted to rank each sales record by the amount entered for a given store_id. We can do this easily with the rank function:

You may notice that if a two or more rows have the same amount, they receive the same rank. The rank function will assign a unique number to each distinct row, but it leaves a gap when there are multiple rows with the same rank. Take a look at the rows with a store_id of 1. We can see that there are two ranks of 2, so the next rank is incremented not to 3, but to 4. the dense_rank function does not do this and instead the row with a rank of 4 would have a rank of 3.

These have been some pretty simple examples of window functions. Let’s move onto something a little bit more interesting.

Complex Window Partitions

Let’s say that in our application we want to be able to get the most recent sales data entered for a given month/year combination and a given store. This problem can be solved handily using window functions.

Let’s start by crafting a query to give us the store_id, the month of the entered date, the year of the entered_date, and the amount of the sale.

Note that we are using the EXTRACT function which allows us to retrieve subfields such as year, hour, day, month from date/time values.

The next thing we want to do is to partition the table. The original intent of our query was to grab the most recent sales data entered for a given month/year combination and a given store. So our partition should be defined as follows “for each row, get me the rows with the same month/year in the entered date, and the same store_id”. In other words, we need to partition by the extracted month, year, and store_id.

Now here’s the problem. Unlike a GROUP BY statement, we can’t partition over several columns. So how can we get the window frames we want?

The solution I found was to use the string function concat, and grouping by the concatenated value of the extracted month/year, and the store_id.

Now the next step is to apply a where clause that selects only the sales with the highest rank. Since we can’t apply a WHERE clause to an aliased column that is not on the table itself, we have to use a subquery in our FROM statement: