This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Window Functions (OVER Clause)—Help Make a Difference

If I had to name one concept in standard SQL that I thought was the most important one, and that is worth Microsoft’s investment for future versions of SQL Server, I’d say window functions, hands down, without a doubt. Window functions are a subset of what the standard calls set functions, meaning, functions that are applied to a set of rows. The term window is used to describe the set of rows that the function operates on, and the language provides a clause called OVER where you provide the window specification. So what’s the big deal, and what makes window functions more important than other features that are missing in SQL Server? There are so many reasons… But first I’ll give a bit more background about window functions, and then I’ll get to the reasons and demonstrate use cases…

First, to clarify, SQL Server 2005 already introduced some support for window functions—the ranking calculations: ROW_NUMBER, RANK, DENSE_RANK and NTILE, and partial support for window aggregate functions with only the partitioning part implemented. SQL Server 2005 was a great release for developers with so many cool and practical T-SQL features. The number of solutions that I simplified and optimized just with the ROW_NUMBER function and CTEs is amazing. Still, there are many standard features related to window functions that SQL Server didn’t yet implement (as of SQL Server 2008 R2) and that can help address quite a wide variety of business problems with simpler and more efficient solutions.

These days the next major release of Microsoft SQL Server—version 11—is being developed. These are pivotal days for candidate features where decisions are made whether they will or will not make it to the final release. And even though I think that more complete support for window functions is so important to developers and to the success of SQL Server, I’m not sure at all that we will see those in the product. This is time for us as part of the SQL Server community to express our strong opinion. Hopefully Microsoft will realize how important it is for us to have those features in the product, as well as to show that the SQL Server community’s opinion matters.

In this article I will explain some of the key features that are missing in SQL Server and why it’s important to add support for such features. If you share my opinion, and haven’t done so already, you can cast your vote in the following feature request items:

Like with any thing in life that you’re not aware of, you don’t know how it can help you if you don’t know that it exists. My feeling is that many developers are not really aware of the capabilities of the standard window functions and therefore Microsoft doesn’t see a lot of demand for it. Education and raising the topic to people’s awareness is therefore key to the realization of the benefits, and as a consequence, encourage people to ask Microsoft for more support.

The unfortunate part is that all of SQL Server’s leading competitors; including Oracle, DB2 and Teradata for some time now already have a far more complete support for window functions. So even though my focus and passion is for SQL Server, I sometimes find myself in the awkward situation of demoing standard SQL window functions on Oracle when teaching or presenting.

So what’s missing…

The most important missing features are probably ordering and framing options for window aggregate functions. Other key features that are still missing are distribution and offset functions, and reusability of window definitions. More details shortly.

Why are window functions so powerful?

SQL is often referred to as a set-based language. The reason is that the language is based on the relational model, which in turn is based, in part, on mathematical set theory. When writing SQL queries you’re supposed to deal with a table (or relation, which is a set) as a whole, as opposed to the table’s individual rows. Also, since sets have no order, you’re not supposed to make any assumptions in regards to the physical ordering of the data.

The reality is that for many developers set-based thinking is far from being intuitive, and it can take a few good years to truly think in SQL terms. This is why often developers tend to use cursors—because using those feel like an extension to what they already know. Cursors allow you to deal with one row at a time, and also rely on specified order of the data.

Window functions have an ingenious design. They do operate on sets, or windows, while allowing you to indicate ordering as part of the calculation where relevant. Not to confuse with cursors, window functions allow defining ordering for the calculation without making any expectations in regards to ordering of the input data given to the query or the output coming out of the query. In other words, no relational concepts are violated. Ordering is only part of the specification of the calculation. Similarly, other common elements in querying problems, like partitioning, framing of applicable rows, are all intuitive parts of the window specification. So in a sense, I see window functions as bridging the big gap that exists between cursor/iterative and set-based thinking.

Now, that’s a lot of words before showing even one example. So let’s look at a few more concrete examples of some of the missing features…

As mentioned, currently window aggregate functions support only a partitioning element. What’s missing are ordering and framing options. The standard supports an ORDER BY clause to define ordering in the window and ROWS and RANGE clauses that frame the window based on the defined ordering. A classic example that would benefit from ordering and framing is running totals. Consider the following Accounts table definition:

The table represents deposit (positive value) and withdrawal (negative value) transactions in bank accounts. You need to calculate at each point what the account balance was. Like with many querying problems there’s a partitioning element (actid), ordering element (tranid), and a measure that the calculation applies to (val). Window aggregate functions in standard SQL support all three elements. Here’s how you would express the query calculating the balance at each point for each account:

But besides the fact that these solutions are not as straightforward and intuitive as the one using a window function, there’s a big problem with the way SQL Server currently optimizes the subquery and join solutions. Assuming you defined a covering index on the partitioning column, followed by the ordering column, and including the aggregated measure, for each row SQL Server will scan all rows with the same partitioning value and an ordering value that is less than or equal to the current. Given p partitions with r rows in average, and fairly even distribution of rows in partitions, the total number of rows processed in such a plan is pr + p(r + r^2)/2. This means that in respect to the partition size, the algorithmic complexity, or scaling, of the solution s quadratic (N^2). That’s bad. The window function form lends itself to good optimization, especially with the fast track case like the above (rows between unbounded preceding and current row). It should be straightforward to the optimizer to optimize this query with one ordered scan of the index, translating to simply pr rows being scanned.

Another example for running totals is querying a table called EmpOrders with a row for each employee and month, and calculating the cumulative performance for each employee and month; in other words, the total value for the employee from the beginning of his/her activity until the current month. Here’s how you would express it with a window aggregate:

There are many business examples where ordering and framing options can be useful besides calculating account balances. Those include inventory, running totals for reporting, moving averages, and so on. Here’s an example for a query calculating the average of the last three recorded periods:

SELECT empid, ordermonth,AVG(qty) OVER(PARTITION BY empidORDER BY ordermonthROWS BETWEEN 2 PRECEDINGAND CURRENT ROW) AS avg_last_threeFROM Sales.EmpOrders;

There are also various temporal querying problems where running totals serve part of the solution.

For simplicity I showed examples where framing is based on the ROWS clause where you indicate an offset in terms of number of rows. The standard also supports a RANGE clause that allows indicating an offset in terms of values, such as time intervals, as in the following example returning the average of the last three months:

The SQL standard defines several offset functions that would make developers’ life so much easier compared to the tools available today for similar needs. Among the missing offset functions are LAG and LEAD, returning a value from a row in a given offset from the current row based on specified ordering. For example, the following query will return, for each current order, also the order date of the previous and next orders:

There are lots of business examples for the usefulness of these functions, like recency calculations, trend analysis, and others. Here’s an example for a query addressing recency calculations, returning the difference in terms of days between the current and previous orders:

Other missing offset functions are FIRST_VALUE, LAST_VALUE, returning the value from the first or last rows in the partition based on specified ordering. Here’s an example returning the value of the first and last orders per customer with each order:

Standard SQL supports window distribution functions that performing statistical calculations. Specifically it supports the PERCENT_RANK and CUM_DIST functions, calculating a percentile rank and cumulative distribution. These functions give you a relative rank of a row in respect to other rows in the window partition, expressed as ratio/percent. The specific formulas used by the two variants are:

PERCENT_RANK: (RK-1)/(NR-1), where RK = rank, NR = number of rows in partition

CUME_DIST: NP/NR, where NP = number of rows preceding or peer with current row (same as next rank - 1)

Here’s an example using these functions:

SELECT custid, COUNT(*) AS numorders,
PERCENT_RANK() OVER(ORDER BY COUNT(*)) AS percentrank,
CUME_DIST() OVER(ORDER BY COUNT(*)) AS cumedistFROM Sales.OrdersGROUP BY custid;

Suppose you need to write several window functions that rely on the same window definition (or part of it). You will end up with a lot of repetition of code. Standard SQL has a clause called WINDOW that allows naming a window definition or part of it, making it reusable. For example, instead of:

SELECT empid, ordermonth, qty,SUM(qty) OVER ( PARTITION BY empidORDER BY ordermonthROWS BETWEEN UNBOUNDED PRECEDINGAND CURRENT ROW ) AS run_sum_qty,AVG(qty) OVER ( PARTITION BY empidORDER BY ordermonthROWS BETWEEN UNBOUNDED PRECEDINGAND CURRENT ROW ) AS run_avg_qty,FROM Sales.EmpOrders;

You would use:

SELECT empid, ordermonth, qty,SUM(qty) OVER W1 AS run_sum_qty,AVG(qty) OVER W1 AS run_avg_qty,FROM Sales.EmpOrders
WINDOW W1 AS ( PARTITION BY empidORDER BY ordermonthROWS BETWEEN UNBOUNDED PRECEDINGAND CURRENT ROW );

As you can see, with the WINDOW clause the code is shorter, more readable, and easier to maintain.

Conclusion

I showed just part of the standard support for window functions that SQL Server is still missing. There’s more, like window frame exclusion. There are also other set functions not implemented, like ordered set functions, and so on. But here I wanted to make a point in hope that Microsoft will realize how important it is to add such support in SQL Server 11. If you feel so as well, help make a difference by voting for the items, write about the topic, talk about it, increasing people’s awareness. Hopefully this request will find open ears. As a reminder, here are the open items for some of the requests for enhancements: