SQL Server 2005's APPLY, Part 2

Last month I started a two-part series about SQL Server 2005’s APPLY operator. I covered the fundamentals of APPLY, explaining the three modes in which you can use it: CROSS APPLY, OUTER APPLY, and implicit APPLY. Now that I’ve covered the fundamentals, this month I discuss more advanced uses for APPLY, including efficient parallelism of queries, reuse of column aliases, aggregate over columns, unpivoting, and inlining scalar functions.

In the examples in this article, I use sample databases called InsideTSQL2008 and Performance that I originally created for my books. To create and populate those databases, go and download the source code for the book Inside Microsoft SQL Server 2008: T-SQL Querying(Microsoft Press, 2009). Use the file InsideTSQL2008 to create and populate the sample database InsideTSQL2008. Use the first part in the source code file for Chapter 4 (Query Tuning) to create and populate the Performance database.

Efficient Parallelism of Queries

The APPLY operator can be used to address a wide variety of problems—typically, problems that involve a partitioning element (e.g., an employee, a customer, a shipper). One interesting use of APPLY is to utilize parallelism efficiently in a certain class of problems. I initially learned about this technique from Adam Machanic, a SQL Server MVP.

To demonstrate the technique, let’s use the Performance sample database. Run the following code to create an index on the Orders table:

USE Performance CREATE INDEX idx1 ON dbo.Orders(empid, orderid);

Suppose that you need to calculate for each row in the Orders table two row numbers: one partitioned by empid and ordered by orderid and another partitioned by empid and ordered by orderid DESC (or any other order that’s different from the first function’s order). You can use the following query to achieve this task:

SELECT empid, orderid, ROW_NUMBER() OVER(PARTITION BY empid ORDER BY orderid) AS rownum_asc, ROW_NUMBER() OVER(PARTITION BY empid ORDER BY orderid DESC) AS rownum_desc FROM dbo.Orders;

Recall that earlier you created an index called idx1 on (empid, orderid). The tricky part here is that the two functions have different ordering specifications, and in such a case, SQL Server can rely on an index ordering to support only one of them. For the other function, SQL Server has to sort the rows. Figure 1 shows the execution plan for this query.

SQL Server uses parallelism to sort one of the functions, but it sorts by both the partitioning element (empid) and the ordering element (orderid DESC). I ran this query on a machine with eight logical CPUs (if you have fewer logical CPUs, you can mimic a similar machine for test purposes by using the startup parameter -P8), and it took it 3 seconds to finish with hot cache and results discarded.

A more efficient strategy and utilization of parallelism is to filter each partition’s rows through the index and apply a sort only to one partition’s rows at a time. To achieve this, you can use the APPLY operator, like so:

Observe that the plan scans the employees clustered index first, and then for each employee, performs a seek in the index on Orders to filter the current employee’s rows. Then the sort is applied only to one employee’s rows. You can also see here the use of an optimization technique called few outer rows, in which the optimizer realizes that a small number of rows are returned from the outer part of the Nested Loops join and uses a Repartition Streams exchange operator to evenly distribute the rows to the different threads that then handle the work in the inner part of the join. This query finished in 1 second on the same machine—one third of the runtime of the previous query.

When you’re done, run the following code for cleanup:

DROP INDEX idx1 ON dbo.Orders;

Reuse of Column Aliases

Suppose that you have a query with lengthy expressions that you need to repeat more than once. For example, you need to write a query against the Sales.Orders table in the InsideTSQL2008 database, and in this query filter only orders that were placed prior to the last day of the month, and return for each order the respective beginning-of-month and end-of-month dates. It would be nice if you could alias the expression that calculates the beginning-of-month date (call it beginning_of_month), then use that alias in the expression that calculates the end-of-month date (call it end_of_month), and then use that alias in the query filter. Making the following attempt

To understand the reasons for the errors, you need to understand certain aspects in the design of SQL. You can’t reuse aliases defined in the SELECT clause within the same SELECT clause because conceptually the language evaluates all expressions that appear in the same logical query processing phase (SELECT, in our case) at the same point in time. You also can’t reuse aliases defined in the SELECT clause in the WHERE clause, because the WHERE clause is conceptually evaluated prior to the SELECT clause. The conceptual evaluation order of the clauses is FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY.

One option is to repeat the logic in all expressions in the query, like so:

This results in a lot of redundancy and increases the likelihood of errors.

Another option is to gradually build your solution query in steps, using CTEs in which you don’t need to repeat the expressions, as Listing 1 shows. However, this layered approach adds its own complexity to the solution and is a bit verbose even though you don’t need to repeat the expressions.

It turns out that you can use the APPLY operator to handle this need in an elegant and more concise manner than the alternatives. Remember that APPLY is a table operator that’s evaluated in the FROM clause of the query. If you have multiple table operators, they’re evaluated from left to right, and each has access to the table result produced by the previous table operator. Also, the FROM clause is the first query clause to be conceptually evaluated—before the WHERE clause and the others. This means that if you define a column alias in a table expression used by an APPLY operator, all subsequent table operators, as well as query clauses, have access to that alias. All this leads to the ability to handle our current querying task in the following elegant manner:

Aggregate Over Columns

Our next example for using the APPLY operator is to calculate aggregates over columns. To create the sample data for this example, use the code in Listing 2. This code creates and populates a table called Sales in the tempdb database for test purposes.

Listing 2: Code to Create and Populate the Sales Table

USE tempdb; IF OBJECT_ID('dbo.Sales', 'U') IS NOT NULL DROP TABLE dbo.Sales;

The Sales table holds the monthly sales quantities. The table has one row for each year and one column for each month. Why the table was designed like this isn’t really our concern—perhaps the data was imported from a spreadsheet in this manner. We have a very specific task: Write a query that calculates the minimum and maximum monthly quantities in each year.

The APPLY operator can be used to handle this task elegantly and very efficiently by combining the use of the enhanced VALUES clause available in SQL Server 2008 and later. The VALUES clause can be used to define a derived table like so:

(VALUES(<row_constructor>),(<row_constructor>),...) AS D(<col_list>);

Relying on this capability, you can use the APPLY operator with the VALUES clause to turn each source row to 12 target rows (one for each month) and have all monthly quantities in one result column (call it qty). Then you can perform any aggregates that you need on that result column. Here’s what the solution query looks like:

Figure 4 shows the output of this query. Although this approach isn’t a conventional way to use the VALUES clause, it’s certainly a valid, supported method.

Figure 4: Output of query calculating aggregates over columns

custid

salesyear

01

02

03

04

05

06

07

08

09

10

11

12

mn

mx

------

---------

--

--

--

--

--

--

--

--

--

--

--

--

--

--

A

2010

90

41

75

9

85

6

65

5

30

90

11

71

5

90

A

2011

29

29

8

95

1

16

36

74

59

43

31

49

1

95

B

2009

29

51

92

15

2

45

26

90

34

14

25

9

2

92

B

2010

39

8

94

25

30

35

42

75

62

7

98

19

7

98

B

2011

39

22

41

56

5

27

2

22

32

52

74

26

2

74

Unpivoting

The last example in which you needed to calculate aggregates over columns involves first unpivoting the monthly quantity columns into rows. You could achieve this with the UNPIVOT operator, but performance tests I’ve done show that the technique with the APPLY operator is faster. And if speed alone isn’t enough of a reason to switch to using the APPLY operator to handle unpivoting, perhaps the next example will be.

Suppose you needed to unpivot multiple sets of columns. To demonstrate this need, I’ll use a table called Sales (a different one than in the previous example) that you create and populate by running the code in Listing 3.

Listing 3: Code to Re-Create and Populate the Sales Table

USE tempdb; IF OBJECT_ID('dbo.Sales', 'U') IS NOT NULL DROP TABLE dbo.Sales; GO

The Sales table has a row for each customer, a set of three columns for yearly quantities (qty2009, qty2010, qty2011), and a set of three columns for yearly values (val2009, val2010, val2011). Your task is to unpivot each customer source row to three target rows (one for each year) and in each result row return the customer ID, year, quantity, and value. Try to do this with the UNPIVOT operator, and you’ll realize that you need to write two UNPIVOT queries—one for each set of columns to be unpivoted—and join the results. But with APPLY and the VALUES clause, you use a similar technique to the one used in the aggregate over columns example to construct three result rows out of each source row using row constructors, like so:

Inlining Scalar Functions

Developers love using functions to encapsulate logic—which is what programmability best practices encourage. Therefore, you see a wide use of user-defined functions (UDFs) in T-SQL code. Consider scalar UDFs. Suppose that the UDF is called in a query and a column from the table is passed as input, as in the following example:

SELECT col1, dbo.MyFunction(col1) AS expr1 FROM dbo.T1;

If the function has a body with flow that includes more than just a single scalar expression, you’d expect to pay whatever cost is involved with the UDF call for each of the rows involved in the query. But what if the UDF just encapsulates a return of a scalar expression—either a result of a query or just an expression? It turns out that SQL Server doesn’t currently attempt to inline such an expression in the calling query—and this fact has a major performance penalty when compared with embedding the original expression directly in the calling query.

I’ll demonstrate the performance problem with an example against the Performance database. Consider the following query returning orders placed on the last day of the year:

The query is certainly clearer and more concise, but it takes 5 seconds to finish! The reason is two-fold. First, SQL Server unfortunately doesn’t inline the UDF, and therefore you pay some overhead for each of the function’s calls. Second, using a T-SQL scalar UDF prevents parallelism. But all hope is not lost; even though this might sound like a bizarre idea, you turn the function to an inline table-valued UDF, like so:

This time, the UDF got inlined prior to optimization because it’s an inline table-valued UDF. You basically get the same query plan as with the first query form that didn’t include a UDF call, and therefore the performance is the same. The query finishes in under a second using a parallel scan.

Give APPLY a Try

The APPLY operator was an underappreciated feature when it was first introduced in SQL Server 2005. But with time, people have found elegant, efficient, and creative ways to use APPLY. These days, I find myself using APPLY quite a lot in my code. If you’re not using APPLY extensively yet, I hope you’ll give it a try.

From the Blogs

Don’t let bad data sneak up on you when and where you least expect it. Ferret out bad data with Melissa Data’s newest Profiling Component for SSIS. Learn how to take control of your data using knowledge-base-driven metadata. The truth shall set you free!...More

Now that we’ve outlined the process to let servers in a SQL Server AlwaysOn Availability Group "talk to each other" by means of setting up linked servers, it’s possible to set up some additional or improved checks on Availability Group Health....More

In my previous post, I provided a high-level outline of the core logic (and rationale behind that logic) that would be needed to set up regular synchronization checks on SQL Server Agent Jobs for servers where AlwaysOn Availability Groups have been deployed. In this post, I’ll walk through the steps--and the code--needed to setup those checks....More