Saturday, September 15, 2012

max(value) keep (dense_rank last order by mydate) over (partition by relation_nr)

Unfortunately, when you start searching for the "keep" clause, you won't find anything in the Oracle documentation (and hopefully because of this blogpost, people will now have a reference). Of course Oracle documents such functions. You only have to know that they are called FIRST and LAST in the SQL Language Reference.

Even though these functions were already introduced in version 9, I've seen lots of code that could have used these functions, but didn't. And that's a pity because it's a wasted opportunity to write shorter and faster code. The common use case I'm talking about is when you have a detail table with a validity period. Typically with a column startdate, and optionally an enddate. For such a table, you often have to know the values of the currently valid row.
An example: suppose we have a table RELATIONS and for each relation we want to know his address at a certain point in time:

Relation "Oracle Nederland" has two addresses, and its current address being at the Hertogswetering. And fictively, relation "Ciber Nederland" has four addresses. The current address is the Den Bosch one. And I've also recorded a future address in Nieuwegein. Note that, in real life, the latter three are all Ciber offices currently in use.
To get the active relation addresses on October 1st, 2012, I can use this query:

Here you compute the row_number when you partition the result set per relation_id ordered by startdate in descending order. Meaning the most recent date starting before the reference date, gets row_number 1 assigned per relation_id. By using an inline view, we can filter on the outcome of the analytic function, and only select the rows with row_number 1. In forums, you'll see this solution often being adviced. Compared to the correlated subquery, this query selects only once from table RELATION_ADDRESSES. However, you can do even better by just adding three "keep clause" functions to the original query:

The three extra aggregate functions all do a "dense_rank last order by startdate", meaning "sort the rows by startdate, and pick only those rows which have the most recent startdate". If you have more rows with the same startdate, the max function at the start tells Oracle to pick the value with the maximum address/postal_code/city. However, (relation_id,startdate) is unique, so ties are impossible and thus the max function is a dummy. I also could have used min.

The query is shorter and -to me- clearer at first glance. However, the main reason for my enthusiasm for the aggregate functions FIRST and LAST is because it's just faster. To show this, let's execute those queries against a table with 300,000 rows, 100,000 relations with 3 addresses each:

Note that I created histograms with 254 buckets just to make the optimizer see that it should full scan the table, despite the "startdate <= :REFERENCE_DATE" predicate.
This next query should give a clue what's in the table:

Note that this query takes longer than the correlated subquery above: .97 seconds versus .66 seconds. The HASH JOIN ANTI took .49 seconds (.66 - .05 -.12) where computing the ROW_NUMBER took .84 seconds (.93 - .09). So here, on my laptop, I have avoided .05 seconds for the INDEX FAST FULL SCAN, but spend .35 (.84 - .49) seconds more for the computation. Likely, when I/O is more expensive than on my laptop, the time of the first query will go up and the times will be closer to each other.
Now the keep clause variant:

The shortest query, the shortest plan and the fastest execution. The SORT GROUP BY immediately reduces the number of intermediate rows from 297K to 100K, whereas the WINDOW SORT PUSHED RANK had to compute the row_number for all 297K rows.

PS: this topic and much more is covered in an upcoming Live Virtual Seminar for Oracle University on October 2nd

I was rereading this article and discovered my comment and your reply, which I did not notice at the time. Sorry for the delay in answering your question.

The idea behind my variant is to use the index to access the desired rows, then get the rows using TABLE ACCESS BY USER ROWID. This results in fewer total consistent gets than the full scan, provided there are several addresses per relation.

In my tests today, with 300 or more addresses per relation I get far fewer consistent gets. With 30 addresses per relation I get somewhat fewer consistent gets, but I need to use the FIRST_ROWS hint to get the plan I want.

My suggestion was not intended to replace your solution generally: it's a variant that could be useful when the expected result set is small compared to the table size.

Which version have you tested this on? Not sure if anything has changed in last couple of years but look at http://www.oramoss.com/blog/2010/03/12/keep-dense_rank-versus-row_number-further-details/ where Jeff mentions the row_number approach works better as the number of "other" columns needed in the resultset (corresponding to the MAX/MIN value) increase.

This is version 11.2.0.2.Thanks for the reference to Jeff Moss' blog. Unfortunately, he forgot to mention the database version as well. My test above was with three columns. I also tested with 1 and 6 columns, but I didn't see any significant changes in runtime, so I left those out. My guess is that I'll never encounter a situation where the number of keep-clause-aggregate-functions is so high that the analytic function will win.