Analytical SQL Functions – theory and examples – Part 2 on the Order By and Windowing Clauses

Windowing Clause

Of course instead of having Oracle apply an implicit, default windowing clause as result of the order by, you can define explicit windows yourself. When no windowing clause is specified, the window is exactly the same as the partition.

Note that for some functions, the order by clause is mandatory, such as for FIRST_VALUE, LAST_VALUE, ROW_NUMBER, LAG and LEAD, RANK and DENSE_RANK.

select ename
, job
, sal
, hiredate
, first_value(sal) over ( partition by job
order by hiredate
range between current row
and unbounded following
) job_avg
, first_value(sal) over ( partition by job
order by hiredate
rows between current row
and 2 following
) job_avg
from emp
where sal &lt; 2500
order
by job

In plain language: the window can be expressed as a (value) range or in terms of rows.

range

A range means: an interval expressed as two boundaries with respect to the current row. Or in even plainer language: from the record with a value of X lower (=preceding) than the current row to the record with a value Y lower (=preceding) than the current row. Where it says lower (=preceding) in the sentence, you can replace it with higher(=following). That is, the range can start and end before, start before and end after or start and end after the current record.

Range can only be used with numeric or date values (a range cannot be expressed in terms of varchar2 or blob, which seems logical). A logical offset can be specified with constants such as RANGE 10 PRECEDING, or an expression that evaluates to a constant, or by an interval specification like RANGE INTERVAL N DAY/MONTH/YEAR PRECEDING or an expression that evaluates to an interval. With logical offset, there can only be one expression in the ORDER BY expression list in the function, with type compatible to NUMERIC if offset is numeric, or DATE if an interval is specified.

An example – calculate the number of employees who earn between 200 less and 400 more than each employee:

select ename
, sal
, count(sal) over ( order by sal asc
range between 200 preceding
and 400 following
) num_of_emps_in_range
from emp
order
by sal asc

This tells us that there are 5 employees who earn between 200 less and 400 more than MARTIN. They are: ADAMS, WARD, MILLER, TURNER and ALLEN. Note that the current record is included in the window in this example, since the start point is before it while the end point is after it. The value of NUM_OF_EMPS_IN_RANGE for MARTIN is 6: himself and these 5 others.

Note too that where the end-point falls outside the partition – even though in this example the partition coincides with entire set of records – it is simply cut-off. Oracle has no problem with you specifying a window with boundaries far outside the scope of the actual range of values.

Also note that the boundaries for a range window can be specified using the keyword UNBOUNDED. Instead of specifying a ridiculously large value to ensure all records from a certain startpoint or until a certain endpoint are included, you can use this keyword.

An example – give me the number of employees earning more than the current employee:

select ename
, sal
, count(sal) over ( order by sal asc
range between 1 following
and unbounded following
) num_of_emps_earning_more
from emp
order
by sal asc

For example for ALLEN: there are 6 employees earning more than him. The Salary range window starting at 1601 (ALLEN’s salary+1) and extending indefinitely into the following salary values contains six records: from CLARK to KING.

As I will discuss later, the values specifying the boundaries of a range window do not have to be constants, they can just as well be expression: column values, functions etc. Let’s run a simple query that uses a range window that depends on the job: for CLERKS the salary window will be 250 on both sides of the current record, for managers 750 and for the others 500:

select ename
, job
, case job
when 'CLERK' then 250
when 'MANAGER' then 750
else 500
end window_size
, sal
, count(sal) over ( order by sal asc
range between case job
when 'CLERK' then 250
when 'MANAGER' then 750
else 500
end preceding
and case job
when 'CLERK' then 250
when 'MANAGER' then 750
else 500
end following
) num_of_emps_in_range
from emp
order
by sal asc

MARTIN, who is a SALESMAN, has a window size of 500. That means that we count the employees who earn more than 500 less or less than 500 more than MARTIN. That adds up to 8 people: from SMITH to ALLEN.

As said earlier, range windows can be used for Numeric and Date values. For Date Values the range can be expressed with numeric value (interpreted as the number of days before and/or after a date) or as INTERVAL DAY TO SECOND or YEAR TO MONTH; see SQL Reference on INTERVAL datatype.

This query counts the number of employees hired between 90 before and one year and 3 months after each employee.

Here we see that in the period from 90 days before to a year and three months after KING was hired, 5 other employees were hired (remember that the value 6 displayed for KING includes KING himself). We can easily see that this period around KING’s hiredate starts at 30-JUN-1981 and ends at 18-FEB-1983. In this period, everyone from TURNER to MILLER was hired.

Note by the way how you can use the functions FIRST_VALUE and LAST_VALUE to retrieve values from the first and last record in a window. This is described by Tom Kyte in his Expert One-on-One Oracle, Apress, 2003).

select ename
, hiredate
, first_value(hiredate) over ( order by hiredate asc
range BETWEEN INTERVAL '90' DAY PRECEDING
AND INTERVAL '1-3' YEAR TO MONTH FOLLOWING
) first_value_in_window
, last_value(hiredate) over ( order by hiredate asc
range BETWEEN INTERVAL '90' DAY PRECEDING
AND INTERVAL '1-3' YEAR TO MONTH FOLLOWING
) last_value_in_window
, count(hiredate) over ( order by hiredate asc
range BETWEEN INTERVAL '90' DAY PRECEDING
AND INTERVAL '1-3' YEAR TO MONTH FOLLOWING
) num_of_emps_in_range
from emp
order
by hiredate asc

rows

The window can also be expressed in terms of a number of rows instead of a range. Rows offer the advantage of not being limited to NUMBER and DATE values as well as ORDER BY expressions on single value. So with using a ROWS expression for the window, you can order the records by more than one column.

The specification of the windowing clause using ROWS is almost the same as definition using RANGE:

The window is specified by indicating a startpoint and an endpoint (‘the window-boundaries’) in terms of a number of rows. When the window is specified as a range, there is no telling upfront how many rows the window (if any) will contain. With rows, you exactly what number the window will contain – but you have no clue what the spread in values within that window will be. Note: you do not always know for sure how many rows there will be in the window; if you specify unbounded precding or unbounded following you can usually not predict how many rows the window will contain. But even with a fixed number preceding or following, the actual number or records in the window is not certain. For example with a window set up as ROWS 5 PRECEDING and 3 FOLLOWING, you would expect the window to always contain 9 rows. However, the first 5 records will have a smaller window, because they do not have 5 preceding; the first record has none preceding, the second only one etc. The same of course applies to the last three records in the partition.

Here is an example of using a ROWS based window. This query selects for each employee the average salary for all colleagues in the same job that were hired later than the employee. Note that we could have done this query with a RANGE based window just as easily. I do not know whether – when you can use both RANGE or ROWS – there is a preference for either one, from the viewpoint of performance for example. One would expect that if there is any preference, it would be to use rows whenever you can – but that is just a hunch.

select ename
, job
, sal
, hiredate
, avg(sal) over ( partition by job
order by hiredate
rows between 1 following
and unbounded following
) job_avg
from emp
order
by job

The value of rows in the windowing clause is also a numeric expression that can be derived from column values or calculations upon them. The next query calculates for each employee the average salary over the current record and the salaries for a number of employees with the same job, earning less than the current employee; the number of employees is calculated as 1% of the salary of the current row – and is therefore dynamic.

Static or Fixed Windows

The Window clause can not be set to a fixed window such as: the first and second row in the partition. For example see the following reporting requirement: Select for the numbers 4 and lower on the salary-ranking per department the difference with the average salary of the numbers 2 and 3. For all employees their name, deptno and salary. Your first idea could be something like the following:

Unfortunately, that is not allowed. Unbounded is perhaps something like infinity in mathematics; infinity minus x is still infinity.

There is a workaround in this case – and probably for most cases. If you use an in-line view to add the row_number within the partition of interest to each record, we have the information we need to work with a quasi-fixed window.

For our example, the next query will do the trick. In the in-line view we select all columns required in the outer query, enriched with the row_number (ranking) of the employee in terms of salary within the department. In the outer query, for all employees ranked 4 or below, we will calculate the difference between his salary and the average of the salaries of the numbers 2 and 3 in that department. To get at the numbers 2 and 3, we need a fixed window ( rows between unbounded -2 preceding and unbounded-3 preceding). As we discussed, that is not allowed. However, now that we know the position of each employee in the partition, we can easily find the window that contains the number 2 and 3 in the department: the (current-rownumber – 2 preceding) takes us to the nr2 in the partition and (current-rownumber – 3 preceding) takes us to the nr 3. So defining the window as

rows between rn-2 preceding
and rn-3 preceding

, we have set up a fixed window that always contains the 2nd and 3rd row in the partition.

Note: we cannot use a negative argument for the rows preceding or following (ORA-01428: argument ‘-1′ is out of range). The employees who come 1st or 2nd in their department would give us negative arguments if we were to use rn-2 and rn-3. Therefore we use the

greatest(rn-2,0)

and

greatest(rn-3,0)

. Since the CASE effectively blocks out the result of all employees ranked 1, 2 or 3, it does not really matter what the avg(sal) returns.

They tell us for example that JAMES earns 600 less than the average of the numbers 2 and 3 in his department (deptno 30). The numbers two and three apparently are ALLEN and TURNER who make 1600 and 1500 which averages to 1550 which is indeed 600 more than the 940 JAMES is earning.

Using the row_number as additional information for each record, gathered from an inline-view, we were able to work with static windows!

SECOND_VALUE

SQL provides the FIRST_VALUE (and also LAST_VALUE) function. FIRST_VALUE allows us to get the value of a certain column (or expression) for the first record in a window. Note that by using IGNORE_NULLS , you can ensure that the first NON NULL value is returned, even that does not belong to the first – or even the second or third – record in the window.

An example of using FIRST_VALUE is the following, to find out the name of the highest paid employee in each department:

select distinct
deptno
, first_value(ename) over (partition by deptno
order by sal desc
)
from emp
order
by deptno

Results:

DEPTNO FIRST_VALU
---------- ----------
10 CLARK
20 KING
30 BLAKE

There is not built-in SECOND_VALUE or ONE_BUT_LAST_VALUE etc., but using the fixed window ‘trick’ we discussed above, it is quite easy to simulate a SECOND_VALUE function. To get the SECOND_VALUE in a window, we will instead get at the FIRST_VALUE in a window that has a boundary that is one lower than the original window. For example, to find the name of the runner-up in terms of salary in each department, we would execute the following query:

Lag and Lead

When the fixed window contains a single record, such as in the example of the SECOND_VALUE approach, you can also make use of the LAG and LEAD functions. These functions allow you to retrieve the value from a row either preceding the current row or trailing it by an indicated number of rows.

LEAD provides access to more than one row of a table at the same time without a self join. Given a series of rows returned from a query and a position of the cursor, LEAD provides access to a row at a given physical offset beyond that position. If you do not specify offset, then its default is 1. The optional default value is returned if the offset goes beyond the scope of the table. If you do not specify default, then its default value is null. You cannot use LEAD or any other analytic function for value_expr. That is, you cannot nest analytic functions, but you can use other built-in function expressions for value_expr. For example the following query returns the next hiree in the same job looking from a certain employee:

meta

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 206 other subscribers

Email Address

About

AMIS is internationally recognized for its deep technological insight in Oracle technology. This knowledge is reflected in the presentations we deliver at international conferences such as Oracle OpenWorld, Hotsos and many user conferences around the world. Our AMIS Technology Blog, the most referred Oracle technology knowledge base outside the oracle.com domain. However you arrived here, we appreciate your interest in AMIS. Link to our Google+ Profile AMIS