Description

Having recently tried (a slightly modified version of) Joe Celko's query to compute the median (see below) I noticed inconsistencies in the results.
Given a small, say, 10 row, table it works as expected. Given a larger table, approx. 100,000 rows, it seem like the row_number () ... function is applied before the where clause has filtered the virtual table in the CTE. Then when the where clause is applied there will be gaps in the row_number () numbers for each row that is filtered out. Remove the where clause and it works as expected regardless of the number of rows in the table.
I don't have the SQL standard(s) available so I can't tell if this is the correct behaviour but it seem very counter-intuitive.
create table [dbo].[t] (
[Id] [bigint] identity (1,1) NOT NULL,
[a] [int] NOT NULL, -- values between 1 and 100,000
[b] [smallint] NOT NULL, -- values between 1 and 15, mostly 1 and 2
[c] [bigint] NOT NULL, -- typical values between -500,00 and 200,00
[d] [bigint] NOT NULL,
[e] [datetime] NOT NULL default (getdate ()),
constraint [pk_t] primary key nonclustered (
[Id] asc
)
) on [primary]
create index ix_a on dbo.t;
create clustered index ix_e on dbo.t;
-- add 10 rows and it works as expected
-- add 100,000 rows and the result
with SortedData (x, hi, lo)
as (
select c, -- integer, positive or negative
row_number () over (order by c asc),
row_number () over (order by c desc),
from t
where t.b = 1
)
select avg (x * 1.0) as median
from SortedData
where hi in (lo, lo+1, lo-1);

Assign To

We're closing this Connect item as Not Repro since the problem was related to duplicate values in the input. Please see earlier comments for a full explanation of the observed behavior and code to generate the desired result.

If you have any questions or if you feel that this item has been closed prematurely, please contact us again.

Thanks again for helping to improve SQL Server by providing us with feedback.

Susan PriceProgram ManagerSQL Server Database Engine

Posted by Microsoft on 8/20/2009 at 12:19 PM

Hello,

Thank you for filing your concerns with the SQL Server database engine. I believe Adam (see Community Discussion) has correctly diagnosed the problem (thanks Adam!) as duplicate c values. If the row numbers in hi and lo are not appropriately sorted within the rows for duplicate c values, then the where clause (WHERE hi IN (lo, lo+1, lo-1)) may not be true for any row. When I ran Adam's code, the median value was -155. Without explicitly sorting by id within rows with duplicate c values, the (partial) output is:c hi lo-154 3322 3308-154 3323 3309-154 3324 3310-155 3303 3311-155 3304 3312-155 3305 3313-155 3306 3314-155 3307 3315-155 3308 3316-155 3309 3317-155 3310 3318-155 3311 3319-155 3312 3320-155 3313 3321-155 3314 3322-155 3315 3323-155 3316 3324-155 3317 3325-156 3293 3326You can see that hi and lo are never equal. When I run the code the breaks the ties (orders by c, id), the output is:

Since Adam has provided an explanation of the behavior you observed and code to generate the result you want, I will close this Connect item. If you have any questions or comments, or if you feel I have closed this item prematurely, please contact us again.

Thank you,

Susan PriceProgram ManagerSQL Server Database Engine

Posted by Adam Machanic on 8/18/2009 at 8:04 AM

The problem is duplicate C values. There is no guarantee that each row will be ordered the same way, so if you have duplicates you can get different row numbers. The following, for example, is totally valid:

c lo hi- -- --1 1 11 2 41 3 21 4 3

... you need to break the ties by ordering by the PK.

--

Here's a query to show all of this in more detail. I've fixed errors in your code to make it work:

Adam is brilliant and clear as usual. Other SQLs will see two ROW_NUMBER() OVER(.. ORDER BY x ASC) and ROW_NUMBER() OVER(.. ORDER BY x DESC) and optimize it using the algebra that DESC = (COUNT(*) - ASC +1); this might be why nir_scorpio expected his answer. But SQL Server does not yet do this kind of optimization and even if it did, you should not count on it (old SQL Server guys: remember when GROUP BY did an ORDER BY under the covers? ). Those are two different function calls and should be done in parallel as teh SELECT list is constructed according to Standards.

However, there is another consideration: NULLs in the median set.

Standard SQL has a pattern for all aggregate functions:1) The name is three letters, such as MDN(). I have no idea why this is so. 2) There is an optional MDN(ALL <expr>) or MDN(DISTINCT <expr>) to remove or retain redundant duplicates. This leads to MAX(DISTINCT x) being valid syntax and really dumb semantics, but I digress. 3) NULLs are removed before the aggregate computation.

Item 3 might be a problem. Originally NULLs sorted in an implementation defined order. That meant they were:

1) Always last, regardless of ASC or DESC2) Always first, regardless of ASC or DESC3) Always higher than the values of a data type4) Always lower than the values of a data type

The current SQL Standard allows you to add a NULLS [FIRST | LAST] subclause to the [ASC | DESC] qualifiers in the ORDER BY clause. But SQL Server dos not have it yet. It would be a good idea to remove NULLs in the CTE to keep the spirit of aggregation in SQL. And to consider a SELECT DISTINCT to mimic MDN(DISTINCT..)