David Morrison's Blog

Ever had to do statistical analysis on data in sql server? Maybe for reporting or to generate some data for things such as KPI’s in SSAS for example. One aspect of doing data analysis I found interesting and challenging in equal measure is percentiles. What is a percentile I hear you ask? (if you know what a percentile is just skip past this bit)

Well imagine you and 99 other people take a test and that the test is scored numerically. Now generally people being people would like to know how many of their follow exam takers they’ve beaten in the exam (I know I would ). So in our example if you’ve done pretty well and say scored better than 80 of your fellow 99 exam takers, then you would be better than 80% of them and hence you would be in the 80th percentile. This of course is a very simplistic example but you get the idea.

Personally I feel this is more of a common thing in every day life in the America for example than it is here in the UK but it does crop up a lot in reporting and KPI’s for example.

Which way!?

Now in my mind there are two quite distinct “directions” you can take in calculating percentiles. The first is the same as in the example above, you take a score, compare it to the rest of its set or “pot” and work out it’s individual position. The second is where you have a set of percentile “milestones” you wish to calculate (5th, 25th, 50th, 75th and 95th for example), then for each set or “pot” of data you have, you find the value within that pot that is the closest to each of your milestones.

NB: If you try and do something like this in excel, its worth noting that the PERCENTILE function in excel will actually fabricate a figure that is mathematically exact to the percentile you ask for based on the pot of data its given. Although technically the this is more numerically accurate, in my opinion in 99.9% of cases this is overkill, simply because if you calculate percentile milestones from a pot of data, you should only ever compare values from that pot to those milestones, making the additional effort to calculate the precise value rather than finding the closest actual value pointless. My preferred method is to select an actual value from the set that is closest to each of the required milestones.

Each “direction” has its own merits and down points, but I think the decision all boils down to volume, purpose and required accuracy. So using my exam example from earlier, imagine if 100 million people took that test, now think of the follow set of possible scenarios

You want to generate and send each person an individualised result sheet, unless you had to in order to give each person their exact percentile, would you really want to run that calculation 100 million times? Probably not, what’s more likely is that you’d take the milestone approach, even if you set 100 milestones (1 per percentile), it would be cheaper to calculate these first and then compare each value to the mile stones to get roughly their percentile

You allow the test takers to login to a website and see a personalised score sheet and lets say you know that around only 10% of the test takers will login and do this. In this case it would probably be best to generate each individual score as its requested from the website.

Ultimately the choice is up to you but I’d say the milestone approach generally covers more bases, so that will be the focus for this post from here on in.

A worked example

So before we dive into the sql I wanted to run though how this works logically. I feel this will make the sql easier to follow as you can concentrate on how the code is working rather than its doing, if that makes sense.

So lets take our demo data pot below

This is a fictitious set of sales figures by store and sales person, each person value is unique within a store and each unique person has one row each (imagine this is a monthly totals of sales for example).

So the first thing that needs to happen is the data needs to be sorted, smallest to largest. As you can see I have already done this. The next step is to give each row a position or row number, based on its now sorted location in the pot (see below)

Then for each row we need to work out it’s percentile in relation to the pot, we do this simply by dividing 100 by the number of rows in the pot (29 in our case), then multiplying the the resulting figure by that rows, row number. So the calc would look something like this: (100/Recs In Pot) * Row Number

This then gives us each records percentile value relative to its pot (see below):

Now we’ve got the percentile figure for every row what we need to do is evaluate each rows suitability to represent each one of our milestones. In this example I’ll be using 5th, 25th, 50th, 75th and 95th percentiles as my required milestones.

So to determine the each rows suitability to represent each milestone, we do a calculation per required milestone, subtracting the rows percentile value from the required milestone getting the “absolute” value back (This is a mathematical term, basically what it does is ensure you always return a positive figure). This then gives us a, for each rows percentile value, a “distance” from each milestone. The calculation for the 5th percentile would look something like this: ABS(Row Percentile – 5)

See the below for the results for these calculations for our demo data pot:

What this leaves us with is a list of candidate values for each of our milestones. From here it is simply a matter of getting the row with the minimum distance value for each of our milestones (Hence numerically the closest, in either direction, to the percentile milestone). That rows value (sale value, not percentile value) then becomes the value for that given milestone.

This leaves us with a very tidy table as below:

Then all you need to do in your report or wherever you need to know which percentile bracket your row falls in to is to simply compare the rows value to the above percentile table. so to see of your row is in the 25th to 50th percentile range you’d do something like: If My Row Value >= Percentile25Value and <= Percentile50Value then ……

All make sense?? good!

To the code!

So by now hopefully you understand logically how this should work so lets see how to do in sql server!

Ok first of all we need to establish our data set, in this example I am using the adventure works database and working with the [Sales].[SalesOrderDetail] and [Sales].[SalesOrderHeader] tables. I’m using CTE’s here because that’s my style, there are a few ways you can do this, but this is my preferred way. As with all my posts I always say this may not be the out and out best way or maybe it is, but it works for me and works well.

The first part of the code looks like this:

;with SalesDetail as
(
SELECT OrderYear, SalesPersonID, LineTotal, (100.00 / RecsInPot) * PositionInPot as Percentile
FROM
(
SELECT year(SOH.OrderDate) as OrderYear, SOH.SalesPersonID, SOD.LineTotal
,count(*) over (partition by year(OrderDate), SalesPersonID) as RecsInPot
,row_number() over (partition by year(OrderDate), SalesPersonID order by LineTotal) as PositionInPot
FROM Sales.SalesOrderDetail as SOD
JOIN Sales.SalesOrderHeader as SOH on SOD.SalesOrderID = SOH.SalesOrderID
where SOH.SalesPersonID is not null
) as T
)

This example varies slightly from the logically worked example above in that this dataset has many “pots” of data within it. In this instance a pot is defined by the OrderYear (which is the year from the SalesOrderHeader OrderDate column) and the SalesPersonID. When we’re done we will be able to see how each sale in each year ranked against all other sales that year with the inclusion of which sales person made the sale. i.e. sales person 123 made a sale of XYZ in 2009, how did that sale compare to all other sales by all other sales people that year.

As in the logical example we take the data, count the number or records per pot (RecsInPot) and then rank the rows in each pot from lowest to highest based on the value of the sale (PositionInPot). From these figures we then work out each rows percentile figure (Percentile).

Then for the next phase we add a second CTE based on the results of the first CTE as below:

As you can see I have named this CTE “Candidates” and if you’ll remember back to the logical example earlier this is doing exactly the same thing, assessing the suitability each LineTotals value to represent each of the required milestones (I’m using 5,25,50,75 and 95 here but obviously these can be any values you like)

As it says on the tin the “Targets” CTE then works out, for each milestone, the smallest percentile distance from the available set available. After this is just a matter of matching up the targets with the candidates which will then give us the value for each milestone. To do this we add the follow final bit to the code:

What we’re doing here is pretty straight forward, we select from the Candidates CTE, joining onto the Targets CTE where they match on order year, sales person and if any of their candidate values match onto one of the targets thus making that rows sales value the value that will represent the matched milestone. We are then pivoting each of the percentile values up to make one row per “pot” which can then be easily joined onto the data later on in reports etc to find a given percentile range for each value.

For reference the resulting output looks like the below. I’m only showing the top 10 here but run the code for yourself against adventure works to get the full set

As you can see this is a small, lightweight reference table that can easily be used to find were a value would fall.

Now that’s about it, apart from a quick note to say if you want to test this yourself I’ve actually added a couple of indexes as below, just to support the query, they aren’t essential but they do make enough of a difference to put them on

Yes it’s true, despite what I have been told for years I can officially confirm it, size does matter! I am of course talking about sizing of columns using the variable length data types in sql server, what else could I mean!?

Back in the early days ..

So this came to me after watching Klaus Aschenbrenner speaking at SQLBits 8, his session was on the internals of the storage engine in sql server and how it all worked. As he explained, this area of sql server can be a bit of a black box and I guess for a lot of people this is fine. After all, I certainty don't know, or need to know, the inner workings of a lots of the technology and gadgets I use on a day to day basis (I don’t know or care how my TV works for example, I just like watching the Simpsons on it ).

Anyway I digress, after Klaus’s session I remembered back to when I first started using sql server (around 11 years ago, which actually makes me feel really old!) and I asked a few of the guys I worked with at the time “If these varchar columns only use space for the data they store then why not just declare them as all varchar(8000)?”, they all looked around at each other and after a minute or two came up with something about using the database design to restrict what the user can store in the table by limiting the column widths.

While this made some sense I was still left thinking “well why do they let you set a size at all, just call it varchar without setting a size, make it accept 8000 characters (4000 for nvarchar) and let the DB handle it”. This to me,with the knowledge I had at the time made much more sense, because the sql server storage engine was a black box to me and obviously the guys I was working with at the time as well.

Back to the future

Obviously time passes and we all learn and improve our skills and knowledge. So as I mentioned earlier listening to Klaus reminded me of this unanswered question from the past and I kind of wished I could go back in time and answer myself ..(?) by saying …

“Size matters! and here’s why…”

So at a basic level SQL Server stores all its data in 8KB chunks it calls “pages”. There are several types of pages but the ones we are looking at are the three different types of data pages, IN_ROW, ROW_OVERFLOW and LOB_DATA.

Pages are then logically lumped into 64KB groups of 8 (8 x 8KB) which is referred to as an “extent”.

There are two kinds of extents, “Uniform” and “Mixed”. I’m not going to go into extents too much (there are a bunch of great resources out there if you want to look into this, I may even do another blog post on the subject) but basically a uniform extent is one where all its 8 pages hold data for the same object (i.e. table or index etc) and a mixed extent as you may have guessed is one that has pages that hold data for multiple objects (example below)

So building on this, lets start with a table (heap or b-tree, the principles are the same), when you first create this table sql server automatically creates a partition for you (tables and indexes can be spread across multiple partitions but I’m not going to go into detail about partitions in this post). Each partition then has something called an “allocation unit” for each type of “page” (See page types above) it is using (I emphasise using as this is the important bit), which tells SQL server which extents are being used by / belong to this object . Example as below:

Page types, what they can store and variable vs fixed width

Now if you remember back to a moment ago when I mentioned the three different types of data pages, lets look at what each is used for.

Fixed width data types, unlike variable width data types, can only ever be stored on IN_ROW type pages, which enforces a row size limit for fixed width data type columns to around 8053 bytes (a page being 8060 with roughly a 7 byte overhead, I say roughly as this can increase). Rows can be larger than this using variable width data type columns as this limit only applies to the fixed with types

LOB_DATA pages are for storing data from large data types such as VARCHAR(MAX), text, image and xml for example.

ROW_OVERFLOW pages are used to store data from variable width data types, but only when the potential maximum width of a row exceeds the 8053 (ish) byte limit. So to clarify, "potential maximum width” means the sum width of all columns using both fixed width and variable width data types, taking the declared length of variable with columns as its width (so a varchar(50) would count as a column 50 bytes wide)

The Finale

So you might have spotted the crux of this post by now? Basically put, if your potential maximum row size is going to exceed the 8053 (ish) byte limit then SQL server moves the variable width data type data into a ROW_OVERFLOW type page and you now need to maintain two types of pages per row. This means two allocation units per partition and you now have to pull two pages into memory to read each row as all I/O in SQL Server is done at the page level, you’re basically doubling your I/O. Now think how that would scale over so millions or hundreds of millions of rows…

So if you don’t really need to store those 8000 characters in your varchar column then don't declare it that size as you may be giving yourself a massive I/O hit.

So there you are, size does matter! Hopefully this has helped and now when you’re designing your tables you wont be so fast and loose with the size of your variable width columns

I recently came across an interesting (I use “interesting” subjectively) situation, we had a stored procedure that, when run on a dev server, produced the desired results but when run on the live server returned no results. It didn’t error, it just didn’t return any results. The purpose of the stored procedure was effectively to take a passed in date and return all rows from a specific table that feel into the same calendar working week. After some investigation it turns out that the dev server had the users login set up with the default_language of “English” (which is actually American English …) and the live server had default_language of “British English”.

As a little side note, the default language for a login is set when the login is created, if you don’t specify a default language, the default language from the sqlserver install is used. To see what your current language is, you have to use a combination of two little bits of code as below:

select @@Langid
exec sys.sp_helplanguage

The first part simply returns an ID number, which correlates to the “langid” column that is returned by the stored proc. This language then in turn, amongst other things, affects two very important connection based options for working with dates in sql server, @@DATEFIRST and dateformat. So lets take a look at what these both are and why they are important when working with dates in sql server.

@@Datefirst

Lets start with @@DATEFIRST, thistells sql server which day of the week (Monday, Tuesday Wednesday etc.) is the beginning of the week. It is simply a number between 1 and 7 that represents the days of the week in the following order:

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday

This is important as in a lot of cases the default value for this, as set by your default language, isn’t Monday as some may expect. Below is a table illustrating the various @@DATEFIRST values for each language.

As you can see, by default, its always either 1 or 7 (Monday or Sunday). The default for “US English” is 7 where as the default for “British English” is 1. You can see what your current @@DATEFIRST is by running this simple bit of code:

select @@DATEFIRST

and you can change it, for the current session by using the SET command as below

SET DATEFIRST {Number | @NumberVar}

Or of course you can modify your login to have different default language.

Dateformat

And then there’s dateformat. This tells sql server how to interpret and convert dates passed as strings into sql server, for example a value of dmy, as you would expect is day month year. It doesn’t however affect the display of dates stored in sql server or how they are stored.

So for example if you pass a date as a string into a stored procedure, lets say '01-02-2010', as I’m from England, in the real world I would expect this to be February 1st 2010, sounds reasonable, doesn’t it?! Now what if your dateformat is “mdy” which is the default for “US English”? Well then sql server would interpret this value as January 2nd 2010, but as its still a perfectly valid date it would accept the value! You might never know its happened until a red faced someone from sales comes as asks you why their big sale that they just made didn’t show in the monthly sales report.

You can change the date format using set dateformat dmy (Note the lack of quotes around the dmy part) but this can be a bit of a minefield for a number of reasons, for example the date format ydm is not supported by the date, datetime2 and datetimeoffset data types. Just be aware what impact this will have when / if you do it. You can also change both settings by using the set language command but I’m not going to go into that now, have a look in BOL for yourself if you’re interested

Back to our story

So that’s a overview on datefirst and dateformat and why it’s quite important you know what they are, but back to my “Interesting” situation. So if you’ll remember we had a stored proc that took in a date as a string and returned all rows within the same calendar week. To illustrate, I’ve put together a little stored proc using adventure works as below

SELECT*
FROM Sales.SalesOrderHeader as SOH where OrderDate >=@BeginOfWeek
and OrderDate <=@EndOfWeek
GO

The idea behind the code being to return all rows where the order date is in the same week as the date passed in via the @OrderDate parameter. The bulk of the “smarts” in this proc is the line that sets the @BeginOfWeek variable, so lets dissect that line (At this point I’d like to caveat that this may not be the “best” or “smartest” way to get a week beginning but it works and demonstrates the point I’m trying to communicate well)

So starting from the inside out (as this is how, logically, it should be calculated), we have the following datepart(dw,@OrderDate). Passing the “dw” value as the first parameter of the datepart function returns the weekday (based on our current @@DATEFIRST value) of our passed @OrderDate parameter.

So for example lets say we have a language of “US English” which gives us a @@DATEFIRST of 7. Now lets also say we select the value from datepart passing in the dw parameter and the date of ‘01-25-2011’ (Note the month, day, year date format, this is due to having US English as our language), which is a Tuesday. This would return a value of 3. “Why?” I hear you cry! Well because a datefirst value of 7 tells SQL Server the beginning of each calendar week is Sunday, which makes Sunday weekday 1, hence making Tuesday weekday 3, Simples!

The next level of our nesting, (datepart(dw,@OrderDate) -2), as you can see, then subtracts two from the value of our DW datepart, you’ll see why, and why this is very important in a second.

The next and final level, dateadd(dd, -(datepart(dw,@OrderDate) -2) ,@OrderDate), takes the result from above, turns it into a negative number, and “adds” (which as any maths professor will tell you, actually subtracts ) that resulting number of days, using the “dd” parameter value, from our @OrderDate parameter, giving us the calendar “start” of the week

Worked Example

So lets work this through, assuming a language of “US English” with all its associated default attributes.

We pass in a date of ‘01-25-2011’ (25th January 2011, which is a Tuesday), the inner datepart(dw) gives us 3, the next section of that subtracts 2 from this so now we have 1, then we turn this into a negative number and “add” (subtract) that number of days from the passed in date, which gives us ‘01-24-2011’ which is the Monday and calendar beginning of that week!

The logic works exactly the same if you pass any date in for the given week, it always resolves back to the Monday for that calendar week.

To get the @EndOfWeek we then just add 4 days to the @BeginOfWeek to get the Friday for that calendar week

Why –2 you say……?

So back to the “–2” bit of the @BeginOfWeek construct. If you followed the logic through, you’ll see very quickly that the –2 figure is totally dependant on the fact that the datefirst is 7 and should datefirst change it would break the logic of our code and we would start getting back all kinds of seemingly odd results, and this was what was causing the problem in the stored procedure as I mentioned at the beginning of this article.

So how do we combat this? We could change this figure to match the @Datefirst value, but who’s @Datefirst? What if there are many users all with different language and hence different DateFirst values?

The solution (yay!)

My solution was to do the following to procedure:

CREATE procedure [Sales].[GetOrdersInWeek]
@OrderDate smalldatetime
as
declare @BeginOfWeek smalldatetime
declare @EndOfWeek smalldatetime
declare @DF int
--get the current @@DATRFIRST for preservation
select @DF = @@DATEFIRST
--set the datefirst to 7, this way we always know what it is
set datefirst 7
--now do the date work, knowing our datefirst is 7
set @BeginOfWeek = dateadd(dd, -(datepart(dw,@OrderDate) - 2) , @OrderDate)
set @EndOfWeek = dateadd(dd, 4, @BeginOfWeek)
--set @@DATEFIRST back to whastever it was when we started
set datefirst @DF
SELECT *
FROM Sales.SalesOrderHeader as SOH
where OrderDate >= @BeginOfWeek
and OrderDate <= @EndOfWeek

So what I’ve done is added a new int variable called @DF, I’ve then at the top of the proc selected the current @@DATEFIRST into this variable. After that I have set the datefirst to 7, that way I know what my datefirst is and dont have to worry about it being different. Then at the end of the code I’m setting the datefirst again, to the value of the @DF variable effectively putting it back to what it was when we started!

And that’s it really! Doing that solved our issue. I hope you enjoyed this post and got something out of it