New windowing functions in XLeratorDB

Jun29

Written by:Charles Flock6/29/2013 4:46 PM

With the release of SQL Server 2012, SQL Server users were finally able to use aggregate functions in a window, enabling calculations like running sums and moving averages. XLeratorDB/windowing put these capabilities into SQL Server 2005 and SQL Server 2008 as well as providing dozens of calculations that are not available in SQL Server 2012. With our latest release of XLeratorDB/windowing we add 14 new functions to the 38 existing functions.

We are happy to announce the availability of windowing function for SQL Server 2005 and above that allow you to calculate running and/or moving totals for: skewness; kurtosis; the Sharpe ratio; the information ratio; and Students t test.

Kurtosis and skewness, like standard deviation and variance, are measures of central tendency. Specifically, variance is calculated using the second central moment (µ2), skewness is calculated using the third central moment (µ3), and kurtosis is calculated using the fourth central moment (µ4). Standard deviation is simply the square root of the variance.

The equations for calculating the central moments are relatively straightforward.

From a computer processing standpoint, however, they pose a special challenge in that they require the calculation of the difference between each element in the data set and the average of the data set implying two passes through the data: one to calculate the average and another to calculate the differences. In terms of a windowing-type function, this implies holding all the elements in memory and recalculating every time a new row is added to the resultant table. This would be painfully slow and would, in fact, represent at least the same overhead as doing a self-join.

Thankfully, there are single-pass algorithms available that allow for rapid, numerically stable calculations of each of the central moments.

Let’s look at what this means. First, let’s put some data into a table.

Prior to SQL Server 2012 to calculate the running population variance you would have to use what is known as a self-join, which would look something like this.

SELECT m1.x

,VARP(m2.x)as VAR_P

FROM #m m1

JOIN #m m2

ON m2.rn <= m1.rn

GROUPBY m1.x, m1.rn

ORDERBY m1.rn

This produces the following result.

x VAR_P

----------- ----------------------

102 0

142 400

110 298.666666666667

110 236

96 252.8

101 227.472222222221

99 210.244897959184

96 201.25

96 190.83950617284

126 208.560000000001

98 197.537190082644

105 181.354166666667

108 167.514792899409

126 179.882653061224

72 249.493333333333

114 237.83984375

111 225.065743944637

100 214.83950617284

118 210.448753462603

84 224.710000000001

SQL Server 2012 offers a significant improvement over this, by eliminating the need for a self-join. Thus, the same result can be achieved in SQL Server 2012 using the following syntax.

SELECTx

,VARP(x)OVER (ORDERBYrn)asVAR_P

FROM#m

ORDERBYrn

Which is obviously much simpler and much less computationally intensive than the self-join. This produces the following result.

x VAR_P

----------- ----------------------

102 0

142 400

110 298.666666666667

110 236

96 252.8

101 227.472222222221

99 210.244897959184

96 201.25

96 190.83950617284

126 208.560000000001

98 197.537190082644

105 181.354166666667

108 167.514792899409

126 179.882653061224

72 249.493333333333

114 237.83984375

111 225.065743944637

100 214.83950617284

118 210.448753462603

84 224.710000000001

Of course, if you are not a SQL Server 2012 user, this syntax is not available to you unless you use the XLeratorDB/windowing library. You can now calculate the running population variance with a TSQL statement that looks like this.

SELECT x

,wct.RunningVARP(x,ROW_NUMBER()OVER (ORDERBY rn),NULL)as VAR_P

FROM #m

ORDERBY rn

This produces the following result.

x VAR_P

----------- ----------------------

102 0

142 400

110 298.666666666667

110 236

96 252.8

101 227.472222222222

99 210.244897959184

96 201.25

96 190.839506172839

126 208.56

98 197.537190082645

105 181.354166666667

108 167.514792899408

126 179.882653061225

72 249.493333333333

114 237.83984375

111 225.065743944637

100 214.839506172839

118 210.448753462604

84 224.71

Obviously, XLeratorDB takes advantage of the windowing capabilities of the ROW_NUMBER() function in SQL Server 2005 and above to provide this capability.

In this graph, we see that as the dataset grows, the processing time increases exponentially for the self-join whereas the windowing functions are more or less unaffacted.

It is interesting to note that the self-join is actually somewhat faster for a small number of rows (<10,000) than using the windowing functions.

However, there is an additional advantage to using the XLeratorDB functions: numeric stability. The SQL Server central tendency functions are numerically unstable. Let’s look at what that means in the following example.

SELECTrn

,CAST(xasfloat)asx

,VARP(x)OVER (ORDERBYrn)asVAR_P

FROM (VALUES

(1,900.000001580507),

(2,900.000003161014),

(3,900.000000948304),

(4,900.000001580507),

(5,900.000003161014),

(6,900.00000118538),

(7,900.000001896608),

(8,900.00000237076)

)n(rn,x)

This produces the following result.

rn x VAR_P

----------- ---------------------- ----------------------

1 900.000001580507 0

2 900.000003161014 0

3 900.000000948304 0

4 900.000001580507 0

5 900.000003161014 0

6 900.00000118538 0

7 900.000001896608 0

8 900.00000237076 0

While the population variance is small it is not zero. Here’s what the results look like using the XLeratorBD function.

SELECTrn

,CAST(xasfloat)asx

,wct.RunningVARP(x,ROW_NUMBER()OVER (ORDERBYrn),NULL)asVAR_P

FROM (VALUES

(1,900.000001580507),

(2,900.000003161014),

(3,900.000000948304),

(4,900.000001580507),

(5,900.000003161014),

(6,900.00000118538),

(7,900.000001896608),

(8,900.00000237076)

)n(rn,x)

This produces the following result.

rn x VAR_P

----------- ---------------------- ----------------------

1 900.000001580507 0

2 900.000003161014 6.24500540873906E-13

3 900.000000948304 8.6597425224703E-13

4 900.000001580507 6.68215689240038E-13

5 900.000003161014 8.23341639369267E-13

6 900.00000118538 7.98840428231102E-13

7 900.000001896608 6.84911544100513E-13

8 900.00000237076 6.20499911840382E-13

The differences become more pronounced in the calculation of standard deviation (which is the square root of the variance).

As you can see, SQL Server 2012 has produced a population standard deviation of zero. While the standard deviation is small, it is clearly not zero. We have previously written about this problem in the SQL Server aggregate functions (see one-pass or two), but it’s surprising that the problem persists in SQL Server 2012.

We have taken the same care to assure numerical stability in the calculation of kurtosis and skewness, both in our implementation of the aggregates and in the windowing version. SQL Server does not provide built-in functions for kurtosis and skewness.

The Sharpe ratio measures the ratio between the mean difference and the sample standard deviation of the differences for an investment return and a risk-free rate. It is very similar to Cohen’s D which may be used to measure the effect size in Student’s t test. It is also similar to the Information ratio, with the exception that the Information ratio compares investment returns to the return on a benchmark rather than to the risk-free rate. Since Sharpe, Information, and t test are so closely related we have created windowing versions of all three.

The Sharpe and Information ratio implementations allow you to use prices or returns as input into the function, though consistency is required in the window. In other words, you cannot switch to prices from returns half-way through the window. We have also implemented an exact flag, which when true will have the function return a NULL when the actual window size is smaller than the specified window size. In other words, if you want to do a 36-month rolling window of the Sharpe ratio, you have the ability to specify whether or not you want a non-NULL value returned when there are less than 36 rows in the window (which will usually be at the beginning of the data set).

The Sharpe ratio implementation is based on Sharpe’s 1994 formula and is arithmetically the same as the Information ratio. The only difference between the functions is that the Information ratio will calculate benchmark return from prices if the prices flag is set to true, whereas the Sharpe ratio always treats the risk-free rate as a rate and never as a price.

The TTEST windowing functions support the paired, equal variance and unequal variance tests. The paired t test requires that the both columns passed into the function be non-NULL to be included in the calculation. The equal and unequal variance tests simply require that one of the columns not be NULL.

For more information about these functions and all of our windowing functions, see the full windowing documentation.

Whether you are using SQL Server 2005, 2008, or 2012, we think that you will find these functions useful. If there are other functions you would like to see, please contact us at support@westclintech.com.