Logistic Regression in SQL Server

Mar4

Written by:Charles Flock3/4/2015 10:08 AM

We have added 5 new functions in the latest release of XLeratorDB / statistics 2008. Four of these functions, LOGIT, LOGITSUM, LOGITPRED, and LOGITPROB are SQL Server implementations of the logistic regression, often referred to as the logit regression. The remaining function, VIF, estimates the variance inflation factor, tolerance, and R2 for a set of independent variables allowing you to test for collinearity.

Given a set of independent variables X and dichotomous outcomes {0, 1} Y, you can use the logistic regression to calculate the probability that Y = 1.

Their CHD data set provides 100 data points consisting of a patient's age and the absence {0} or presence {1} of coronary heart disease (this full example is explored in the LOGIT documentation on our website). A logistic regression of this data attempts to answer a question about the probability of coronary heart disease being present in a patient based on their age, since age is the only independent variable in our data.

The following table provides a summary of the CHD data.

Coronary Heart Disease

Age Group

n

Absent

Present

20 – 29

10

9

1

30 – 34

15

13

2

35 – 39

12

9

3

40 – 44

15

10

5

45 – 49

13

7

6

50 – 54

8

3

5

55 – 59

17

4

13

60 – 69

10

2

8

Total

100

57

43

Given that there is only a single independent variable, the table above would lead us to the conclusion that the probability that coronary heart disease is present in someone between the ages of 60 and 60 is approximately 80% while the probability that coronary heart disease is present in someone between the ages of 20 and 29 is approximately 10%.

Let's take the CHD dataset and put it into a temp table, #chd, and then use the LOGIT function to calculate the coefficients.

This returns the following values, but we are really only concerned with the first two.

stat_name idx stat_val

---------- ----------- ----------------------

b 0 -5.30945337391905

b 1 0.1109211422069

se 0 1.13365463681529

se 1 0.0240598358749988

z 0 -4.68348401840844

z 1 4.61022023521619

pval 0 2.82039453738119E-06

pval 1 4.02242615828777E-06

Wald 0 21.9350225506872

Wald 1 21.2541306171968

LL0 NULL -68.3314913574166

LLM NULL -53.6765463471564

chisq NULL 29.3098900205205

df NULL 1

p_chisq NULL 6.16800830270808E-08

AIC NULL 111.353092694313

BIC NULL 116.563433066289

Nobs NULL 100

rsql NULL 0.214468391061534

rsqcs NULL 0.254051637397186

rsqn NULL 0.340992792718107

D NULL -0.482789163126702

Iterations NULL 5

Converged NULL 1

This tells us that based on age we would estimate the probability that coronary heart disease is present in a 24 year-old as

SELECT

1/(1 +EXP(-(-5.30945 + 0.11092 * 24)))

or approximately 6.6%.

We have also implemented a LOGITSUM function so that you can analyze grouped data. Instead of passing in a column containing the dichotomous results for each observation you can pass in counts of success and failures for each grouping of the data. For example, let's say our coronary heart disease data were presented to us in the following way:

mean age chd N

---------------------- ----------- -----------

25.4 1 10

32 2 15

36.9167 3 12

42.3333 5 15

47.2308 6 13

51.875 5 8

56.8824 13 17

63 8 10

We could use the LOGITSUM function.

SELECT

*

INTO

#chdsum

FROM (VALUES

(25.4,1,10)

,(32,2,15)

,(36.9167,3,12)

,(42.3333,5,15)

,(47.2308,6,13)

,(51.875,5,8)

,(56.8824,13,17)

,(63,8,10)

)p([mean age],chd,N)

SELECT

*

FROM

wct.LOGITSUM('

SELECT

[mean age],

chd,

N-chd

FROM

#chdsum'

,2

,3

)

This produces the following result.

stat_name idx stat_val

---------- ----------- ----------------------

b 0 -5.20385902245454

b 1 0.108651473280877

se 0 1.11798776031707

se 1 0.0237329519129878

z 0 -4.65466546876925

z 1 4.57808508942444

pval 0 3.24507143474021E-06

pval 1 4.69251822153992E-06

Wald 0 21.6659106261529

Wald 1 20.9588630860104

LL0 NULL -68.3314913574166

LLM NULL -54.1790842644491

chisq NULL 28.3048141859351

df NULL 1

p_chisq NULL 1.03637918925054E-07

AIC NULL 112.358168528898

BIC NULL 117.568508900874

Nobs NULL 100

rsql NULL 0.207113979394092

rsqcs NULL 0.246516487110749

rsqn NULL 0.330878975046836

D NULL -0.464151599148838

Iterations NULL 5

Converged NULL 1

Now, when we calculate the probability of the presence of coronary heart disease in a 24 year-old, the formula would be:

SELECT

1/(1 +EXP(-(-5.20386 + 0.10865 * 24)))

or approximately 6.9%.

You can use the LOGITPROB or the LOGITPRED functions to calculate the probability that Y = 1. For example,

SELECT

wct.LOGITPRED('-5.20386, 0.10865','24')

produces the same result as the previous SELECT statement. When there are many independent variables and/or many predictions to be made, the LOGITPRED and LOGITPROB functions are simpler to use than explicitly embedding the equation in a SELECT statement.

The VIF function is used to detect collinearity and should be used in conjunction with not only the LOGIT and LOGITSUM functions but also with LINEST, LINEST_q, LOGEST, and LOGEST_q. VIF works by performing ordinary least squares regression on all the combinations of independent variables and calculating the R2 for each combination. From the R2 value, it then calculates the tolerance and the variance inflation factors. Based upon the results of this table-valued function you can determine whether or not to eliminate any of the independent variables from the regression analysis.

You can read more about these function and find more examples in our on-line documentation. You can also try out these functions today by downloading the free 15-day trial and installing it on your SQL Server database. We think that these functions are a great tool to have inside SQL Server as it permits you to do the analysis right where the data is, without having to drag the data across the network, and let's you use simple SQL commands instead of having to learn how to use another tool. We hope that you like it and, as always, if there is something that you would like to see in XLeratorDB, just send us an e-mail at support@westclintech.com.