Statistics Project 2

* This was an assignment given in my first year as part of a general statistics course. It provided a dataset and required the use of Minitab (statistics software) to perform statistical analysis on this data and answer questions pertaining to it. An understanding of statistics may be necessary to interpret many of the figures, though the text may explain it satisfactorily. Many of the original figures and tables from minitab are not present here. *

———————————————————————————————————————————————————————-

Question 1

Aim

We wish to identify a 95% confidence interval for the Beck depression score at baseline (which is a measure of baseline depression) in our study population. Our study population consists of 120 patients in Glasgow, who have received and are recovering from major surgery (the type of surgery itself is unspecified but for the purposes of this report we will assume that the induced effects are comparable).

Data description

We are looking specifically at the “Beck.Pre” variable, described as being the “Beck depression score at baseline” and a higher Beck score indicates a more depressed individual. There are 120 subjects in the study population.

Initial Impression

We will now display descriptive statistics for our variable of interest (Figure 1) as well as display the data graphically using a boxplot (Figure.2).

Figure.1

Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3

BECK.PRE 117 3 8.564 0.580 6.270 0.000 4.000 7.000 11.000

Variable Maximum

BECK.PRE 35.000

From the descriptive statistics we can see that in the Beck.Pre variable there are 117 subjects, and that the mean score is 8.564 with a standard deviation of 6.270. The boxplot appears fairly normal in its interquartile range, however it does have a slightly positive skew, and a number of outliers. Informally, both the descriptive statistics and the boxplot suggest a normal distribution.

Formal Analysis

We must now address the underlying assumptions associated with this data. The first assumption is that the sample is a random sample of the wider population of individuals undergoing major surgery in Glasgow, and as such is representative of this population. We will assume that this sample was collected randomly from a number of hospitals across Glasgow and is suitably representative of the wider population.

The second assumption is that the data follows a normal distribution. Typically data with more than 30 subjects will follow a normal distribution (due to the central limit theorem), and so considering that there are 117 subjects we will assume normality is true. Having determined that the assumption of normality is true, we can use a parametric test to attain a 95% confidence interval for the beck.pre variable (figure.3).

Figure.3

Variable N Mean StDev SE Mean 95% CI

BECK.PRE 117 8.564 6.270 0.580 (7.416, 9.712)

Conclusion

As we can see from the one sample t test, the 95% confidence interval for the beck depression score ranges from 7.416 to 9.712. This means that the average beck depression score for any individual is 95% certain to be between 7.4 and 9.7 in our sample population.

Question 2

Aim

We have a sample of 97 subjects undergoing surgery (this is the total number of subjects for whom valid measurements for both the dep.pre & dep.post variable were available), and each was segregated into one of two categories before and after surgery; beck<=8 or beck>8. These categories relate to the beck depression score of an individual patient, a score of 8 or less being considered normal and a score of more than 8 being considered an indicator of depression. As such, this data can be used; first to establish if there is a significant change in the proportions of patients who are depressed before and after surgery; and if this is the case, estimate confidence intervals for this change which can be applied to the wider population of patients undergoing surgery.

Informal Analysis

In order to compare the proportions of depression score for patients undergoing surgery, it is necessary to display the data percentage values using cross tabulation.

Tabulated statistics: depression diagnosis (table 1)

Using frequencies in Count

Rows: BeckPost Columns: BeckPre

Beck<=8 Beck>8 All

Beck<=8 55.67 8.25 63.92

Beck>8 12.37 23.71 36.08

All 68.04 31.96 100.00

Cell Contents: % of Total

From table 1, it would appear that the percentage of patients with a beck score less than or equal to 8 will decrease after surgery (before surgery being 68.04% and after surgery being 63.92%), while the percentage of patients with a beck score greater than 8 is likely to increase after surgery (before surgery being 31.96% and after surgery being 36.08%). There does appear to be a difference in the distribution of patients who retain a score which places them in the “normal” category and those who are placed in the “depressed” category before and after surgery and this can be more clearly illustrated using a clustered bar chart (figure.4). However the change appears to be fairly small and a formal analysis must be performed to say for sure.

Formal Analysis

Prior to carrying out a test of marginal homogeneity, we must first establish our hypotheses:

Null Hypothesis: For each of the categories of depression, the proportions of depressed individuals does not change after surgery in the populations of patients.

Alternative Hypothesis: For at least one of the categories of depression, the proportions of depressed individuals will change after surgery in the population of patients.

We may now proceed to carry out a test of marginal homogeneity, which in this case will be a Wald test (figure 5).

Wald Test of Marginal Homogeneity – Figure 5

Observed Value of Test Statistic is 0.81

Degrees of Freedom are 1

P value is 0.3691

From the Wald test of marginal homogeneity, the observed value of the Wald test statistic is 0.81, with one degree of freedom. This corresponds with a p-value of 0.3691, which is greater than the significance value of p<0.05 and as such there is insufficient evidence of a significant change in the overall population proportions in at least one of the depression categories after surgery. Because of this, it is unnecessary to perform further investigation into the size of said changes using Bonferroni intervals.

Conclusion

We conclude from the test of marginal homogeneity that there has been no significant change in the population proportion of depressed and non-depressed individuals after surgery. While during initial analysis it appeared that there may have been a small alteration in the proportions of depression groups before and after surgery, a formal analysis has deemed that this is not the case. Therefore we must conclude that the alterations seen to depression scores after surgery are unaffected by the depression score observed prior to surgery.

Question 3

Aim

We would like to ascertain whether or not there is a significant average change in the baseline to 6 days score for the STAI.pre and STAI.post scores (STAI refers to a measure of anxiety where a higher score corresponds to higher anxiety) in our study population. To do this we will carry out a hypothesis test and calculate a 95% confidence interval using a suitable method.

Initial Impression

We will first display descriptive statistics for both of our variables of interest (Figure.6 and Figure.7 for STAI.pre and STAI.post respectively) as well as display the spread of these variables graphically using a boxplot (Figure.8, boxplots will be displayed beside one another in the same graph for ease of interpretation).

Figure.6

Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3

STAI.PRE 116 4 38.83 1.18 12.66 20.00 29.00 37.00 47.00

Variable Maximum

STAI.PRE 80.00

Figure.7

Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3

STAI.POST 103 17 33.272 0.975 9.900 20.000 24.000 33.000 39.000

Variable Maximum

STAI.POST 60.000

From the descriptive statistics (Figures 6/7), we can see that the mean anxiety score for a patient prior to surgery is 38.83 with a standard deviation of 12.66, while the mean anxiety score for a patient 6 days after surgery is 33.272 with a standard deviation of 9.900. The boxplots in figure.8 display a comparable interquartile range; however the medians are closer than the means found in the descriptive statistics. Informally, both the descriptive statistics and the boxplots suggest a normal data distribution.

Formal Analysis

Prior to performing a hypothesis test, we must first address modelling assumptions present for this data (ascertaining normality will also allow us to decide on an appropriate hypothesis/confidence interval test).

The first assumption is that the data has been collected randomly and is a representative sample of the wider population of patients undergoing major surgery in Glasgow. We will assume that these patients were selected by a properly random selection process and from a number of hospitals around Glasgow. As such, the assumption of this sample as representative of the wider population is valid.

Next we must check the assumption of normality. Considering that the number of subjects involved in STAI.pre and STAI.post (116 and 103 respectively) are significantly greater than the 30 required for the central limit theorem, we will accept the assumption of a normal distribution for both of these variables.

Having determined a normal distribution, we can carry out a parametric test to attain the 95% confidence interval, which will be given as part of the hypothesis test. As these data points are measurements of one variable in the same group of individuals before surgery as well as 6 days after surgery, it is appropriate to use a paired t test as the hypothesis test (Figure.9).

Using this method of hypothesis testing, we have established that the null (Ho) is that there is no significant difference between the mean values for STAI score in the wider population, while the alternative hypothesis (H1) is that there is a significant difference in the mean values for STAI score in the wider population.

From the paired-t test we can infer that, due to the low p-value (<0.05) there is in fact a significant difference between the mean values of STAI.pre and STAI.post and we can reject the null hypothesis. The 95% confidence interval for this difference ranges between 2.32 & 7.64, and as this interval does not include 0 we can conclude that there is a significant positive difference between these variables. Since the hypothesis test detracts the pre score from the post score, the confidence interval tells us that patients STAI scores are lower six days after surgery than prior to surgery, decreasing by between 2.32 and 7.64 STAI points. In simpler terms; patients are less anxious 6 days after surgery than they are before undergoing surgery.

Question 4

Aim

We wish to build a regression model which elucidates the dependence of the STAI.post variable on the covariates: STAI.pre, age, EXTRO, NEURO, NART and BECK.pre. It may be unnecessary to include some of these variables in the final model, so it will be necessary to assess their individual relationship with the response variable and among the other covariates, and eliminate extraneous variables. The final regression model may be used to predict the STAI.post score of a future patient in the wider population of patients undergoing surgery, using the correct covariates.

Informal Analysis

The first step in building a regression model is in examining the relationships between these variables using a matrix plot, given in figure 10.

It is also necessary to produce a list of correlation coefficients to help determine the relationships between these variables, and this coefficient is given in figure 11.

Figure.11

STAI.PRE AGE EXTRO NEURO NART BECK.PRE

AGE -0.186

0.046

EXTRO -0.275 -0.196

0.004 0.039

NEURO -0.110 -0.000 0.089

0.319 1.000 0.427

NART 0.194 -0.143 0.151 -0.143

0.038 0.122 0.114 0.191

BECK.PRE 0.530 -0.028 -0.211 -0.062 0.326

0.000 0.765 0.026 0.576 0.000

STAI.POST 0.340 0.169 -0.024 -0.251 0.163 0.207

0.001 0.088 0.816 0.021 0.101 0.039

Cell Contents: Pearson correlation

P-Value

Observing the plots (figure 10), the explanatory variable which appears to have the most significant relationship with STAI.post is STAI.pre, which appears to have a positive relationship with STAI.post. Likewise, BECK.pre and to a lesser degree NART also appear to have a positive relationship with STAI.post, however these relationships are not as obvious as the aforementioned relationship between STAI.pre & STAI.post. The NEURO variable appears to have a negative relationship with STAI.post. In terms of outliers, making any specific comment is unnecessary other than to mention that in general each plot appears quite scattershot, containing numerous possible outliers with no easily identifiable abnormal trends.

From the correlation coefficients (figure 11), the variable STAI.pre has the strongest relationship with STAI.post (r = 0.340), as the matrix plots had indicated. After STAI.pre, the strongest relationship is the negative relationship between NEURO and STAI.post (r = -0.251) and finally the positive relationship between BECK.pre and STAI.post (r = 0.207). The remaining variables: age, extroversion and NART have weak relationships with STAI.post coupled with p-values higher than the significance value of p<0.5 (0.088, 0.816 and 0.101 respectively).

There are seven relationships between explanatory variables which possess a p-value of significance, however of particular note is the relationship between BECK.pre and STAI.pre (r = 0.530) and the relationship between BECK.pre and NART (r = 0.326) as they have the strongest relationships out of the group and both have a p-value of <0.001, which indicates a very significant relationship.

Having assessed the relationships between both the explanatories and STAI.post as well as between the explanatories themselves, we may now progress onto using multiple regression methods to eliminate unnecessary variables, leaving only those most upon which STAI.post is most dependant.

Formal Analysis

Considering that a number of explanatory variables were correlated with one another, it is now necessary to use stepwise regression (figure 12) to eliminate extraneous variables from the model.

Figure.12

Alpha-to-Enter: 0.05 Alpha-to-Remove: 0.05

Response is STAI.POST on 6 predictors, with N = 79

N(cases with missing observations) = 41 N(all cases) = 120

Step 1 2 3

Constant 20.604 3.430 4.052

STAI.PRE 0.308 0.348 0.327

T-Value 3.99 4.49 4.26

P-Value 0.000 0.000 0.000

AGE 0.26 0.27

T-Value 2.17 2.26

P-Value 0.033 0.027

NEURO -2.1

T-Value -2.01

P-Value 0.048

S 9.39 9.17 8.99

R-Sq 17.12 21.96 25.95

R-Sq(adj) 16.04 19.91 22.98

Mallows Cp 10.1 7.1 5.0

The stepwise regression process has identified three explanatory variables which have been deemed the most predictive of STAI.post. The most useful single explanatory variable is STAI.pre (R-sqadj = 16.04), which accounts for 16% of the variability of STAI.post. The next most useful variable is age followed by NEURO. Cumulatively, the variables: STAI.pre, age and NEURO have an r-squared adjusted value of 22.98, therefore they explain approximately 23% of the variability of STAI.post.

We must now check the assumptions associated with multiple regression, namely: linearity, constant spread and normality.

Using figures 13-16, we can see that there are no increasing/decreasing patterns, neither are there weaving patterns and, as such, both the assumption of linearity(random errors have zero mean) and the assumption of constant spread (random errors have constant variance) can be accepted. Finally figure 17 shows that the data points mostly adhere to the fitted line, and the assumption of normality(random errors follow a normal probability model) is accepted.

The regression output is as follows (figure 18).

Figure 18

The regression equation is

STAI.POST = 2.24 + 0.328 STAI.PRE + 0.300 AGE – 2.11 NEURO

82 cases used, 38 cases contain missing values

Predictor Coef SE Coef T P

Constant 2.239 7.893 0.28 0.777

STAI.PRE 0.32801 0.07558 4.34 0.000

AGE 0.2997 0.1116 2.69 0.009

NEURO -2.115 1.028 -2.06 0.043

S = 8.86271 R-Sq = 26.8% R-Sq(adj) = 24.0%

Analysis of Variance

Source DF SS MS F P

Regression 3 2239.34 746.45 9.50 0.000

Residual Error 78 6126.72 78.55

Total 81 8366.06

Source DF Seq SS

STAI.PRE 1 1350.91

AGE 1 555.84

NEURO 1 332.59

Unusual Observations

Obs STAI.PRE STAI.POST Fit SE Fit Residual St Resid

10 68.0 20.000 37.743 2.724 -17.743 -2.10R

32 60.0 60.000 39.915 1.995 20.085 2.33R

39 41.0 58.000 32.307 1.239 25.693 2.93R

53 80.0 53.000 42.702 3.471 10.298 1.26 X

60 25.0 45.000 24.188 1.880 20.812 2.40R

62 46.0 38.000 44.727 4.449 -6.727 -0.88 X

64 32.0 56.000 33.407 1.559 22.593 2.59R

84 45.0 49.000 42.395 3.906 6.605 0.83 X

88 20.0 44.000 26.520 1.805 17.480 2.01R

R denotes an observation with a large standardized residual.

X denotes an observation whose X value gives it large leverage.

The p-values for STAI.pre, age and NEURO (<0.001, 0.009 and 0.043 respectively), are all significant, ergo all are good predictors of STAI.post cumulatively.

The coefficient for STAI.pre is 0.328, meaning that for any given age and NEURO score, the average STAI.post anxiety score increases by 0.328 points for every one point increase in STAI.pre anxiety score.

The coefficient for age is 0.2997, meaning that for any given STAI.pre and NEURO score, the average STAI.post anxiety score increases by 0.2997 points for every one year increase in age.

The coefficient for NEURO is -2.115, meaning that for any given STAI.pre and age score, the average STAI.post anxiety score decreases by 2.115 points for every one unit increase in NEURO.

The R-squared adjusted value is given as 24%, ergo; twenty four percent of the total variability in STAI.post score is explained by its dependence on the selected explanatory variables: STAI.pre, age & NEURO.

In total there are 9 unusual observations, however 6 of these observations (10,32,39,60,64,88) lie more than 2 standard deviations from the line and are outliers. These outliers have less of an effect than the three observations (53, 62, and 84) which possess unusual covariate values which have more potential to affect the model detrimentally. Removing these observations (53, 62 and 84) may improve the model, however in this case their removal causes the R-squared adjusted value to drop from 24% to 19.4% (Figure 19) and as such the final model will retain these observations.

Figure 19

The regression equation is

STAI.POST = 3.60 + 0.291 STAI.PRE + 0.300 AGE – 2.55 NEURO

79 cases used, 38 cases contain missing values

Predictor Coef SE Coef T P

Constant 3.596 8.008 0.45 0.655

STAI.PRE 0.29118 0.08116 3.59 0.001

AGE 0.2996 0.1122 2.67 0.009

NEURO -2.549 1.386 -1.84 0.070

S = 8.87630 R-Sq = 22.5% R-Sq(adj) = 19.4%

Conclusion

Our final regression model finds that the STAI.post measure is most dependant on the: STAI.pre, age and NEURO covariates in that order, and the final model was simplified using stepwise regression to include only those terms as they were deemed to have the most significant contribution to the regression relationship. Using this model, it would be possible to predict the likely STAI.post of a future patient in the wider population of patients undergoing surgery given a record of the pertinent covariates used in the equation; STAI.POST = 3.60 + 0.291 STAI.PRE + 0.300 AGE – 2.55 NEURO.

Question 5

Aim

Using the final model produced in question four, we wish to include the SEX variable, and assess its contribution to the final R-squared adjusted value of the model. As the sex of a patient is easily attained information, it is trivial to retrieve and as such if it were found to significantly improve the model it would be, from a logistical standpoint, useful in predicting the STAI.post of a future subject in the wider population of patients undergoing surgery.

Formal Analysis

We may now revisit the stepwise regression performed in the formal analysis section of question 4 (figure 12), and introduce the SEX variable in addition to the original covariates (figure 20).

Figure 20

Alpha-to-Enter: 0.05 Alpha-to-Remove: 0.05

Response is STAI.POST on 7 predictors, with N = 79

N(cases with missing observations) = 41 N(all cases) = 120

Step 1 2 3

Constant 20.604 3.430 4.052

STAI.PRE 0.308 0.348 0.327

T-Value 3.99 4.49 4.26

P-Value 0.000 0.000 0.000

AGE 0.26 0.27

T-Value 2.17 2.26

P-Value 0.033 0.027

NEURO -2.1

T-Value -2.01

P-Value 0.048

S 9.39 9.17 8.99

R-Sq 17.12 21.96 25.95

R-Sq(adj) 16.04 19.91 22.98

Mallows Cp 12.5 9.4 7.2

In this case, the stepwise regression model does not opt to include SEX as a significant contributor and so it is necessary to force the variable to be included in the model by selecting it as a predictor to be used in every model (figure 21).

Figure 21

Alpha-to-Enter: 0.05 Alpha-to-Remove: 0.05

Response is STAI.POST on 7 predictors, with N = 79

N(cases with missing observations) = 41 N(all cases) = 120

Step 1 2 3 4

Constant 30.364 15.454 -3.054 -2.403

SEX 2.6 5.0 5.4 5.4

T-Value 0.79 1.62 1.80 1.83

P-Value 0.435 0.110 0.075 0.071

STAI.PRE 0.330 0.375 0.354

T-Value 4.25 4.81 4.60

P-Value 0.000 0.000 0.000

AGE 0.28 0.28

T-Value 2.31 2.40

P-Value 0.024 0.019

NEURO -2.1

T-Value -2.03

P-Value 0.046

S 10.3 9.29 9.04 8.85

R-Sq 0.79 19.87 25.21 29.16

R-Sq(adj) 0.00 17.76 22.21 25.33

Mallows Cp 29.8 11.6 8.0 5.8

This model finds that individually, SEX is essentially useless as a predictor of STAI.post (R-sqadj of 0). However, including the other covariates the final R-sqadj value is 25.33%, higher than the value found in figure 12 by 2.35 percentage points.

From figure 22, the coefficient for SEX is 4.226, meaning that for any given STAI.pre, age and NEURO score, the average STAI.post anxiety score increases by 4.226 points where SEX determines that the patient is male. This interpretation of the coefficient means that men are more likely to more anxious after surgery than women, and that SEX actually has a more significant individual coefficient than any of the other covariates in this model.

Conclusion

Given the output of the regression model, we can conclude that after forcing the inclusion of SEX, men have a STAI.post score around 4 points higher than women and as such are more likely to be more anxious after surgery than women. The R-sqdadj value of this model is 25.3%, higher than that found in figure 18 (24%), indicating that this model accounts for more of the variability in the STAI.post score of a future patient.

It is difficult to comment on the inclusion of SEX into the final model designed in question 4. This is due to the fact that the stepwise regression methodology does not make any comment on the usefulness of the addition of the new covariate. We are given the R-squared adjusted values of both models, but whether or not the 1.3% increase given the inclusion of SEX is worthwhile, or if a simpler model as was found in question 4 would be best is fairly subjective. Further tests of both models using new sets of patient data would be the most useful method of determining the viability of the more complicated model over the simpler one.

Question 6

In this report we investigated a number of uses for common measurements taken on patients undergoing surgery. We first established a likely range of values (7.4-9.7) for beck depression score in the sample of patients, and then concluded that surgery does not alter a patient’s state of depression. STAI scores however, do change after surgery, meaning that we can infer that while a patients level of depression is unaffected by surgery, these patients are likely to become less anxious after surgery.

In addition we found that the variables STAI.pre, age and NEURO could be used to calculate a likely measure of STAI.post (using what is known as a regression model), which is useful as it would allow for a likely estimate of patient anxiety after surgery to be made prior to the surgery itself. The inclusion of the sex variable in the calculation of this estimate may be beneficial to the accuracy of the prediction; however complicating the model in this manner may be detrimental. Therefore, no certain recommendation on this matter has been made.