11 Exploratory Data Analysis: ObjectiveExamine the relationship between continuous variable using a scatter plotQuantify the degree of association between two continuous variables using correlation statisticsAvoid potential misuses of the correlation coefficientObtain Pearson correlation coefficients11

17 Exploratory Data Analysis: Hypothesis testing for a CorrelationCorrelation Coefficient TestH0: 𝜌= 0Population parameterSample statisticCorrelation𝜌rHa: 𝜌≠ 0A p-value does not measure the magnitude of the association.Sample size affects the p-value.Rejecting the null hypothesis only means that you can be confident that the true population correlation is not 0. small p-value can occur (as with many statistics) because of very large sample sizes. Even a correlation coefficient of can be statistically significant with a large enough sample size. Therefor, it is important to also look at the value of r itself to see whether it is meaningfully large.17

19 Correlation does not imply causationExploratory Data Analysis: Avoiding Common Errors in Interpreting Correlations Cause and EffectCorrelation does not imply causationBesides causality, could other reasons account for strong correlation between two variables?A strong correlation between two variables does not mean change in one variable causes the other variable to change, or vice versa.Sample correlation coeffcients can be large because of chance or both varibles are affected by other variables.WeightHeight19

20 Correlation does not imply causationExploratory Data Analysis: Avoiding Common Errors in Interpreting Correlations Cause and EffectCorrelation does not imply causationWeightHeightA strong correlation between two variables does not mean change in one variable causes the other variable to change, or vice versa.20

21 Correlation does not imply causationExploratory Data Analysis: Avoiding Common Errors in Interpreting Correlations Cause and EffectCorrelation does not imply causationSample correlation coefficients can be large because of chance or because both variables are affected by other variable.21

22 Correlation does not imply causationExploratory Data Analysis: Avoiding Common Errors in Interpreting Correlations Cause and EffectCorrelation does not imply causationSample correlation coefficients can be large because of chance or because both variables are affected by other variable.22

23 SAT score bounded to college entrance or notExploratory Data Analysis: Avoiding Common Errors in Interpreting Correlations Cause and Effect?SAT score bounded to college entrance or notSample correlation coefficients can be large because of chance or because both variables are affected by other variable.X: the percent of students who take the SAT exam in one of the stateY: SAT scoresThere are many reasons for the varying participation rates. Some state have lower participation because their students primarily take the rival ACT standaized test. Others have rule requiring even non-college-students to take the test. In low participating states, often only the highest performing students choose to take the SAT.X: the percent of students who take the SAT exam in one of the statesY: SAT scores23

24 Pearson correlation coefficient: r -> 0Exploratory Data Analysis: Avoiding Common Errors: Types of Relationships?Pearson correlation coefficient: r -> 0curvilinearPearson correlation coefficient measure linear relationships.A Pearson correlation coefficient close to 0 indicates that there is not a strong linear relationship between two variables.A Pearson correlation coefficient close to 0 does not mean that there is no relationship of any kind between the two variables.parabolicquadratic24

26 Exploratory Data Analysis: Avoiding Common Errors: outliersWhat to do with outlier??Why an outlierValidCompute two correlation coefficientsErrorCollect dataReplicate data: collecting data at a fixed value of x (in this case, x=10). This determines whether the data point is unusual.Report both coefficientsReplicate data26

27 Exploratory Data Analysis: Scenario: Exploring Data Using Correlation and Scatter PlotsFitnessIn exercise physiology, an objective measure of aerobic fitness is how efficiently the body can absorb and use oxygen (oxygen consumption). Subjects participated in a predetermined exercise run of 1.5 miles. Measurement of oxygen consumption as well as several other continuous measurements such as age, pulse, and weight were recorded. The researcher are interested in determining whether any of theses other variables can help predict consumption.Fitnessoxygen consumption?27

29 Exploratory Data Analysis: Exploring Data with Correlations and Scatter PlotsWhat’s the Pearson correlation coefficient of Oxygen_Consumption with Run_Time?What’s the p-value for the correlation of Oxygen_Consumption with Performance?29

33 Exploratory Data AnalysisQuestion 1.The correlation between tuition and rate of graduation at U.S. college is What does this mean?The way to increase graduation rates at your college is to raise tuitionIncreasing graduation rates is expensive, causing tuition to riseStudents who are richer tend to graduate more often than poorer studentsNone of the above.Answer: d33

50 Simple Linear Regression: Performing Simple Linear RegressionQuestion 3.In the model Y=X, if the parameter estimate (slope) of X is 0, then which of the following is the best guess (predicted value) for Y when X is equals to 13?13The mean of YA random numberThe mean of XAnswer: b50

51 Simple Linear Regression: Confidence and Prediction IntervalsA 95% confident interval for the mean says that You are 95% confident that your interval contains the true population mean of Y for a particular X.Confident intervals becomes wider as you move away from the mean of the independent variable. This reflects the fact that your estimates become more variable as you move away from the means of X and Y.For prediction:A 95% confident interval is one that you are 95% confident contains a new observation.Prediction intervals are wider than confidence intervals because single observations have more variability than sample means.51

52 Simple Linear Regression: Confidence and Prediction IntervalsQuestion 4.Suppose you have a 95% confidence interval around the mean. How do you interpret it?The probability is .95 that the true population mean of Y for a particular X is within the interval.You are 95% confident that a newly sampled value of Y for a particular X is within the interval.You are 95% confident that your interval contains the true population mean of Y for a particular X.Answer: c52

66 Multiple Regression Common applicationsMultiple Linear Regression is a powerful tool for the following tasks:Prediction, which is used to develop a model future values of a response variable (Y) based one its relationships with other predictor variables (Xs).Analytical or Explanatory Analysis, which is used to develop an understanding of the relationships between the response variable and predictor variablesMyers (1999) refers to four applications of regression:PredictionVariable screeningModel specificationParameter estimation66

67 Multiple Regression Analysis versus Prediction in Multiple RegressionThe terms in the model, the values of their coefficients, and their statistical significance are of secondary importance.The focus is on producing a model that is the best at predicting future values of Y as a function of the Xs.𝑌 = 𝛽 0+ 𝛽 1 𝑋 1 + …+ 𝛽 𝑘 𝑋 𝑘67

68 Multiple Regression Analysis versus Prediction in Multiple RegressionAnalytical or Explanatory AnalysisThe focus is understanding the relationship between the dependent variable and independent variables.Consequently, the statistical significance of the coefficient is important as well as the magnitudes and signs of the coefficients.𝑌 = 𝛽 0+ 𝛽 1 𝑋 1 + …+ 𝛽 𝑘 𝑋 𝑘68

70 Multiple Regression Hypothesis Testing for Multiple RegressionQuestion 4.Match below items left and right?aAt least one slope of the regression in the population is not 0 and at least one predictor variable explains a significant amount of variability in the response modelNo predictor variable explains a significant amount of variability in the response variableThe estimated linear regression model does not fit the data better than the baseline modela) Reject the null hypothesisa) Reject the null hypothesisb) Fail to reject the null hypothesisbNull hypothesis:The regression model does not fit the data better than the baseline model.Alternative hypothesis:The regression model does fit the data better than the baseline model.b70

73 Multiple Regression: Adjust R2Adj. R2𝑅 2 =1− (𝑛−𝑖)(1−R2 ) 𝑛−𝑝𝑅 2 =1− (𝑛−𝑖)(1−R2 ) 𝑛−𝑝i = 1 if there is an intercept and 0 otherwisen = the number of observations used to fit the modelp = the number of parameters in the modeli = 1 if there is an intercept and 0 otherwisen = the number of observations used to fit the modelp = the number of parameters in the model73

75 Multiple Regression: Performing Multiple Linear RegressionWhat’s the p-value of the overall model?Should we reject the null hypothesis or not?Based on our evidence, do we reject the null hypothesis that the parameter estimate is 0?75

82 Model Building and Interpretation: ObjectivesExplain the Linear Regression task options for the model selectionDescribe model selection options and interpret output to evaluate the fit of several models82

89 Model Building and Interpretation : Viewing Mallows' Cp StatisticMallows' criterion: Cp <= pPartial outputPartial outputIn this output, how many models have a value for Cp that is less than or equal to p?Which of these models has the fewest parameters?89

91 Model Building and Interpretation : Viewing Mallows' Cp StatisticQuestion 5.What happens when you use the all-possible regressions method? Select all that apply.yYou compare the R-square, adjusted R-square, and Cp statistics to evaluate the models.SAS computes al possible modelsYou choose a selection method (stepwise, forward, or backward)SAS ranks the results.You cannot reduce the number of models in the outputYou can produce a plot to help identify models that satisfy criteria for the Cp statistic.yyy91

92 Model Building and Interpretation : Viewing Mallows' Cp StatisticQuestion 6.Match below items left and right.cPrefer to use R-square for evaluating multiple linear regression models (take into account the number of terms in the model).Useful for parameter estimationUseful for predictionMallows' criterion for Cp.Hockings' criterion for Cp.adjusted R-squareba92

93 Model Building and Interpretation : Using Automatic Model Selection93

94 Model Building and Interpretation : Using Automatic Model Selection94

99 Model Building and Interpretation : The Stepwise Selection Approach to Model BuildingForwardForward selection method starts with no variable, then select the most significant variable, until there is no significant variable. The variable added will not be removed even it becomes in-significant later.Forward selection method starts with no variable, then select the most significant variable, until there is no significant variable. The variable added will not be removed even it becomes in-significant later.99

100 Model Building and Interpretation : The Stepwise Selection Approach to Model BuildingBackwardBackward selection method starts with all variables in, then remove the most in-significant variable, until all variables left are significant. Once the variable is removed, it cannot re-enter.Backward selection method starts with all variables in, then remove the most in-significant variable, until all variables left are significant. Once the variable is removed, it cannot re-enter.100

101 Model Building and Interpretation : The Stepwise Selection Approach to Model BuildingStepwise combines the thoughts of both Forward and Backward selection. It starts with no variable, then select the most significant variable as the Forward , however, like Backward selection, stepwise method can drop the in-significant variable one at a time. until there is no significant variable.Stepwise method stops when all terms in the model are significant , and all terms out off model are not significant.Stepwise combines the thoughts of both Forward and Backward selection. It starts with no variable, then select the most significant variable as the Forward , however, like Backward selection, stepwise method can drop the in-significant variable one at a time. until there is no significant variable.Stepwise method stops when all terms in the model are significant , and all terms out off model are not significant.101

114 Home Work: Exercise 11.1 Describing the Relationship between Continuous VariablesGenerate scatter plots and correlations for the variables Age, Weight, Height, and the circumference measures versus the variable PctBodyFat2.Important! The Correlation task limits you to 10 variables at a time for scatter plot matrices, so for this exercise, look at the relationships with Age, Weight, and Height separately from the circumference variables (Neck, Chest, abdomen, Hip, thigh, Knee, Ankle, Biceps, Forearm, and Wrist)Note: Correlation tables can be created using more than 10 VAR variables at a time.What variable has the highest correlation with PctBodyFat2?What is the value for the coefficient?Is the correlation statistically significant at the 0.05 level?Can straight lines adequately describe the relationships?Are there any outliers that you should investigate?Generate correlations among the variable (Age, Weight, Height), among one another, and among the circumference measures. Are there any notable relationships?114

115 Home Work: Exercise 2 2.1 Fitting a Simple Linear Regression ModelUse the BodyFat2 data set for this exercise:Perform a simple linear regression model with PctBodyFat2 as the response variable and Weight as the predictor.What is the value of the F statistic and the associated p-value? How would you interpret this with regard to the null hypothesis?Write the predicted regression equation.What is the value of the R2 statistic? How would you interpret this?Produce predicted values for PctBodyFat2 when Weight is 125, 150, 175, 200 and 225. (see SAS code in below comments part)What are the predicted values?What’s the value of PctBodyFat2 when Weight is 150?data BodyFat2;set sasuser.BodyFat;if Height=29.5 thenHeight=69.5;run;data BodyFatToScore;input Weightdatalines;;115

116 Home Work: Exercise 3 3.1 Performa Multiple RegressionUsing the BodyFat2 data set, run a regression of PctBodyFat2 on the variables Age, Weight, Height, Neck, Chest, Abdomen, Hip, thigh, Knee, Ankle, Biceps, Forearm, and Wrist. Compare the ANOVA table with that from the model with only Weight in the previous exercise. What is the different?How do the R2 and the adjusted R2 compare with these statistics for the Weight regression demonstration?Did the estimate for the intercept change? Did the estimate for the coefficient of Weight change?116

117 Home Work: Exercise 3 3.2 Simplifying the modelRerun the model in the previous exercise, but eliminate the variable with the highest p-value. Compare the result with the previous model.Did the p-value for the model change notably?Did the R2 and adjusted R2 change notably?Did the parameter estimates and their p-value change notably?3.3 More simplifying of the modelRerun the model in the previous exercise, but eliminate the variable with the highest p-value.How did the output change from the previous model?Did the number of parameters with a p-value less than 0.05 change?117

118 Home Work: Exercise 4 4.1 Using Model Building TechniquesUse the BodyFat2 data set to identify a set of “best” models.Using the Mallows' Cp option, use an all-possible regression technique to identify a set of candidate models that predict PctBodyFat2 as a function of the variables Age, Weight, Height, Neck, Chest, abdomen, Hip, thigh, Knee, Ankle, Biceps, Forearm, and Wrist .Hint: select the best 60 models based on Cp to compareUse a stepwise regression method to select a candidate model. Try Forward selection, Backward selection, and Stepwise selection.How many variables would result from a model using Forward selection and a significant level for entry criterion of 0.05, instead of the default of 0.50?118

About project

Feedback

To ensure the functioning of the site, we use cookies. We share information about your activities on the site with our partners and Google partners: social networks and companies engaged in advertising and web analytics. For more information, see the Privacy Policy and Google Privacy &amp Terms.
Your consent to our cookies if you continue to use this website.