16 Multiple Linear Regression AnalysisParameter EstimationThe goal of an estimator is to provide an estimate of a particular statistic based on the data. There are several ways to characterize estimators:Bias: an unbiased estimator converges to the true value with large enough sample size. Each parameter is neither consistently over or under estimatedLikelihood: the maximum likelihood (ML) estimator is the one that makes the observed data most likely ML estimators are not always unbiased for small NEfficient: an estimator with lower variance is more efficient, in the sense that it is likely to be closer to the true value over samples the “best” estimator is the one with minimum variance of all estimators16

17 Multiple Linear Regression AnalysisA linear model can be written asWhere:is an N-dimensional column vector of observationsis a (k+1)-dimensional column vector of unknown parametersis an N-dimensional random column vector of unobserved errorsMatrix X is written asThe first column of X is the vector , so that the first coefficient is the intercept.The unknown coefficient vector is estimated by minimizing the residual sum of squares17

18 Multiple Linear Regression AnalysisModel assumptionsThe OLS estimator can be considered as the best linear unbiased estimator (BLUE) of provided some basic assumptions regarding the error term are satisfied :Mean of errors is zero:Errors have a constant variance:Errors from different observations are independent of each other: forErrors follow a Normal Distribution.Errors are not uncorrelated with explanatory variable:18

19 Multiple Linear Regression AnalysisX2Interpreting Multiple Regression ModelFor a multiple regression model：1 should be interpreted as change in y when a unit change is observed in x1 and x2 is kept constant. This statement is not very clear when x1 and x2 are not independent.Misunderstanding: i always measures the effect of xi on E(y), independent of other x variables.Misunderstanding: a statistically significant  value establishes a cause and effect relationship between x and y.X119

20 Multiple Linear Regression AnalysisExplanation Power by If the model is useful…At least one estimated  must  0But wait …What is the chance of having one estimated  significant if I have 2 random x?For each , prob(b  0) = 0.05At least one happen to be b  0, the chance is:Prob(b1  0 or b2  0)= 1 – prob(b1=0 and b2=0) = 1-(0.95)2 =Implication?20

23 Multiple Linear Regression AnalysisThe R2 statistic measures the overall contribution of Xs.Then test hypothesis:H0: 1=… k=0H1: at least one parameter is nonzeroSince there is no probability distribution form for R2, F statistic is used instead.23

25 Multiple Linear Regression AnalysisHow many variables should be included in the model?Basic strategies:Sequential forwardSequential backwardForce entireThe first two strategies determine a suitable number of explanatory variables using the semi-partial correlation as criterion and a partial F-statistics which is calculated from the error terms from the restricted (RSS1) and unrestricted (RSS) models:where k, k1 denotes the number of lags of the unrestricted and restricted model, and N is the number of observations.25

26 Multiple Linear Regression AnalysisThe semi-partial correlationYZXMeasures the relationship between a predictor and the outcome, controlling for the relationship between that predictor and any others already in the model.It measures the unique contribution of a predictor to explaining the variance of the outcome.26

27 Multiple Linear Regression AnalysisTesting the regression coefficientsAn unbiased estimator for the variance isThe regression coefficients are tested for significance under the Null-Hypothesis using a standard t-testWhere denotes the ith diagonal element of the matrix is also referred to as standard error of a regression coefficient .27

28 Multiple Linear Regression AnalysisWhich X is contributing the most to the prediction of Y?Cannot interpret relative size of bs because each are relative to the variables scale but s (Betas; standardized Bs) can be interpreted.a is the mean on Y which is zero when Y is standardized28

29 Multiple Linear Regression AnalysisCan the regression equation be generalized to other data?Can be evaluated byrandomly separating a data set into two halves. Estimate regression equation with one half and apply it to the other half and see if it predictsCross-validation29

31 Multiple Linear Regression AnalysisThe Revised Levene’s testDivide the residuals into two (or more) groups based the level of x,The variances and the means of the two groups are supposed to be equal. A standard t-test can be used to test the difference in mean. A large t indicates nonconsistancy.ex/E(y)31

32 Multiple Linear Regression AnalysisDetecting Outliers and Influential ObservationsInfluential points are those whose exclusion will cause major change in fitted line.“Leave-one-out” crossvalidation.If ei > 4s, it is considered as outlier.True outlier should not be removed, but should be explained.32

33 Multiple Linear Regression AnalysisGeneralized Least-SquaresExample for a Generalized Least-Square model which can be used instead of OLS-regression in the case of autocorrelated error terms (e.g. in Distributed Lag-Models)33

39 Comparing more than two groupsANOVAComparing more than two groupsANOVA deals with situations with one observation per object, and three or more groups of objectsThe most important question is as usual: Do the numbers in the groups come from the same population, or from different populations?39

40 One-way ANOVA: ExampleAssume ”treatment results” from 13 soil plots from three different regions:Region A: 24,26,31,27Region B: 29,31,30,36,33Region C: 29,27,34,26H0: The treatment results are from the same population of resultsH1: They are from different populations40

41 ANOVA Comparing the groups Averages within groups: Total average:Region A: 27Region B: 31.8Region C: 29Total average:Variance around the mean matters for comparison.We must compare the variance within the groups to the variance between the group means.41

42 Variance within and between groupsANOVAVariance within and between groupsSum of squares within groups:Sum of squares between groups:The number of observations and sizes of groups has to be taken into account!42

43 Adjusting for group sizesANOVAAdjusting for group sizesBoth are estimates of populationvariance of error under H0n: number of observationsK: number of groupsIf populations are normal, with the same variance, then we can show that under the null hypothesis,Reject at confidence level if43

46 When to use which methodANOVAWhen to use which methodIn situations where we have one observation per object, and want to compare two or more groups:Use non-parametric tests if you have enough dataFor two groups: Mann-Whitney U-test (Wilcoxon rank sum)For three or more groups use Kruskal-WallisIf data analysis indicate assumption of normally distributed independent errors is OKFor two groups use t-test (equal or unequal variances assumed)For three or more groups use ANOVA46

47 Two-way ANOVA (without interaction)In two-way ANOVA, data fall into categories in two different ways: Each observation can be placed in a table.Example: Both type of fertilization and crop type should influence soil properties.Sometimes we are interested in studying both categories, sometimes the second category is used only to reduce unexplained variance. Then it is called a blocking variable47

48 Sums of squares for two-way ANOVAAssume K categories, H blocks, and assume one observation xij for each category i and each block j block, so we have n=KH observations.Mean for category i:Mean for block j:Overall mean:Illustrate in table!48

51 Two-way ANOVA (with interaction)The setup above assumes that the blocking variable influences outcomes in the same way in all categories (and vice versa)Checking interaction between the blocking variable and the categories by extending the model with an interaction term51

52 Sums of squares for two-way ANOVA (with interaction)Assume K categories, H blocks, and assume L observations xij1, xij2, …,xijL for each category i and each block j block, so we have n=KHL observations.Mean for category i:Mean for block j:Mean for cell ij:Overall mean:Illustrate in table!52

55 ANOVANotes on ANOVAAll analysis of variance (ANOVA) methods are based on the assumptions of normally distributed and independent errorsThe same problems can be described using the regression framework. We get exactly the same tests and results!There are many extensions beyond those mentioned55

57 MANOVA Multiple DVs could be analysed using multiple ANOVAs, but:The FW increases with each ANOVAScores on the DVs are likely correlatedNon-independent, and taken from the same subjectsHard to interpret results if multiple ANOVAs are significantMANOVA solves this by conducting only one overall testCreates a ‘composite’ DVTests for significance of the composite DV57

58 MANOVA The Composite DV is a linear combination of the DVsi.e., a discriminant function, or rootThe weights maximally separate the groups on the composite DVC = W1Y1 + W2Y2 + W3Y3 + …+ WnYnwhere, C is a subject’s score on the composite DVYi are scores on each of the DVsWi are the weights, one for each DVA composite DV is required for each main effect and interaction58

59 MANOVA Considering the DVs together can enhance powerFrequency distributions show considerable overlap between groups on the individual DVsThe elipses, that reflect the DVs in combination, show less overlapSmall differences on each DV combine to make a larger multivariate difference59

63 MANOVA The deviation score for the first subject is:The squared deviation is obtained by multiplying by the transpose:SS are on the diagonal: (25.89)2 = 670, and (20.78)2 = 431Cross-products are on the off-diagonals: (25.89)(20.78)=538And:63

64 MANOVAThe squaring of a matrix is carried out by multiplying it by its transposeThe transpose is obtained by flipping the matrix about its diagonal:To multiply, the ijth element in the resulting matrix is obtained by the sum of products of the ith row in A and the jth column in A'For a vector, the transpose is a row vector, and:64

66 MANOVAIn ANOVA, variance estimates (MS) are obtained from the SS for significance testing using the F-statisticIn MANOVA, variance estimates (determinants) are obtained from the SSCP matrices for significance testing e.g. using Wilk’s Lambda ()ANOVA MANOVASS ~ SSCPMS ~ |SSCP|~Note that F and  are inverse to one another66

67 MANOVA The determinant of a 2x2 matrix is given by:The determinants required to test the interaction are:Wilk’s Lambda for the Interaction is obtained by:67

68 MANOVA If the effect is small, then  approaches 1.0Here SDT was small, and  was 0.91Eta Squared for MANOVA is:2 = 1 -  Effect= 1 – 0.91= 0.09The interaction accounts for only 9% of the variance in the group means on the composite DV68

69 MANOVA Must have more cases/cell than number of DVsAvoids singularity, enhances powerLinear relation of all DVs and of DVs and COVsMultivariate normalitySampling distribution of means for all DVs and linear combinations of DVs is normalHomogeneity of Variance-Covariance matricesTo rationalize pooling of error estimateCan be extended to within-subjects and mixed designsRepeated measures are treated as new DVs69

75 Discriminant AnalysisDiscriminant analysis is used to predict group memberships from a set of continuous predictorsAnalogy to MANOVA: in MANOVA linearly combined DVs are created to answer the question if groups can be separated.The same “DVs” can be used to predict group membership!!75

76 Discriminant AnalysisWhat is the goal of Discriminant Analysis?Perform dimensionality reduction “while preserving as much of the class discriminatory information as possible”.Seeks to find directions along which the classes are best separated.Takes into consideration the scatter within-classes but also the scatter between-classes.76

77 Discriminant AnalysisMANOVA and Disriminant Analysis (DA) are mathematically identical but are different in terms of emphasis:DA is usually concerned with grouping of objects (classification) and testing how well objects were classified (one grouping variable, one or more predictor variables)Discriminant functions are identical to canonical correlations between the groups on one side and the predictors on the other side.MANOVA is applied to test if groups significantly differ from each other (one or more grouping variables, one or more predictor variables)77

79 Discriminant AnalysisAssumptionssmall number of samples might lead to overfitting.If there are more DVs than objects in any cell the cell will become singular and cannot be inverted.If only a few cases more than DVs equality of covariance matrices is likely to be rejected.With a small objects/DV ratio power is likely to be very smallMultivariate normality: the means of the various DVs in each cell and all linear combinations of them are normally distributedAbsence of outliers – significance assessment is very sensitive to outlying casesHomogeneity of Covariance Matrices. DA is relatively robust to violations of this assumption if interference is the focus of the analysis, but not in classification.79

80 Discriminant AnalysisAssumptionsFor classification purposes DA is highly influenced by violations for the last assumption, since subjects will tend to be classified into groups with the largest varianceHomogeneity of class variances can be assessed by plotting pairwise the discriminant function scores for the first discriminant functions.LDA assumes linear relationships between all predictors within each group. Violations tend to reduce power and not increase alpha.Absence of Multicollinearity/Singularity in each cell of the design: Avoid redundant predictors80

81 Discriminant AnalysisInterpreting a Two-Group Discriminant FunctionIn the two-group case, discriminant function analysis is analogous to multiple regression; the two-group discriminant analysis is also called Fisher linear discriminant analysis.In general, in the two-group case we fit a linear equation of the type:c = a + d1*x1 + d2*x dm*xmwhere a is a constant and d1 through dm are regression coefficients and c is the predicted class.The interpretation of the results of a two-group problem is straightforward and closely follows the logic of multiple regression: Those variables with the largest (standardized) regression coefficients are the ones that contribute most to the prediction of group membership.81

82 Discriminant AnalysisDiscriminant Functions for Multiple GroupsWhen there are more than two groups, then we can estimate more than one discriminant function. For instance, when there are three groups, there exist a function for discriminating between group 1 and groups 2 and 3 combined, and another function for discriminating between group 2 and group 3. Canonical analysis. In a multiple group discriminant analysis, the first function is defined such that it provides the most overall discrimination between groups, the second provides second most, and so on.All functions are independent or orthogonal. Computationally, a canonical correlation analysis is performed that determines the successive functions and canonical roots.The number of function that can be calculated is:Min [number of groups-1;number of variables]82

83 Discriminant AnalysisEigenvaluesEigenvalus can be interpreted as the proportion of variance accounted for by the correlation between the respective canonical variates.Successive eigenvalues will be of smaller and smaller size. First, compute the weights that maximize the correlation of the two sum scores. After this first root has been extracted, you will find the weights that produce the second largest correlation between sum scores, subject to the constraint that the next set of sum scores does not correlate with the previous one, and so on.Canonical correlations.If the square root of the eigenvalues is taken, then the resulting numbers can be interpreted as correlation coefficients. Because the correlations pertain to the canonical variates, they are called canonical correlations.83

84 Discriminant AnalysisSuppose there are C classesLet µi be the mean vector of class i, i = 1,2,…, CLet be the total number of samples. AndWithin-class scatter matrix:Between-class scatter matrix:= mean of the entire data setWhereand84

86 Discriminant AnalysisLinear transformation implied by LDAThe LDA solution is given by the eigenvectors of the generalized eigenvector problem:The linear transformation is given by a matrix U whose columns are the eigenvectors of the above problem.Important: Since Sb has at most rank C-1, the max number of eigenvectors with non-zero eigenvalues is C-1 (i.e., max dimensionality of sub-space is C-1)86

87 Discriminant AnalysisDoes Sw-1 always exist?If Sw is non-singular, we can obtain a conventional eigenvalue problem by writing:In practice, Sw is often singular when more variables than cases are involved in the analysis (M << N )87