3.03 The regression model

Inferential statistics are concerned with making inferences based on relations found in the sample, to relations in the population. Inferential statistics help us decide, for example, whether the differences between groups that we see in our data are strong enough to provide support for our hypothesis that group differences exist in general, in the entire population.
We will start by considering the basic principles of significance testing: the sampling and test statistic distribution, p-value, significance level, power and type I and type II errors. Then we will consider a large number of statistical tests and techniques that help us make inferences for different types of data and different types of research designs. For each individual statistical test we will consider how it works, for what data and design it is appropriate and how results should be interpreted. You will also learn how to perform these tests using freely available software.
For those who are already familiar with statistical testing: We will look at z-tests for 1 and 2 proportions, McNemar's test for dependent proportions, t-tests for 1 mean (paired differences) and 2 means, the Chi-square test for independence, Fisher’s exact test, simple regression (linear and exponential) and multiple regression (linear and logistic), one way and factorial analysis of variance, and non-parametric tests (Wilcoxon, Kruskal-Wallis, sign test, signed-rank test, runs test).

ZH

This course was relatively difficult for my slow brain. I learned A LOT. but I still feel the need for doing this course one more time

FY

Jan 25, 2017

Filled StarFilled StarFilled StarFilled StarFilled Star

A very fast course! Really gave me a hard time. Very good content and good organization. A lot of calculations on the final.

從本節課中

Simple regression

In this module we’ll see how to describe the association between two quantitative variables using simple (linear) regression analysis. Regression analysis allows us to model the relation between two quantitative variables and - based on our sample -decide whether a 'real' relation exists in the population. Regression analysis is more useful than just calculating a correlation coefficient, since it allows us assess how well our regression line fits the data, it helps us to identify outliers and to predict scores on the dependent variable for new cases.

教學方

Annemarie Zand Scholten

Assistant Professor

Emiel van Loon

Assistant Professor

腳本

In this video, we'll look at the population regression equation. And see how it models a relation between predictor and response variable in the population. We'll see that it describes the linear relation between the population means of the conditional distributions of the response variable. Don't worry, you'll understand what that means at the end of this video. Up until now I've been a bit too simplistic in my explanation of regression. Earlier, we looked at an example where we predicted popularity of cat videos, measured just number of video views, using the cat's age as the predictor. In this very small sample, we only considered the relation between these variables for this particular sample. But of course, my hope is that the regression equation describes the relation in general. So not just for this sample but for the entire population. If it does, then I understand the world a little bit better. It also means I can generate useful predictions for new cases, I'll be able to predict the popularity of new videos of my kitten and my older cat. So the goal is to model the relation at the population level. Later on we'll see how we can draw inferences about this model of the population. Okay, that all sounds pretty abstract. So why exactly do we model in the population? Of course, it's impossible but supposed for a second that we could gather information about cat age and popularity for all cat videos that have ever been available online. You can imagine that if we focus just on one year old cats, we would find a huge number of videos with varying popularity scores, they won't all be the same. The distribution might look something like this, with a mean of [INAUDIBLE], we could do the same for one and a half year old cats, and 2.74 year old cats, 5.38 year old cats, and so on. For each possible cat age, we will find a distribution of varying popularity scores. For example, for the one and a half year old cats with a mean of 43.75, and for the 2.74 year old cats with a mean of [INAUDIBLE] and so on. In simple linear regression we assume that, conditional on the value of the predictor, in other words for any given cat age, the shape of the distribution of popularity scores looks exactly the same. Assuming that the relation is perfectly linear, the population regression line goes through the means of these distributions. More formally, the line describes the population means of the conditional response distributions, which are assumed to have uniform shape and standard deviation. We can express these using the following equation, mu sub y, the conditional population mean on the response variable y, equals alpha plus beta times x, the predictor, with the same standard deviation sigma at every x. This looks very similar to the expression we used earlier for the sample regression line but with three differences. One is that we use Greek symbols for the intercept and regression coefficient to indicate that we're talking about the population regression equation. Another difference is that besides the parameters, Alpha and Beta, we also specify the parameter Sigma. Although this parameter is not in the equation, it is an essential part of the model. It will become important later on when we use the model for inferential purposes. The final difference is that we describe the population means, not predicted values for individual cases. Modeling the means of the conditional distributions per cat age is important because it allows for natural variation around the regression line in the population. If we modeled the predicted response value for individual cases, for example, by saying y sub i = alpha + beta times x sub i, it would mean that we expect all cat videos of all one-year-old cats in the population to have exactly the same popularity score. And of course that would be very unlikely. There is a way to express the model at the individual level by introducing an error term. The model looks like this, y sub i = alpha + beta time x sub i + epsilon sub i. Epsilon indicates the variation around the conditional mean. It describes the conditional distributions we just saw. Conditional on the value of x, the errors are assumed to be distributed normally with a uniform standard deviation sigma and are expected to have a mean of zero. Since inclusion of epsilon can be confusing, many textbooks don't present the model in this form. But I mentioned it so that if you see a percentage like this, you don't get confused and thinks it's an entirely different model. Okay, back to our model of the conditional means. Ideally, the population regression line fits perfectly, and goes exactly through all these means. Of course, it's unlikely that in the population the means will line up perfectly, they probably won't. But the straight line is assumed to be a close enough approximation to result in a useful model for description or prediction.