Useful facts about independence

In case of many dimensions, we follow the same idea. Before doing that we state without proofs two useful facts about independence of random variables (real-valued, not vectors).

Theorem 1. Suppose variables have densities Then they are independent if and only if their joint density is a product of individual densities:

Theorem 2. If variables are normal, then they are independent if and only if they are uncorrelated:

The necessity part (independence implies uncorrelatedness) is trivial.

Normal vectors

Let be independent standard normal variables. A standard normal variable is defined by its density, so all of have the same density. We achieve independence, according to Theorem 1, by defining their joint density to be a product of individual densities.

Definition 2. For a matrix and vector of compatible dimensions a normal vector is defined by

Properties. and

(recall that variance of a vector is always nonnegative).

Distributions derived from normal variables

In the definitions of standard distributions (chi square, t distribution and F distribution) there is no reference to any sample data. Unlike statistics, which by definition are functions of sample data, these and other standard distributions are theoretical constructs. Statistics are developed in such a way as to have a distribution equal or asymptotically equal to one of standard distributions. This allows practitioners to use tables developed for standard distributions.

Exercise 1. Prove that converges to 1 in probability.

Proof. For a standard normal we have and (both properties can be verified in Mathematica). Hence, and

Application: estimating sigma squared

Consider multiple regression

(1)

where

(a) the regressors are assumed deterministic, (b) the number of regressors is smaller than the number of observations (c) the regressors are linearly independent, and (d) the errors are homoscedastic and uncorrelated,

(2)

Usually students remember that should be estimated and don't pay attention to estimation of Partly this is because does not appear in the regression and partly because the result on estimation of error variance is more complex than the result on the OLS estimator of

Definition 1. Let be the OLS estimator of . is called the fitted value and is called the residual.

Generalized Pythagoras theorem

Ordinary Least Squares (OLS) estimator derivation

Problem statement. A vector (the dependent vector) and vectors (independent vectors or regressors) are given. The OLS estimator is defined as that vector which minimizes the total sum of squares

Denoting we see that and that finding the OLS estimator means approximating with vectors from the image should be linearly independent, otherwise the solution will not be unique.

Assumption. are linearly independent. This, in particular, implies that

Exercise 2. Show that the OLS estimator is

(2)

Proof. By Exercise 1 we can use Since belongs to the image of doesn't change it: Denoting also we have

(by Exercise 1)

This shows that is a lower bound for This lower bound is achieved when the second term is made zero. From

we see that the second term is zero if satisfies (2).

Usually the above derivation is applied to the dependent vector of the form where is a random vector with mean zero. But it holds without this assumption. See also simplified derivation of the OLS estimator.

Question 4. Note that the matrix (1) is symmetric (elements above the main diagonal equal their mirror siblings below that diagonal). This means that some terms in the second sum on the right of (5) are repeated twice. If you group equal terms in (5), what do you get?

Answer 4. The idea is to write

that is, to join equal elements above and below the main diagonal in (1). For this, you need to figure out how to write a sum of the elements that are above the main diagonal. Make a bigger version of (1) (with more off-diagonal elements) to see that the elements that are above the main diagonal are listed in the sum This sum can also be written as Hence, (5) is the same as

(7)

Unlike (6), this equation is applicable when there is autocorrelation.

Type I and Type II errors

Regarding the true state of nature we assume two mutually exclusive possibilities: the null hypothesis (like the suspect is guilty) and alternative hypothesis (the suspect is innocent). It's up to us what to call the null and what to call the alternative. However, the statistical procedures are not symmetric: it's easier to measure the probability of rejecting the null when it is true than other involved probabilities. This is why what is desirable to prove is usually designated as the alternative.

Usually in books you can see the following table.

Decision taken

Fail to reject null

Reject null

State of nature

Null is true

Correct decision

Type I error

Null is false

Type II error

Correct decision

This table is not good enough because there is no link to probabilities. The next video does fill in the blanks.

Violations of classical assumptions

This will be a simple post explaining the common observation that "in Economics, variability of many variables is proportional to those variables". Make sure to review the assumptions; they tend to slip from memory. We consider the simple regression

(1)

One of classical assumptions is

Homoscedasticity. All errors have the same variances: for all .

We discuss its opposite, which is

Heteroscedasticity. Not all errors have the same variance. It would be wrong to write it as for all (which means that all errors have variance different from ). You can write that not all are the same but it's better to use the verbal definition.

Remark about Video 1. The dashed lines can represent mean consumption. Then the fact that variation of a variable grows with its level becomes more obvious.

Video 1. Case for heteroscedasticity

Figure 1. Illustration from Dougherty: as x increases, variance of the error term increases

Homoscedasticity was used in the derivation of the OLS estimator variance; under heteroscedasticity that expression is no longer valid. There are other implications, which will be discussed later.

Companies example. The Samsung Galaxy Note 7 battery fires and explosions that caused two recalls cost the smartphone maker at least $5 billion. There is no way a small company could have such losses.

GDP example. The error in measuring US GDP is on the order of $200 bln, which is comparable to the Kazakhstan GDP. However, the standard deviation of the ratio error/GDP seems to be about the same across countries, if the underground economy is not too big. Often the assumption that the standard deviation of the regression error is proportional to one of regressors is plausible.

To see if the regression error is heteroscedastic, you can look at the graph of the residuals or use statistical tests.

Violations of classical assumptions

This is a large topic which requires several posts or several book chapters. During a conference in Sweden in 2010, a Swedish statistician asked me: "What is Econometrics, anyway? What tools does it use?" I said: "Among others, it uses linear regression." He said: "But linear regression is a general statistical tool, why do they say it's a part of Econometrics?" My answer was: "Yes, it's a general tool but the name Econometrics emphasizes that the motivation for its applications lies in Economics".

Both classical assumptions and their violations should be studied with this point in mind: What is the Economics and Math behind each assumption?

When it is not linear in parameters, you can think of nonlinear alternatives. Instead of saying "correctly specified" I say "true model" when a "wrong model" is available.

A1. What if the existence condition is violated? If variance of the regressor is zero, the OLS estimator does not exist. The fitted line is supposed to be vertical, and you can regress on . Violation of the existence condition in case of multiple regression leads to multicollinearity, and that's where economic considerations are important.

A2. The convenience condition is called so because when it is violated, that is, the regressor is stochastic, there are ways to deal with this problem: finite-sample theory and large-sample theory.

A3. What if the errors in (1) have means different from zero? This question can be divided in two: 1) the means of the errors are the same: for all and 2) the means are different. Read the post about centering and see if you can come up with the answer for the first question. The means may be different because of omission of a relevant variable (can you do the math?). In the absence of data on such a variable, there is nothing you can do.

Nonlinear least squares

Here we explain the idea, illustrate the possible problems in Mathematica and, finally, show the implementation in Stata.

Idea: minimize RSS, as in ordinary least squares

Observations come in pairs . In case of ordinary least squares, we approximated the y's with linear functions of the parameters, possibly nonlinear in x's. Now we use a function which may be nonlinear in . We still minimize RSS which takes the form . Nonlinear least squares estimators are the values that minimize RSS. In general, it is difficult to find the formula (closed-form solution), so in practice software, such as Stata, is used for RSS minimization.

Simplified idea and problems in one-dimensional case

Suppose we want to minimize . The Newton algorithm (default in Stata) is an iterative procedure that consists of steps:

Select the initial value .

Find the derivative (or tangent) of RSS at . Make a small step in the descent direction (indicated by the derivative), to obtain the next value .

Repeat Step 2, using as the starting point, until the difference between the values of the objective function at two successive points becomes small. The last point will approximate the minimizing point.

Problems:

The minimizing point may not exist.

When it exists, it may not be unique. In general, there is no way to find out how many local minimums there are and which ones are global.

The minimizing point depends on the initial point.

See Video 1 for illustration in the one-dimensional case.

Video 1. NLS geometry

Problems illustrated in Mathematica

Here we look at three examples of nonlinear functions, two of which are considered in Dougherty. The first one is a power functions (it can be linearized applying logs) and the second is an exponential function (it cannot be linearized). The third function gives rise to two minimums. The possibilities are illustrated in Mathematica.

Video 2. NLS illustrated in Mathematica

Finally, implementation in Stata

Here we show how to 1) generate a random vector, 2) create a vector of initial values, and 3) program a nonlinear dependence.

Alternatives to simple regression in Stata

In this post we looked at dependence of EARNINGS on S (years of schooling). In the end I suggested to think about possible variations of the model. Specifically, could the dependence be nonlinear? We consider two answers to this question.

Quadratic regression

This name is used for the quadratic dependence of the dependent variable on the independent variable. For our variables the dependence is

.

Note that the dependence on S is quadratic but the right-hand side is linear in the parameters, so we still are in the realm of linear regression. Video 1 shows how to run this regression.

Video 1. Running quadratic regression in Stata

Nonparametric regression

The general way to write this model is

The beauty and power of nonparametric regression consists in the fact that we don't need to specify the functional form of dependence of on . Therefore there are no parameters to interpret, there is only the fitted curve. There is also the estimated equation of the nonlinear dependence, which is too complex to consider here. I already illustrated the difference between parametric and nonparametric regression. See in Video 2 how to run nonparametric regression in Stata.

Running simple regression in Stata is, well, simple. It's just a matter of a couple of clicks. Try to make it a small research.

Obtain descriptive statistics for your data (Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Summary statistics). Look at all that stuff you studied in introductory statistics: units of measurement, means, minimums, maximums, and correlations. Knowing the units of measurement will be important for interpreting regression results; correlations will predict signs of coefficients, etc. In your report, don't just mechanically repeat all those measures; try to find and discuss something interesting.

After running regression, report the estimated equation. It is called a fitted line and in our case looks like this: Earnings = -13.93+2.45*S (use descriptive names and not abstract X,Y). To see if the coefficient of S is significant, look at its p-value, which is smaller than 0.001. This tells us that at all levels of significance larger than or equal to 0.001 the null that the coefficient of S is significant is rejected. This follows from the definition of p-value. Nobody cares about significance of the intercept. Report also the p-value of the F statistic. It characterizes significance of all nontrivial regressors and is important in case of multiple regression. The last statistic to report is R squared.

Think about possible variations of the model. Could the dependence of Earnings on S be nonlinear? What other determinants of Earnings would you suggest from among the variables in Dougherty's file?