Monday, January 30, 2012

Now that I got these outlier and influence measures (see last post), we can look at the Anscombe example, Anscombe Quartet

I took the example from matplotlib here. With a few changes, mainly adding the observation number. I get this graph

The first example is what we would usually expect and shouldn't have any problems for estimating the regression line. The second example is nonlinear, but we are fitting only a linear model. The third example has an outlier in y, which pulls the regression line towards it. The last example has an outlier in x.

Interpreting the last case is a bit tricky. The slope of the regression line is dominated by observation 7, if we think it is a reliable (correctly measured) observation, then it is very useful since it contains a lot of information. If we doubt the reliability of that measurement, then we are left without any information about the slope and the effect of the explanatory variable.
(I don't remember where I read this, someone argued that we shouldn't through out x outliers.)

Now we can run a loop over the four problems and check the influence diagnostic table, and also Ramsey's RESET test for linearity.

Everything looks mostly ok, the RESET test doesn't reject the linear specification. Observation 2 might be an outlier, the studentized residual is at around 2, which is about the threshold for the t-distribution.

Observation 2 and 7 look like outliers. However, the RESET test strongly rejects the hypothesis that the linear specification is correct. So we better look for another model, for example including polynomial terms for the explanatory variable.

What's going on with observation 7? The "-1.#IO"stands for numpy.NaN. All the measures that rely on leaving out observation 7 are indeterminate, since without observation 7 we cannot identify a slope.
The diagonal value of the hat matrix and the dfbeta for the slope indicate that this observation is very influential.

Since I didn't like the NaN results in the previous example, I made one change. I added an additional observation that has an x value slightly larger the the large group, 8.01 insted of 8.00, the y value was chosen so it lies on the predicted line. The new observation is the first observation, the x outlier is now at index 8.

What happened to our influence measures? Both observations 0 and 8 have zero influence, since we only need one of them to identify the slope. Leave one observation out doesn't help if we have two "similar" (in terms of parameter estimation) outliers.

Since this was interesting, let's try another variation. This time I change the value of the new observation to lie below the regression line (6.999 instead of 7.006). One more table and we are done.

Now we see that the outlier measures like the studentized residuals are very small for observation 8, however the influence measures, diagonal value of hat matrix, Cook's distance, dffits and dfbeta, are all large and clearly signal that this is a very influential observation.

Extra comment:

The RESET test in the versions of example four has different degrees of freedom than the first three examples, even though it uses the same code. After looking at this surprising result a bit, it turns out that because of the structure of the explanatory variable (only 2 or 3 distinct x values) the auxilliary regression for the RESET test has reduced rank.

Sunday, January 29, 2012

Suppose we run a simple regression, and want to know whether there are possible outliers, or observations that have a large influence on the estimated parameters. In the past, I added several specification test to statsmodels. These are tests that check whether your model assumptions are "correct" or whether there are violations.
(This is sloppy phrasing, they are statistical tests after all.)

What is missing so far in statsmodels are the usual outlier and influence measures. (Although there are robust linear models.)

Trying to do something similar as an extension to statsmodels, I ended up with the table below. This is supposed to correspond to the bottom tables on the above linked page.
All numbers look the same, but I'm still missing `Cov Ratio` and haven't looked for the prediction intervals yet. I also have dfbetas, but ran out of time to add them nicely to the table. Some of the measures have associated pvalues or recommended thresholds that can be used to interpret the results and find which observations might "mess up" the estimation.

The html for this table was created with scikits.statsmodels.iolib.table.SimpleTable, but I never tried to fine tune the html options for a nice result. The text table that is getting printed in the python shell is

This was the fun part, and I managed to get it done in a few (or more) hours this weekend. The rest sounds more like work.

Some comments on the implementation.

The first measures can be calculated efficiently from only the original regression, the second part of this list of measures requires a loop that estimates the model dropping each observation one at a time (Leave One Observation Out). The latter can get expensive if the number of observations is large.

Initially I wanted to cache all Leave One Observation Out result instances, but this might get expensive in terms of memory, since at least some copies of the data will be made and stored together with the result instances. Instead, I finally decided to loop once but store all the attributes that will be required for the various measures. (Right now, I actually needed only the estimate of the error variance and the parameter estimates).

If the number of observations is large, then calculating the Leave One Observation Out results for dropping every observation will be a waste, because many observations will not be likely candidates for the "full treatment". The, not yet implemented, idea is to calculate the cheap measures first, and use them to select those observations that are likely candidates for the more precise LOOO loop.

Also still missing: plots.
I did some cleanup for the regression plots, but the outlier plots still look plain vanilla.

Quote: "Parc Jean-Drapeau is proud to welcome a North-American premiere – a Snow
Village which has a hotel, restaurant, replicas of familiar Montréal
buildings and lighted snow tunnels to explore all winter long. This
Finnish concept attraction allows visitors and clients to enjoy a unique
Montréal-winter experience."

I'm definitely NOT going to book a room there. But if anyone is interested, here's the link.

Previously I was trying out mainly the contour plots, since I didn't get 3d to look nice enough.

For example the following shows a mixture of two bivariate normal distributions and the estimate by gaussian_kde. The colored areas are the differences between kde and the true distribution, in the blue areas the kde is too large, in the redish areas the kde is too small. It's a contour plot version for showing that gaussian_kde with default settings for the bandwidth lowers the hills and fills the valleys in the case of bimodal distributions.