8.31.2012

Watercolor regression

I'm in a rush, so I will explain this better later. But Andrew Gelman posted my idea for a type of visually-weighted regression that I jokingly called "watercolor regression" without a picture, so its a little tough to see what I was talking about. Here is the same email but with the pictures to go along with it. The code to do you own watercolor regression is here as the 'SMOOTH' option to vwregress. (Update: I've added a separate cleaner function watercolor_reg.m for folks who want to tinker with the code but don't want to wade through all the other options built into vwregress. Update 2: I've added watercolor regression as a second example in the revised paper here.)

This was the email with figures included:

Two small follow-ups based on the discussion (the second/bigger one is to address your comment about the 95% CI edges).

1. I realized that if we plot the confidence intervals as a solid color that fades (eg. using the "fixed ink" scheme from before) we can make sure the regression line also has heightened visual weight where confidence is high by plotting the line white. This makes the contrast (and thus visual weight) between the regression line and the CI highest when the CI is narrow and dark. As the CI fade near the edges, so does the contrast with the regression line. This is a small adjustment, but I like it because it is so simple and it makes the graph much nicer.

My posted code has been updated to do this automatically.

2. You and your readers didn't like that the edges of the filled CI were so sharp and arbitrary. But I didn't like that the contrast between the spaghetti lines and the background had so much visual weight. So to meet in the middle, I smoothed the spaghetti plot to get a nonparametric estimate of the probability that the conditional mean is at a given value:

To do this, after generating the spaghetti through bootstrapping, I estimate a kernel density of the spaghetti in the Y dimension for each value of X. I set the visual-weighting scheme so it still "preserves ink" along a vertical line-integral, so the distribution dims where it widens since the ink is being "stretched out". To me, it kind of looks like a watercolor painting -- maybe we should call it a "watercolor regression" or something like that.

The watercolor regression turned out to be more of a coding challenge than I expected, because the bandwidth for the kernel smoothing has to adjust to the width of the CI. And since several people seem to like R better than Matlab, I attached 2 figs to show them how I did this. Once you have the bootstrapped spaghetti plot:

I defined a new coordinate system that spanned the range of bootstrapped estimates for each value in X

The kernel smoothing is then executed along the vertical columns of this new coordinate system.

I've updated the code posted online to include this new option. This Matlab code will generate a similar plot using my vwregress function:

x = randn(100,1);

e = randn(100,1);

y = 2*x+x.^2+4*e;

bins = 200;

color = [.5 0 0];

resamples = 500;

bw = 0.8;

vwregress(x, y, bins, bw, resamples, color, 'SMOOTH');

NOTE TO R USERS: The day after my email to andrew, Felix Schönbrodt posted a nice similar variant with code in R here.

Update: For overlaid regressions, I still prefer the simpler visually-weighted line (last two figs here) since this is what overlaid watercolor regressions look like:

It might look better if the scheme made the blue overlay fade from blue-to-clear rather than blue-to-white, but then it would be mixing (in the color sense) with the red so the overlaid region would then start looking like very dark purple. If someone wants to code that up, I'd like to see it. But I'm predicting it won't look so nice.

Thanks.a) I understand the issues with line-width, why all this is tricky.

b) I can think of two different approaches. One would be a filled contour plot, with 3 levels for 1, 2 and 3 SD.

The other would retain the smooth shading without ink conservation. Of course, that would make the darker areas look bigger where the CI was bigger, which might be counterproductive. That's why I thought your conservation approach was likely on the right track. I merely suggested trying these because I long ago learned that my intuition was insufficient to predict how these would look.