My interest lies in finding the "right" correlation between a continuous IV ($x$) and a continuous DV ($y$).

At first I ran a simple linear regression:
$$
y=a+b_1 x
$$
However, lots of other factors influence $y$ besides $x$. One of them is a categorical variable $c$, with $n+1$ categories. Using dummy coding, I ran the regression:
$$
y = a + b_1x + b_2c_1 + b_3c_2 + \ldots + b_{n+1}c_n
$$
$c$ and $x$ are not orthogonal, so I then added interaction effects:
$$
y = a + b_1x + b_2c_1 + b_3c_2 + \ldots + b_{n+1}c_n + b_{n+2}xc_1 + \ldots b_{2n+1}xc_n
$$
The interaction effect is significant (although the coefficients for some categories*x are not), as is the main effect, in all models, and (most) of the category coefficients.

What I noticed, however, is that the coefficient $b_1$ is somewhat different in each of the three models. What is the correct way to interpret the correlation of $x$ and $y$?

Hint: the coefficient $b_1$ depends (strongly) on the form of dummy coding. Another hint: what precisely could be meant by the correlation of $x$ and $y$? It appears that you have introduced many correlations, each depending on values of the covariate $c$.
–
whuber♦Jan 13 '13 at 19:02

2 Answers
2

Don't think of it in terms of the coefficients, but rather in terms of the "effect" of the IV. In the unadjusted regression, these are the same thing. In the model with interactions, you are now looking at the "effect" of the IV within each category of c. For a given category, you just add B1 and the relevant B from the interaction; and B1 alone is the "effect" of the IV in the group with the omitted dummy term.

If you work through this, it should become pretty clear why B1 changes between the models.

Thanks DL. On a mathematical level, I understand and agree. What I'm interested in is a (justified) general statement about the correlation I can make to someone who knows very little statistics. For example, is it correct to say "among all categories c, the average correlation of x with y is b_1", and if so, which b_1 should I use here? The one in model 1, 2, or 3?
–
MakoreJan 13 '13 at 19:43

The correlation is a bivariate statistic that would involve only x and y and would not have an intercept. None of what you've given is a correlation.

But, given that there is an interaction with other variables, the correlation, while mathematically correct, will be misleading.

In one of your replies you write:

For example, is it correct to say "among all categories c, the
average correlation of x with y is b_1", and if so, which b_1 should I
use here? The one in model 1, 2, or 3?

If you want to make a statement of this type, you can get a weighted average of the correlations (not the regression parameters) in the different categories of c. But I don't think this is a good approach.

Given this goal

What I'm interested in is a (justified) general statement about the
correlation I can make to someone who knows very little statistics.

What I would do is present a graph of the variables x and y, with the dots on the scatterplot colored differently for each level of c, and with regression lines (or possibly loess lines) superimposed on top.

This wouldn't be a single simple statement, but sometimes there is no good single simple statement.