* One of the things I talked about a lot in class today
was why the df for the unequal variance ttest can be so different from the df
for the equal variance ttest. Well, in simple terms, the equal variance t-test
takes all the data equally into account, but the unequal variance t-test can
weight the standard error of the difference so much to what the smaller sample
is (see above, how the sociologists standard error of the mean is similar to
the standard error of the difference), that you can think of the unequal
variance t-test as taking only the smaller sample into account in terms of
variance of the difference, which is why df is 6 rather than 970. But note also
that the two tests have the same substantive interpretation (no significant
difference) meaning the wild difference between the df of the two models does
not determine the answer… See my Excel file and also Stata's documentation on
T-tests (either printed doc or online pdfs) for Satterthwaite's formula.

* See my excel file for a graphical example of why the
line does not fit the relationship between age and income. The relationship is
a parabola, an upside down "U", and so we need a second order age
term to fit it…

* Not that in the first model age is barely significant
at all, but here both age and age squared are highly significant, and the
R-square of the model has gone up quite a bit (but still has room for
improvement).

. tabulate occ1990_reduced

occ1990_redu |

ced | Freq. Percent Cum.

-------------+-----------------------------------

nurses | 966 68.37 68.37

sociologists | 6 0.42 68.79

lawyers | 441 31.21 100.00

-------------+-----------------------------------

Total | 1,413 100.00

. table occ1990_reduced sex, contents(freq mean
incwage) row col

----------------------------------------------------

occ1990_redu | Sex

ced | Male Female Total

-------------+--------------------------------------

nurses | 62 904 966

| 48602.45161 36777.9281 37536.85197

|

sociologists | 2 4 6

| 39200 42662.5 41508.33333

|

lawyers | 308 133 441

| 80236.42208 59704.73684 74044.32653

|

Total | 372 1,041 1,413

| 74743.46774 39729.70893 48947.76858

----------------------------------------------------

* Why we do multiple regression: we want to control for
potential confounding variables. In this case, maybe we would worry that the
apparent advantage of lawyers over nurses could be due to the fact the lawyers
are mostly male, and the nurses mostly female. So let's regress both at the
same time.

* OK, even after accounting for the fact that women make
less money than men, lawyers still earn significantly more than nurses. So the
lawyer- nurse gap is not just a function of the gender distribution in the two
occupations. In fact, if you look at the table above, you see that male lawyers
make a lot more than male nurses, and female lawyers make a lot more than
female nurses.

* Notice that the predicted values and the actual values
in our 3x2=6 cells do not coincide. That is because our model had only 4 terms,
and cannot fit the 6 cells exactly. Another way to think about this is that the
3 occupations have different gender income gaps, but our model above allowed
for only 1 general gender income gap. If we want to fit all 6 cells exactly, we
need to allow the gender gap to vary across occupations.