In order to study the relationships among variables, observational studies
are performed. Unlike controlled experimental designs where only certain
variables are allowed to vary (at prespecified levels), in observational
studies the variables are observed and recorded. Often some of the variables
are controlled as much as possible. Consider a long term study on a drug
involving humans where a variable that needs to be controlled is diet.
The diet guidelines are set but these will probably be broken from time
to time (or maybe often) by some of the human subjects. Contrast this with
a lab setting, where the diet of animals can be controlled.

In observational studies, cause and effect are hard (often impossible)
to establish. But associations and predictabilities among variables can
be investigated. Such associations and predictabilities may be further
studied in a lab setting.

Here's a simple example. Let Y be the weight of a baseball player
and let
X be the height of a baseball player. Recall the scatter
plot which is given by:

If we enter this data set into the data box and choose regression, we get the prediction equation (Wilcoxon):

Predict Wt = -228.57 + 5.71*Height .

There is an association between height and weight, An increasing relationship.
We predict a baseball player's weight in terms of his height. A confidence
interval for the slope parameter is
(4.2, 7.2); hence, we predict
the weight of a ball player to increase between 4 to 7 pounds for each
additional increase in 1 inch of height, (4 to 7 pounds per inch). We are
not saying taller causes heavier, this is absurd. But we are observing
an association between height and weight. We are saying that if a ball
player is taller then he is more likely to be heavier.

To make better predictions, there may be other variables to consider.
In the height-weight data, a measure of body build would be useful. In
a more advance class, we would discuss these issues.

We do need to emphasize one thing concerning observational studies.
There
must be a reason to explore associations and predictions. An example
here is worth thousands of words. Let Y be the number of deaths
per 100,000 in England for a year in the late 1800's and let
X be
the number of church weddings (in thousands) in England for that year.
There is no reason to seek an association between these variables. But
suppose we do. The data is given in Appendix A.
The scatter plot of the data is:

The relationship is linear. In fact the pattern is quite tight. It is clear
from the plot: to reduce deaths, reduce church weddings! There is a variable
here causing this pattern. It is time! These data are recorded over the
years. Here is a plot of the death rate versus year:

Church attendance dropped over these years. Hence both variables decrease
with respect to year and thus have an increasing relationship when plotted
with each other. So that solves the puzzle. Time is called a lurking
variable here.

In an observational study, make sure you are including variables for
which a relationship between them makes sense. If a paradox occurs (such
as death rate and church wedding rate) look for a lurking variable.

Exercise 12.4.1

1.

(From Bhattacharyya and Johnson (1977), Statistical Concepts and Methods, New York: Wiley). Below are used-car prices (in thousands of dollars) for a
foreign compact (1970's data) with their ages in years.