Having read the Statalist FAQ, and previous correspondence about general
statistical questions, I hope no one minds . . . .

Among my teaching duties in my medical school and family practice
residency is "critical appraisal of the medical literature." I try to
go over principles of good design and valid analysis. A question
frequently comes up when we discuss randomized controlled trials. In
these articles, there is almost always a "Table 1," that describes the
baseline demographic and clinical variables of the two arms (say,
placebo and active drug, for example.) There are usually *a lot* of
baseline measurements. Each one is usually listed with a "P value,"
indicating whether the placebo and active drug subjects differed on that
measurement.

Then the manuscript goes on to describe the rest of the study, and the
results . . .

If the results show an advantage for the active drug, readers (including
my students and residents) will often go back to "Table 1" and say, "Oh
but look, the samples were not identical. Blah-blah was significantly
higher in the placebo arm to begin with. Therefore I can't accept these
results as valid."

I've never agreed with that. So I want to outline my chain of reasoning
here and see if I've got it straight.

There are two premises in a randomized controlled trial with two arms:

1. The two samples are drawn randomly from the same population
2. The active drug actually has no effect (the null hypothesis)

And then there are the results (R).

If 1 and 2 are both true, we can look at R and calculate how likely we
were to see results that "extreme" or more so. That's the P value. If
P < the conventional 0.05, we say, "Gee, if 1 and 2 are both true, we
*might* have seen results R, but only 5% of the time or less, and that's
pretty unlikely. But we *did* see R. Therefore either 1 or 2 must be
untrue. And I'm confident my randomization was solid. Therefore 2 must
be untrue, and the drug really does have an effect."

There is nothing this chain of reasoning that requires the samples to be
indentical/indistinguishable. And for every 20 baseline variables
compared, you'd *expect* about 1 of those baseline variables to have a P
of < 0.05 The statistical techniques have "built-in" accomodation for
this. This does not invalidate the conclusions.

It is a difficult concept for my learners to grasp. Or maybe I've got
it wrong?