More on detection, attribution and estimation 2: Incredible confidence intervals

For the first, rather natural example, let's assume we are trying to measure some simple non-negative quantity such as the mass of an apple. We have a set of scales which have a random (but well-characterised) error of +-50g (Gaussian at 1 standard deviation). That is, if we take a calibrated mass of value X, repeatedly use the scales and plot a histogram of the results of each measurements, the outputs will form a nice gaussian shape with mean X and standard deviation 50g. [OK, I know I'm doing this at a very boring pace, but I need to make sure it is all clearly set out.]

One obvious and very natural way to create a confidence interval for the apple's mass is to take a single measurement (call the observed mass m) and then write down (m-50,m+50), which is a symmetric 68% confidence interval for M, the true mass. That is to say, if we were to hypothetically repeat this experiment numerous times, and construct the set of confidence intervals (mi-50,mi+50) where i indexes the measurements generated in the experiments, then 68% of these intervals would include M, 16% would be wholly greater than M and 16% wholly smaller (this is guaranteed by the specified observational uncertainty). We don't, of course, actually do this infinite experiment - but this is precisely what is meant by "(m-50,m+50) is a symmetrical 68% confidence interval for M". [Are you asleep yet? The punchline is coming up...]

It is incorrect to interpret the specific confidence interval (m-50,m+50) as implying that M lies in that range with probability 68% (and above and below with probability 16%).

To see why this is the case, consider the following: what if the reading happens to be m=40g? (Which it might well be, if the true mass is say 80g.) Is the confidence interval (-10,90) really a symmetric credible interval at the 68% level? That is to say, would anyone believe that the apple's mass is <-10g with probability 16% (would they bet on it - if so, please point them this way...)? Of course not. Any symmetric 68% credible interval (ie an interval so that one believes M lies in it with probability 68%, and below and above with probability 16% each) must necessarily be truncated somewhere above zero. Yet it is trivial to show that the confidence interval as constructed above is entirely valid. Under repeated observations, 16% of the measurements will be lower than M-50, 16% greater than M+50, and the remainder in between, so the population of confidence intervals has exactly the statistical properties required of it.

One can, perhaps, state that "negative mass is not statistically inconsistent with the measurement" or maybe even say "negative mass cannot be ruled out by the measurement", but these statements cannot be interpreted as implying that anyone thinks the apple actually has negative mass!

There are other examples of non-credible confidence intervals that are quite striking. Here's one I found on the web (description lightly modified):

Let's say we want to estimate a parameter x. Let's ignore all the available measurements entirely! In their place, start by using a random number generator to generate y uniformly in [0,1]. If y > 0.68, then define the confidence interval to be the empty interval. If y < 0.68, then define the confidence interval to be the whole number line. That's it! Again, this routine trivially generates a 68% CI - that is, exactly 68% of the time, the CI contains x whatever value this takes. But neither of the two possible intervals that the algorithm generates is credible at the 68% level - it should be clear that one of the possible intervals contains x (and the other does not) with certainty, even without knowing what x is.

In the next part, I'll try to reconcile these results with the underlying theory.

10 comments:

I'm trying to follow this, having had only grad-level statistics in the 1970s for bio/psych, and much appreciate your making it available.

One probably naive question -- how does your view describe the choice to ignore the "incredible" low values detected by NASA's TOMS early on?

I recall the system had discovered the 'ozone hole' but the scientists didn't see the data, because the low numbers had been automatically ignored based on assumptions about what was credible; eventually ground-based teams' reports caused the assumptions to be changed. Here's one cite:

http://darwin.nap.edu/books/0309092353/html/116.html

"The timing of the Nimbus-7 mission included a period of rapid deepening and discovery of the ozone hole. The significant lowering in total ozone over Antarctica caused a rethinking of the autonomous ground quality-assurance programs that otherwise would reject the “unrealistic” low values."

That's an interesting one. I don't know the story in detail, but scientists often use similar simple outlier-rejection techniques. I'd view this as a Bayesian prior belief that the instrument is much more likely to return a bad value, than that a massive rapid change is likely to occur in reality.

Perhaps some people might try to waffle about Kuhnian paradigm-shifts at this point, but it seems like an example of rather reasonable and rational behaviour - once the evidence built up, people changed their opinions pretty quickly. Of course, there are denialists over the ozone issue too...

A couple more sources on that delayed discovery -- cautionary as we now watch far more instruments indirectly via computers.==="Murphy never sleeps, but that's no reason to poke him with a sharp stick." -- www.nancybuttons.com===The Antarctic ozone hole was first observed by ground-based measurements from Halley Bay on the Antarctic coast, during the years 1980-84. (At about the same time, an ozone decline was seen at the Japanese Antarctic station of Syowa; this was less dramatic than those seen at Halley since Syowa is about 1000 km further north, and did not receive as much attention.) It has since been confirmed by satellite measurements as well as ground-based measurements elsewhere on the continent, on islands in the Antarctic ocean and at Ushaia, at the tip of Patagonia. With hindsight, one can see the hole beginning to appear in the data around 1976, but it grew much more rapidly in the 1980's.http://www.hu.mtu.edu/~mmcooper/old%20homepage/classes/ozone.html

Satellite measurements showing massive depletion of ozone around the south pole were becoming available at the same time. However, these were initially rejected as unreasonable by data quality control algorithms (they were filtered out as errors since the values were unexpectedly low); the ozone hole was only detected in satellite data when the raw data was reprocessed following evidence of ozone thinning in in situ observations. (Wikipedia)

I just want to point out that in your apple-weighing experiment you've made conflicting assumptions. First you accepted a normal distribution for the measured value, but later stated that the mass couldn't be negative. It seems you've made some prior (oops, there's that word again) assumptions about what value the mass of an apple may have.

That's not quite right. I start off with the premise that the measurement error is normal. Then I find that the measurement implies a non-zero likelihood for a negative mass, and also that the natural confidence interval extends to negative values.

I have indeed made a prior assumption that the apple's mass cannot be negative. I think that's an entirely reasonable prior assumption!

I think you've criticized bayesian language for a case which doesn't include a bayesian analysis. An analyst that applied a reasonable prior to the apple's mass (m>0 for instance) would never end up with a confidence interval that included negative values.

The absurdity is easy to see. In your analysis, you've said that the confidence interval will always be centered on the measured value. Since the gaussian measurement error results in negative measurents for m>0, unreasonable (incredible) confidence intervals are guaranteed. To say that a non-frequentist interpretation is incorrect is very disingenuous. Only frequentist methods were used. If anything, this is a demonstration of the problems with failing to apply bayesian methods.

You describe it is a problem of failing to apply Bayesian methods, but there are legitimate ways of applying both frequentist and Bayesian methods here, so long as one recognises that they are answering different questions. The confidence interval is fine as it stands, so long as one accepts that it is a confidence interval!

The real problem IMO arises when the answer to a frequentist analysis is presented in a Bayesian manner (ie, presenting a confidence interval as a credible interval). Unfortunately, this is what some of the climate literature on detection and attribution seems to do. In fact, rumour has it that one of the main figures in the field actually believes it is a valid (or even the correct) approach...

If your original post wasn't meant as a critique of bayesian analysis, then I don't have a problem with it.

It sounds like you're saying that the term "confidence interval" can only describe an interval that is arrived at by frequentist methods, while "credible interval" should refer to bayes-derived intervals. Is that true? I think that such a suble distinction is bound to cause more confusion than it solves. I also think it's a distinction without a practical difference, as both are trying to answer the same question i.e. what is the value of an unknown population parameter. In a comparison of any two or more methods, the interval performance would be evaluated in exactly the same way.

It sounds like you're saying that the term "confidence interval" can only describe an interval that is arrived at by frequentist methods, while "credible interval" should refer to bayes-derived intervals. Is that true?

I don't pretend to speak for them, but I think Bayesians would generally insist on it (eg here), and Frequentists who don't are usually those who are unaware of the distinction :-) I can count myself as a member of the latter group until fairly recently, I might add.

I think that such a suble distinction is bound to cause more confusion than it solves.

Is it not more confusing to use the same term to describe two different things? It has certainly confused me in the past. As I've shown in these examples, it is easy to create confidence intervals that are not credible, and even when their non-credible nature is not so immediately clear, this does not mean that they actually are valid credible intervals.

The problem as I see it is that frequentist methods don't actually attempt to answer the question "what is the value of the parameter" at all. However, people sometimes interpret their results as if they do.

The problem as I see it is that frequentist methods don't actually attempt to answer the question "what is the value of the parameter" at all. However, people sometimes interpret their results as if they do.

I agree completely, and that's what puts me in the bayesian camp. Maybe the frequent misinterpretation of confidence intervals by frequentists is what initially caused my distrust. Any time a decision needs to be made based on available data, bayesian methods are required. I haven't found a counterexample yet. Thanks for the discussion.