Climate Science Glossary

Term Lookup

Settings

Use the controls in the far right panel to increase or decrease the number of terms automatically displayed (or to completely turn that feature off).

Term Lookup

Term:

Settings

Beginner Intermediate Advanced No DefinitionsDefinition Life:

All IPCC definitions taken from Climate Change 2007: The Physical Science Basis. Working Group I Contribution to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change, Annex I, Glossary, pp. 941-954. Cambridge University Press.

Posted on 12 November 2010 by Maarten Ambaum

Climate science relies heavily on statistics to test hypotheses. For example, we may want to ask whether the global mean temperature has really risen over the past ten years. A standard answer is to calculate a temperature trend from data and then ask whether this temperature trend is “significantly” upward; many scientists would then use a so-called significance test to answer this question. But it turns out that this is precisely the wrong thing to do.

This poor practice appears to be widespread. A new paper in the Journal of Climate reports that three quarters of papers in a randomly selected issue of the same journal used significance tests in this misleading way. It is fair to say, though, that most of the times, significance tests are only one part of the evidence provided.

The post by Alden Griffith on the 11th of August 2010 lucidly points to some of the problems with significance tests. Here we summarize the findings from the Journal of Climate paper, which explores how it is possible that significance tests are so widely misused and misrepresented in the mainstream climate science literature.

Not unsurprisingly, preprints of the paper have enthusiastically been picked up by those on the sceptic side of the climate change debate. We better find out what is really happening here.

Consider a scientist who is interested in measuring some effect and who does an experiment in the lab. Now consider the following thought process that the scientist goes through:

My measurement stands out from the noise.

So my measurement is not likely to be caused by noise.

It is therefore unlikely that what I am seeing is noise.

The measurement is therefore positive evidence that there is really something happening.

This provides evidence for my theory.

This apparently innocuous train of thought contains a serious logical fallacy, and it appears at a spot where not many people notice it.

To the surprise of most, the logical fallacy occurs between step 2 and step 3. Step 2 says that there is a low probability of finding our specific measurement if our system would just produce noise. Step 3 says that there is a low probability that the system just produces noise. These sound the same but they are entirely different.

This can be compactly described using Bayesian statistics: Bayesian statistics relies heavily on conditional probabilities. We use notations such as p(M|N) to mean the probability that M is true if N is known to be true, that is, the probability of M, given N. Now say that M is the statement “I observe this effect” and N is the statement “My system just produces noise”. Step 2 in our thought experiment says that p(M|N) is low. Step 3 says that p(N|M) is low. As you can see, the conditionals are swapped; these probabilities are not the same. We call this the error of the transposed conditional.

How about a significance test? A significance test in fact returns a value of p(M|N), the so-called p-value. In this context N is called the “null-hypothesis”. It returns the probability of observing an outcome (M: we observe an upward trend in the temperature record) given that the null-hypothesis is true (N: in reality there is no upward trend, there are just natural variations).

The punchline is that we are not at all interested in this probability. We are interested in the probability p(N|M), the probability that the null hypothesis is true (N: there is no upward temperature trend, just natural variability) given that we observe a certain outcome (M: we observe some upward trend in the temperature record).

Climate sceptics want to argue that p(N|M) is high (“Whatever your data show me, I still think there is no real trend; probably this is all just natural variability”), while many climate scientists have tried to argue that p(N|M) is low (“Look at the data: it is very unlikely that this is just natural variability”). Note that low p(N|M) means that the logical opposite of the null-hypothesis (not N: there really is an upward temperature trend) is likely to be true.

Who is right? There are many independent reasons to believe that p(N|M) is low; standard physics for example. However many climate scientists have shot themselves in the foot by publishing low values of p(M|N) (in statistical parlance, low p(M|N) means a “statistically significant result”) and claiming that this is positive evidence that p(N|M) is low. Not so.

We can make some progress though. Bayes' theorem shows how the two probabilities are related. The aforementioned paper shows in detail how this works. It also shows how significance tests can be used; typically to debunk false hypotheses. These aspects may be the subject of a further post.

In the meantime, we need to live with the fact that “statistically significant” results are not necessarily in any relevant sense significant. This doesn't mean that those results are false or irrelevant. It just means that the significance test does not provide a way of quantifying the validity of some hypothesis.

So next time someone shows you a “statistically significant” result, do tell them: “I don't care how low your p-value is. Show me the physics and tell me the size of the effect. Then we can discuss whether your hypothesis makes sense.” Stop quibbling about meaningless statistical smoke and mirrors.

Comments

I was playing around with statistics a few weeks ago. It helps me understand Tamino :-)

Then this claim below crossed my mind, just like Dr. Ambaum:

In the meantime, we need to live with the fact that “statistically significant” results are not necessarily in any relevant sense significant.

I think you could statistically correlate car sales and global warming, for instance, and it would mean nothing. It's the underlying physics AND the statistics that will give you the evidence - which is the case.

Daniel's chart 'proves' that global warming is caused by lack of pirates... but in the past several years piracy has been booming off the coast of Somalia! We should start seeing temperatures turn around now! :]

I recently saw an article in a journal that supported AGW but the numbers weren't significant at the p<0.05 level. So AGW isn't real because every supporting evidence needs to be above the 95% certainty level.

Really, this is a very formalized demonstration of why syllogistic argumentation is not a conclusive or reliable means of establishing truth-values. I'm reminded that the great US philosopher Charles Sanders Peirce (19th C) whose father Benjamin Peirce was a pioneer in statistical theory (esp. outliers) himself, admonished his readers that any argument depending on syllogism was to be implicitly mistrusted. He claimed the better means of understanding is by examining the substantial implications and possibilities of relations between things exhaustively, instead of attempting to fit them into formal logic.

Actually there might be a tenuous link with pirates and warming.
eg. I think it was the Royal Navy that eliminated a lot of piracy.

But they had to chop a lot of trees down to do it, plus the age of ironclads and battleships (coal use) meant pirates needed to be more sophisticated with access to a better income stream to afford a steam boat with heavy guns.

Also piracy became a state sanctioned aim during the world wars with submarines, but the motive wasn't to steal produce. Although maybe that was the Nazis big mistake. They should of stolen the convoys, rather than sinking them?

I'm unsure why people are so quick to ascribe global warming to pirates, when clearly the opposite is more like, i.e. that global warming is causing a precipitous decline in the number of pirates. This only makes sense, as the increased heat will tend to make our young people lethargic, and so less likely to get up to go to pirate tryouts, and to attend piracy school.

At the same time, as someone whose brain hurts whenever I think about probabilities in any sense beyond my chances of finally winning the lottery, I must admit that I find statistics and statisticians as annoying as piracy and pirates... perhaps even more so.

If only global warming had such a negative impact on statisticians! Alas, and alack, I fear that the opposite is the case. I'm far more cognizant of statisticians in this woefully warming world.

I also have no doubt that statisticians keep Bayesian eye patches in their desk drawers, to be worn in complete secrecy in the privacy of their lairs, while performing their heinous acts of statismancy and probabalism. The line between pirate and statistician is, I fear, as blurry as the line between p(M|N) and p(N|M).

Thank you Dr Ambaum, I'm sure to use this explication elsewhere. I noticed the fallacy between #2 and #3, but I thought power analysis was going to come into play as a patch (actually, I initially flinched at #3 because the scientist should be thinking that noise + effect is being observed). Typically when one fails to detect a 'significant' effect, one can't accept the null hypothesis but can do a power analysis to determine the strength of effect that he/she should have been able to detect. On the flip side, however, a frequentist wouldn't worry too much if he/she detected a 'significant' effect -- the problem with power only really occurs in one direction. Your post here is about a broader issue than I first thought, and you are saying that frequentist statistics are always(?) misleading relative to Bayesian methods.

I'll have to look at this more carefully (I keep telling myself to learn the Bayesian approach, but I still haven't sat down and done it). I had thought that the main misuse of frequentist statistics was in post-hoc analyses of existing data from uncontrolled experiments. That was the other thing I thought you were getting at: that JOC authors were obtaining data, visualizing them, and then deciding to do frequentist tests (after conscious or sub-conscious pre-selection). That's obviously wrong, to me, and I know it happens in my field (biology). I didn't think planned application of frequentist stats in a controlled experimental design was problematic. Time to learn...

General discussion of broad categories of evidence for global warming should go in an appropriate thread, such as this or this.

Also, please note that in a series of visits over the past month, you've left at least five versions of the same comment about ice cores, in five different threads. Most of them have now been deleted or redirected here.

Please try to post your comments in the appropriate thread and then stick with them there, rather than spreading discussions across many different threads. This helps make the site more readable for everyone.

"The travesty, of course, is that we cannot account for the number of pirates empirically measured via apprehension or by sinking of their crafts vs that predicted by Disney movies. Latest measurements of the briny deep suggest some may have fled to Davy Jones' Locker" says the study lead auteur Calypso Cousteau.

While the pirate information is entertaining, this post claims that 75% of climate peer reviewed papers use the wrong statistics. I find this a very interesting claim. How could so many people, including skeptical statisticians like Mcintyre overlook this simple mistake? The linked thread by Alden Griffith discusses these type of statistics. He finds only a very small difference in the numbers (92% using Bayesian statistics versus 92.4% using significance tests). Perhaps scientists use significance tests because there is little difference betwen the two and significance tests are easier to do. The post suggests significance tests are not useful, while Griffith seems to suggest there is little difference. Can someone who knows statistics explain how different these analysis really are?

In the meantime, we need to live with the fact that “statistically significant” results are not necessarily in any relevant sense significant. This doesn't mean that those results are false or irrelevant. It just means that the significance test does not provide a way of quantifying the validity of some hypothesis.

So next time someone shows you a “statistically significant” result, do tell them: “I don't care how low your p-value is. Show me the physics and tell me the size of the effect. Then we can discuss whether your hypothesis makes sense.”

The big challenge for using Bayesian statistics is choosing the prior probabilities. Bayesian proponents argue that at least that approach forces the decision maker (i.e., scientist) to be explicit about their assumptions. But in practice, most scientists don't bother going through that. Instead they happily rely on the messier and less quantitative but nonetheless completely legitimate approach of treating these non-Bayesian statistical test results as just some pieces of the large body of evidence they use to make their subjectively probabilistic decisions about scientific hypotheses and theories. In doing so, they don't really rely on all the quantitative information that nominally is included in the 5% or whatever percent significance levels. Instead they tend to treat those percentages only as rough indicators of strength of evidence. Consequently, the scientists tend not be be much misled by the incorrectness of those numbers for the particular decisions being made.

I have to disagree with your application of Bayesian statistics; the scientists should not be bothering with them.

When do Bayesian statistics matter? When the prior probability is extreme (very likely or very unlikely). So if the chance of a woman your age has breast cancer is 1 in 1000, and mammograms have a 1 in 100 false positive rate, and you had one done as part of a routine checkup and it came back positive, Bayesian statistics tells us that chances are you don't have cancer.

But when your prior probability is something medium, it isn't likely to affect the significance of the result. What's more... just how do you establish the prior probability? By counting planets where the climate sensitivity is above 2 degrees per doubling of CO2 and those where it's below? And if you're already pretty certain that you know what the answer is, what are you adding by doing the experiment? Let's say the existing body of evidence leads you to be 99% certain, and your experiment doesn't cause that figure to budge, do you now show using Bayesian statistics that combining your result with the prior gives you 99% confidence and, presto! a statistically significant publishable result! Of course not.

Another problem with this is that it's that prior (is Global Warming real?) which is precisely what we want to figure out, not the "real" posterior (is it really warming at the moment?) We want P(N), not P(N|M). Asking how to get P(N|M) from P(M|N) is getting a few steps ahead -- you also want to know P(M2|N) and P(M3|N) and P(M4|N) and all the other peices of evidence before you do that calculation. And if someone else finds further evidence and publishes a paper showing P(M5|N), well now that Bayesian analysis you did in your paper to get P(N|M1..M4) is out of date. But that calculation of P(M4|N) stands, and will forever be useful as a piece of the evidence used to assess P(N).

Bayesian analysis provides a way of thinking about how to combine all the pieces of evidence to form your conclusion, but the proper role of research is to establish those individual pieces of evidence. Establish the symptoms if you will. One experiment is your family history, another the mammogram, another the biopsy. We don't calculate whether the mammogram is positive or negative by considering your family history, rather they stand as separate results which we then combine to make an inference. And in this analogy we can't perfectly do the Bayesian calculation because we don't really know what fraction of the population has cancer, except for what we infer through these tests. But you don't subject patients to tests that tell 1 in 5 healthy people they have cancer, and so likewise we demand statistical significance.

The real abuse of statistical significance is among deniers who tout statistical insignificance as evidence of something. (See the misunderstanding of Phil Jones' statement on the statistical significance of warming.) No statistically significant result is no result; it is not evidence for the null hypothesis; it is not evidence for anything because statistical insignificance is always achievable with little enough data no matter what is going on. For real evidence that warming has stopped, you want statistically significant evidence that warming is not above a certain rate (let's say .05 degrees per decade). That would be something if that existed.

I have not read Dr Ambaum's paper because it is not yet available in my bibliographic source but I look forward to it. I think that Eric L's comments are particularly insightful and it would be good if the Dr. would respond to them. More generally, I do experimental research in the social sciences and have always been aware, since my very statistics course of the difference between statistical significance and quantitative differences in effects. I cannot believe that climate scientists are not aware of this or do not understand the distinction. If you have taken advanced stat courses, you understood it or you failed the courses. In my studies, which are generally on very large samples of families (several thousand experimental and control members) it is relatively easy to show statistical significance when the quantitative differences in treatment effects are relatively small. And the first question that arises, especially from practitioners, is if this is real how much or how many cases should we expected to occur in which these changes will be observed? Statistical significance does indicate that an observed difference is real. There are other way of answering the question, is the size of the effect minor, moderate or large?

Thank you for this. I see problems more often (or more obviously) in popular articles about health-related research than climate-related research. Sometimes it's journos drawing invalid conclusions, on occasion it's been researchers apparently deliberately misrepresenting their own research (eg obesity). I'm not sure if some disciplines get better training in stats, or have access to professional statisticians.

Climate sciencei is heavily reliant on proxy data and hence statistical methods, probably more than other empricial sciences are. Its not surprise that this part of the science comes under such heavy scrutiny. However, I would imagine that if climate scientists were in any doubt relative to natural variability they would not have published so many peer reviewed articles.

So rather than bring up the subject we should be asking why so much literature on th subject if it has little or no merit?

There is a high probability that only an off topic post by a skeptic will be flagged while more egregiously off topic posts about pirates will go unanswered.

Based on Ambaum's statement that 3/4 of the articles in a recent randomly picked issue of a prestigious climate publication contained this error is it likely that the papers that the IPCC uses in it's publications are tainted? Ambaum further stated that this number was up over a ten year previous issue where the error only occurred 1/2 the time.

I have seen what Ambaum alludes to in his paper, an increased use of computer programs to analyze data without understanding the underlying reasoning. You will typically see this on tests when asking students to take the sin(pi/3)/cos(pi/3)/sqrt(3). A calculator dependent student will more often than not get this wrong.

Temperature anomaly is a low signal to noise ratio quantity. I'd sure like to see a study of the proper use of statistics in deriving that quantity. In fact it seems like there was one in a past topic. Can't quite recall the name at the moment.

@muoncounter"No, but it does mean that 75% of climate denier posts are misleading -- and that's significant."
Guess I'm not seeing the connection to "climate deniers". What is a "climate denier" anyway? Someone who denies that there is such a thing as climate? I wasn't aware that the Journal for Climate was an anti-anthropogenic global warming publication. After all they put out this, "Global Warming is Unequivocal: The Evidence from NOAA" 5/6/2010.

The Pirate Chart was used to illustrate Alexandre's point, that just because things can be correlated doesn't mean that the correlation itself has any meaning.

Just because comments by skeptics get flagged for being off-topic doesn't mean comments by those who believe in climate science do not get flagged for being off-topic. Check out the Deleted Comments bin sometime. I've had comments land there before; I can also guarantee I'll end up there again sometime. Comments that are off-topic get deleted; fact of life here.

No it means 75% of all climate research is in part misleading. This is surely not a "denier"/fear-mongerer issue.

In fact given that many on this website believe almost all peer-reviewed literature is in support of AGW then this paper is a critique of the mainstream science, "deniers" should be left out of the discussion because this paper has not researched the space where the audience of this website believe "deniers" predominantly publish. Let's stay within the bounds of the published work.

"No it means 75% of all climate research is in part misleading. This is surely not a "denier"/fear-mongerer issue. "

Lacks in rigor, not "is in part misleading".

I love the way that HR and others latch on to one paper critical of statistical analysis in science, and immediately cast aside all the supposed "skepticism" they show towards published work.

I imagine it's because HR and others believe this shows some gaping problem with climate science that undercuts the fundamental overwhelming scientific consensus that increasing CO2 will warm the planet somewhere between 1.5 and 4.5 C per doubling.

Classic example of confirmation bias. Based on essentially a sample size of 1 issue of 1 climate science publication, author Ambaum demonstrated at least one instance of misuse of significance testing in approximately three-fourths of the articles in the issue. No surveying of other publications in the field, no controls to other publications in other fields. Again, a sample size of 1.

Based on that, HumanityRules conflates that into

"No it means 75% of all climate research is in part misleading."

Sad. There was a time when I thought you had something constructive to offer, HumanityRules. Now I find I can't take you seriously anymore as it seems you aren't even trying, preferring to serve up inflammatory distortions instead.

OK. Look here, where we discussed the gross generalization in a 'published work' stating that many climate scientists are computer illiterate. In this case, the gross generalization was "this paper suggest 75% of climate science papers use statistical significance in a "misleading" way".

My point was and remains: Broad generalizations like these include everyone in the affected class. That includes Watt$, Godd@rd, Mc&tyre and the like. If you want to stick with this nonsense, that requires that 75% of climate change denier posts are misleading.

Better to drop both the name-calling ('fear-mongering'? really?) and the gross generalizing. Then maybe we can have an intelligent conversation.

In the post you seemed to object to, I was referring to making claims based on statistical insignificance, as when many climate deniers misunderstood Phil Jones' remarks about warming since 1995 not being statistically significant as evidence that warming has stopped. A statistically insignificant warming trend isn't evidence either way. This is not the sort of error Dr. Ambaum is talking about. Are you aware of instances where climate scientists have made this error?

TOP,
Be sure you are not misinterpreting the author as saying climate scientists should be making weaker claims or that they are publishing "statistically significant" results that if tested the way the author thinks they should be would be insignificant. Chances are climate scientists would use Bayesian statistics to show that they can make even stronger claims of confidence. For example, because the physics of climate lead you to believe it should be warming with high probability, you can combine this prior probability with your analysis of the temperature data to give an even stronger confidence in the existence of a warming trend than you would have otherwise. If Phil Jones had followed Dr. Anbaum's advice when calculating statistical significance, he would have said something far less useful to those trying to cast doubt on warming.

But I think he was right not to do it that way, as I mentioned above. And I should qualify that by saying I haven't read the paper, only this post, so maybe I don't understand what it is Dr. Anbaum thinks they should be doing when analyzing data.

Thank you all for your reactions to my post. I hope you don't mind it if stick my oar in in some of the topics you raised. If I have overlooked something, please let me know. Sorry for the somewhat rambling response here ...

Re post 1, and the pirate-global mean temperature correlation: Alexandre is of course right to say that we need statistics and physics to make any progress. What I am highlighting, though, is not that specific issue (which is serious and important in itself). I am highlighting that significance tests are used to give certain statistical results higher "credibility" than others, based on a largely spurious test. So it is the selection of statistical results that I am objecting to, not the statistical results per se.

Some posts (specifically Steve L) refer to the frequentist vs Bayesian discussion. This is interesting in itself, but in my paper I am simply applying Bayes' equation, which also a frequentists would accept as indisputable. The difference comes in the interpretation of the meaning of these probabilities. Indeed, significance tests have a clear frequentist flavour, while hypothesis tests have a much more Bayesian flavour. I think it is hard to escape that scientific hypotheses naturally fit a Bayesian framework. Nonetheless, I think the distinction between Bayesian and frequentist interpretations is largely irrelevant to the discussion at hand.

Several posts point out that scientists should know about this and also that climate science should not be singled out. Indeed, in my paper I point to more general references which highlight the misuse of significance tests in a wide spectrum of fields (medicine, economics, sociology, psychology, biology, ...) In fact, I suspect that your average research psychologist knows more about the pitfalls of significance tests than the climate scientist. In those more "softer" fields, people have had to mainly rely on statistics from the start and therefore needed to know how to use statistics from day one. In those fields, many people have pointed this problem out (and it still seems to persist).

Climate science has always been a subfield of physics, where significance tests are largely irrelevant. I bet that most physicists (by training, I am a theoretical physicist myself) didn't get a stats course in their curriculum! However, these days more and more geographical thinking seems to enter the field of climate science with the resulting lack of rigour and physical underpinning. Many climate scientists have become geographers of their model worlds!

Also, the point I am making is not new: many people are aware of the problems with significance tests, and many people have pointed it out before (although most practitioners probably believe that climate scientists would know better). It boggles the mind that the error keeps on being propagated - surely an interesting question for a psychologist or sociologist to get their teeth into. I do have an opinion about why this may be, but that would make this post even longer.

Regarding the somewhat rambling posts about 75% of papers being misleading in part. I claim that 75% of papers (in my own paper I clearly state that this is based just 1 (one) sample and make no claim regarding its statistical significance!) make a technical misuse of significance tests: they use it to select or highlight certain statistical results in favour of others.

Perhaps I should write a post where I discuss what significance tests can be used for (largely for debunking fake hypotheses, but even this is an application with its own pitfalls). However, this is generally not how significance tests are presented in the literature. The latter of course follows from the fact that very few scientists would publish negative results (in fact, they would probably have a hard time to get it past the reviewers).

Some people, including John Cook himself, pointed me to a post by Tamino. Tamino also highlights some further points from my original paper. Let me just add two little comments to Tamino's interesting post: Tamino states that "I’ve certainly struggled to emphasize to colleagues that a highly significant statistical result does not prove that one’s hypothesis is true, it merely negates the null hypothesis." This is again the error of the transposed conditional: a low p-value does not negate the null-hypothesis, it just indicates that our statistical result would be unlikely in case the null-hypothesis were true. It is remarkable how easily we can stray into this error. Tamino also seems to indicate that the p-value does provide useful quantitative information. I cannot find any evidence in his post of this. Yes, the p-value is quantitative, but its usefulness is never really made clear. The p-value is perhaps an indication of the signal-to-noise ratio; a high p-value means that it will be difficult to see any evidence of any claimed effect. A low p-value indicates very little really: we want to study the validity of some hypothesis assuming it is false; some attempt at a reductio ad absurdum proof of your hypothesis - unfortunately it is not quite that ...

I strongly disagree that scientists should not bother with Bayesian statistics, especially in the case of statistical significance tests. There is rather more to Bayesianism than Bayes rule (which is a fundamental law of probability whether Bayesian or frequentist); the very definition of what a probability actually is, is an argument in favour of the Bayesian framework in this case. The problem with frequentist approach to statistical significance tests is that they fundamentally cannot assign a probability to the truth of a hypothesis, because a hypothesis is either true or it isn't, its truth is not a random variable and has no long run frequency (the frequentist definition of a probability). Unfortunately the probability of the alternative hypothesis being true is exactly what we want to know! Fortunately the Bayesian definition of probability is based on the state of knowledge regarding the truth of a proposition, so the Bayesian framework can directly assign a probability to the truth of a hypothesis. Generally in science it is best to carefully formulate the question you want to ask, and then choose a method that is capable of giving a direct answer to that question. As such the Bayesian approach is perfectly respectable, if not preferable. The frequentist approach can only give an indirect answer, telling you the likelihood of the observations assuming the null hypothesis is true, and leaving it up to you to decide what to conclude from that. Most of the problems with frequentist statistical tests lie in mistaking the indirect answer to the key question for a direct (Bayesian) one.

The Bayesian approach is more than a means of aggregating evidence; one of the most important benefits of the Bayesian approach is that it gives mechanism to properly incorporate the fact that you know you don't know something, by assigning a non- or minimally-informative prior on it and marginalising it out of the analysis. For instance, if you want to model the impacts of climate change, it is incorrect to assume we know the exact value of climate sensitivity (for instance by picking the maximum likelihood value), instead we should integrate it out by computing an average of the impacts for each value of climate sensitivity weighted by its plausibility according to what we do know.

"When do Bayesian statistics matter? When the prior probability is extreme (very likely or very unlikely). So if the chance of a woman your age has breast cancer is 1 in 1000, and mammograms have a 1 in 100 false positive rate, and you had one done as part of a routine checkup and it came back positive, Bayesian statistics tells us that chances are you don't have cancer."

In this case, the Bayesian result exactly coincides with that from the frequentist approach. The only difference is that the Bayesian approach allows you to formulate the question in terms of an individual patient, rather than a randomly selected member of some population with the same test results.

"Bayesian statistics that combining your result with the prior gives you 99% confidence and, presto! a statistically significant publishable result! Of course not."

Indeed not! Bayesian conclusions are only as strong as the priors used, if you could show the priors were unreasonable then you could reject the result of the test (and the paper). If you can't question the prior, you are logically forced to accept the result of the test. The good thing about the Bayesian approach is that the priors are explicitly stated. If you disagree with the use of priors on the hypothesis, you could always use a "significance test" based on Bayes factors instead, where the priors (on the hypotheses) do not appear in the analysis.

"And if someone else finds further evidence and publishes a paper showing P(M5|N), well now that Bayesian analysis you did in your paper to get P(N|M1..M4) is out of date."

That is equally true of any frequentist analysis - if your information changes, your view on the truth of the hypothesis should also change, whatever form of analysis you choose.

"But that calculation of P(M4|N) stands, and will forever be useful as a piece of the evidence used to assess P(N)."

That is only correct if M4 is independent of M1-M3 & M5 (otherwise it is the so-called Naive Bayes approach), which in the case of climate change is rather unlikely as rising levels of atmospheric carbon dioxide are posited to be a causal factor for a great many phenomena.

"And in this analogy we can't perfectly do the Bayesian calculation because we don't really know what fraction of the population has cancer"

This is incorrect, the whole point of the Bayesian formulation is that it allows your to deal rationally with the fact that you don't know something, or that you have imperfect knowledge of somthing. You choose a prior distribution that captures what you do and don't know about it and marginalise. The perfect Bayesian calculation reflects the consequences of that uncertainty.

", except for what we infer through these tests."

This is incorrect, the operational priors are estimated from epidemiological studies, not just from diagnostic tests followed by biopsies.

"But you don't subject patients to tests that tell 1 in 5 healthy people they have cancer"

Neither a competent Bayesian nor frequentist statisticians would do so.

Eric L. @19: I agree there, however given sufficient data it is similarly virtually always possible to get a statistically significant result even if the effect size is negligible, which is the flip side to the same coin. A common criticism of frequentist statistical tests is that we almost always know from prior knowledge that the null hypothesis is false from the outset. For instance with temperature trends, do we really think the trend is actually exactly zero?

Anyway the differences between the two frameworks is a fascinating topic in its own right, you need a really solid understanding of both frameworks to know which tool to use for which job.

PS: I would like to highlight Tom Dayton's excellent contribution (no 16, above - I don't know how to include internal links - sorry). I agree very much with what he says.

But may I just add that significance tests are perhaps not as innocent as he makes them out to be. Indeed, they are usually only a small part of the evidence, but I have been involved in discussions where an important part of the argument was whether a certain link, as measured by linear correlation, was "significant" (in the statistical meaning). This was very much an instance of explorative data analysis, where some link was posited, with only tenuous indications this link should be there, and where significance tests were an important part of the argument. Interestingly, that claimed link has now become part of mainstream climate literature (I am referring to "annular modes" which appear to indicate a connection between Atlantic and Pacific pressure patterns) and a large number of people have by now stopped to worry whether this implied link is really present. This is a feature of significance tests in general: perhaps many people do not mean to say that a low p-value is evidence for their hypothesis, but by publishing the low p-value along with phrases such as, "this or that effect is significant at the 95% level" certainly seems to imply that that want to use these statistics as positive evidence at face value.

The terminology used in reporting the results of statistical tests is indeed a thorny issue. Taminos comment that "it merely negates the null hypothesis.", would have been O.K. if he had instead written "is enough for us to reject the null hypothesis". The difference, while subtle, is very important; "negating" the null hypothesis is a statement that the null hypothesis is false, while "rejecting" the null hypothesis is merely a statement that we have made a subjective (if perhaps very reasonable) choice not to believe the null hypothesis based on the evidence - but stops well short of saying that it is false. Essentially it is only a convention that we "reject" the null hypothesis if the p-value falls below some critical value (which is also a subjective choice) - nothing more.

The two phrases we should use would be something along the lines of "we can reject the null hypothesis" or "we are unable to reject the null hypothesis" - the frequentist test doesn't really give a basis to make any statement about the alternative hypothesis (note the alternative hypothesis doesn't actually appear in the frequentist test - so perhaps that isn't surprising!).

Perhaps I misunderstand you and Tamino, but a low p-value cannot objectively be used to reject a null-hypothesis; it simply does not contain the required information to do so. I formalize this in my paper, if you like to know more.

On the other hand, a high p-value indicates that the presented evidence is easily consistent with the null hypothesis. This is not evidence that the null-hypothesis is true; the evidence could also be consistent with the alternative hypothesis. A significance test simply contains no information either way. Using Occam's razor we can then conclude that there is no evidence for our hypothesis, so we better stick with the null-hypothesis. It is Occam's razor that makes the argument here, not the significance test.

Hi Maarten, I fully agree the p-value can't be used to draw a fully objective conclusion; it is a subjective choice to disregard the null hypothesis based on a convention/tradition amongst frequentist statisticians (Occam's razor being a large part of the motivation) - nothing more (I checked with a frequentist colleague before I wrote that ;o). The distinction between rejecting and negating the null hypothesis is the key to the point I was making. Essentially we need to employ a form of words that emphasises the fact that we are chosing (not to) accept the null hypothesis, rather than that we have established that it is (not) true. When I was taught stats, that was the motivation given for saying "we reject the null hypothesis" rather than a positive statement about the alternative hypothesis or claiming that the null hypothesis is false. Essentially rejecting implies a choice, rather than a rational necessity.

In short - I agree!

BTW, the p-value fallacy doesn't just appear in science, I have seen this error made in statistical methodology papers I have reviewed. It certainly isn't limited to climatology!

Perhaps I just need to see an example of Bayesian significance testing done right to understand the way you and Dr Anbaum think this should be done.

"one of the most important benefits of the Bayesian approach is that it gives mechanism to properly incorporate the fact that you know you don't know something, by assigning a non- or minimally-informative prior on it and marginalising it out of the analysis"

Does a Bayesian analysis with a minimally informative prior often lead to a different result than a frequentist approach?

I must confess that my knowledge of Bayesian statistics comes entirely from studying data mining/machine learning, so there may be a side to this I'm missing from not having studied more stats. In that class one thing we were taught is that if you don't really know the prior the most common thing to do is assume it's 50/50. Is that the sort of thing you mean by minimally informative prior?

"Bayesian conclusions are only as strong as the priors used, if you could show the priors were unreasonable then you could reject the result of the test (and the paper). If you can't question the prior, you are logically forced to accept the result of the test."

It still seems to me to be a question of what the point of the work you're doing is and what you can add to the body of knowledge. Let's assume I am an expert in dendrochronology, and I core a few trees in my backyard. Now I need to calculate a prior probability for observing warming in that data set. One way I might do that is by looking at the evidence from atmospheric physics and other areas outside my expertise and decide how likely this should be, but why would I be the one to do this when that really isn't my field and I'm likely to screw it up, I just know all there is to know about tree rings? Or are you suggesting I use a non-informative prior? Let's say I did the full analysis and found that with 99% confidence given changes in various forcings and our range of sensitivity estimates the data should show an upward trend of .15 degree/decade or more. And then I did some calculations on my little data set that any frequentist would sneer at and calculated a posterior probability of 99% for my hypothesis. Have I used my knowledge as a dendrochronologist to contribute anything to the state of our knowledge about climate? My result comes from my prior calculation, the part of my work I'm least qualified to do, meanwhile the actual data I've collected is superfluous (and I should have collected more of it, as a frequentist statistical significance test would have told me).

I do think a Bayesian analysis by someone who was an expert in such things that combined varous lines of evidence from many subfields of climate research and tried to establish probabilities for various climate related hypothesis would be an interesting work, but it's not reasonable or useful to expect every researcher to do this in the process of establishing their result, and indeed Dr. Anbaum's research shows pretty conclusively that most would not be competent to do it. If on the other hand you want most scientists to replace frequentist significance tests with Bayesian tests with non-informative priors to show they've learned at least that much about stats and know what their confidence values mean, I guess I'm okay with that, but I doubt it would change anyone's results much beyond changing the confidence values by a small amount.

I do think scientists should not put their confidence values front and center as if they are the results, better to focus on estimating the magnitudes of effects, but do some kind of confidence calculation just to keep yourself honest and make yourself less likely to publish garbage. But if you think that's the main value that comes from confidence calculations in science (and I do) rather than determining whether we should be 96% certain or 99.3%, then a frequentist approach will generally work okay and if the result is your paper leads people to believe that climate sensitivity is 3.2 when you really do have good reason to believe it is 3.2, then your paper isn't particularly misleading just because there may be a better way you could have done your confidence calculation.

In my opinion you are making too much of the frequentist vs Bayesian discussion. I think it is not that central to whether you think significance tests are useful or not. Also a frequentist would agree with the statement that the p-value does not contain enough information to calculate the probability of the truth of a hypothesis, or the null hypothesis (such statements can be perfectly well framed in frequentist terms).

Regarding the dendrochronologist, this is an example that is very interesting. Equation 6 in my paper states how to view this. It is simply Bayes equation written in terms of prior and posterior odds:

posterior odds = prior odds x p(M| not N) / p(M|N)

where I used the notation as in the post above (note the p(M|N) is the p-value). So whether your confidence in the global warming hypothesis has been increased by your tree work depends on whether the p-value is smaller than the probability to see your measurement in situations that we know there is global warming. This statement is independent of the prior odds; the actual posterior odds of course do depend on the prior odds. In other words, every single measurement increases our knowledge (changes our confidence in a hypothesis) in the same way; this is independent of whether you were a "believer" or not to start with.

This discussion is getting quite long now. I will probably write another post with some of this stuff in sometime soon where I can also comment on the suggestion by HumanityRules. I think John Cook agreed that I could send in another guest post about this subject anyway.

Best wishes to all and thank you very much for your interest in this post and for an interesting discussion,

Maarten, statistics is never used in natural sciences in a way you put it. That is, it simply does not make sense to talk about the probability of hypotheses being true (or false).

It's either true or false. Of course it is entirely possible we are ignorant about its truth value; in that case one should say I do not know (a perfectly legitimate scientific stance), but it surely has a truth value, even if no one was able to determine it so far (provided of course the hypothesis makes sense in the first place).

The Bayesian method you describe could only serve as a heuristic device, but only if we had clear (quantifiable!) picture of prior probabilities regarding our own ignorance. That's almost never the case. If we knew how ignorant we were (having a reliable structural model of our own ignorance), most of the job required to overcome this ignorance would already be completed. However, when heuristics is most needed, we are at the edge of utter darkness, just feeling our way around, not even equipped to make educated guesses about Bayesian priors of our own state of mind regarding the subject matter. In cases like that almost any fractional understanding is better than fake formal methods to arrive at a reasonable conclusion regarding the way forward.

It may be different for decision makers (like politicians or business people) who rely on expert advice in certain matters, but are not equipped to actually understand and evaluate the detailed reasoning behind those expert opinions (they only digest the executive summary, anyway). They may well wonder how likely it is the experts have got it right, and in complicated cases it makes perfect sense for them to seek a quantified description of uncertainty. To ask an independent group of experts to give an estimate of prior probabilities and build a Bayesian model to evaluate reliability of expert propositions may be a way forward. However, in practice extra rounds like that are seldom better than honest expert meta-opinion, expressed in plain language.

There is a more restricted domain where statistics can (and do) come into play in natural sciences. That's measurement laden with noise.

However, in this case there is no room for theoretical ambiguity. We should know pretty much everything how the signal we are looking for is supposed to look like along with the statistical properties of noise behind which it is hiding. This knowledge should take the form of a bunch of true propositions about the phenomenon under scrutiny, neither of which has a dubious truth value expressible in a probabilistic form.

If this knowledge is given, we should be able to build an adequate statistical model which enables us to recover the signal from noise as much as possible.

Of course the first thing to do is not to rely on statistical speculations, but to improve the signal to noise ratio of measurement whenever it is practicable. Unfortunately in climate studies most of the noise is not from the measurement procedure itself, but it is weather noise, that is, an inherent property of the system itself. There is no way to get rid of it during the measurement phase.

Weather is an open thermodynamic system, and as such it works on the edge of chaos, in other words it is always in critical state (by way of SOC - Self Organized Criticality). Systems like this are characterized by system variables with pink noise characteristics (the noise has random phase and the same power in each octave).

Pink noise is scale invariant with no lower cutoff frequency, therefore system variables like this do not make a natural distinction between weather and climate, no matter how long is the averaging window used (how low the upper cutoff). Pink noise is never stationary, it has an arbitrarily long autocorrelation scale.

This is why it is a bit tricky to look for trend (as signal) in a climate variable laden with weather noise. A simple model of a linear trend plus some stationary noise would surely not do (even if mainstream climate science is almost always guilty of using such simplistic models).

Pink noise can have spontaneous excursions on all scales, including extremely low frequency ones (well in the supposed climate range of 30+ years).

You say "A standard answer [to the question if temperatures are rising or not] is to calculate a temperature trend from data and then ask whether this temperature trend is “significantly” upward; many scientists would then use a so-called significance test to answer this question. But it turns out that this is precisely the wrong thing to do."

Yes, but it is not wrong just because the result of an otherwise correctly applied significance test is misused, but in most cases people also apply the wrong significance test (that fails to take into account the very long autocorrelation timescale).

The above statements on weather (or climate) noise, critical state, self-organized criticality, pink noise, etc. are simply true statements with no further qualification whatsoever. It is not likely they are true, not even 100% sure, they are simply adequate descriptions of certain aspects of the behavior of open thermodynamic systems with many degrees of freedom.

Still, they are entirely missing from IPCC reports, prepared by experts for decision makers. Phrases like "pink noise" (or "1/f noise") are not even mentioned under http://ipcc.ch. Funny.

In Schmidt and Manns response to Mcshane and Wyner, Schmidt and Mann calculate a 99% probability of the last decade being the warmest in the record using the Lasso stastical technique from MW. They then discount that probability to likely (66-90%) claiming unidentified measurement uncertainty and possible systematic errors (page 3). When they discount their statistics that much, does the difference between Bayesian and Frequentist really amount to anything? If climate scientists use frequentist statistics and then discount the result to account for unknown errors they will still have conservative estimates of the actual effects.

"Also a frequentist would agree with the statement that the p-value does not contain enough information to calculate the probability of the truth of a hypothesis, or the null hypothesis (such statements can be perfectly well framed in frequentist terms)."

The first part is certainly true, however the second is not; the frequentist framework does not allow probabilistic statements to be made concerning particular hypotheses. Frequentist statistics can assign probabilities to the ocurrence of errors in repeated application of statistical tests, but that is not the same thing (I checked this with my vastly experienced frequentist colleague and he concurs).

If it were true, frequentists could construct a credible interval, rather than a confidence interval by considering the hypothesis that the true value of a statistic lying within a particular interval. But as far as I know, frequentists cannot construct a credible interval - however I'd be very interested to hear otherwise.

(1) You state that weather is in a state of Self Organized Criticality - SOC. I have been unable to find any references that indicate this; do you have a paper to link to on this subject? A statistical analysis of unforced noise in the climate? While water vapor, ice, and condensation are critical point transitions, weather doesn't seem to display the same behavior as a whole.

In particular, a pink noise 1/f relationship would indicate the largest variations on low frequencies, where what we observe (glacial cycles, for example) is a fairly direct tracking of climate variables (temperature, ice cover, etc.) to historic forcings.

(2) The universe is what it is - that's the final arbitrator of our theories. However, our knowledge is imperfect, and our hypotheses are probablistic, as per the first definition of probability. We can only state that a particular hypothesis is more probable than others given the evidence, the statistics of our data. And whether using Bayesian or frequentist methods, we can estimate from the statistics the probability (second definition) that our hypotheis is supported by that data. That's how induction works, and how we can learn something new.

We can be pretty sure, but we can only work with the evidence we have - we don't have perfect knowledge of anything.

At a certain point we become certain enough to label a particular hypothesis a fact. Gravity, evolution, and it appears climate change falls into that category as well. But even the strongest "fact" is supported by our inductive conclusion that the laws of physics are consistent over space and time, and won't change on us - incredibly well supported, but the rules could change tomorrow. Crystalline proofs of the type you describe would be nice, but they don't exist.

BP, I second KR's comment. Relevant also are my responses to Eric (skeptic)'s claim that science has no place for probabilities of the the correctness of theories other than 100% certainty, on the thread The Science Isn't Settled. Start with my most recent comment and work backward by clicking the embedded links to the previous comments.