Statistical significance is the measure of what statisticalevidence there is to suggest an observed effect was due to the independent variable (what is being tested) rather than chance. A significant effect, therefore, is one which would most likely be observed in subsequent experiments - i.e., it's a real effect, not just a fluke. The easiest way to demonstrate significance is with a larger sample size, as it is less likely that a larger number of data points would conspire to be incorrect by chance.

The word "significant", in this sense, does not mean "large" or "important" as it does in the everyday use of the word. Statistically significant effects can, in fact, be very small indeed although larger sample sizes are required to demonstrate significance of smaller effects.

In frequentist statistical approaches, statistical significance often arises when reporting the results of hypothesis testing. An alternative hypothesis (that there is an effect) is favoured - and a null hypothesis (that there is not an effect) is rejected - if the experimental evidence shows a significant difference from the null hypothesis. If a significant difference is not present the null hypothesis remains favored and the alternative hypothesis is rejected, because there's a good possibility that the effect was just random.

Contents

Abuse

A common misinterpretation of statistical significance is that "a lack of statistical significance" or "no significant difference" are synonymous with "not enough data to draw a conclusion." In fact, a non-significant result is a conclusion in favor of the null hypothesis -- there isn't enough evidence to throw it out and accept the alternative hypothesis. A better way of interpreting "non-significant" is "there's a good chance that the outcome was a random event, and we can't differentiate between two competing effects".

Similar abuse of statistics is when journalists or certain agenda pushers ignore the concept of significance entirely - leading to false information being given out to people. In 2005, a report commissioned by the UK government concluded that there had been "no significant increase in drug use in UK schools". Not content with the conclusion that "things aren't that bad, actually", a few newspapers jumped on the report and decided to draw their own conclusions. In their, frankly amateurish, search for something to data mine, they noticed that cocaine use in schools went from 1% to 2% - although these were rounded off for the summary, it was actually 1.4% and 1.9%, so a 35% increase, rather than a 100% increase. They had their smoking gun; despite what the government concluded, cocaine use had doubled, cocaine was flooding the playground and the government were covering it up. However, the government's conclusion was more accurate, because it took into account significance, clustering and the fact that the use of many different drugs had been polled. If you test many variables the chances of one of them showing a clear trend by chance increase, and so tests for significance have to be altered appropriately. Upon doing the actual maths, the results were actually very insignificant, essentially produced by accident and the random chance that the sample would have fallen on a cluster of individuals using drugs that wasn't representative of the whole sample.[1]

Problems with statistical significance

In most statistical methods significance is defined by some threshold value, often referred to as the alpha value. This threshold value is set usually at 0.05 or less. This means that there is a less than five percent chance of getting the results by chance alone. There is nothing fundamentally magic about an alpha level of 0.05 yet after many generations of using it in analysis it seems to have taken on a certain magical value for many sciences. If a statistical test comes back with p=0.04 results are called significant and if p=0.06 they are called non-significant.

With this standard alpha level about 1 in 20 results should come back significant when there really is no effect. This does and can happen frequently so it is wrong to assume a good value means you're completely certain, it's still all about probability. In individual experiments that run many statistical tests this is a problem, if you run 40 tests about 2 of them will show an effect that is not really there. This is often referred to as a family wise error rate and is difficult to control for but some measures can be used. While it is easy to see this problem in a single set of experiments in a single paper the same phenomenon emerges if a bunch of single experiments are published in multiple papers. With the thousands of experiments run everyday all over the world a very large number of them will show a statistical significance when there really is no effect at all. Publishing biases in journals exaggerate this problem because journals rarely publish experiments that show only a non-effect (i.e., "failed" experiments), and are much more likely to publish papers that show an effect. So you wind up with a massive uncontrolled bias in the published papers towards showing statistical significance where there really is none.

(Ab)use in pseudoscience

This is one reason why picking out a single test in a single paper to make a point is meaningless. It is a common tactic in pseudoscience to search through thousands of papers to find that one result that's significant and makes their point. Real science must be accompanied by the preponderance of evidence, and experimental results need to be replicated repeatedly and reliably before they should be incorporated in the body of accepted knowledge. This is why scientific consensus is important and quacks and cranks that go against this consensus do not gain points by finding a single example in a paper that might support their claims.

The problems above are due mostly to the use of frequentist approaches to statistical analysis. There is a growing movement of scientists who are encouraging the use of Bayesian based statistics. Bayesian approaches are not subject to the same sort of systematic error propagation issues as frequentist approaches (however they are subject to their own unique sets of issues).

P-value fishing

"P-value fishing" is a pejorative term for a statistical sleight of hand often abused by cranks and those with an agenda to push. There are two common ways to get a statistically significant result that doesn't mean much at all. The first is, in studies with a large number of variables, to run comparisons of all the variables and hope that something comes out significant. Proper methodology dictates that the experimenter choose which variables are being compared beforehand and to run post-hoc corrections on any further comparisons. In other words, just comparing as many variables as possible will eventually turn up a significant result, though it's likely to be statistical noise. The post-hoc correction calculates how many of these comparisons will be significant by chance and so if the post-hoc analysis comes up with equal to or fewer significant results than the correction allows for, it's still insignificant.[2]

The second trick is to fish for p-values by cranking up the number of subjects until significance is achieved. Normally, it's good to have more subjects, however, the data should be interpreted in light of that. What often happens with a large subject pool is that even a slight difference in means will become significant even though the effect size is close to nothing. This is why it's important to look at the effect size in addition to the p-value.[3]

Proposed solutions to the problems

Another approach has been to argue that statistics needs to lose its magical status in science as some sort of analogy to a proof, but rather needs to be seen as an argument. The p value of statistics are just one piece in the broader perspective and should be weighed against other types of evidence. Rather then assigning a magical threshold value, p values can be reported directly, allowing people to integrate them with other evidence in making their conclusions. If other evidence is weak maybe a p value of 0.05 is not convincing, or maybe if all the other evidence is strong a p value of 0.1 is good enough.