“Quality control” (rather than “hypothesis testing” or “inference” or “discovery”) as a better metaphor for the statistical processes of science

I’ve been thinking for awhile that the default ways in which statisticians think about science—and which scientists think about statistics—are seriously flawed, sometimes even crippling scientific inquiry in some subfields, in the way that bad philosophy can do.

Here’s what I think are some of the default modes of thought:

– Hypothesis testing, in which the purpose of data collection and analysis is to rule out a null hypothesis (typically, zero effect and zero systematic error) that nobody believes in the first place;

– Inference, which can work in the context of some well-defined problems (for example, studying trends in public opinion or estimating parameters within an agreed-upon model in pharmacology), but which doesn’t capture the idea of learning from the unexpected;

– Discovery, which sounds great but which runs aground when thinking about science as a routine process: can every subfield of science really be having thousands of “discoveries” a year? Even to ask this question seems to cheapen the idea of discovery.

A more appropriate framework, I think, is quality control, an old idea in statistics (dating at least to the 1920s; maybe Steve Stigler can trace the idea back further), but a framework that, for whatever reason, doesn’t appear much in academic statistical writing or in textbooks outside of the subfield of industrial statistics or quality engineering. (For example, I don’t know that quality control has come up even once in my own articles and books on statistical methods and applications.)

Why does quality control have such a small place at the statistical table? That’s a topic for another day. Right now I want to draw the connections between quality control and scientific inquiry.

Consider some thread or sub-subfield of science, for example the incumbency advantage (to take a political science example) or embodied cognition (to take a much-discussed example from psychology). Different research groups will publish papers in an area, and each paper is presented as some mix of hypothesis testing, inference, and discovery, with the mix among the three having to do with some combination of researchers’ tastes, journal publication policies, and conventions within the field.

The “replication crisis” (which has been severe with embodied cognition, not so much with incumbency advantage, in part because to replicate an election study you have to wait a few years until sufficient new data have accumulated) can be summarized as:

– Hypotheses that seemed soundly rejected in published papers cannot be rejected in new, preregistered and purportedly high-power studies;

– Inferences from different published papers appear to be inconsistent with each other, casting doubt on the entire enterprise;

– Seeming discoveries do not appear in new data, and different published discoveries can even contradict each other.

In a “quality control” framework, we’d think of different studies in a sub-subfield as having many sources of variation. One of the key principles of quality control is to avoid getting faked out by variation—to avoid naive rules such as reward the winner and discard the loser—and instead to analyze and then work to reduce uncontrollable variation.

Applying the ideas of quality control to threads of scientific research, the goal would be to get better measurement, and stronger links between measurement and theory—rather than to give prominence to surprising results and to chase noise. From a quality control perspective, our current system of scientific publication and publicity is perverse: it yields misleading claims, is inefficient, and it rewards sloppy work.

The “rewards sloppy work” thing is clear from a simple decision analysis. Suppose you do a study of some effect theta, and your study’s estimate will be centered around theta but with some variance. A good study will have low variance, of course. A bad study will have high variance. But what are the rewards? What gets published is not theta but the estimate. The higher the estimate (or, more generally, the more dramatic the finding), the higher the reward! Of course if you have a noisy study with high variance, your theta estimate can also be low or even negative—but you don’t need to publish these results, instead you can look in your data for something else. The result is an incentive to have noise.

The above decision analysis is unrealistically crude—for one thing, your measurements can’t be obviously bad or your paper probably won’t get published, and you are required to present some token such as a p-value to demonstrate that your findings are stable. Unfortunately those tokens can be too cheap to be informative, so a lot of effort has to be taken to make research projects look scientific.

But all this is operating under the paradigms of hypothesis testing, inference, and discovery, which I’ve argued is not a good model for the scientific process.

Move now to quality control, where each paper is part of a process, and the existence of too much variation is a sign of trouble. In a quality-control framework, we’re not looking for so-called failed or successful replications; we’re looking at a sequence of published results—or, better still, a sequence of data—in context.

I was discussing some of this with Ron Kennett and he sent me two papers on quality control:

24 Comments

The empire of chance, a book by Gerd Gigerenzer and others, does discuss quality control statistics — The Neyman-Pearson tradition. If I recall correctly, its whence came p values, for deciding when a sampled set of mass-produced products contains enough flaws to destroy the sampled production run.

I was going to say – you don’t have to look far for the quality control tradition, wasn’t that Student’s entire job for the Guinness Brewery? Testing grain and beer batches for consistency & quality at economical costs.

Yes, the Neyman-Pearson is related to accept-reject rules in inspection. But this is not the same as the quality control of Shewhart, Deming, etc., a key principle of which was not just to measure error rates but to work on reducing the variation of the process.

Yes now you are making even more sense to me. I think that, more fundamentally, that some social sciences are less and less distinguishable from marketing enterprises. And current measures have only reinforced this trend. But with preprints, preregistrations, blogs, & other venues, we can begin to examine research more continuously & hopefully more transparently.

Great post! My first thought after digesting it was “Cargo Cult Statistics”.

As Tom and gwern note, quality control is part of the underlying tradition. But the original, primitive techniques have been hijacked as incantations, with no real appreciation of why they matter (and have been superseded by stronger measures).

Traditional statisticians unintentionally reinforce the cult by continuing to structure statistical education along lines that give “hypothesis testing”, “inference”, and “p-value” their power.

There’s a whole body of quality literature that would inform future iterations of this discussion.

The first principle is that what gets measured gets controlled. If we have a good measure for “variation of the [scientific] process,” we can reduce it.

Donald Wheeler’s writings (http://spcpress.com) are largely about measurement in the presence of uncertainty and measuring process behavior (e.g. he refers to control charts as “natural process behavior charts”). While I think that statisticians often get bent out of shape by the liberties Wheeler takes, much of his work seems to me directly relevant to the idea of dealing with noise and measurement uncertainty in assessing scientific results.

Al Endres wrote a whole book on applying Juran’s philosophy of quality to corporate R&D environments: Improving R&D Performance the Juran Way (http://a.co/0GFVWbE). I think there’s sufficient overlap between the scientific process discussed here and the quality-based R&D processes discussed in the book that much of the book should be relevant.

The part that I think is missing in the above is selection of appropriate p-values (or likelihood ratios, or whatever). This, I think, is a risk management issue rather than a statistics issue. Throughout the quality literature, the underlying application of risk management is understood but not often discussed, risk management, being intimately tied to business strategy and controlled through project management methods. I think this is discussed most explicitly in the quality literature on lot quality assurance sampling, where alpha and beta levels are indirectly selected (usually by the customer), and updated or changed, based on a business evaluation of the acceptable risk. There are probably some relevant lessons for science, there.

Andrew said, “Move now to quality control, where each paper is part of a process, and the existence of too much variation is a sign of trouble. In a quality-control framework, we’re not looking for so-called failed or successful replications; we’re looking at a sequence of published results—or, better still, a sequence of data—in context.”

I think this describes part of quality, but doesn’t mention the important point that each study in the sequence should have improvements made based on studying what was less than optimal (e.g., measurement, design, model) in the preceding studies.

“One might think of two golfers playing iron shots from the fairway. One aims for the green: she either meets the specification for being on the green, in which case she is ready to go into the next stage of production, putting, or she misses the green and is out of specification, doing rework in the rough. The other player is not content with the criterion of being either on or off the green. She takes the hole itself for her target and aims for the flag with the criterion of getting as close to the hole as possible; as her game improves her mean square error become smaller, and she works continually to reduce it even further.”

Discussing this in terms of “Quality Control” as meaning “Statistical Quality Control” is too limited. I will speak as someone who has designed components for nuclear reactors, where high quality is essential, and you can’t afford to be wrong about it.

A preferred term, BTW, is “Quality Assurance”, which is broader than “QC”. The premier concept is that you cannot inspect quality into a product. You have to have a reliable design and production system in place. Then you can use inspection to make sure that system still operates as intended. If it doesn’t, then you try to diagnose it and make adjustments.

As one example, let’s say you have specified the use of a certain kind of stainless steel pipe. Perhaps it will go into making a steam generator. That pipe has to have been made from certain starting material, and fabricated in certain ways. It will come with certification that provides some reassurance that this is the case. For example (a made up one but a representative kind of thing), perhaps a certain lubricant can’t be used when drawing the material into the finished pipe, because that lubricant gets into the grain boundaries and will later cause cracking after years of exposure to fast neuron flux. After cleaning you can’t detect that the lubricant was used, but it still will cause cracking.

Now it can happen that the pipe stock may somehow get separated from its certification. If that happens, you can never use the material for your steam generator; inspection alone will not provide adequate assurance that the material really is what you need.

Of course, there will be strong financial motivations to attempt to fake certification, and to claim that it will be all right to use defective materials anyway. A good QA system (and good management) will be able to resist such pressures… others may not.

That’s why the recently publicized case of faked certification of materials by a Japanese steel company is a really big deal, at least when the materials are to be used for critical parts in the Aerospace industry.

There are some obvious analogies to our statistical issues. For example, given noisy data, or data whose provenance you are unsure of, you cannot inspect – i.e., verify by statistical analysis – quality into the study results. But note that the larger point goes beyond statistics per se. It has to do with the overall flow and reliability of the work. It also has to do with the integrity of management, good adjustment of incentives, and so on.

Shewhart did not recommend the use of P values for ‘statistical control’ which is neither a matter of estimation nor of testing a hypothesis. Such view stems from the “Criterion of Meaning: Every sentence in order to have definite scientific meaning must be practically or at least theoretically verifiable as either true or false upon the basis of experimental measurements either practically or theoretically obtainable by carrying out a definite and previously specified operation in the future. The meaning of such a sentence is the method of its verification”. (op. cit. Deming’s obituary of Shewhart, Review of the International Statistical Institute, 36(3), 1968, pp.372-375).

“very sentence in order to have definite scientific meaning must be practically or at least theoretically verifiable as either true or false upon the basis of experimental measurements either practically or theoretically obtainable by carrying out a definite and previously specified operation in the future. The meaning of such a sentence is the method of its verification”.”

That may be fine for established scientific matters (although since one cannot prove a negative – for some value of “negative” – we may wonder), but it’s not such a good way to specify how scientific progress and discovery can get made.

Shewhart rather used the term ‘predict’ as against ‘proof’. His first two postulates (“All chance systems of causes are not alike in the sense that they enable us to predict the future in terms of the past” and “Systems of chance causes do exist in nature such that we can predict the future in terms of the past even though the causes be unknown. Such a system of chance causes is termed constant”) deal with this predictability aspect (in a scientific and/or non-scientific problem). His third postulate (It is physically possible to find and eliminate chance causes of variation not belonging to a constant system) deals with improvement. For example, there are both `rational’ and ‘speculative’ behaviours in a financial market. If the market is largely rational (aka only chance variation is present), the future returns are predictable. This predictability is lost when irrational behaviours dominate (special causes). Financial markets can be improved (e.g. through regulatory interventions).

Andrew, I think a better term for what you’re talking about is “quality improvement’ rather than “quality control.” But I’m thrilled that you are talking about it! Donald Wheeler is the great explainer of why that word “control” in “statistical process control” doesn’t mean exactly what it sounds like it means: https://www.spcpress.com/pdf/DJW129.pdf.

Thanks for this link. I especially appreciate the comments starting from, “And this is where the nomenclature gets in the way”(about a quarter of the way down p. 2) and through “These changes are not as hard to get used to as they might seem at first, and they avoid the red-herring of “control limits” (about 3/4 of the way down p. 3)

I think that when NHST is eventually replaced by better methodology, this will be part of it. It starts when journals insist that novel statistical approaches (in fields other than statistics) be approved by expert statisticians.

This is, IMO, one of two ideas that could revolutionize scientific methodology. The other essential component is that researchers be clear on whether they are determining causation or merely correlation. Correlation is comparatively easy, often illuminating, and fits within the NHST framework, assuming no forking paths and such. But determination of causation requires far more effort. You need to build a cause tree (is this the same idea as multi-level modeling? Not sure exactly what that means.) and show that not only does your purported cause strongly correlate with the effect, but also that other plausible causes do not correlate. I envision a future in which your cause/effect relationship by NHST will only be published if the question has already been diagrammed in a cause tree that is acceptable to a consensus of researchers. If there is no cause tree for your effect, you need to show the tree yourself.

There are books about quality due to Redman (former Bell Labs), English, Loshin and others that deal with quality. A big issue is getting the data into the computer as accurately as possible. As statisticians, before analyzing the data for distributional issues, we need to eliminate ‘duplicate’ records for which we might use the Fellegi-Sunter model of record linkage (JASA 1969) or file in missing values of variables or replace ‘erroneous’ values (such as a child under sixteen being married) using the Fellegi-Holt model of statistical data editing (JASA 1976). The means of implementation of record linkage are mostly computer science and the means of implemention of edit/imputation involve considerable OR (set covering, integer programming, etc.).
If you have 10% error in your data, how will you analyze the data? How will you even know how much error that you have?
Meng has introduced an additional error source where an somewhat erroneous administrative or big data source is linked with a high quality survey sample.http://www.stat.harvard.edu/Faculty_Content/meng/COPSS_50.pdf
Although he assumes that there is no linkage error, most of the time the joined survey sample and administrative data source cannot be properly analyzed. If there is linkage error (due to the absence of unique identifiers) then the error will be much greater.https://sites.google.com/site/dinaworkshop2015/invited-speakers
There have been seventeen papers in the last fity-two years on adjusting statistical analyses for linkage error. I believe that the problem is 5-10% solved.