In academia, there has been recently some cases that came to light of very well known scientists that have fabricated their data out of thin air.

In some instances these papers have been cited many times by other researchers and some of them have even been praised. Thus, when the truth came to light, it also appears to the public that scientists have bad peer-review processes.

In light of the presented reasons, how can a reviewer make at least some sanity test that the data is (most likely) not fabricated? Suggesting so could come as a great injury to the researcher, but I think there should be some kind of mechanism to control this.

I am not sure how to detect fabricated data without reproducing the data by the same methods suggested by the paper authors. I think the sanity test you mention depends on the field and the complexity of the experiments.
–
scaaahuJan 30 '13 at 6:27

Indeed, but for example, one friend once told me to just take their results and see how well the fit normal samples.
–
Leon palafoxJan 30 '13 at 6:46

2

@Leonpalafox that only works if their results might reasonably be expected to have a Normal distribution. Lots of things do not have a Normal distribution.
–
EnergyNumbersJan 30 '13 at 7:16

5 Answers
5

There is only one reliable way to do it, which is to try to replicate their results.

The unreliable, but not completely useless way, is to see if the numbers fit Benford's Law.
Benford's Law describes the distribution of the first digit of many very diverse data sets. This is the distribution:

For that to work, you need large batches of raw data (and they need to be expected to follow normal distribution). Most papers only include curated/analyze/post-processed data, for the sake of clarity.
–
F'xJan 30 '13 at 11:44

2

I would perhaps amend this to be clear that the actual notion of Benford's law can be applied to any digit (not just the first! - its distribution just changes depending on the characteristics of the data). The citation you provide makes this clear right in the title, and some papers examine the trailing digits, not the leading digits.
–
Andy WJan 30 '13 at 12:55

@Andy W In fraud detection, it's actually common to use the second digit. The reason is that it's considered that people who are faking data are quite likely to get the first digits about right, for example they might change "16" to "15", but they are less likely to pay attention to second digits. Nevertheless, some people claim that Benford's Law is pretty useless, for example: vote.caltech.edu/content/…
–
FloundererJan 30 '13 at 21:28

4

Quite frankly, I see Benford Law as close to useless in detecting fraud, both because of false positives and because of false negatives. In the cases of fraud (or gross mistakes, I am not sure) I have seen, Benford Law would not have been useful: The article reported the results of computer simulations, with one important parameter not taken into account, and reported the data as if that parameter was irrelevant. So the data measured "something", but not what the paper claimed it did, and there's no reason the incorrect results would be less likely to follow Benford's law than correct ones.
–
user6114Mar 1 '13 at 20:54

Edit

After thinking about some of the points raised in the comments, I would like to expand on my answer, but also defend its form against the criticism that it is so vague as to be unhelpful. [In case you are wondering what the original answer was, it is roughly the sections 'Looking for mistakes' and 'Trusting your feelings'.]

Benford's law

Benford's law is only one of many statistical techniques that can be and have been used to detect fraud or bias. It has become widely known, probably partly because it is simple to apply, but also simple to justify in a 'hand-waving' way.

However, its validity is much more limited than @EnergyNumbers (who calls it the unreliable way) implies. As originally formulated, Benford's law said that if you take a large range of numbers which have different sources, contexts, meanings or units, the logarithmic distribution emerges. This is a very interesting statement, but has little utility in detecting fraud. The statement that Benford's law, whether applied to first or second digits, should apply to a particular set of observations of a single variable, is an extremely strong statement. There are many, many natural examples of well-formed non-fraudulent datasets to which Benford's law does not apply. Several other digit distributions could reasonably arise in bona fide data. You may or may not be able to justify the assertion for your own data, however what you should not do is blindly apply Benford's law to various sets of numbers, and start forming opinions about their reliability.

It is a serious statistical technique and requires non-trivial statistical understanding to apply. The same thing applies to checking for normality. Unless you have a good understanding of how normal distributions arise, you will not be able to form a theory as to why some distribution should be normal. If this is the case, then any test for departure from normality will be useless.

Why this answer doesn't go into any statistical detail

The original answer I gave, below, tries to err on the side of not handing 'formulas' over to people who possibly don't understand their use. I tried, perhaps not very successfully, to suggest starting places for thinking about how and maybe why people either fake results or unconsciously introduce bias.

This kind of forensics is in some ways very similar to other stats, but has some very important differences. If you are looking for a signal in some noise, you might form two hypotheses, both of which imply that data is random, but with say different means or distributions. If you are looking for cheating, you have to remember that fraudulent data is not in any sense random. Spotting it involves teasing apart (possibly) three elements: the real numbers, the deliberate adjustment, and any pseudo-random perturbation that might have been made to mask the adjustment.

I believe that in order to properly apply some forensic test to a set of data, you need to first develop a proper theory of why the test might be meaningful. This entails hypothesing about exactly how the data might have been manipulated. For example, Benford's law was successfully used to investigate whether China's GDP growth in % was being rounded up if it had a high second digit: http://ftalphaville.ft.com/2013/01/14/1333552/chinas-non-conforming-gdp-growth/ (registration required).

Taking a whole battery of tests and applying them to some data might allow you to get to the stage of theorizing, but it can't get you any further. This is why in the first few paragraphs of my original reply, I talked in very general terms about how faked data might differ from genuine data. These are supposed to give you places to go looking for anomalies, which you later investigate rigorously.

Looking for mistakes cheaters might make

Starting points can be things like testing to see if the numbers fit the conclusion too closely. If an experiment was done on several groups of test subjects, all of which are supposed to be identical, then you would expect the success rate in each group to be close to the overall average, but not too close. Some researchers who have made up their results had all group success rates equal to the average success rate to the nearest integer.

If you get someone to make up the results of 20 successive coin tosses, they deviate from statistical likelihood because they don't put eg. enough sequences of 5 heads in a row. People usually think things like this are less likely than they are. Look out for things which are 'too random' or 'too regular'.

Researchers into election fraud have had some success looking at the last two digits of numbers to see if double sequences like '11' or '22' occur less than they should, because humans who make up 'random' numbers tend to avoid them. This applies in the specific case where you have enough digits that the trailing digits should be uniform, but that no rounding should have been applied. This test wouldn't have detected the Chinese GDP rounding, or manipulations where leading digits are adjusted.

The mathematician Borel weighed the loaf of bread that his baker gave him each day and decided that the average was too far below the standard weight of a loaf to support the hypothesis that the baker wasn't making underweight bread. He confronted the baker, who promised that he would make the loaves heavier. After that Borel continued to weigh his bread. The average weight was now high enough, but he studied the distribution of weights and realized it corresponded to that which you would get if you always took the maximum of several observations of a normal distribution. He concluded that the baker always gave him the biggest loaf from those on the rack, but that the average was still below spec.

This is a classic illustration of how someone might falsify results - by taking the best result from several runs. In order to reason about the distributions, it was first necessary to understand how this method of cheating works.

Or suppose that someone had a bunch of results but threw out those which they didn't like. Has this introduced unlikely biases in the selection of the original test subjects? Eg if patients are supposed to be chosen at random but there are fewer old people than one might expect. In general if any data were rejected you should test for dependence between rejection and other variables.

Sometimes real data has a particular bias or noise which is lost in faked data. In the Simonsohn paper cited below, he looked at a psychological study where subjects were asked to say how much they would pay for a T-shirt. Unlike other, genuine studies, the results didn't cluster around multiples of $5.

Another thing which can be hard to look for but which is very damning is to figure out what the results might be if no effect was present and see if eg a single digit has been changed, or a round number has been added.

Sometime people genuinely do introduce biases unconsciously because they believe in their theories or want to succeed. This could mean that they make very small adjustments which can have a large cumulative effect, such as rounding up numbers which should be rounded down.

Trusting your 'feelings'

The other thing you need to try, is to get a 'feeling' for something dodgy, outside of the actual numbers. Again, all this does is give you a place where you try and build a proper statistical hypothesis and then test it against the data.

A mathematics professor once said to me that you can spot false proofs by two things: either the work becomes very complicated at the point where it is wrong, or the wrong step is skipped over as obvious. Not quite the same situation I know, but very complicated data handling procedures could be designed to be difficult to replicate (or could be the point where the researcher manipulated data until she got what she wanted). Saying something like 'cleaning' or 'normalization' without explaining exactly what was done could also be a red flag.

If there's a very very standard source of data of a particular type and someone didn't use it, or used it but not in its original form, why not? People often give a citation justifying some supposedly straightforward manipulation they perform on the data to clean it or get it in a more convenient form. Usually but not always, this reference should be to a standard textbook on stats or experiment design, or to some paper which everyone in the field knows. If it's to something extremely obscure is this justified by the obscurity of the topic? Does the cited work actually say what they claim it does?

How to proceed

I have tried to promote the general skill of trying to understand how people fake things, why and how they mix the truth with fabrication (or sometimes are subject to unconscious bias), and what constitutes strong evidence of anomaly. Looking at case studies, of which Simonsohn's paper is a great example, can help. Stephen Jay Gould's famous book 'The Mismeasure of Man', on the face of it a political tract critical of biological determinism, is also a collection of many case studies of deliberately or accidentally biased scientific work.

If you think that something is fishy, but you don't have the analytical tools you need to prove it, then you need to do research into specific statistical tests that apply to those cases. Among academics, most stats isn't done to detect fraud, and even if you have good quantitative skills you might not have this knowledge. The example of Borel is a good one in that many of us don't know offhand what the distribution of the 'biggest loaf to hand' should be, given some reasonable assumptions for the distribution of the loaf sizes.

However, as a researcher you should definitely have the skills to go and find this out from a book. Asking a statistician is a very important technique which may or may not be a last resort, depending on how friendly your statistician is.

I don't think If an experiment was done on several groups of test subjects, all of which are supposed to be identical, then you would expect the success rate in each group to be close to the overall average, but not too close. is very useful advice. For instance Fisher's critique of Mendel was based on the data being to similar than would be expected by chance, and Uri Simonsohn's recent work is based on similar observations of data being less random than one would usually expect.
–
Andy WJan 30 '13 at 13:06

3

At a minimum it would be nice to supplement the answer with actual citations to support the statement is just one of many statistical techniques that can be and have been used to detect fraud or bias. This is certainly true, but pointing to some examples would at least provide the reader a means to make progress on his own.
–
Andy WJan 30 '13 at 13:25

2

@AndyW, what is your understanding of the words but not too close? If you need 'conditions for evaluating the variability', you should ask a statistician. As for providing examples, I tried to write something that would be a decent starting point for someone who isn't an expert. I did provide one (simple) example, that of Borel. The paper you cite is excellent, you should add it as an answer, otherwise I am happy to edit and mention it.
–
jwgJan 30 '13 at 13:31

2

There are enough answers here already, and I would happily upvote yours if you add the Simonsohn paper (please provide reference to Borel as well - I'm not familiar with that and the wikipedia page did not mention it)! My point was that the quoted statements provide no effective way to evaluate results because statements like too close are so ambiguous, it can mean anything to anyone. Taken literally the statement about too random or too regular applies to every set of numbers (because if it is not random it has some regular structure).
–
Andy WJan 30 '13 at 13:43

3

@jwg: I think Andy's objection is precisely that you really don't know how to explain the word "too". Don't just say "ask a statistician". Any academic working in an experimental discipline needs a strong enough statistics background to give, understand, and apply a precise definition of "too close".
–
JeffEJan 30 '13 at 16:53

Fabrication of data is not easy to find out as a reviewer. You may try tricks for raw numerical data, if they come in large numbers and can be expected to have normal distribution. But even if the tests say there is some likelihood the data was fabricated, it is still not "proof". You would need at least strong likelihood to prevent publication. That is not easy to find.

If you look at accounts of retracted papers and retraction process, you will find out that the culprit is usually identified not by the numbers alone but other facts: he has very rapid publication rate compared to field studies, he has weird behavior and does not allow coauthor to see raw data, things like that. In most cases there was nothing a very diligent reviewer can do. Taht is sad, but that is the truth of it in most cases.

This is an example of selection bias - the fraudsters who were found out published too frequently or behaved secretively. Nothing about this tells you that there aren't many papers out there by people who are more cautious about how they fabricate results, but which could still be discovered by careful examination of the papers by the reviewer. If there isn't enough detail in the paper to check for fabrication, or if there is a strong likelihood without proof, you should reject the paper and explain this in the review.
–
jwgJan 30 '13 at 13:09

1

Sometimes, too good to be true is true not only in academia but also in sports.
–
scaaahuJan 30 '13 at 13:18

2

@jwg I wouldn't agree with you last sentence. In fact, that's not the editorial policy of most journals I know: raw data is simply not required for review and publication. As a reviewer, you should definitely follow the journal policy…
–
F'xJan 30 '13 at 13:21

@F'x I didn't say raw data but enough detail. The editorial policy of most journals includes the idea that the reviewers should make a reasonable effort to advise the editors whether or not the papers are bogus.
–
jwgJan 30 '13 at 13:25

1

As a reviewer, you should definitely follow the journal policy… — But reviewers don't decide; they only recommend. I think it's perfectly reasonable for a referee to recommend rejection on the grounds that raw data is unavailable, as long as they also give the editor enough information to make a decision if by journal policy those grounds are insufficient to reject. With enough recommendations from the community, the journal policy might change.
–
JeffEJan 30 '13 at 17:43

It depends on the nature of the data. If the data presented is in form of pictures (such as photos of biological experiments, like Western blot), you can check for traces of image manipulation. Guidelines to examine photographic data are available from the Council of Science Editors.

There has been related work in survey research on how to detect interviewer falsification of survey responses, sometimes referred to as curb-siding (when an interviewer is allegedly sitting on the curb next to the house where they were supposed to be doing an interview). See a collection of practices from the Survey Research Methods Section of the American Statistical Association, and a system for detecting interviewer falsification from RTI, one of top 3 US survey research organizations.

The general findings usually run along these lines: interviewers are OK with getting the first moments (means, proportions) about right, but are lousy in the second moments (variances and correlations): they avoid extreme answers, thus reducing variance, and are lousy at correlations (may not know well enough how things go together).

Not much of that may be applicable to natural sciences, though. I would suggest enlisting a local statistician. Many stat departments run consulting courses for their grad students that welcome requests for expertise from other disciplines.