To Throw Away Data: Plagiarism as a Statistical Crime

“The distortion of a text,” says Freud in Moses and Monotheism, “is not unlike a murder. The difficulty lies not in the execution of the deed but in doing away with the traces.” —James Wood

Much has been written on the ethics of plagiarism. One aspect that has received less notice is plagiarism’s role in corrupting our ability to learn from data: We propose that plagiarism is a statistical crime. It involves the hiding of important information regarding the source and context of the copied work in its original form. Such information can dramatically alter the statistical inferences made about the work.

In statistics, throwing away data is a no-no. From a classical perspective, inferences are determined by the sampling process: point estimates, confidence intervals and hypothesis tests all require knowledge of (or assumptions about) the probability distribution of the observed data. In a Bayesian analysis, it is necessary to include in the model all variables that are relevant to the data-collection process. In either case, we are generally led to faulty inferences if we are given data from urn A and told they came from urn B.

A statistical perspective on plagiarism might seem relevant only to cases in which raw data are unceremoniously and secretively transferred from one urn to another. But statistical consequences also result from plagiarism of a very different kind of material: stories. To underestimate the importance of contextual information, even when it does not concern numbers, is dangerous.

Perhaps the most prominent statistician to have repeatedly published material written by others without attribution is Edward Wegman, formerly of the Office of Naval Research and currently a professor at George Mason University. The case is especially interesting because Wegman has a distinguished record of public service and scholarship (he received the Founders Award in 2002 from the American Statistical Association) and because one of the plagiarized documents was part of a report on climate change delivered to the U.S. Congress. The ethical dimensions of this copying seem clear enough: By taking others’ work without giving credit—even copying from Wikipedia at one point (see the appendix to this essay)—Wegman and his research team were implicitly claiming expertise on subjects in which they were not experts. Wegman continues to deny having plagiarized, even in the face of direct evidence that several of his publications (on topics ranging from network analysis to color vision) include unattributed material previously published by others.

We shall avoid speculating about the motives for plagiarism here. Generally, however, the ethical dilemma seems to be analogous to the person who robs a store to feed his or her family, or the politician who lies to achieve a larger political goal. In all of these cases, the behavior in question is generally recognized to be unethical, so if the broader context in which the action takes place is deemed ethical, it can only be thus because the unethical action serves some larger, more important goal. In Wegman’s case, no such argument about a larger context has been made (perhaps because that would require admitting the ethical violation in the first place).

The Wegman case came to public notice after the Canadian blog Deep Climate found the first few pages of material in the report to be plagiarized from a book by Ray Bradley, one of the authors whose work was attacked in that document. The blog post stirred others to study this and other documents written by Wegman and his students, at which point additional incidents of copying without attribution turned up. In 2011, a related article by Wegman and a collaborator in the journal Computational Science and Data Analysis was formally retracted by the publisher on grounds of plagiarism.

Despite the human and political drama of the Wegman case, it may not appear immediately interesting from the standpoint of statistics. Perhaps counterintuitively, a purely qualitative example reveals why this appearance is wrong.