How Bad Data Practice Is Leading To Bad Research

Two weeks ago, I spoke on a panel at the Association of American Publishers’ (AAP) Professional and Scholarly Publishing (PSP) pre-conference on “Fake News -- Fake Scholarship: Improving Accuracy and Transparency in Professional and Scholarly Publishing” that explored how the world of academia is being impacted in some of the same ways that “fake news” has impacted the public information environment. In particular, the problem of bad data practice is increasing as “big data” approaches spread to non-traditional disciplines, leading to a growing issue of bad data-driven research. How is a lack of proper data practice damaging the scholarly landscape?

In my presentation, I broke the world of bad data practice into five key themes: Honest Statistical/Computing Error, Honest Misunderstanding of Data, Honest Misapplication of Methods, Honest Failure to Normalize and Malicious Manipulation, made worse through the poor citation practices of Copy-Paste Google Scholar-ship.

Perhaps the most basic and common kind of bad data practice is honest statistical or computing error. This can be as simple as an Excel spreadsheet formula error or as complex as a subtle buffer overflow error in a complex piece of simulation software that randomly changes a critical parameter or output value. Today's large data-driven projects are often ad-hoc affairs, involving multiple different software packages glued together into workflows never imagined or intended by their creators. A file format conversion package used to glue two tools together might silently truncate double precision floating point numbers to single precision, damaging the data enough to yield a false conclusion without the users ever understanding what is happening or even what exactly a double versus single precision floating point number is. Common off the shelf software may also implement a limited version of an algorithm that has special edge cases or restrictions on its input, and even widely used software can encounter strange issues when pressed into unusual analytic scenarios.

Some journals have begun adding a separate statistical review process that can catch these kinds of issues, especially those involving software limitations, but the only real way of ferreting out honest error is to encourage open data, open software and open workflows to the greatest extent possible. When others are able to replicate and thoroughly examine an entire analysis from start to finish, it makes it far more likely the community will catch these kinds of unintentional errors.

Even when the statistical and computational elements of a project are correctly implemented, misunderstandings of the underlying data can undermine or entire nullify a study’s findings. Fields without a traditional focus on data-driven analysis are increasingly adopting data analytics, while traditional data fields are increasingly reaching for datasets outside their traditional domain expertise. Even with sound statistical approaches, the data being analyzed many bear no semblance of reality to what the authors think it does.

For example, Twitter’s free 1% stream has become a defacto dataset for studying global human and natural activity. Yet, on a daily basis, I see studies cross my inbox that do things like map English language GPS-tagged tweets to report on domestic Syrian anti-regime views in rural Syria. Or offer conclusions on domestic image and video portrayals of social issues in areas of the world where there is no cellular service, no 20 megapixel-equipped smartphones with unlimited 4G data plans and precious few citizens live-streaming their daily lives.

Even when using traditional library resources intended for academic research, subtle issues can impact an analysis, such as wire stories not being included in the full text searchable archive of a major newspaper due to licensing issues. A lack of understanding of the differences between online, print and broadcast media frequently lead researchers to base claims off online searchable archives. Just because a dataset is easy to access and available in machine-friendly format does not in any way suggest it is amenable to a particular analytic question.

As research across all fields increasingly looks globally, English language Western sources are still a primary data source for understanding the rest of the world. Indeed, even within prominent social sciences fields, there are still large portions of the academic community that believe you can fully and completely understand the entire world using just a handful of English language American and European newspapers – no need to look at local sources or other languages. Yet, these same researchers are forever surprised that reading a few English language American newspapers didn’t give them the same level of detail as a European study that consumed local papers across Europe in the local languages of each country. Unfortunately, such studies appear even in the top methodological and statistical journals, having passed both peer and statistical review – their statistics are accurate, but the reviewers lack the understanding of the data to ask whether it is relevant to the questions posed it.

It is critical that researchers spend the time to fully investigate a dataset’s suitability to their analysis and not be afraid to ask the questions that aren’t addressed in the documentation. For example, one widely used global social sciences dataset that is heavily relied upon, including in media and government reports, was assumed by the research community to rely on local language sources in all of the areas of the world it analyzed. The authors never claimed this, but users simply assumed that if you were positioning yourself as the authoritative catalog of the rural non-English speaking world, that you would be using local sources. Yet, for years no one asked just what sources the authors were using to build their dataset. It turns out that they rely nearly exclusively on major English language, primarily American, news outlets. The authors had never claimed otherwise, but had simply lacked the domain knowledge to recognize what they were missing by not using local sources. More to the point, the academic, governmental and media users had never stopped to ask what those sources were and had just assumed from their own domain knowledge that anyone cataloging the world would use local sources.

In addition, few human coded datasets outside certain disciplines publish inter and intra-coder reliability metrics and many rely on single coder workflows. More troubling, as I’ve documented extensively, American Institutional Review Boards (IRBs) now routinely permit the use of stolen data in research, arguing that as long as a dataset can be downloaded from somewhere on the web or dark web, it does not matter whether it stems from an illegal hacking incident or data breach. Stolen data presents particularly unique challenges in that the release of stolen datasets frequently comes with a particular motivation and thus may represent a hand curation of the original data to capture or portray a certain conclusion the attackers wish to publicize while excluding all of the other mitigating content.

The underlying problem is that the academic world prioritizes new discovery, rather than documentation and validation of the data that drives those discoveries. It is very hard to publish a dataset documentary that dives deeply into a dataset, documenting its sources, nuances, limitations and strengths. A study that misunderstands a dataset and accidentally misuses it to derive splashy new conclusions can become a seminal work that drives prestige, grant funding and grad students, whereas a critical work that documents the inability of a dataset to be used for entire classes of research where it is the dominant dataset, will struggle to find a journal or conference willing to publish it.

To address this, journals should consider adding dedicated data review, much as many of them have added statistical review processes. Traditional peer reviewers are frequently just as unfamiliar with the nuances and incompatibilities of a dataset as the authors are and may even be selected because of their frequent use of the dataset to examine similar kinds of questions, rather than those more familiar with its limitations. Adding a specialized review process that focuses on the question of how well a dataset aligns with the questions asked of it would certainly catch a lot of these studies. However, only adopting open data requirements will truly address honest accidental misuse of data, as the broader scholarly community most familiar with those datasets can help evaluate their use, but even this must be coupled with encouragement of data documentary and validation studies.

Journals must also be more stringent about enforcing and encouraging their open data requirements. When a Board of Reviewing Editors member of Science declined to provide a replication dataset or answer any questions about a paper published in a different venue, Science noted that while it encourages its BoRE members to adhere to its open data requirements in all their work, it declined to comment on whether it would take any action against board members who refuse to do so. If there are no consequences for refusing to provide open data, then it is unclear what incentives scholars have to do so.

Of course, even when the dataset is perfectly aligned with the questions being asked of it, the ever-increasing availability of powerful analytic techniques means results gleaned from those datasets may still be incorrect. Incredibly powerful statistical and analytic methods are now available in point-and-click interfaces that bring even the most advanced methodologies to the masses. Yet, few software packages warn about employing a method in an inappropriate way or help users avoid misinterpreting results.

I regularly see network analysis papers that perform clustering on networks using algorithms that rely on random seeds and claim that their results represent the one “true” clustering of the network. These authors typically aren’t maliciously attempting to misrepresent their work, they simply don’t have the familiarity with the tools they are using to understand that a random seed is involved (or even what a random seed is) and that if you click the “cluster” button 10 times you may get 10 very different results.

In areas like content analysis that have long histories of computational approaches, it is not uncommon to see papers from other disciplines that take any available tool and apply it without understanding that language evolves over time. For example, users will take tools like 1961’s General Inquirer, dating from the punch card era, to mine Twitter or popular dictionaries from the 1980’s that come up at the top of quick Google searches and use them to mine Facebook posts without understanding that emojis, hashtags and the meaning of words has changed over time – “cool” may mean that something is “neat” rather than refer to temperature, while “throwing shade” may not refer to the amount of light reaching the surface of an object. Conversely, prominent papers will use modern circa 2010’s sentiment dictionaries to analyze digitized historical materials from 200 years ago.

Identifying honest misapplication of methodology is particularly difficult given that few non-technical journals and conferences require precise reporting of the specific algorithms and parameters used to run an analysis. In many journals, papers will simply say “we applied sentiment mining” or “we used a network clustering algorithm.” Even papers in prestigious methodology venues will frequently include phrases like “we performed human cleaning of the data prior to the analysis” without listing what precisely that entailed. Neural network analyses may list the package used and even some code snippets, but fail to provide the specific hyperparameters used.

Algorithmic review can help with such methodological error, but adding “open methodology” to “open data” requirements is the only real path forward. Requiring authors to precisely document the tools, algorithms and configurations/parameters used to perform an analysis would go a long way towards helping others evaluate and replicate research.

One of the least understood sources of error, at least in media studies, is the absolute need to normalize results. Media analyses that perform counts of articles, social media posts or other outputs must normalize those counts to account for continual changes in the underlying population. Simply creating a graph showing that the number of political tweets has increased dramatically from 2006 to 2013 is not particularly meaningful given that the total daily volume of tweets increasedsubstantially over that period.

Graphing the total number of newspaper articles that mentioned a given topic over time is similarly rather meaningless without normalizing by the changing size of the underlying papers. The total annual number of articles published in the New York Times has shrunk linearly by nearly half over the last 60 years. Thus, if you plot the raw number of articles per year mentioning a US research university, you end up with a fairly stable unchanging horizontal trendline over 60 years. In contrast, if you adjust for the fact that while the raw number of articles has remained constant, the total output of the paper has declined by half, then 6,000 articles in 2005 represents a much greater fraction of the paper’s total annual output than 6,000 articles in 1945. By normalizing, the true upwards linear trendline is seen, which shows that as a percent of the paper’s output, mentions of research universities nearly tripled from 5% to 13% over the 1945-2005 period.

Unfortunately, normalizing is very difficult for typical academic researchers to perform. Geographic normalization of social media requires access to baselines that are rarely provided by the platforms, while few news search engines provide total volume curves or inventories to permit such normalization.

Of course, not all errors in data-driven scholarship are the result of honest mistakes. As noted earlier, historical practice in many fields has permitted the use of vague language like “we performed a field study” or “we compiled a list of 10,000 partisan Twitter accounts” or “we performed sentiment mining,” creating an opening for bad actors to exploit this lack of detail to maliciously adjust their data. Traditional norms on closed datasets allow papers with questionable data pass through even the most rigorous peer review processes. While ethical and legal considerations mean that not all data can be made available for inspection, studies which are based entirely on data must find ways of making at least the secondary products from that data available for inspection.

The rise of easy to use image manipulation software has made it increasingly easy for scholars in image-intensive fields like biology to alter medical imagery, with a sharp rise in alleged manipulation after 2003, with the release of Adobe Creative Suite and its point-and-click ease. Manipulation can be difficult for even the most prestigious journals to catch. When PNAS was notified of concerns with the images in a paper it had published, a spokesperson said it “asked two scientific subject matter experts to look at the published version of the images … The experts had opposite opinions about whether duplication had occurred. Because the authors asserted that the images had not been duplicated, we did not take further action at the time.” The New York Times hired its own experts who concluded that the images had indeed been manipulated, to which PNAS responded “the experts the NYTimes [sic] recently consulted appear to have more experience in image forensics and were able to provide more sophisticated analyses. After the Times contacted us, we secured two additional independent analyses that came to the same conclusion of probable duplication.” When asked why the Times, a newspaper, had better image forensics experts available than PNAS, a spokesperson for the journal said it had no further comment.

That a newspaper could identify image manipulation where PNAS could not suggests that journals must be ever more vigilant for the possibility of manipulated data, whether in numeric or image form and that they may need to seek out more sophisticated data and image forensics experts to assist them. As deep learning-powered image and video manipulation becomes mainstream, it will become increasingly difficult for even the best experts to identify altered material. It also raises the role that blind trust plays in data evaluation – that papers with prestigious coauthors are often given the benefit of the doubt, with significant data concerns dismissed. In the end, however, science is about skepticism and all data should be treated as suspect until verified.

In this era of copy-paste Google Scholar-ship, not all error is the rest of data manipulation. Information in papers doesn’t just enter via the findings of the authors – the myriad citations in academic papers are also a key source of wrong information. Much of this error is due to authors in a hurry copy-pasting citations from other papers via Google Scholar. These errors then propagate through the academic landscape as other authors then copy-paste that incorrect information from those papers and so on.

Yet, key changes in what constitutes acceptable citations has also shifted the scholarly landscape’s understanding of what is “fact.” The Editor in Chief of a major scientific journal informed me recently that his journal now accepts citations to personal blogs, social media posts and other non-academic content as the sole citation to uphold a claim in a paper, raising the question of just what distinguishes scientific literature from random tweets? Another major journal recently published an editorial that referenced as fact several rumors sourced from a newspaper article that cited a single anonymous source as having overhead the rumors and which have been forcefully denied on the record by the organization they are about. When asked if they had verified the claims in any way, the journal said it would have no comment. Scholarship builds on other scholarship and when the landscape of basic “facts” that build the scaffolding of future discoveries is built upon questionable foundations, this can have real ramifications for our understanding of knowledge and how we interpret the data at hand.

Of course, politics, personal relationships and a myriad other non-data factors can play significant roles in the kinds of datasets and methodologies prevalent in each discipline.

Putting this all together, poor data practice, from honest statistical error to misunderstanding of data and methods, failure to normalize to malicious manipulation, coupled with copy-paste Google Scholar-ship, threatens to call into question many of the findings of data-driven research and is creating a dangerous landscape where a single honest spreadsheet error can reshape government policy, where "data" is conflated with "truth" and where dubious results are accepted as “fact” through the gilded veneer of data.