It may be surprising to find that there is little guidance available for how to effectively handle problematic data. Although professional guidelines cover general principles of data collection and data analysis, “researcher degrees of freedom” for how data are cleaned and edited are evident (Simmons, Nelson, & Simonsohn, 2011). Without scientific guidelines that clearly indicate what is and is not appropriate for editing messy data, many researchers and scientists proceed idiosyncratically. For all practical purposes, when presented with the same messy data, different scientists will “fix” the data problem in different ways yielding widely differing results (Leahey, Entwisle, & Einaudi, 2003).

For example, Lam and colleagues (2013) demonstrated that the data editing approach used for handling data inconsistencies in Global Youth Tobacco Survey data resulted in different estimates of smoking prevalence. As they aptly note, “accurate comparisons between two studies can be made only if the same approach in handling inconsistent data is used.” Documentation on data editing is required to reproduce and compare study results. This has particular implications for meta-analyses.

Understanding that different data cleaning approaches produce different results, Datacorp has worked to standardize data cleaning processes. These data cleaning procedures can be applied across a broad spectrum of data that are used for a variety of purposes.

Recently, the Obama administration issued an Executive Order to ensure government data are accessible and useful. Project Open Data (http://project-open-data.github.io/) is intended to improve transparency and spawn innovation.

Agencies are tasked with thinking about data as an asset. In doing so, agencies are encouraged to manage data from its start and release it to the public in an open format that encourages discovery. Anticipated benefits of the new policy include greater productivity and significant cost savings. For instance, data transparency could lead to a reduction in the duplication of data collection efforts and increase secondary data analysis.

Data sharing efforts can be easily undermined by poor data documentation and data management, two critical activities necessary to ensure data quality and the likelihood that existing data are utilized.

Data handling typically gets short shrift in scholarly publications. While researchers painstakingly report study methods and measurement tools, data handling is largely invisible. Some argue this practice needs to change to ensure research credibility (e.g., Simmons, Nelson, & Simonsohn, 2011), but publishers do not necessarily want to give up page space for these details (Leahey, 2008).

Yet, published studies are difficult–if not impossible–to replicate, retractions are on the rise, and there is some resistance to data sharing.

The inability to replicate (http://www.omsj.org/corruption/scientists-elusive-goal-reproducing-study-results) is often attributed to a lack of detailed information about study procedures or how data were handled based on what information is allowable or reported within the journal space. Retractions, on the other hand, are becoming more common (Fang, Steen, & Casadevall, 2012). They can be the result of a simple data handling mistake like forgetting to apply a sample weight or reporting a finding based on a value that has been miscoded, or can be attributed to fraud and falsified data (http://retractionwatch.wordpress.com/).

Despite these problems, Leahey (2008) found that funding agencies, institutional review boards, and publishers reported that data editing oversight fell outside of “the stages and domains of current gatekeeping activity.” All three gatekeepers agreed it is incumbent on the researcher to ensure data are handled properly and ethically. However, some journals are taking proactive steps to improve study transparency and reproducibility (http://www.nature.com/ni/journal/v14/n5/full/ni.2603.html).

It appears there are issues that should be addressed but how? What do you think?

Fang, F. C., Steen, R. G., & Casadevall, A. (2012). Misconduct accounts for the majority of retracted scientific publications. Proceedings of the National Academy of Sciences of the United States of America, 109(42), 17028-17033.

How much do bad data cost? Some suggest bad data might translate into a 3 trillion dollar problem (http://hollistibbetts.sys-con.com/node/1975126) although the true costs are hard to determine based on reported study data (Haug, Zachariassen, & van Liempd, 2011). What we do know is that poor quality data are costly – no matter how you look at it.

Automated cleaning rules reduce the cost of data cleaning, speed up the process, and impose a level of standardization that is not possible with manual data cleaning approaches.

Real-time data may be appropriate for simple analyses (e.g. averages, distributions). This information can provide a general idea about performance and potentially inform decision-making, but their use in more complex analyses such as statistical testing and predictive modeling may prove more costly than fully cleaning the data in the long run (http://www.mjdatacorp.com/does-data-cleaning-matter-a-resounding-yes/). Thus, the short-term costs of implementing a data cleaning process must be weighed against the long-term costs of drawing the wrong conclusions.

On the one hand, delayed data analysis may lead to fatal delays in budget planning; on the other hand, financial decisions based on inaccurate data can be costly. After all, what policy-maker wants to be put in the embarrassing position of having to retract what they thought was a “data-driven” decision?

When considering the costs of data management and data cleaning, keep in mind that you can pay now or you can pay later.

Eleven years ago we made what turned out to be a fortunate discovery at Datacorp. In the course of natural staff turn-over, we learned that data managers and analysts didn’t all do their work the same way. It’s one of those discoveries you don’t necessarily make until it’s staring you in the face. You’ve got a deadline and no one is 100% sure how a dataset has been handled and the best and most efficient way to proceed. We decided straight away that formal data management policies and procedures were needed.

We formed a Data Management Steering Committee made up of two scientists, two data analysts, a data manager, and a research assistant. We deliberately included a group of people that represented the company policy makers as well as end users. The group created a transparent infrastructure for everything from standardized cleaning rules to the company’s overarching file structure. We reaped the benefits immediately. Anyone could find anything in our file structure at any time. But most importantly, our standardization of practices is where the benefits of this effort were fully realized.

The Data Management Steering Committee expanded and evolved into our present day Data Governance Council. Our Data Governance Council makes all high-level decisions about how data are handled. It is responsible for data quality, data policies, processes, and risk management related to data handling. For instance, when we instituted our variable warehouse, there were a host of decisions that had to be made from both a scientific and technical perspective. Occasionally the council revisits issues as company needs change over time.

Eleven years later, we continue to see the value of this early effort. Our policies and practices have made it easier to train up new staff, transition projects from one staff member to another, oversee and monitor project progress, and understand how data have been handled from start to finish. Additionally, our clients benefit from the consistency and efficiencies applied in the work delivered to them.