Tackling genomic data corruption with a new tool

Data conversions and formatting defaults can cause unwanted transformations to scientific data. A recently published article in BMC Genomics demonstrates a possible solution to the issue of data corruption in genomics.

As scientific research becomes more reliant on published datasets and online resources, the effects of data corruption are a growing concern. This is particularly the case in the field of genomics where researchers often use online depositories and published supplements for their analyses.

Problems can occur when data is altered by formatting defaults in programs such as Microsoft Excel. This poses a problem to genomics researchers when gene names are converted to dates. For example, inputted gene names such as ‘SEPT2’ may be converted to 2016/09/02 when the program automatically formats this gene name to a date.

The huge scale of this problem was highlighted by Ziemann et al in an article published recently in Genome Biology. The researchers scanned leading genomics journals to find that one fifth of papers containing supplementary gene lists harboured these errors. Altered gene names were present in 987 supplementary files of 704 publications. Interestingly, it transpired that the journals with the highest impact factors were more likely to contain the corrupted data.

Pixabay

Finding this high rate of data corruption in published materials is disconcerting to the genomics community, who utilise these data sources heavily. This error can be difficult to reverse, and there have been few known solutions to remediate it.

In order to combat this troubling issue a team from Health Research Institute Germans Trias i Pujol (IGTP) in Spain have developed a new software tool named Truke, to help retain the integrity of genomic data. The new software which is described in an article recently published in BMC Genomics, restores corrupted symbols to their original gene names, through methods such as library referencing and reverse engineering.

By sourcing data from the National Center for Biotechnology Information (NCBI) database, a library of error prone gene symbols are collated, and their corresponding ‘dates’ predicted. When the user’s datasets are scanned by the system, the false dates are identified, and converted back to their original gene names. Safeguards are in place to avoid errors in the system. For example, in cases where a false date could correspond to more than one gene name, such as SEPT1 or SEPT-1, this discrepancy is flagged to the user for action.

The Truke web application was built using R, Shiny, HTML and bootstrap2 programming tools, and is freely available for use at http://maplab.cat/truke.

As shown in this article, we are presented with a potential solution to a problem which can be commonly overlooked by both scientists and publishers. While more and more data is published each year, perhaps these threats to data integrity can be lessened with the help of software solutions.