What if scientists don’t really know what’s in their vials and lab dishes? A research team has analyzed dozens of data sets from human genomics studies and found that nearly half of them have a sexual identity problem—they’re labeled as coming from a male but the data suggest they must be from a female, or vice versa. These mix-ups, likely due to accidental mislabeling of the data at some point, but possibly also from cell contamination in the original samples, could have untold effects on the validity of comparisons in genomics experiments conducted worldwide, according to the group, which last week posted its findings on bioRxiv, a site for preprints that have not yet been formally peer reviewed.

The disputed data sets describe a tissue’s transcriptome—the array of messenger RNAs (mRNAs) produced when genes in cells turn on to manufacture a protein. Although much work has been done in recent years to reduce errors in studies of RNA transcriptomes, computational biologist Lilah Toker and her colleagues at the University of British Columbia, Vancouver, in Canada, kept noticing errors in how samples were labeled after they performed routine quality checks of data sets. “At some point we were wondering if this is just because we are doing so much data analysis, or is it actually something much more widespread,” Toker says.

Toker and her colleagues then examined the transcriptomes from 70 publicly available data sets for human tissue samples, trying to corroborate the sex of the tissues by looking for mRNAs from male- or female-specific genes. They found discrepancies between the labeled sex and the mRNA results in 32 out of the 70 data sets…..