Objective Errors within scientific publications contribute to research irreproducibility. A collection of highly similar cancer research publications (CorpusP) was recently identified, and 38 of 48 of these publications (79%) included nucleotide sequence(s) whose identities, according to blastn analyses, did not match their experimental use (either targeting an identified gene, or serving as a nontargeting control). To expand capacity to identify other studies that may incorrectly describe nucleotide sequence reagents, we aimed to design a semiautomated tool that checks the claimed use of nucleotide sequence reagents with indisputable facts from blastn homology searches; the tool was also tested with other literature claims using Google Scholar searches.

Design From a given publication, seek & blastn, a semiautomated tool, automatically extracts gene identifiers and nucleotide sequences (15 to 90 bases) using named entity recognition techniques (thesaurus and rules). The sentence containing each sequence is automatically analyzed (using finite-state machines) to assign a claimed status (targeting or nontargeting) that is compared with the most likely status according to blastn analysis. Claimed status within the literature can be further assessed by Google Scholar searches. The approach was built using the CorpusP publications and further analyzed using a set of 154 unknown studies (CorpusU) retrieved using studies from CorpusP and the “PubMed similar” functionality.

Results In CorpusP and CorpusU, 48 of 48 (100%) and 111 of 154 (73%) publications included nucleotide sequences that were extracted using seek & blastn. Application of seek & blastn identified the 38 of 48 studies (79%) in CorpusP that appear to have incorrectly employed nucleotide sequence reagent(s). More nontargeting than targeting sequences were accurately predicted to have been used incorrectly (37 of 47 [78.7%] vs 19 of 294 [6.5%]). Furthermore, the analysis of nucleotide sequences flagged by seek & blastn predicted that 30 of 154 CorpusU studies (19%) may have incorrectly employed nucleotide sequence reagent(s). However, the automated use of seek & blastn faces challenges. Overall, 10 of 341 (2.9%) and 11 of 341 (3.2%) sequences in CorpusP were either not extracted or incorrectly extracted, respectively, and claims were not (correctly) identified for 19 of 341 sequences (5.6%). Furthermore, gene identifier variations may complicate the analysis of targeting sequences. Application of seek & blastn therefore currently requires follow-up analyses by life science expert peers.

Conclusions Preliminary use of seek & blastn suggests that the incorrect use of nucleotide sequence reagents may be frequently undetected and represents an underestimated source of error in life science publications. Text mining and text analysis tools such as seek & blastn may therefore provide valuable support to allow peers to identify obvious errors in the published or forthcoming scientific literature.

1Molecular Oncology Laboratory, Children’s Cancer Research Unit, Kids Research Institute, The Children’s Hospital at Westmead, Westmead, Australia; 2The University of Sydney Discipline of Child and Adolescent Health, The Children’s Hospital at Westmead, Westmead, Australia; 3University of Grenoble Alpes, The National Center for Scientific Research (CNRS), Grenoble, France, cyril.labbe@imag.fr

Conflict of Interest Disclosures: Springer-Nature is funding a PhD student within the research group of Cyril Labbé. This PhD project is exploring methods to detect automatically generated scientific papers. Funding from Springer-Nature did not support the work described in this abstract.