Abstract

Motivation:

Current plant and animal genomic studies are often based on newly assembled genomes that have not been properly consolidated. In this scenario, misassembled regions can easily lead to false-positive findings. Despite quality control scores are included within genotyping protocols, they are usually employed to evaluate individual sample quality rather than reference sequence reliability. We propose a statistical model that combines quality control scores across samples in order to detect incongruent patterns at every genomic region. Our model is inherently robust since common artifact signals are expected to be shared between independent samples over misassembled regions of the genome.

Results:

The reliability of our protocol has been extensively tested through different experiments and organisms with accurate results, improving state-of-the-art methods. Our analysis demonstrates synergistic relations between quality control scores and allelic variability estimators, that improve the detection of misassembled regions, and is able to find strong artifact signals even within the human reference assembly. Furthermore, we demonstrated how our model can be trained to properly rank the confidence of a set of candidate variants obtained from new independent samples.

General scheme of the methodology. (a) The LGP is constructed from sample reads that cover regions across the genome. (b) Then, specific markers of interest can be evaluated by contrasting their corresponding window value against the stored empirical distributions. Finally, the CES is computed to obtain the definitive diagnosis

Distribution of CES values depending on similarity score for Ahy (a), Sce (b) and Ath (c). CES was also plotted for Ath patched regions (d) and splitted in deletions (DEL), insertions (INS), substitutions (SUBS) and the set of randomly selected loci (B) that represents the background variability state of the genome. Distribution of REAPR values are also represented for the same categories: Ahy (e), Sce (f), Ath (g) and Ath patches (h)

CES distribution values for Hsa analysis. Clear differences are shown between patched and random regions of the genome (a). Also, CES showed a clear correlation with the number of mismatches between the NGS protocol and the validation SNP array (b). Interestingly, the false-positive variants of an independent set of samples fall at the end of the rank (c). The mean cumulative density function (cdf) of false positives is depicted (d) with clear differences between REAPR (light red curve) and our methodology (black curve)