Jump to another community

SVDiscovery yielding far greater than expected number of polymorphic sites

I ran SVPreprocess and SVDiscovery on a couple of loci in the human genome, each about 2 Mb in size, using high-coverage WGS data from 200 samples aligned to hg38 as input. One locus has multiple paralogous genes within it whereas the other does not. The SVDiscovery yielded ~4000 variants in each of the two loci which seems like an excessive number of variants given the size of these loci. Of these, one locus had 6 variants for which the filter status was "PASS" and the other had 35. I kept the minimum size for SV deletion discovery at default (100 bp), so this seems unlikely to be explained by SNVs or indels. Would be great to get your thoughts on why this might be happening.

Best Answer

The output of SVDiscovery includes a line for every candidate variant that was evaluated. You should focus only on the variants where the filter field is PASS unless you want to adjust the default filtering for some reason (see below). So it sounds like the yield is 6 and 35, which is probably more in line with your expectations.

For some applications, people have adjusted the default filtering to increase the yield of selected sites to increase sensitivity, sometimes using machine learning approaches. But as you note, most of the lines in the VCF represent candidates that are not true variants (e.g. due to misalignment, chimerism, sequencing error) and should be rejected. You may also need to apply more stringent filters to the variants marked PASS by SVDiscovery. Typically we would take the PASS variants through to genotyping and then apply additional filters.

Answers

The output of SVDiscovery includes a line for every candidate variant that was evaluated. You should focus only on the variants where the filter field is PASS unless you want to adjust the default filtering for some reason (see below). So it sounds like the yield is 6 and 35, which is probably more in line with your expectations.

For some applications, people have adjusted the default filtering to increase the yield of selected sites to increase sensitivity, sometimes using machine learning approaches. But as you note, most of the lines in the VCF represent candidates that are not true variants (e.g. due to misalignment, chimerism, sequencing error) and should be rejected. You may also need to apply more stringent filters to the variants marked PASS by SVDiscovery. Typically we would take the PASS variants through to genotyping and then apply additional filters.