Maximum size for HaplotypeCaller

I'm running HaplotypeCaller, and I was wondering what a recommended cohort size was. Would 30 to 40 families (100 to 150 individuals) be too large? Is there a maximum recommended size for number of independent alleles/number of individuals? How does the run time scale with additional individuals?

Best Answers

We no longer recommend running HaplotypeCaller on multisample cohorts because of performance issues (runtimes get exponentionally long as you add individuals). Instead, we have a new workflow that completely bypasses these issues. It involves running HC individually on each sample in GVCF mode, then performing a joint analysis using a new tool called GenotypeGVCFs. This is dramatically faster and more powerful than multisample calling. Have a look at this document which explains the process in more detail: http://www.broadinstitute.org/gatk/guide/article?id=3893

Sorry for the late reply. GenotypeGVCFs is indeed population-aware in the sense that it performs genotyping of each sample in light of the evidence reported for all samples. Running it on a matched cohort (rather than individuals or families) will therefore increase your ability to call variants where supporting evidence is low in individual samples but also observed in other samples of the population. And the larger the population, the more power you have to correctly identify rare variants.

You can just re-genotype everyone together; the presence of non-matched samples should not cause any disruptive effects. I think you might even get some benefits if some of your individuals have some admixed history.

Answers

We no longer recommend running HaplotypeCaller on multisample cohorts because of performance issues (runtimes get exponentionally long as you add individuals). Instead, we have a new workflow that completely bypasses these issues. It involves running HC individually on each sample in GVCF mode, then performing a joint analysis using a new tool called GenotypeGVCFs. This is dramatically faster and more powerful than multisample calling. Have a look at this document which explains the process in more detail: http://www.broadinstitute.org/gatk/guide/article?id=3893

Hi, I have a followup question about genotypeGVCFs. Is there an advantage to running it with a ethnically-matched cohort of individuals vs running it on an individual or a family at a time? Is it population-aware?

Sorry for the late reply. GenotypeGVCFs is indeed population-aware in the sense that it performs genotyping of each sample in light of the evidence reported for all samples. Running it on a matched cohort (rather than individuals or families) will therefore increase your ability to call variants where supporting evidence is low in individual samples but also observed in other samples of the population. And the larger the population, the more power you have to correctly identify rare variants.

Hi, another followup question. I work with a lab that accumulates samples over time. Would it be fine to just re-genotype everyone together each time we get a new batch of samples, or should we try to separate out samples from different ethnicities into different genotyping cohorts? Does the presence of non-ethnically matched samples disturb anything?

For example, we could genotype 70 African American individuals together. Or, we could genotype them along with 30 Asian individuals and 300 European American individuals. Will the first method produce better results than the second?

You can just re-genotype everyone together; the presence of non-matched samples should not cause any disruptive effects. I think you might even get some benefits if some of your individuals have some admixed history.