Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Jump to another community

Calling complex pedigrees with HaplotypeCaller

I want to use HaplotypeCaller to call families together. I have bam files for each individual in the 4 families I am studying, as well as a ped file describing the pedigree information. The problem is that these families have complex pedigrees, with the parents (mother and father), the children, and then one grandchild for each child (do not have information about the other parent of the grandchild). I would like to call these families with their complex pedigrees together, and I would like to call all the 4 families together to maximize the power of the calling. However, I'm not sure how to do that with just the -ped option. -ped seems to be designed for only one family or cohort, and I'm not sure it would work for me to feed it all my bams as inputs. Are there any other tools for GATK that I could use to call complex pedigrees?

The other possibility would be to call the 4 trios separately and each child-grandchild pair separately, but not sure how to do that either with just -ped. What would you recommend?

And finally, I get an error message saying that --emitRefConfidence only works for single sample mode. It seems that I should omit this option when I run HaplotypeCaller on my families, but are there any other options that I should use for cohort calling besides the standard ones (which I know to be --variant_index_type and --variant_index_parameter)?

@astrand‌ Not quite: you should run HaplotypeCaller on each sample individually with the --emitRefConfidence option, then run GenotypeGVCFs on all the resulting GVCFs together. This is a new(ish) workflow for joint analysis that gives best results while being computationally efficient. See the methods documentation for more details.

All good except the first step -- creating single-sample VCFs in regular mode is redundant if you go with the GVCF workflow. And to be clear, the output of GenotypeGVCFs will be a single VCF containing all the samples that you put through it together.

Workflow

To call variants with the GATK using pedigree information, you should base your workflow on the Best Practices recommendations -- the principles detailed there all apply to pedigree analysis.

But there is one crucial addition: you should make sure to pass a pedigree file (PED file) to all GATK walkers that you use in your workflow. Some will deliver better results if they see the pedigree data.

At the moment there are two of the standard annotations affected by pedigree:

Allele Frequency (computed on founders only)

Inbreeding coefficient (computed on founders only)

@gauthier said:
The new Genotype Refinement Pipeline is already in the codebase and should be available via the nightly builds. It has the capability (via CalculateGenotypePosteriors) to derive posterior genotype probabilities (in the new PP format field) based on the genotype likelihoods of the other members of the trio. (Genotypes will be modified based on these posteriors if necessary.) You can pass in population allele counts from HapMap or 1000 Genomes to help inform the posteriors as well. There's also a new possibleDeNovo annotation that can be applied with VariantAnnotator after CGP to tag high- and low-confidence de novo mutations in the trio offspring if that's something you're interested in.

There's some information in the CalculateGenotypePosteriors tool docs, but more comprehensive documentation is forthcoming pending the completion of Geraldine_VdAuwera's trip abroad.

If I'm not very clear, then it's because I'm not quite sure of the advantages of using pedigree information with the various GATK tools.

@tommycarstensen HC doesn't use the pedigree for calling variants (regardless of mode), but the ped file gets used for calculating annotations like InbreedingCoeff if you request them. The tools that are able to use the ped file are the ones that clearly require it (like PhaseByTransmission) plus the annotator-capable tools (HC, UG, VariantAnnotator and GenotypeGVCFs) if you request an annotation that has to do with population structure.

Therefore, in order to annotate all my samples together (multi-sample variant calling by using GenotypeGVCFs) do I need to provide all the bam files (using -I ) once again here as an input? along with all the GVCFs (using --variant ), in order to get a raw single VCF?

I don't believe GenotypeGVCFs takes bam files as input. It also seems to me that you shouldn't have to annotate twice... Seems to me that you could just pass the pedigree file when running GenotypeGVCFs and you would be fine. I have not used ped files with these tools though, so someone more experienced should confirm/correct my assertion.

I also have a family (of plants) for which I want to define haplotypes. I used to call Haplotypes using Haplotype Caller and phasing them by PhaseByTransmission. Has using PhaseByTransmission now become redundant when Haplotype Caller and CalculateGenotypePosteriors are used? What about ReadBackedPhasing? It also does physical phasing similiar to HC. What is the difference? In which order are the tools recommended to be used?

@Geraldine_VdAuwera said:
All good except the first step -- creating single-sample VCFs in regular mode is redundant if you go with the GVCF workflow. And to be clear, the output of GenotypeGVCFs will be a single VCF containing all the samples that you put through it together.

Would you recommend the 4 following steps for pedigree sample analysis:

1 Run HaplotypeCaller in GVCF mode on each sample bam file to create single-sample GVCFs (using --emitConfidence GVCF, --variant_index_type LINEAR, --variant_index_parameter 128000)
2. run CombineGVCFs per ~200 samples
3. Run GenotypeGVCFs on all the gvcf files together to get raw VCFs
4. Run VQSR on these raw VCFs
5. genotype refinement pipeline (CalculateGenotypePosteriors,VariantFiltration,VariantAnnotator,SnpEff)

Since I have about 1200 samples, would you recommend process these samples per chromosome in step 2 and 3 ?

If your plant family is in trios or you know the parents and the offspring, you can use PhaseByTransmission to get more accurate phasing.

If you are interested in exact genotypes, you should use CalculateGenotypePosteriors.

If you are interested in all the tools, you can run HaplotypeCaller to get phasing. Then, you can run Calculate Genotype Posteriors to get the updated genotypes and posteriors. Then, if you want, you can run PhaseByTransmission again to re-phase after the genotypes and posteriors have been updated.

ReadBackedPhasing is redundant, but it does come in handy when merging MNPs, because Haplotype Caller does not currently do it.

If you are interested in all the tools, you can run HaplotypeCaller to get phasing. Then, you can run Calculate Genotype Posteriors to get the updated genotypes and posteriors. Then, if you want, you can run PhaseByTransmission again to re-phase after the genotypes and posteriors have been updated.

I still have a question on this.... I run the HapotypeCaller on each genotype individually, then I use GenotypeGVCFs to merge the variants of within a family of mother, father and 35 children. After that I run CalculateGenotypePosteriors on the merged genomic vcf. In order to run PhasebyTransmission I create 35 individual vcfs each containing the parents and one children using SelectVariants. The resulting haplotypes of the parents after running PhaseByTransmission differ between the phased vcfs depending on the childs haplotypes. Do you have a recommondation on how to deal with this?

It looks like what is happening is that the father (Villard Blanc's) genotype at that position is not really known, so whatever makes sense given the child is what is being output. Notice the PLs for the father. 0,0,376 means we are not sure if he is 0/0 or 0/1. When the child is 0/1 and the mother is 0/0, it makes sense to assume the father is 0/1. Unfortunately, the site is ambiguous because of the PLs, so there may not be anything we can do.