1. JEXL in a nutshell

JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.

2. Basic structure of JEXL expressions for use with the GATK

In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.

JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:

"QUAL > 30.0"

QUAL is a key: the name of the annotation we want to look at

30.0 is a value: the threshold that we want to use to evaluate variant quality against

> is an operator: it determines which "side" of the threshold we want to select

The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:

"MY_STRING_KEY == 'foo'"

3. Evaluation on multiple annotations

You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:

"QUAL / DP < 10.0"

You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):

"QUAL > 30.0 && DP == 10"

where && is the logical "AND".

Or if you want to select variants that have at least one of several conditions fulfilled:

"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"

where || is the logical "OR".

4. Filtering on sample/genotype-level properties

You can also filter individual samples/genotypes in a VCF based on information from the FORMAT field. Variant Filtration will add the sample-level FT tag to the FORMAT field of filtered samples. Note however that this does not affect the record's FILTER tag. This is still a work in progress and isn't quite as flexible and powerful yet as we'd like it to be. For now, you can filter based on most fields as normal (e.g. GQ < 5.0), but the GT (genotype) field is an exception. We have put in convenience methods to enable filtering out heterozygous calls (isHet == 1), homozygous-reference calls (isHomRef == 1), and homozygous-variant calls (isHomVar == 1).

5. Important caveats

Sensitivity to case and type

You're probably used to case being important (whether letters are lowercase or UPPERCASE) but now you need to also pay attention to the type of value that is involved -- for example, numbers are differentiated between integers and floats (essentially, non-integers). These points are especially important to keep in mind:

Case

Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual, qual or whatever) in your JEXL expression.

Type

The types (i.e. string, integer, non-integer or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g.45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (aka a Java exception).

Complex queries

We highly recommend that complex expressions involving multiple AND/OR operations be split up into separate expressions whenever possible to avoid confusion. If you are using complex expressions, make sure to test them on a panel of different sites with several combinations of yes/no criteria.

6. More complex JEXL magic

Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.

Introducing the VariantContext object

When you use SelectVariants with JEXL, what happens under the hood is that the program accesses something called the VariantContext, which is a representation of the variant call with all its annotation information. The VariantContext is technically not part of GATK; it's part of the variant library included within the Picard tools source code, which GATK uses for convenience.

The reason we're telling you about this is that you can actually make more complex queries than what the GATK offers convenience functions for, provided you're willing to do a little digging into the VariantContext methods. This will allow you to leverage the full range of capabilities of the underlying objects from the command line.

In a nutshell, the VariantContext is available through the vc variable, and you just need to add method calls to that variable in your command line. The bets way to find out what methods are available is to read the VariantContext documentation on the Picard tools source code repository (on SourceForge), but we list a few examples below to whet your appetite.

Examples using VariantContext directly

For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:

Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:

Example using JEXL to evaluate arrays

Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:

This document describes why you might want to extract a subset of variants from a callset and how you would achieve this.

Often, a VCF containing many samples and/or variants will need to be subset in order to facilitate certain analyses (e.g. comparing and contrasting cases vs. controls; extracting variant or non-variant loci that meet certain requirements, displaying just a few samples in a browser like IGV, etc.). The GATK tool that we use the most for subsetting calls in various ways is SelectVariants; it enables easy and convenient subsetting of VCF files according to many criteria.

Select Variants operates on VCF files (also sometimes referred to as ROD in our documentation, for Reference Ordered Data) provided at the command line using the GATK's built in --variant option. You can provide multiple VCF files for Select Variants, but at least one must be named 'variant' and this will be the file (or set of files) from which variants will be selected. Other files can be used to modify the selection based on concordance or discordance between the callsets (see --discordance / --concordance arguments in the tool documentation).

There are many options for setting the selection criteria, depending on what you want to achieve. For example, given a single VCF file, one or more samples can be extracted from the file, based either on a complete sample name, or on a pattern match. Variants can also be selected based on annotated properties, such as depth of coverage or allele frequency. This is done using JEXL expressions; make sure to read the linked document for details, especially the section on working with complex expressions.

Note that in the output VCF, some annotations such as AN (number of alleles), AC (allele count), AF (allele frequency), and DP (depth of coverage) are recalculated as appropriate to accurately reflect the composition of the subset callset. See further below for an explanation of how that works.

Command-line arguments

For a complete, detailed argument reference, refer to the GATK document page here.

Subsetting by sample and ALT alleles

SelectVariants now keeps (r5832) the alt allele, even if a record is AC=0 after subsetting the site down to selected samples. For example, when selecting down to just sample NA12878 from the OMNI VCF in 1000G (1525 samples), the resulting VCF will look like:

Although NA12878 is 0/0 at the first sites, ALT allele is preserved in the VCF record. This is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. This is related to the tricky issue of isPolymorphic() vs. isVariant().

isVariant => is there an ALT allele?

isPolymorphic => is some sample non-ref in the samples?

For clarity, in previous versions of SelectVariants, the first two monomorphic sites lose the ALT allele, because NA12878 is hom-ref at this site, resulting in VCF that looks like:

If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting.

How do the AC, AF, AN, and DP fields change?

Let's say you have a file with three samples. The numbers before the ":" will be the genotype (0/0 is hom-ref, 0/1 is het, and 1/1 is hom-var), and the number after will be the depth of coverage.

BOB MARY LINDA
1/0:20 0/0:30 1/1:50

In this case, the INFO field will say AN=6, AC=3, AF=0.5, and DP=100 (in practice, I think these numbers won't necessarily add up perfectly because of some read filters we apply when calling, but it's approximately right).

Now imagine I only want a file with the samples "BOB" and "MARY". The new file would look like:

BOB MARY
1/0:20 0/0:30

The INFO field will now have to change to reflect the state of the new data. It will be AN=4, AC=1, AF=0.25, DP=50.

Let's pretend that MARY's genotype wasn't 0/0, but was instead "./." (no genotype could be ascertained). This would look like

BOB MARY
1/0:20 ./.:.

with AN=2, AC=1, AF=0.5, and DP=20.

Additional information

For information on how to construct regular expressions for use with this tool, see the method article on variant filtering with JEXL, or "Summary of regular-expression constructs" section here for more hardcore reading.

This is not exactly new (it was fixed in GATK 3.0) but it's come to our attention that many people are unaware of this bug, so we want to spread the word since it might have some important impacts on people's results.

Affected versions: 2.x versions up to 2.8 (not sure when it started)

Affected tool: SelectVariants

Trigger conditions: Extracting a subset of samples with SelectVariants while using multi-threading (-nt)

Effects: Genotype-level fields (such as AD) swapped among samples

This bug no longer affects any tools in versions 3.0 and above, but callsets generated with earlier versions may need to be checked for consistency of genotype-level annotations. Our sincere apologies if you have been affected by this bug, and our thanks to the users who reported experiencing this issue.

GATK 2.8 was released on December 6, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Note that this release is relatively smaller than previous ones. We are working hard on some new tools and frameworks that we are hoping to make available to everyone for our next release.

Unified Genotyper

Fixed bug where indels in very long reads were sometimes being ignored and not used by the caller.

Haplotype Caller

Improved the indexing scheme for gVCF outputs using the reference calculation model.

The reference calculation model now works with reduced reads.

Fixed bug where an error was being generated at certain homozygous reference sites because the whole assembly graph was getting pruned away.

Fixed bug for homozygous reference records that aren't GVCF blocks and were being treated incorrectly.

Variant Recalibrator

Disable tranche plots in INDEL mode.

Various VQSR optimizations in both runtime and accuracy. Some particular details include: for very large whole genome datasets with over 2M variants overlapping the training data randomly downsample the training set that gets used to build; annotations are ordered by the difference in means between known and novel instead of by their standard deviation; removed the training set quality score threshold; now uses 2 gaussians by default for the negative model; numBad argument has been removed and the cutoffs are now chosen by the model itself by looking at the LOD scores.

Reduce Reads

Fixed bug where mapping quality was being treated as a byte instead of an int, which caused high MQs to be treated as negative.

Diagnose Targets

Added calculation for GC content.

Added an option to filter the bases based on their quality scores.

Combine Variants

Fixed bug where annotation values were parsed as Doubles when they should be parsed as Integers due to implicit conversion; submitted by Michael McCowan.

Select Variants

Changed the behavior for PL/AD fields when it encounters a record that has lost one or more alternate alleles: instead of stripping them out these fields now get fixed.

Miscellaneous

SplitSamFile now produces an index with the BAM.

Length metric updates to QualifyMissingIntervals.

Provide close methods to clean up resources used while creating AlignmentContexts from BAM file regions; submitted by Brad Chapman.

I am using BWA/GATK Haplotype caller on core exom sequences.
In my pipeline I use an extended (250bp padding around exons for indels/50bp for SNPS) bed interval file to identify variants.
Following that, I use SelectVariants to filter out core variations with narrower bed coordinates (50bp padding for indels, 10bp for SNPs).

Everything is fine with SNPs, and insertions, however in case of large deletions SelectVariant is filtering out large deletions that are started outside of the core bed coordinates, even though the end of the deletion overlaps the narrower bed coordinates.

To visualise the problem I am attaching an IGV view, where the extended list contains the large deletion, however in the core list it is absent.

Is there any way to fix this issue? Calling the variants twice with different bed coordinates would require unnecessary computation power.

Had a JEXL syntax question that I was hoping to get some thoughts on. Let's say that I have a VCF file with 10 samples in it, S1 through S10. I would like to select the sites where among 5 specific samples, 3 are homozygous variant. So something like:

select ' vc.getHomVarCount(S1, S2, S3, S4, S5) >= 10 '

However this syntax doesn't work unfortunately. Is there a way to specify how to do something like this using vc.GetHomVarCount()?

I have some VCF files, each of which I have merged to contain >300 genotypes. Furthermore, to make them more manageable I have subsetted them to just contain the chromosome regions I am interested in.

Now I wish to generate some genotype specific FASTA sequences using these files and a reference sequence; i.e. a sequence for each genotype which is the same as the reference sequence but with the SNPs specific to each genotype in place of their counterparts in the reference sequence.

Now I know that there is variation in the genotypes. Here is a picture visualizing three exemplar genotypes that I generated by loading the VCF file into Geneious.

http://imgur.com/LRZHIcw

I then try to create individual VCF files for each genotype using this:

Here the Xs represent the location on the reference sequence of the regions of interest.

I did this in a loop and got identical sequences for every genotypes. I then implemented it individually for the 3 exemplar genotypes in the picture above and in both cases I get identical sequences for every genotype. Interestingly they are not the reference sequence.

I have used VariantFiltration to filter the variants based on quality thresholds. Subsequently, SelectVariants to select the filter passed variants using "--excludeFiltered" flag in the command line. Could it be possible to use SelectVariants to select only the filter failed variants? I could not find it in the documentation of the tool. Is there any such option available?

I currently have many WES gVCFs called with GATK 3.x HaplotypeCaller, and I'm now looking to combine them and run GenotypeGVCFs. Unfortunately, I forgot to add the "-L" argument to HC to reduce the size of the resulting gVCFs, and CombineGVCFs looks like it's taking much longer than I expect it to.

Is there any potential problem with using the "-L" argument to SelectVariants to reduce the size of my gVCFs and then use those smaller gVCFs in the CombineGVCFs stage (and beyond), or do I have to re-call HaplotypeCaller again? Would it be better to extend the boundaries of the target file by a certain amount to avoid recalling HaplotypeCaller?

I have a joint VCF that contains three samples, I want to filter the VCF based on the DP in one sample. I noticed that SelectVariants can only select variants based on "INFO" field in the VCF. Any suggestions?

In this case, I get no results ( although GATK executes succesfully without errors). However when I exclude the third clause I get the expected results namely those variants that are (1)heterozygous and (2) The ALT_AD exceeds 8 reads. Like so:

I am experimenting with JEXL expressions in order to do some hard filtering on variants. I wonder if there is a method to tell the filter expression to operate on the current sample (assuming here a single sample VCF file) without knowing the sample name a priori e.g.

vc.getGenotype("Sample1").isHet()

Works just fine to sample heterozygous positions from a VCF where I know the sample will be called "Sample1", but can I refer to a sample name programmatically, e.g. something like: vc.getGenotype( sample() ).isHet()

Sorry if this is a really dumb question. (BTW I realise you could use a genotype_filter e.g. --genotypeFilterExpression "isHet == 1" to do the same thing, but I want to annotate the VCF in the FORMAT/FILTER field, rather than the INFO field for downstream variant selection operations.

I split my vcf using snpsift - so I run it on each chromosome seperately. For 12 of the 38 chromosomes I get the error - the others work fine. I re-extracted one of the chromosomes that failed, to be sure it wasn't a snpsift issue, and it still fails.
Any ideas on what the problem might be?

ERROR stack trace

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:189)
at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.loadFromDisk(RMDTrackBuilder.java:336)
at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.attemptToLockAndLoadIndexFromDisk(RMDTrackBuilder.java:320)
at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:279)
at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:225)
at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:148)
at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.(ReferenceOrderedDataSource.java:208)
at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedDataSource.(ReferenceOrderedDataSource.java:88)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:990)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:772)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:285)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:185)
... 14 more
Caused by: java.io.EOFException
at htsjdk.tribble.util.LittleEndianInputStream.readFully(LittleEndianInputStream.java:138)
at htsjdk.tribble.util.LittleEndianInputStream.readLong(LittleEndianInputStream.java:80)
at htsjdk.tribble.index.linear.LinearIndex$ChrIndex.read(LinearIndex.java:271)
at htsjdk.tribble.index.AbstractIndex.read(AbstractIndex.java:363)
at htsjdk.tribble.index.linear.LinearIndex.(LinearIndex.java:101)
... 19 more

Thanks to you, we've got an amazing .vcf file just bursting with SNPs. If my file contains five replicates of a control group and five replicates of a treatment group (so 10 individuals total) and I'd like to select variants which are the same across all members of the control group, the same across all members of the replicate group, but different between the two groups, can I accomplish that with SelectVariants and JEXL? Or via some even easier method?

I am wondering if it would be okay to use the "SelectVariants" tool to select mutants specific to the tumor sample in a pair of tumor and normal samples. I tried the following command, and it seems working. But I am suspecting whether it differentiates the genotype differences (e.g. 0/1 in tumor sample and 0/0 in normal sample) or coverage differences (e.g. 0/1 in tumor sample and ./. in normal sample).

Could someone help me understand "SelectVariants" for my question? Thanks a lot!

Hi,
I successfully ran a variant calling pipeline and got the raw.vcf file obtained by haplotype caller which contains exactly 4,484,688 records. And then I processed it further by extracting SNPs and INDELs to seperate VCF files using SelectVariants , and it yielded a vcf for raw SNPs with 85,239 records and a vcf for raw INDELs with 14,501 records. As you can see there is a significant decrease of the total number of records. where almost 4 million records were ignored. Could you please explain me what's the process behind all these and why is this happening ?

Would it be possible to add an option to keep the original read depth (DP) of a variant in the INFO field when subsetting multi-sample VCF files using SelectVariants. This is already implemented for AC, AF, and AN values (--keepOriginalAC).

Is there a Walker to select a certain number of random variants from a VCF file? I have checked only the SelectVariants, but the only option similar is --select_random_fraction, that select a fraction of mutations instead of exact number.

I'm trying to get specific variants from a VCF file. I'm using SelectVariants' '--concordance'. This option is outputting any variant in the same genome position without regards for the actual REF and ALTs fields. Is there a way to get the variants matching REF exactly and ALTs at least containing the ALT in the concordance file?

I'm analyzing two exome deep sequencing libraries, one from cancer cells and the other from normal cells.
I have been through the GATK best practices to end with a recalibrated filtered vcf file (my last step was the Variant Recalibration).

Now I would like to keep the variants (variations) found between normal and cancer cells and keep only those which have been retained
according to plots produced during the previous step i.e VQSR.

I'm not sure which parameters I should select in my command line in order to achieve my goal.

I have searched through the documentation on your website, but still not sure...

Is there a way to include only variant sites and no-calls in your final vcf. I know during SNP calls you can only emit variants, or only confident sites or all. However is there a way to reduce your vcf in the end to only variant sites (vsqr passed) and places where no calls could be made. So the end vcfs have only variant sites and missing data - and everything that is not listed in the vcf file is reference. I need such a file for merging with other vcf files - so that every position that is not in the vcfs while merging can be called ref.

So far i have called snps with emit-all and done vsqr - I now want to reduce vcfs in size by excluding NO_VARINATION sites (but want to keep information on "missing" sites)

I just wanted to select variants from a VCF with 42 samples. After 3 hours I got the following Error. How to fix this? please advise. Thanks
I had the same problem when I used "VQSR". How can I fix this problem?

ERROR stack trace

java.lang.NegativeArraySizeException
at org.broad.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:97)
at org.broad.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:116)
at org.broad.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:84)
at org.broad.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:73)
at net.sf.samtools.util.AbstractIterator.next(AbstractIterator.java:57)
at org.broad.tribble.readers.AsciiLineReaderIterator.next(AsciiLineReaderIterator.java:46)
at org.broad.tribble.readers.AsciiLineReaderIterator.next(AsciiLineReaderIterator.java:24)
at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:73)
at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:35)
at org.broad.tribble.AbstractFeatureCodec.decodeLoc(AbstractFeatureCodec.java:40)
at org.broad.tribble.index.IndexFactory$FeatureIterator.readNextFeature(IndexFactory.java:428)
at org.broad.tribble.index.IndexFactory$FeatureIterator.next(IndexFactory.java:390)
at org.broad.tribble.index.IndexFactory.createIndex(IndexFactory.java:288)
at org.broad.tribble.index.IndexFactory.createDynamicIndex(IndexFactory.java:278)
at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createIndexInMemory(RMDTrackBuilder.java:388)
at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:274)
at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:211)
at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:140)
at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedQueryDataPool.(ReferenceOrderedDataSource.java:208)
at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedDataSource.(ReferenceOrderedDataSource.java:88)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:964)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:758)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:284)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

When I use SelectVariants -fraction option more than once, I always get the same variant sites. The same happens for RandomlySplitVariants. Is there a way of randomly taking a fraction of the vcf that is not always the same, but truly random (which means it will pick different sites each time, of course with a certain probability of including common sites)?

I hope I have not duplicated the question since I did not find solution.

Suppose I have one variant dataset which just includes variants from ONE sample . If i have another outer datasets (not my test dataset), I can produce variants that are unique in my test call dataset by using --discordance argument like this with no problem:

However, If my sample dataset includes one clinical affected sample and three controls, I want produce all the variants of this affected sample that are unique within this dataset ( discordance of this affected sample comparing with three controls), what tools or commends I can use?

I am working on a Queue script that uses the selectVariants walker. Two of the arguments that I am trying to use both use an enumerated type: restrictAllelesTo and selectTypeToInclude. I have tried passing these as strings however I get java type mismatch errors. What is the simplest way to pass these parameters to the selectVariant walker in the qscript?

I was wondering if you could use the toolkit to generate a separate VCF file containing only SNPs that are found at a predetermined chromosome and base pair position. I have a plink file which I want to convert back to VCF format and it seems unbelievably hard to do so I thought this may be a good way to get around that problem?

I am aware that vcftools offers this function with the "--positions " option, however for some reason I am getting far more variants than I listed and there is nothing wrong that is obvious with my listed positions/vcf file.

ERROR stack trace

java.lang.RuntimeException:
Line: chr1 2277268 rs140545544 CCACA C,CCACACA 515.33 PASS . GT:GQ:DP:PL:AD 2/2:5:21:.,.,381,.,5,0:1,12 1/1:45:21:723,45,0,.,.,.:0,15
at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:68)
at org.broad.tribble.TribbleIndexedFeatureReader$QueryIterator.readNextRecord(TribbleIndexedFeatureReader.java:342)
at org.broad.tribble.TribbleIndexedFeatureReader$QueryIterator.next(TribbleIndexedFeatureReader.java:297)
at org.broad.tribble.TribbleIndexedFeatureReader$QueryIterator.next(TribbleIndexedFeatureReader.java:261)
at org.broadinstitute.sting.gatk.refdata.utils.FeatureToGATKFeatureIterator.next(FeatureToGATKFeatureIterator.java:60)
at org.broadinstitute.sting.gatk.refdata.utils.FeatureToGATKFeatureIterator.next(FeatureToGATKFeatureIterator.java:42)
at org.broadinstitute.sting.gatk.iterators.PushbackIterator.next(PushbackIterator.java:65)
at org.broadinstitute.sting.gatk.iterators.PushbackIterator.element(PushbackIterator.java:51)
at org.broadinstitute.sting.gatk.refdata.SeekableRODIterator.next(SeekableRODIterator.java:223)
at org.broadinstitute.sting.gatk.refdata.SeekableRODIterator.next(SeekableRODIterator.java:66)
at org.broadinstitute.sting.utils.collections.RODMergingIterator$Element.next(RODMergingIterator.java:72)
at org.broadinstitute.sting.utils.collections.RODMergingIterator.next(RODMergingIterator.java:111)
at org.broadinstitute.sting.gatk.datasources.providers.RodLocusView.next(RodLocusView.java:122)
at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$MapDataIterator.next(TraverseLociNano.java:173)
at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$MapDataIterator.next(TraverseLociNano.java:154)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:271)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:145)
at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:100)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:283)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
Caused by: java.lang.NumberFormatException: For input string: "."
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:481)
at java.lang.Integer.valueOf(Integer.java:582)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.decodeInts(AbstractVCFCodec.java:703)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:664)
at org.broadinstitute.variant.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:114)
at org.broadinstitute.variant.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:131)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:309)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:241)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:220)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:46)
at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:65)
... 25 more

No error stack trace with version 2.7

I tried to validate vcf file using ValidateVariants too, but getting same error.

Hi Team,
I have a VCF which I'd like to filter by variant frequency. The problem is, my frequencies are percentages rather than decimals. Is there a workaround in JEXL which allows it to parse the '%' operator as a percentage (or ignore it entirely) rather than considering the field a string upon seeing the modulo operator?
The VCF also has two columns in the format column (a normal and a tumor). Is it possible to drill down into these using just the genotypeFilterExpression/genotypeFilterName flags or must do something else?

Is there a best practice for finding compound heterozygotes using GATK? I can easily find recessive pattern variants using the SelectVariants tool, however, I have not been able to find a way to select compound hets. I am sure the Broad has a straightforward way. Is it somehow integrated into the GATK tool set?

I have considered using something like Gemini, but I would prefer to keep tools use to fewer product lines whenever possible.

First post. Thank you for these amazing tools. I have spent two days pulling my hair out, trying all enumerations, searching the documentation and forums, and in the end I come to you for help. Please forgive me if these topics have been covered elsewhere.

I have several VCFs generated by SomaticSniper that I'd like to filter based on the SomaticScore (SSC in the FORMAT field). I was working with VariantFiltration and SelectVariants, and trying to use different options, as well as regular expressions, to select those calls with a SSC over 40. I have been unable to do so. I also looked into trying to figure out JEXL, and using the last command listed on the documentation page, about using the VariantContext feature to drill into an array. I cannot get it to recognize the SSC column of the FORMAT field and then filter for the TUMOR sample.

Using VariantFiltration I was using -select (but I understand now that this searches the INFO field only). I was then using the --genotypeFilterExpression, but it would not add the FT tag to the FORMAT field as it said it would, it would just apply PASS to everything.

Using SelectVariants, I was using -sn to select the TUMOR sample and then using -select_expressions, but I guess this also only works on the INFO field. I had been trying to use --sample_expression which gives the ability to use a regular expression, but then I never had good results; it wouldn't do any filtering, and output the entire input file. Does the regular expression only apply to the sample name, and not the content of each line? Trying to select SSC over 40 from a line like this

I am not a coder, as you can probably see, but I'm trying to get this worked out. This output the entire file still, with SSC values above and below 40.

Looking into use the vc.getGenotype array access, I could not find much documentation about VariantContext; I was looking through the files on github, looking through the code and looking for samples, since the .getAD() from the documentation seems to work, but alas, there is no .getSCC() available. Is using vc. the best way to drill into an array (the FORMAT field) and search for what I want?

I didn't post all the code and output, trying to keep this as short as possible. I can post pastebin outputs if that would be helpful. Thank you, David

Just noticed that SelectVariants produces "ERROR MESSAGE: Invalid argument value '>' at position 10" when I use the '-select' parameter with the syntax given in your third example [-select "QD > 10.0"].
I'm using GATK 2.6-5, java/1.7.0_25.
It worked well without the whitespace in the expression [-select "QD>10.0"]. Hope that's not just me ;-)
Also, does the example miss the line continuation character just before the '-select' option?

As I said in my last post about splitting my 11 samples from the recalibrated VCF file. I now have a different question which is how to set up a criteria to select variants from this 11-sample-combined VCF. My criteria would be DP >= 20 and # of ALT reads >= 10. I know the AD is the sum of both REF and ALT reads, but I was wondering if there's any way to select by the # of ALT and DP >=20?

Should I use the "-T SelectVariants" or "-T VariantFiltration"? I am using GATK 2.5 on a remote Mac OS X server by the way.

I was just wondering if anyone uses GATK's SelectVariants walker to call de novo mutations (Mendelian violations) and, if so, what -mvq cut-off do they use? My data is exome sequencing with a large range of read depths - from mean target coverage of 14X to >50X.

I have a vcf file of indels, and an interval file of single-base positions I am interested in. I would like to select all of the variants in the file that overlap the positions I'm looking at. Using select variants with the interval list I have returns nothing, because the indels are not contained within any intervals. I don't want to just add an arbitrary amount of padding to my intervals because I don't want to include other nearby variants that don't actually overlap my sites.

I hope I'm not making a duplicate question here, but I couldn't find anything regarding this in the forum.

Basically, what I want to do is to use SelectVariants to filter against another call set, but I do not want to be as strict as using -discordance (i.e. 100% discordance rate between the two call sets). I want to say for example: "filter call set A against variants that occur in >90% of call set B".

Hello I would like to subset a VCF file to only save a few specific regions of the whole genome. I know some of your tools allow for an interval list to be used to subset the region analyzed. Do you have a tool or are you aware of a tool that would allow me to quickly do this from an interval list or something similar? I could make a little script myself, but I figure sub setting and printing out a specific genomic region of interest in a VCF file has to be a solved problem by GATK.

I'm trying to use JEXL to filter variants but something isn't working and I can't figure it out. I'm hoping someone can point me in the right direction. My VCF file contains an INFO field 1000g2012Apr_ALL. Some of the variants in my VCF have an entry for this field, some don't. I want to filter my VCF file for entries that are below a certain value or are NULL (empty).

Just thought I should report a possible small bug with SelectVariant --maxIndelSize as I think it's only filtering insertions greater than the given value and not deletions.
Of course using JEXL expressions I can still get the variants I'm after...

After this I still would have to filter out the private 0/0 calls and doing this for a large multisample VCF means entering this command for all the combinations which is not really nice.

Surely this must be possible with GATK. Does anyone know how to do this with GATK.

Maybe it is somewhere in the SelectVariants? The --discordance option looked promissing but there is something about that the samples should be the same? Or is it possible to write another variant walker or a JEXL expression?

I wouldn't classify this as a bug, but thought I would just put this up here in anyways.

I'm creating a vcf file from exome targeted sequencing, using unified genotyper to just call SNVs on one sample, then using SelectVariants instead of VariantFiltration using hard-cutoffs. This is just a intermediate QC file (something that we can use to do a rough TiTv estimate, concordance to GWAS arrays, etc) in order to make the determination as to whether or not this sample was good enough as is to go into the pool of samples to run unified genotyper/haplotyper caller on together and then VQSR.

So when using the expression;

-select 'MQRankSum > -12.5'

any non-reference homozygote that has no reads containing an alternate allele logically won't have this annotation calculated and since this is missing then this record is removed. Nothing major. I plan on redoing it using the VariantFiltration walker instead and the just use the SelectVariants to pull out PASS records, but I though I would put this up in any case. I was trying to think of a way to use a more complex expression to say if GT was 0/1 then MQRankSum, but couldn't wrap around my head for the case were GT was 1/1 and MQRankSum was present and greater than -12.5.

I was using GenomeAnalysisTK-2.2-4-g4a174fb at the time and my command line is below.

Firstly, the SelectVariants documentation describes selecting 1000 random variants from a vcf using '-number 1000', however when I try that (with the command "java -jar GenomeAnalysisTK.jar -T SelectVariants -R human_g1k_v37.fasta --variant variants.vcf -o 1000.vcf -number 1000") it produces the error 'Argument with the name number isn't defined'. Trying --number instead doesn't make any difference, while the --help output does not list this argument (GATK 2.2.2). It this option no longer available?

I used the UnifiedGenotyper (GATK 1.6) on a multi-sample set to call variants, and for some of the positions I get multiple mutated alleles. The genotype entries in the combined VCF file look like (GT:AD:DP:GQ:PL):

0/1:94,11,0:124:22.18:22,0,2485,209,2500,2709

0/2:27,0,54:81:99:1651,1726,2695,0,968,836

so it's three AD values per entry. Running SelectVariants yields the following line for the second example from above:

GT:AD:DP:GQ 0/1:27,0,54:81:99

Although it changed the genotype from 0/2 to 0/1, it did not update the AD field. I checked the forums, but I could not really find anything discussing specifically the update of AD, except for the GATK 2.2 release notes where it says SelectVariants: "Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file."

I was wondering whether you could confirm if cases like the one above would benefit from the bugfix, or if the bug description applies to something else.

I'm looking to find all the entries that change between two calls to UG on the same data. I would like to find all the entries where the call in the variant tract are different from those in the comparison track. So in effect I want those entries that would not be result from -using -conc in SelectVariants. From the documentation is is unclear if the -disc option does this:

A site is considered discordant if there exists some sample in the variant track that has a non-reference genotype and either the site isn't present in this track, the sample isn't present in this track, or the sample is called reference in this track.

What if the comp is HOM_VAR and the variant track is HET? Or if they are both HET but disagree on the specific allele?

I have completed filtering my SNP data using VariantFiltration, and now I want to use SelectVariants to output all calls marked "PASS" in the FILTER field. I used the following script, but only the header information writes to the output file.

I have used the UnifiedGenotyper to call variants on a set of ~2400 genes (TruSeq Illumina data) from 28 different samples mapped against a preliminary draft genome. I do not have a defined set of SNPs or INDELs to use in recalibration via VQSR.

While the raw VCF has plenty of QUAL scores that are very high, not a single call has a PASS associated with it in the Filter field- all are "." If I use SelectVaraints to filter the VCF based on high QUAL or DP values, or combination, the Filter field remains "." for the returned variants.

Am I doing something wrong, or is the raw file telling me that none of the variant calls are meaningful, in spite of their high QUAL values?

Is there a "best practices" way to go about filtering such a dataset when VQSR can't be employed? If so, I haven't found it.

So, I tried using VariantFiltration followed by SelectVariants. The variant filtration seems to work fine adding FT tag to the format field. And then I am trying to get records having FT tag using the following commands

Hi, I wanted to double check my methods for some targeted capture data. I ran 96 samples through UG to produce a multisample VCF. I separated snps and indels into separate files using SelectVariants, and applied filters:

I then went back through with SelectVariants, pulling out each sample one at a time into their own filtered VCF.

My results are... lets say, wrong. I am wondering if it would be better practice to select each sample first and then apply the filters, or if it does not matter and my errors lie elsewhere. Thank you.

I have been trying get variants out of a VCF file where the Allele Frequency (AF) is greater than 4%. I have tried both VariantFiltration and SelectVariants but I get different errors with each. Here is my call for SelectVariants:

java.lang.ArithmeticException: Double coercion: java.util.ArrayList:([0.010, 0.010])
at org.apache.commons.jexl2.JexlArithmetic.toDouble(JexlArithmetic.java:1023)
at org.apache.commons.jexl2.JexlArithmetic.compare(JexlArithmetic.java:699)
at org.apache.commons.jexl2.JexlArithmetic.greaterThan(JexlArithmetic.java:790)
at org.apache.commons.jexl2.Interpreter.visit(Interpreter.java:796)
at org.apache.commons.jexl2.parser.ASTGTNode.jjtAccept(ASTGTNode.java:18)
at org.apache.commons.jexl2.Interpreter.interpret(Interpreter.java:232)
at org.apache.commons.jexl2.ExpressionImpl.evaluate(ExpressionImpl.java:65)
at org.broadinstitute.sting.utils.variantcontext.JEXLMap.evaluateExpression(VariantJEXLContext.java:267)
at org.broadinstitute.sting.utils.variantcontext.JEXLMap.get(VariantJEXLContext.java:233)
at org.broadinstitute.sting.utils.variantcontext.JEXLMap.get(VariantJEXLContext.java:118)
at org.broadinstitute.sting.utils.variantcontext.VariantContextUtils.match(VariantContextUtils.java:293)
at org.broadinstitute.sting.gatk.walkers.filters.VariantFiltration.filter(VariantFiltration.java:331)
at org.broadinstitute.sting.gatk.walkers.filters.VariantFiltration.map(VariantFiltration.java:270)
at org.broadinstitute.sting.gatk.walkers.filters.VariantFiltration.map(VariantFiltration.java:80)
at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:65)
at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:18)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:62)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:265)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)

For both I have tried variations of double quotes and different sigfigs. Also, it works when I select on parameters other than AF.

My current workflow for analysing mouse exome-sequencing (based on v4 of Best Practices) can require me to use slightly different VCFs as --knownSites or --known parameters in BQSR, indel realignment etc. Basically, I have a "master" VCF that I subset using SelectVariants. The choice of subset largely depends on the strain of the mice being sequenced but also on other things such as AF'. It'd be great to be able to do this on-the-fly in conjunction with--known' in tools that required knownSites rather than having to create project-specific (or even tool-specific) VCFs.

Is there a way to do this that I've overlooked? Is this a feature that might be added to GATK?