Jump to another community

Calling variants in RNAseq

Overview

This document describes the details of the GATK Best Practices workflow for SNP and indel calling on RNAseq data.

Please note that any command lines are only given as example of how the tools can be run. You should always make sure you understand what is being done at each step and whether the values are appropriate for your data. To that effect, you can find more guidance here.

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller. Here is a detailed overview:

Caveats

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have been working with RNAseq for a somewhat shorter time, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

We know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data.

The workflow

1. Mapping to the reference

The first major difference relative to the DNAseq Best Practices is the mapping step. For DNA-seq, we recommend BWA. For RNA-seq, we evaluated all the major software packages that are specialized in RNAseq alignment, and we found that we were able to achieve the highest sensitivity to both SNPs and, importantly, indels, using STAR aligner. Specifically, we use the STAR 2-pass method which was described in a recent publication (see page 43 of the Supplemental text of the Pär G Engström et al. paper referenced below for full protocol details -- we used the suggested protocol with the default parameters). In brief, in the STAR 2-pass approach, splice junctions detected in a first alignment run are used to guide the final alignment.

Here is a walkthrough of the STAR 2-pass alignment steps:

1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:

3. Split'N'Trim and reassign mapping qualities

Next, we use a new GATK tool called SplitNCigarReads developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.

In the future we plan to integrate this into the GATK engine so that it will be done automatically where appropriate, but for now it needs to be run as a separate step.

At this step we also add one important tweak: we need to reassign mapping qualities, because STAR assigns good alignments a MAPQ of 255 (which technically means “unknown” and is therefore meaningless to GATK). So we use the GATK’s ReassignOneMappingQuality read filter to reassign all good alignments to the default value of 60. This is not ideal, and we hope that in the future RNAseq mappers will emit meaningful quality scores, but in the meantime this is the best we can do. In practice we do this by adding the ReassignOneMappingQuality read filter to the splitter command.

Finally, be sure to specify that reads with N cigars should be allowed. This is currently still classified as an "unsafe" option, but this classification will change to reflect the fact that this is now a supported option for RNAseq processing.

4. Indel Realignment (optional)

After the splitting step, we resume our regularly scheduled programming... to some extent. We have found that performing realignment around indels can help rescue a few indels that would otherwise be missed, but to be honest the effect is marginal. So while it can’t hurt to do it, we only recommend performing the realignment step if you have compute and time to spare (or if it’s important not to miss any potential indels).

5. Base Recalibration

We do recommend running base recalibration (BQSR). Even though the effect is also marginal when applied to good quality data, it can absolutely save your butt in cases where the qualities have systematic error modes.

Both steps 4 and 5 are run as described for DNAseq (with the same known sites resource files), without any special arguments. Finally, please note that you should NOT run ReduceReads on your RNAseq data. The ReduceReads tool will no longer be available in GATK 3.0.

6. Variant calling

Finally, we have arrived at the variant calling step! Here, we recommend using HaplotypeCaller because it is performing much better in our hands than UnifiedGenotyper (our tests show that UG was able to call less than 50% of the true positive indels that HC calls). We have added some functionality to the variant calling code which will intelligently take into account the information about intron-exon split regions that is embedded in the BAM file by SplitNCigarReads. In brief, the new code will perform “dangling head merging” operations and avoid using soft-clipped bases (this is a temporary solution) as necessary to minimize false positive and false negative calls. To invoke this new functionality, just add -dontUseSoftClippedBases to your regular HC command line. Note that the -recoverDanglingHeads argument which was previously required is no longer necessary as that behavior is now enabled by default in HaplotypeCaller. Also, we found that we get better results if we set the minimum phred-scaled confidence threshold for calling variants 20, but you can lower this to increase sensitivity if needed.

7. Variant filtering

To filter the resulting callset, you will need to apply hard filters, as we do not yet have the RNAseq training/truth resources that would be needed to run variant recalibration (VQSR).

We recommend that you filter clusters of at least 3 SNPs that are within a window of 35 bases between them by adding -window 35 -cluster 3 to your command. This filter recommendation is specific for RNA-seq data.

As in DNA-seq, we recommend filtering based on Fisher Strand values (FS > 30.0) and Qual By Depth values (QD < 2.0).

Please note that we selected these hard filtering values in attempting to optimize both high sensitivity and specificity together. By applying the hard filters, some real sites will get filtered. This is a tradeoff that each analyst should consider based on his/her own project. If you care more about sensitivity and are willing to tolerate more false positives calls, you can choose not to filter at all (or to use less restrictive thresholds).

An example of true variants that were filtered (false negatives). As explained in text, there is a tradeoff that comes with applying filters:

Known issues

There are a few known issues; one is that the allelic ratio is problematic. In many heterozygous sites, even if we can see in the RNAseq data both alleles that are present in the DNA, the ratio between the number of reads with the different alleles is far from 0.5, and thus the HaplotypeCaller (or any caller that expects a diploid genome) will miss that call. A DNA-aware mode of the caller might be able to fix such cases (which may be candidates also for downstream analysis of allele specific expression).

Although our new tool (splitNCigarReads) cleans many false positive calls that are caused by splicing inaccuracies by the aligners, we still call some false variants for that same reason, as can be seen in the example below. Some of those errors might be fixed in future versions of the pipeline with more sophisticated filters, with another realignment step in those regions, or by making the caller aware of splice positions.

As stated previously, we will continue to improve the tools and process over time. We have plans to improve the splitting/clipping functionalities, improve true positive and minimize false positive rates, as well as developing statistical filtering (i.e. variant recalibration) recommendations.

We also plan to add functionality to process DNAseq and RNAseq data from the same samples simultaneously, in order to facilitate analyses of post-transcriptional processes. Future extensions to the HaplotypeCaller will provide this functionality, which will require both DNAseq and RNAseq in order to produce the best results. Finally, we are also looking at solutions for measuring differential expression of alleles.

Comments

@mmterpstra: I believe the GATK pipeline should only used for calling variants and not for anything else. Therefore I think you should process your reads in parallel and apply CollectRnaSeqMetrics to the resulting BAM. I believe that things like MarkDuplicate will modify the results significantly...

Hi @sirian‌
"I actually don't quite understand how splitNcigarReads can remove "intronic overhang" if it is before the N cigar. How does it know it is intronic, without any annotation information"
You are right, the tool does not try to "guess" the splice positions, but use the ones that already discovered. In some cases, although STAR (although true for other aligners that we tried) find the correct splice position for some of the reads, but not for others. In this cases you will see in the IGV (see the screenshot we provided in the doc) that some of the reads were correctly split, while others have intronic overhang, and lead to FP in the calling step (unlike your case, where there are no correctly split reads). Those are the cases that are fixed by splitNcigarReads tool. We are aware on that some of the FP that we still have are due to similar errors as in you case, we have some ideas and we hope to fix those in the future.

btw - In your case it might be fixed by adding the known annotations (genecode or something like it) as input. We chose not to use them since we prefer to have a pipeline that can work based on the actual data. However, this is our recommendation and you can try other things and let us know if it work better for you. We would like to see such cases in order to improve our development.

@ami said:
Hi sirian‌
"I actually don't quite understand how splitNcigarReads can remove "intronic overhang" if it is before the N cigar. How does it know it is intronic, without any annotation information"
You are right, the tool does not try to "guess" the splice positions, but use the ones that already discovered. In some cases, although STAR (although true for other aligners that we tried) find the correct splice position for some of the reads, but not for others. In this cases you will see in the IGV (see the screenshot we provided in the doc) that some of the reads were correctly split, while others have intronic overhang, and lead to FP in the calling step (unlike your case, where there are no correctly split reads). Those are the cases that are fixed by splitNcigarReads tool. We are aware on that some of the FP that we still have are due to similar errors as in you case, we have some ideas and we hope to fix those in the future.

btw - In your case it might be fixed by adding the known annotations (genecode or something like it) as input. We chose not to use them since we prefer to have a pipeline that can work based on the actual data. However, this is our recommendation and you can try other things and let us know if it work better for you. We would like to see such cases in order to improve our development.

Hope it makes more sense now, let me know if not.

Thanks!
Did you mean using reference annotations in STAR alignment? SplitNcigarReads doesn't seem to have such an option.
I'm considering removing any variants that fall outside of exons based on RefSeq annotation, because theoretically RNA-seq reads should be from exons only (if not consider immature mRNA).

@sirin - yes, I meant using reference annotations in STAR alignment (although it might be useful to have an option for that input in SplitNcigarReads, thanks for the idea).
The problem is that you will limit yourself to the known annotations, which based on our discussions with other people in the field (though never checked it myself) cover only about half of the real exoms, and thus you will throw all the covered exoms and any novel splice junction that your data might have.

@ami said:
sirin - yes, I meant using reference annotations in STAR alignment (although it might be useful to have an option for that input in SplitNcigarReads, thanks for the idea).
The problem is that you will limit yourself to the known annotations, which based on our discussions with other people in the field (though never checked it myself) cover only about half of the real exoms, and thus you will throw all the covered exoms and any novel splice junction that your data might have.

I actually tried redoing the alignment with Ensembl annotation, but that variant still exists. That position is in one exon by Ensembl, but not by RefSeq. It will be more stringent to use RefSeq but I'm worried it may be too stringent.
Anyway, glad to know you will improve GATK for this issue and I'm looking forward to it.
Thanks.

Hi,
I'm wondering does GATK support or plan to support calling variance betwenn two bam files, for example, compare one RNAseq data with pared DNAseq data to identify RNAediting events? Ideally would also support biological replicates.
Thanks for the nice tool again.

@s6juncheng‌ We do plan to provide such option/tool. We already doing something like that with allele specific expression (i'm currently working on that) and collaborate with few group to do the same with RNA editing. When we will have tools that are ready to be used by the community we will be glad to share them, as we always do.

Hello GATK team. I was trying to run SplitNCigarReads on some RNA-seq data aligned using the PRADA pipeline (BWA against genome+transcriptome reference, then all reads converted back genome coordinates). I tried using the "--refactor_NDN_cigar_string" option from the recent "vnightly-2014-09-12-gb0687a8" build (due to the presence of "1N1D1N" like reads), but received this error before the walker began transversal of any reads:

##### ERROR stack trace
java.lang.UnsupportedOperationException
at java.util.AbstractList.add(AbstractList.java:148)
at java.util.AbstractList.add(AbstractList.java:108)
at org.broadinstitute.gatk.tools.walkers.rnaseq.SplitNCigarReads.initialize(SplitNCigarReads.java:150)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

We do not consider the known RNA editing sites as FP so we do not filter them out. In fact one implementation of our pipeline is to find the RNA editing events. One can filter out (using -XL for example or other ways that @Geraldine_VdAuwera‌ can suggest) the known RNA-sites if you have them as interval file or VCF file. However I can't think of a way to prevent the haplotype assembly to use those site in the current pipeline.

For the variant calls at splice sites, we believe that the previous studies that use older tools had many FP in those location due to the limitation of the aligners and tools. Currently we are not filtering those type of calls since the HC and the aligning pipeline are able to do much better job in those regions and for some implementation it is important to be able to call such sites. Also in this case there is no easy way to ignore these sites in the haplotype assembly step, however if it turn out to improve the output of the pipeline, we might include such option in the future version of the tools. If you see cases of FP calls that are due to this reason, please share them with us so we can make our pipeline even better

Also, can you tell me a little bit about the purpose of RNAseq variant calling in your project? I try to learn more about the user cases of our pipeline so we can have them in mind while developing new tools and recommendations.

Hello @ami,
Thanks for the fascinating pipeline. I have some questions about the pipeline.

In DNAseq pipeline, there is a step "joint genotyping" which merge all the gvcf files from all samples. In RNAseq it seems we should run each sample separately? And it also comes to a next question:

How do we deal with the lanes and samples in RNAseq. Same with DNAseq? ( get recalibrated files separately --> merge lanes of the same sample --> realign). Or should we run through each pair of fastq files separately?

And for the organism without dbSNP, in the step of base quality recalibration, should we do the same with DNAseq (hard filter high quality results as dbSNP ---> run BQSR).

Thank you for your prompt and extensive answer. Let me first start by briefly explaining our project. We recently performed eQTL and ASE using only genotypes derived from public RNA-seq samples (http://biorxiv.org/lookup/doi/10.1101/007633). We are here only interested in the genomic variation since these contribute to eQTL and ASE effects. Since we started that project the number of available samples has doubled and in the near future we want to also include these samples. In contrast to our current work we would then also like to also discover novel variants instead of only calling known variants. For this reason I also asked if the GVCF mode will be made available for the RNA-seq as well, Geraldine told me that work on this is in progress.

Great that you now have higher accuracy around the splice sites, I agree that these are biological very interesting to call. I’m still a bit worried about the RNA editing sites though. In essence you are dealing with a non-diploid region at the RNA level, the 2 real copies from the DNA plus 1 or 2 haplotypes where the editing event took place. As long as this does not adversely affect the genotyping for nearby sites I’m happy, then I can simple exclude the editing sites at a later time. It would be a pity though if the nearby accuracy drops due to an editing event, is this something that you assessed?

We would by the way be happy to beta test the GVCF method in combination with RNA-seq genotyping when the software is ready for this. We are using the Geuvadis RNA-seq data as a benchmarking set since these samples are also part of the 1000G project.

@ami may respond in more detail, but FYI it is already technically possible to emit GVCFs for RNAseq data, if you're interested in testing this. You just combine the specific arguments of both modes. We'd love to hear whether this produces meaningful results for you and what are the chief error modes you observe. By the way, if you like living on the cutting edge, you can try the in-development versions which are made available every night; see nightly builds in the downloads section.

I get the similar error as @egeulgen. few warning line for each sample "Interpreter - ![0,2]: 'QD < 2.0;' undefined variable QD". I just copy and paste the command line in this tutorial, only change the file name to my own VCF file. The bam file is aligned by MapSplice2, instead of output quality score > 20, i output all of the variant regions with qual > 0

@GooderPanQi‌ , we never tried that ourself, but there is no reason why it should not work. You should probably follow the DNA pipeline, using gvcfs, and using the specific parameters for RNA. We was just discussing it with our friends in Mount Sanai and they will also try to do it.
Just to be clear - we do not have experience with that, so that the reason that it is not in the recommendation, but we will be happy to get any feedback from you and other users that are willing to try that, and hopefully, based on that, we can update the recommendation at some point.

@ami, we run the step "variant calling", but there're something wrong: "Invalid command line: The parameter recoverDanglingHeads is deprecated. This argument is deprecated since version 3.3". GATK version 3.3 don't supported the parameter "recoverDanglingHeads"??

Hi @ami and @Geraldine,
First of all, thanks for this excellent workflow! Easy to follow and very descriptive!
I do have, however, some remarks concerning read groups and mapping quality. You wrote in the workflow, that these need to be adjusted, after running star. Are you aware that this can be done already within star? With the parameter "--outSAMmapqUnique" you can change the mapping quality assigned for unique mappings from the default 255 to 60. With the "--outSAMattrRGline" parameter you can add a read group line to the sam header. I further think it's useful to mention that star can directly output mappings in a coordinate sorted BAM file using "--outSAMtype BAM SortedByCoordinate". If you prefer picard sorting, you could at least start with an unsorted BAM using "--outSAMtype BAM Unsorted".
Just to let you know.
Best,
c

@corlagon Thanks! Are these capabilities in the latest version of STAR? We know Alex has been making a lot of upgrades, and we haven't yet updated this doc to fit, so I wouldn't be surprised if that was the reason for these redundancies.

@corlagon‌ - thanks, we do aware of all those changes since we do touch bases with Alex. I assume that in the next versions of the pipeline, some of those changes or all of them will be included (in fact Alex included some of those options after our joint discussions).
Currently, since we focus on using the existing pipeline for several applications, and we plan to revisit the pipeline (and improve it farther more) after we will get more experience with several of its main use cases.

Please continue to give us feedback and comments, to allow us make sure we provide the best recommendations.

Hi @ami and @Geraldine, thanks for this excellent workflow. Since we used 'Star' to align RNA raw read, we find some problems that the number of aligned bam ( 19291348, 'samtools flagstat' was used ) with 'STAR' is more than RNA sequencing raw reads ( 18698820, wc -l fq ). Then, how we should compute mapped rate on RNAseq with 'STAR' ??

New member here. I was wondering whether you would consider adding your "dangling-head" functionality to the Unified Genotyper as well. I've decided to use UG for RNAseq variant calling because my samples are heterogeneous primary tumours which likely contain a variable and unpredictable number of haplotypes in a given region, and so I didn't think HC was appropriate. I realize GATK does not support variant calling for tumour samples, but other tools (eg. MuTect) do not support RNAseq, so I tried to make the best of it.

It turns out that this RNAseq pipeline performs quite well for my samples, in that I get strong concordance between calls made from RNAseq and whole exome for the same individual (~90%), given adequate coverage of the variant. From manual inspection of my alignments, it seems that many of my false positive calls from RNAseq are caused by these "dangling heads" misaligned to introns. Something like "--recoverDanglingHeads" in UG would clean a lot of this up, I think. Just a suggestion.

@santayana‌ can you upload an IGV screenshot of such an example, I don't think you are thinking about "dangling heads" in the same way that we are, but I want to be sure.
[I will let @Geraldine_VdAuwera‌ to replay about any future development of UG]

Thanks for the quick reply Ami. The top window is whole exome and the bottom window is RNAseq from the same individual. It seems that the variant is only present in reads that only extend a few bases into the intron, suggesting to me that this a misalignment.

Unfortunately it is impossible to add this functionality to the UnifiedGenotyper, since it does not perform any reassembly or realignment of the reads. In any case, the UnifiedGenotyper is now permanently deprecated, which means that we will no longer make any fixes or improvements to it. Going forward, we are only putting development effort into HaplotypeCaller.

For the record, the HaplotypeCaller is neither more nor less appropriate for tumor samples than UnifiedGenotyper, but at least it does handle RNAseq properly, so I would recommend you use HC rather than UG. In future we may be able to provide support for tumor RNAseq, but that's still a long way down the road.

Thank you for your reply. It may be a moot point now, but please let me follow up your comment about HC vs UG for tumour samples, since this is a point of confusion for my colleagues and I. UG seems to me to be more suitable than HC for genetically heterogeneous tumour samples for the following reason: Say a tumour has two subclones: one contains a heterozygous mutation A, and one contains a heterozygous mutation B. These mutation modify adjacent bases, but they are mutually exclusive: no cell (and no haplotype) contains both A and B mutations. As I understand it, HC calls the two likeliest haplotypes, which in this case would be 1) wild type for both A and B, and 2) mutant for one of A or B (whichever is more abundant). In other words, it is bound to miss one of the mutations, because it cannot handle more than two haplotypes. UG on the other hand, considers each mutation independently, so, as long as the heterozygous mutant genotype has higher likelihood than wild type, then the mutation will be called, even if the mutant allele frequency deviates strongly away from 0.5. Hence, A and B are likely to be called by UG. In my tests, heterozygous mutants can be called with minor allele frequencies as low as 0.1, presumably because a heterozygous genotype is still more likely than a wild type genotype when sufficient depth is present.

Hi @santayana‌
1) since it is an alignment issue, I just wanted to make sure you are using 2 pass STAR and use the SJ from the first round in the second round. Do you?
2) That type of problem are suppose to be fixed by SplitNCigarString and not by "dangling heads", which fix the missed true variants at the begging of exons. If you do use SplitNCigarString in you pipeline, it might still not fixed that specific error, since the default mismatches to clip the overhang are 2. You might change it to one and see how many "real" event you will remove, and how many FP you will remove, and decide based on that tradeoff if to use it or not.
3) I don't think you are write regarding your interpretation of how the HC will handle your suggested case. If there is no read to support the A+B haplotype, it won't be called as one event although it is a consecutive bases. Let me bering to the discussion @valentin who is actively developing the HC.
4) As Geraldine suggested the multi ploid option of HC can solve many of the more complex cases that might give advantage to a single locus caller, and we are testing it currently with RNA editing calling, and hopefully will publish even better recommendation for such cases in the near future.

1) Yes, I am following the pipeline that you have set out here, with the only exception of substituting HC for UG.
2) Ah OK, I understand now this is SplitNCigarString issue. I'll try changing the default option - thanks!
3) That's my point, that one of A or B won't be called in the sample even though they are both present.
4) I suppose I could experiment with the -ploidy option, although I imagine the intent of this option is to handle non-diploid organisms, and not the issue of clonal heterogeneity of tumours discussed here (although broad and focal copy number changes are common in cancer, which makes calculating the number of haplotypes even more complicated). The extent of tumour heterogeneity is an active area of research, but single cell sequencing data suggests it is very extensive. If I set -ploidy > 10, wouldn't that slow things down a lot?

By blatting the sequence, the called indel allele here is just the intron (starting from the 3rd nt in the allele sequence). You can see nearly all reads got correctly spliced by mapping and clipped by SplitNCigarReads. I don't understand how HC could call it as an indel. And nearly all the alignment in the exon region end with TC which is exactly the end of that exon. So even if the intron was mis-called as an indel, the allele should be TC instead of T. However when HC called it as an indel, it even ignored the last C.

This depends on the aligner identifying the splice junction correctly. If it doesn't, then the reads won't be mapped with the N cigar elements that the split & trim tool needs to do its job. Did you use STAR or something else?

@Geraldine_VdAuwera said:
This depends on the aligner identifying the splice junction correctly. If it doesn't, then the reads won't be mapped with the N cigar elements that the split & trim tool needs to do its job. Did you use STAR or something else?

Yes I used STAR. It must have identified the correct splice junction. The alignment plot I showed above is the input for HC. It's after split & trim. You can see almost all reads get clipped at the correct junction. I believe mapping and splitting/trimming did the right job here. But HC made a wrong call. I don't know how it can be.

The call metrics in your VCF record suggest that the HC sees a majority of reads spanning the exon, with a deletion. Maybe the reads are getting remapped unexpectedly. Try having HC generate the output bam (see tech doc for argument details) as that will show you what it thinks the region looks like after reassembly.

hi GATK dev.,
I am using the GATK best practices for var-calling from RNA-seq and I was wondering if I could specify the minimum read depth in Haplotype Caller. I went through the documentation but couldn't find anything similar.

RNA-seq data, by design, would have unequal depth across exons and I think a parameter that takes the expression value for the gene before deciding read-depth threshold would be very useful.

Is it possible to specify depth any how, gene-wise or one value, in HC?

@Sheila
Hi,
I appended read depths to the variants called by HC and the min. for the variant allele is 1 (filtered depth of course). So I am little relieved that HC wouldn't do a blind refusal to call at low depth.
What started my inquiry was failure of HC to call variants where the actual gene was expressed low, like 2 reads (after dedup.) with 1 WT and 1 mut. I know that its there because for the same sample I have evidence from whole exome.
So, I was wondering if the thresholds (whatever HC uses) could be tweaked for a list of pos. (those from whole-exome, say) to comprehensively identify somatic mutations that are expressed in RNA.
This would be very helpful in prioritizing antigenic somatic mutations with more potential (seen in RNA => more potential)

Just a thought - @Sheila do you think @amitm can use GGA mode? we never tested it with RNA-seq, but in theory it should work and might be able to correctly call those positions. I also not sure if in this case the HC will ignore any minimum threshold it might have.

@ami Unfortunately GGA mode is buggy in HC. (we've had an improvement request to fix for a while but it's just very low priority)

Very low read depth is always going to be challenging to deal with. What you could do to improve sensitivity to this sort of case would be to do a separate run in which you tell the tools to ignore duplication flags (which is the equivalent of not marking duplicates -- there is an option to do this in the latest nightly build). Then you can evaluate and merge the results intelligently. You'd want to give priority to the calls made on the dedupped version, but include calls made only on the non-dedupped data at sites where DP is low in the dedupped data, in case they can rescue real variants that were overlooked due to DP in the dedupped data. Does that make sense?

@Geraldine_VdAuwera
Thank you for the suggestion. I will try it. Though it would take a lot of hand-holding for the data. But this should be able to pull out the ones affected by DP. I have used this approach in targeted MiSeq data for low cov. regions.
I notice that there is an -L parameter in HC. Can it accept single base positions, rather than intervals, for restricting SNV calling in RNA-seq only to pos. identified in whole-exome (or any type of DNA seq from paired sample)?

ERROR MESSAGE: The platform (platform) associated with read group GATKSAMReadGroupRecord @RG:id is not a recognized platform. Allowable options are ILLUMINA,SLX,SOLEXA,SOLID,454,LS454,COMPLETE,PACBIO,IONTORRENT,CAPILLARY,HELICOS,UNKNOWN

I believe this to be directly tied to this directive detailed below that proceeded the 2-step STAR alignment:

@brohawndg You can run AddOrReplaceReadGroups directly on the bam file. You'll need to replace the stand-in values in the command by actual values, e.g. you'd write RGPL=ILLUMINA if you are running on ILLUMINA data, instead of RGPL=platform. Sorry if that wasn't clear in the doc.

Hi,
The organism on which I work doesnt have good genome so I have have to rely totally on de novo assembled Transcriptome. In that case can I use STAR aligner or can I make use of the sorted abm that has been generated using Bowtie2 and samtools ? Kindly guide me.

First I'd like to apologise for not using the correct terminology here, as I'm an engineer, not a biologist.
I'm following the above pipeline, trying to find variance between RNAseq of two "groups" of plants.
For each group, three "samples" were sequenced, so I have six bam files: G1S1, G1S2, G1S3, G2S1, G2S2, G2S3

How do I find variance between the two groups?
Variance between the groups and the reference is not the goal.

We would like to compare the transcriptome sequence of two genotypes: wild-type and mutant.
Each genotype have three technical repeats.
We've already aligned the reads to a reference genome, following the above protocol, but uncertain about the variant calling step.

That said your original description made enough sense already. You'll want to call variants the same way as recommended for pretty much all experimental designs (see HaplotypeCaller GVCF workflow). Once you have joint calls for all samples together, it'll be a matter of subsetting variants based on e.g. whether they are unique or shared in wild-type vs. mutant (depending on what you're looking for).

Sorry for the late response. You are looking at the correct workflow.
You command is fine. The only thing is that -stand_call_conf 20.0 -stand_emit_conf 20.0 will not be taken into account. The GVCF will output all potential variants regardless of quality score. You can set your thresholds in GenotypeGVCFs.

Hi,
Currently I'm working on calling variants from RNA-seq data. Following this workflow, I got VCF files for my sequencing data. But there are no '0/0' (wild genotype) in my VCF file. I wonder if this is correct. Is it because I did "calling variants" step for only one sample? Is there a way to see wild genotype alleles?
Thanks
Hongji

Hi - I'm following this pipeline on some RNA samples, everything seems normal til I get to mapping against the generated splice junction reference with STAR. The final star alignments are 99% unmapped. Star log says:
% of reads unmapped: too short | 99.11%
final star output SAM file is 300mb down from 30gb.
Anyone else experience this ?

could you please confirm what kind of non. ref allele HC reports?! meaning that the allele on RNA -level (expressed) is not the one on DNA-level. So the final reported result, would it be Ref allele, non.ref as observed in the data = expressed one?!

The allele is representative of any alleles at the site that are not the reference allele or alternate allele. Basically, it is any allele that is present at the site, but does not have enough evidence to call it as an alternate allele.

For example, let's say you have 20 reads with good qualities covering a site, and the reference at the site is A. 10 reads have an A, 9 reads have a T, and 1 read has a G. The reference will be A, the alternate will be T, and NON-REF will represent the G.

I am trying to run this pipeline to call variant in my RNAseq data. I got my sam files from STAR aligner. However, when I try to addorreplacegroups using PICARD I get the following error message:
Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. length(QUAL) != length(SEQ); File checK_nobT.bam; Line 1
Line: NB500921:39:HK5VYBGXX:1:11309:16590:5060 0 Supercontig_2.9 702412 255 151M1S * 0 0 CAGCAAGCTCGATCATCACCAAAACCCGCTCCAAAGACCACCACCACCACCCCAACCGTAAATTCTTCACCATGCACCTTGGACTCTACCTTGCTTGGGCTCTGGTTGTACTCCACTCTGTGGTAGCTCACCCGCAGTGCTACGAACCCTGN AAAAAEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEAAEEEEEAEEEEEEEEEEEEEEEEEEAEEEAEEEEEEEEEEEEEEEEEEEAEEEEEEEEAEEEEEEEEEEEAAEAA<EEEEAEEEAAA<EAAEA
at htsjdk.samtools.SAMLineParser.reportErrorParsingLine(SAMLineParser.java:439)
at htsjdk.samtools.SAMLineParser.parseLine(SAMLineParser.java:324)
at htsjdk.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:248)
at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:236)
at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:212)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:545)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:519)
at picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:124)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

Then I thought of using samtools to convert from SAM to BAM and use PICARD addorreplacegroups option. Then I got the following error message:
Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. length(QUAL) != length(SEQ); File checK_nobT.bam; Line 1
Line: NB500921:39:HK5VYBGXX:1:11309:16590:5060 0 Supercontig_2.9 702412 255 151M1S * 0 0 CAGCAAGCTCGATCATCACCAAAACCCGCTCCAAAGACCACCACCACCACCCCAACCGTAAATTCTTCACCATGCACCTTGGACTCTACCTTGCTTGGGCTCTGGTTGTACTCCACTCTGTGGTAGCTCACCCGCAGTGCTACGAACCCTGN AAAAAEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEAAEEEEEAEEEEEEEEEEEEEEEEEEAEEEAEEEEEEEEEEEEEEEEEEEAEEEEEEEEAEEEEEEEEEEEAAEAA<EEEEAEEEAAA<EAAEA
at htsjdk.samtools.SAMLineParser.reportErrorParsingLine(SAMLineParser.java:439)
at htsjdk.samtools.SAMLineParser.parseLine(SAMLineParser.java:324)
at htsjdk.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:248)
at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:236)
at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:212)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:545)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:519)
at picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:124)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

When I converted this SAM file to BAM using Samtools and used Picard AddOrReplaceReadGroups options for the BAM file, I got the following error message. I am confused how I got different error message after SAM to BAM conversion. Also, I noticed an extra 'N' inserted at the end of the sequence shown in the error message:

Then your file is unusable if it's not even readable by that tool. Perhaps something went wrong during your alignment job. I suggest redoing ti from scratch -- and make sure you use the most recent version of STAR.

I got a suggestion to use tail command for my aligned.sam file and found out that SAM files was corrupted. Now, I am trying to redo the alignment from the start. However, I was wondering if I could use the commands, '--outSAMattrRGline' to add RG and -'-outSAMtype BAM SortedByCoordinate' to get the final output as sorted BAM. Will this affect the downstream processes?

Also, does the files size of FASTQ affect the alignment? I have 23 samples and the fastq file size range from 12Gb to 30GB.

I don't completely understand why you remove duplicated reads from rnaseq data. I know that for exome seq this is done to remove PCR amplification artifacts, In RNA seq the high number of duplicated reads is often considered as measure of RNA expression. So I guess that for snv this could reflect the number of clones with the mutation. Removing duplicates might affect the real frequency of the mutation. PLease could you give a feedback about this observation? namy thanks

For the specific pipeline of variant discovery we remove the duplication reads from the same reason that you mentioned - remove sequencing errors that might be duplicate in the PCR process and thus count as real variants. The assumption is that if there is a real variant in the RNA, you will have enough reads that are not exact duplicates that support that variant. I would not believe any variant that all its support comes from only duplicates reads. You are right, that if we are interested in the exact frequency of the alleles (and not just in the HET vs HOM state) we might want to keep the duplicated reads, but I would only do it in a second pass, after I discovered all the HET sites, and then call again on the specific sites of interest, as we and other do in Allele specific expression pipelines.

At this time the RNAseq pipeline is only meant for single-sample processing, yes. We have not finished testing how the tools perform on RNAseq for multiple samples.

Lanes and samples should be treated the same way as for DNAseq.

...

Has the limitation to analyzing single samples been resolved yet?

Also, when analyzing a sample that has fastq's from multiple lanes, you recommended that we treat them the same way that DNAseq samples would be treated? However, is it really necessary to re-run the split-'n-trim step again AFTER we've merged the BAMs from the different lanes? I can understand why one would want to re-run the dedup, realignment and recalibration steps, but the split-n-trim step seems unnecessary, and is throwing errors for a few of my samples, i.e. it's complaining about a read that has a start position that occurs after it's stop position ("The stop position 90336265 is less than start 90336266 in contig 15"). Unfortunately, I don't have access to all the intermediate bam files to determine the step that introduced that error, but can re-run them if necessary.

If not, what are the recommended steps for dealing with samples that have reads from multiple lanes?

If the suggestion is still true (to only apply it 1x on the merged bam), I'd suggest that you add a clarification as it currently appears that you recommend that users run the full pipeline on the lane-specific bams, as well as the merged bam - hence running the Split'N'Trim step 2x.

This is a misunderstanding, apologies of our docs weren't clear. We certainly don't recommend running all the preprocessing steps twice. The only step that you would run twice of you have multiple lanes per sample is marking duplicates. We'll clarify.

But no, we haven't yet validated multi-sample analysis of RNAseq. There isn't any obvious technical obstacle, we just haven't tested it yet so we can't guarantee results.

@Geraldine_VdAuwera said:
This is a misunderstanding, apologies of our docs weren't clear. We certainly don't recommend running all the preprocessing steps twice. The only step that you would run twice of you have multiple lanes per sample is marking duplicates. We'll clarify.

But no, we haven't yet validated multi-sample analysis of RNAseq. There isn't any obvious technical obstacle, we just haven't tested it yet so we can't guarantee results.

Hmmm..... well, I just got an answer about when to run the Split'N'Trim step. GATK v.3.5 will throw an error if you try to apply any of the subsequent analytical steps if you haven't already run the Split'N'Trim step first, i.e. the RealignerTargetCreator walker threw an error before I could run any of the subsequent steps.

With respect to the pre-processing steps, are you absolutely sure that we only need to re-run the de-dup step? According to article 3060 (https://www.broadinstitute.org/gatk/guide/article?id=3060), it clearly states that we should re-run the de-dup, realignment & base recalibration steps:

Per-sample pre-processing
At this point you perform an extra round of marking duplicates. This eliminates PCR duplicates (arising from library preparation) across all lanes in addition to optical duplicates (which are by definition only per-lane).

The example data becomes:

sample1.merged.dedup.bam
sample2.merged.dedup.bam
Then you run indel realignment and base recalibration (BQSR).

Realigning per-sample means that you will have consistent alignments across all lanes within a sample.

Base recalibration will be applied per-lane (as it should be) if you assigned appropriate read group information in your data.

[...] we run the initial steps of the pre-processing workflow (mapping and sorting) separately on each individual read group, then we merge the data to produce a single BAM file for each sample (aggregation). Then we re-run Mark Duplicates, run Indel Realignment and run Base Recalibration on the aggregated per-sample BAM files.

This is further detailed in bullet point format that boils down to:

run each [per-lane] file through mapping, sorting and marking duplicates (dedup)

merge and run mark duplicates per-sample (actually this can be done in a single step by running MarkDups on all lane files at once for a sample)

then you run indel realignment and base recalibration per-sample

It does not state that you should do realignment and BQSR in the per-lane processing. That used to be the case (a couple of years ago), but no longer.

If you're running on RNAseq, you simply insert the Split'N'Trim step after the second round of MarkDuplicates.

@Geraldine_VdAuwera said:
...
It does not state that you should do realignment and BQSR in the per-lane processing. That used to be the case (a couple of years ago), but no longer.

If you're running on RNAseq, you simply insert the Split'N'Trim step after the second round of MarkDuplicates.

Ok, apologies for my misunderstanding! In that case, will it pose a problem if we/I ran the Split'N'Trim, realignment & BQSR steps on each lane prior to merging? I realize that it wasn't necessary, but will it cause issues with the downstream analysis, i.e. would the data somehow be over-normalized and/or over-pre-processed?

No, it shouldn't hurt. The second SplitNTrim will do nothing, the second realignment will only "complete" the realignment process started per-lane, and BQSR should not make any significant changes if the recalibration worked well the first time around (you can check the recalibration plots and see that the blue and pink dots should be in mostly the same places).

I performed this pipeline and the results in the vcf generally match what I'm seeing in the STAR-aligned RNA-seq .bam file via IGV, which is fantastic. The STAR alignment produced a relatively high mapping rate (>80%).

However, I'm noticing a huge loss of reads after the HaplotypeCaller (please see below) in the majority of samples. Some even as high as 95-98% filtered out. Is this 'normal' behavior for GATK RNA-seq variant calling and how would you recommend I troubleshoot this if not? Notably, this is an analysis of a capture-based RNA-seq library (Illumina Access, ~20K genes).

Yes @npriedig, I have the exact same issue (but with even worse numbers)
I am wondering if this is due to a recent update of STAR or Picard that leads to more stringent rules for duplicate markings or Mapping Quality?
I don't think that this error was reported before, and I did not see @ami, @Geraldine or @Sheila mentioning this problem with the pipeline.