Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?

Did we give you a GitHub issue to track?

Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community

Attention:
The frontline support team will be slow on the forum because we are occupied with the GATK Workshop on March 21st and 22nd 2019. We will be back and more available to answer questions on the forum on March 25th 2019.

Comments

First, thanks a lot for this document. I am testing it and will probably report on it soon.
In the meantime, I have a question about the alignment part. In STAR, it is possible to build the index with a set of canonical splicing junctions, in addition to the reference sequence (as shown in the second pass of the alignment). Has it been tried to do the first-pass with a known annotation (RefSeq or Gencode, for instance). Does it decrease the performances?
Also, a typo in the command for the generation of the index, in 2ndPass-STAR: the parameter genomeFastaFiles need a double-dash:
--genomeFastaFiles

Hi @NicolasRobine‌, thanks for pointing out the error in the STAR command.
We did not use a known annotation file as you suggest since we would like to have a pipeline that is not dependent on such annotations (that are constantly being updated and yet are probably still missing many splice junctions).
One can use it and I I would assume it won't decrease the performance. If you do it, it will be great to get a feedback about it.

at net.sf.picard.sam.SamFileHeaderMerger.mergeProgramGroups(SamFileHeaderMerger.java:321)
at net.sf.picard.sam.SamFileHeaderMerger.<init>(SamFileHeaderMerger.java:170)
at org.broadinstitute.sting.gatk.datasources.reads.SAMDataSource$SAMReaders.<init>(SAMDataSource.java:885)
at org.broadinstitute.sting.gatk.datasources.reads.SAMDataSource$SAMResourcePool.createNewResource(SAMDataSource.java:795)
at org.broadinstitute.sting.gatk.datasources.reads.SAMDataSource$SAMResourcePool.getAvailableReaders(SAMDataSource.java:766)
at org.broadinstitute.sting.gatk.datasources.reads.SAMDataSource.<init>(SAMDataSource.java:277)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.createReadsDataSource(GenomeAnalysisEngine.java:872)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:761)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:284)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8):

I think the latest version of Picard no longer produces this issue (or there's an option to disable writing PG tags). To work with these files, you'll need to reheader them in order to get GATK to accept them as input.

I had a comment regarding SplitNCigarReads and how it assigns the same QNAME for the split reads. In the SAM spec it says that "In a SAM file, a read may occupy multiple alignment lines, when its alignment is chimeric or when multiple mappings are given." However, SplitNCigarReads doesn't change anything else in the reads except the CIGAR and start position. In other words, the FLAG isn't changed in any way to reflect the split reads. I wonder if this is intentional?

The BAM file produced by SplitNCigarReads is only intended to be used for variant calling by the HaplotypeCaller. So that information is not meant to be usable by other programs, if that's what you are concerned about.

You can if you want. We found that we got best results overall (including indel calls) with the STAR 2-pass method. But you can choose to use something different if you prefer. It's your data, your analysis

Hi Geraldine, I am currently running GATK for my RNAseq data set (after running TopHat with bowtie2). It is great that you have integrated a tool that takes care of reads with Ncigar. I used to filter out all spliced reads and run GATK. I was wondering if it would be possible to keep rest of the read (somehow, may be in separate intermediate file) and make use of them as well, rather than throwing them away? I know I want too much but just an idea.
My second question; would it hurt if we use known variation databases for VQSR in case of RNAseq (human)? I am testing both VQSR and hard filtering but wondering the side effects?

@zzq : Just to give more info to @Geraldine_VdAuwera‌ answer - if you care only about SNPs, Tophat will perform only slightly worse than star (** based on our data ** ). Probably most of the differences will of variants that are close to the splicing positions and/or close to Indels. If you do care about Indels as well, star will perform much better (when using the default parameters of both aligners).
As we try to get more familiar with the data that our users have - can you please also mention what type of data you have?

I was wondering if it would be possible to keep rest of the read (somehow, may be in separate intermediate file) and make use of them as well, rather than throwing them away?

I'm not sure exactly what you mean, can you please clarify? With the new tool nothing really gets thrown away; the reads with Ns are split but all non-N bases are used, for the most part.

would it hurt if we use known variation databases for VQSR in case of RNAseq (human)?

This is something we're currently testing; @ami may be able to tell you more about the very latest findings, but my understanding is that VQSR should work. There may just be some tweaks required to get the best results.

Since the pipeline is currently recommended for single sample (as we did not test it yet with many samples), we do not have recommendations or enough experience with VQSR for that pipeline (you need to see enough data in order to use VQSR, which is not the currently the case with the RNA-seq pipeline). We are planning to use VQSR in the future versions of the best practices pipelines.

As we try to get more familiar with the data that our users have - can you please also mention what type of data you have? Do you have data from many samples (enough for VQSR)?

@Geraldine_VdAuwera‌ What I meant by Ncigar was the reads that are represented with an "N" notation in cigar in their bam/sam files, which basically shows that they are spliced reads (specific to RNAseq). I did not mean the reads with Ns (ambiguous bases). As far as I understand from the explanations of the tool above ("3. Split'N'Trim and reassign mapping qualities"), the reads that carry N in their cigar stings are trimmed off of their intronic site. This way tool gets rid of possible miss-matches. Please correct me if I am wrong. What I suggested to keep was the part of the read that is trimmed off.

@ami‌ My data is strand specific total RNAseq and I have ~40 samples. I ran GATK on individual samples but also planning to run it on all at the same time even though I don't know up to which level it will effect the calls.

Ah, I see -- you're referring to "dangling" heads and tails. When they occur later, within the HaplotypeCaller (due to the reads being moved during the assembly step), they are "rescued" and placed in the appropriate location if possible. When they occur at the Split'N'Trim step we currently can't rescue them, but I believe the plan is to move that functionality into HaplotypeCaller, so they will ultimately also be rescued wherever possible. @ami, is this the right way to put it?

@ami If you try to align RNAseq reads using directly bowtie/bwa, you will not get any spliced reads (without using any soft-clipping option etc.). Then, when you visualize the alignment on IGV, you would see many miss-matches especially in intron spanning reads. These would introduce many false positives when you call variations. However; a splice aligner splices these intron spanning reads and flags the cigar string with 'N'. At this point GATK needs to handle cigar string properly. So, my understanding is that SplitNCigarReads tool is for this purpose, right? I might have been a little unclear, sorry. Thanks for your time and effort.

Everything you describe is right. The splice aligner will add N's to the cigar only to gap between 2 exons, and this is the part that we remove with the SplitNCigarReads, so you don't (implicitly) lose information and there is no part of the read that I can think that we wold like to keep as in your suggestion "What I suggested to keep was the part of the read that is trimmed off" - the part that is trimmed is only N's. We do trim some "dangling" heads and tails that have many mismatches since they usually should be spliced by the aligner by somehow were missed.
Does it make sense now?

Hi all:
I have failed when i use SplitNCigarReads(mapping by tophat2). It give me an error like this :
Cannot split this read (might be an empty section between Ns, for example 1N1D1N): 10M628N2D203N90M
Should I filter these reads mapped by this style ?
What is more , which tools can do this? (does samtools can finish it ?)
thanks!

@zzg‌ which aligner are you using? such a Cigar string does not make any sense, so I would suspect such results... You can write a GATK walker to filter such reads (I don't think that any of the tools can do it right now). Even if you do filter these reads out, I would be suspicious with the other reads...

hey,
First of thanks really very much for this!
I've already run the HaplotypeCaller and also run VariantFiltration on the resulting VCFs. However, for each read I get 1-3 warning lines like this:
"Interpreter - ![0,2]: 'QD < 2.0;' undefined variable QD"
I thought this is because some lines in the VCF files don't contain QD. Is that possible? I don't think it should bother me that much because I only get the error for 1 to 3 lines per sample. Should I check what lines don't contain QD?
I'm using the latest GATK version(v3.1-1-g07a4bf8) and the line for variant filtration is exactly the same you suggest using, above. The process doesn't abort and I have the resulting filtered vcfs.

@ami‌@Geraldine_VdAuwera‌ I mapped my reads by tophat2. I have filtered these reads by a simple perl script :perl -e 'while(<>){chomp;@tmp=split/\t/;if($tmp[0]=~/^\@/ or $tmp[5] !~ /(\d+)N(\d+)D(\d+)N/){print "$_\n"}}' input.sam > filtered.bam .Until now , it works all right.

@zzq‌ I've discussed this with the TopHat developers; it seems to be a side effect of the use of a transcriptome resource in the alignment process. What's happening here is probably that the read maps to a transcript that includes only two of three exons, with the middle one getting skipped at transcription time. We will need to add some functionality to handle this properly. In the meantime, I would suggest filtering out these reads by name (since your script identified them) using the ReadName filter.

first of all, thank you so much for this tutorial and this software. I have a question. I have RNA-seq data generated from F1 mice (B6 X NOD), therefore het at every loci. B6 is the reference genome, NOD the alternate. I applied exactly your command line, with the exception of starting with bam files aligned with Tophat2. I used SNPs and INDELs reference for NOD from Sanger Institute.
The vcf files before and after variant filtering somehow only show genotypes from het and homo-alt (GT = 0/1 and 1/1). Am I missing something so that I do not see GT=0/0 ?

@davidpz‌ Can you explain a little bit more what are you trying to do? Do you just want to check the genotypes of the new pipeline given the known alleles from the Sanger Institute (as you wrote: " I used SNPs and INDELs reference for NOD from Sanger Institute.") or do you want to compare the calls of the new pipeline and the calls that you have from the Sanger Institute (on those known sites? or on all genome?). There are different tools for those 2 different directions, and we will be able to give you better suggestions.
As we try to get feedback on the new pipeline, I would be happy to get some feedback after you will do you comparisons... especially for INDELs (since you are using tophat2) Also, out if curiosity, why do you use tophat2?

Thank you for your feedback Geraldine and Ami.@Geraldine_VdAuwera‌ : thank you, I am trying these options. @ami‌ : 1) My specific question is to study allelic expression in single cell RNA-seq data: the NOD allele and B6 allele. I am not interested in finding new variants, just to call the one that are in the SNP and INDEL reference files from Sanger. I am using 3'digital gene expression so that one read corresponds to one transcript in the original cell. So in theory, counting the reads with an NOD variant, and reads with a B6 variant, should provide me the solution. I was thinking of using the DP info from the VCF file generated by GATK.
2) I have no specific reason to use Tophat2 instead of STAR, yet. I already had the bam files aligned with tophat2, and wanted to test out the pipeline first. I do not know STAR yet. For single cell, I just had a better mapping than bwa.
I am in Boston (HMS), I would be very happy to discuss in person, if you are interested, and I would less likely make analytic errors.

Thanks so much for this wonderful pipeline!
I am currently on the indel realignment step (-T IndelRealigner) and I am getting a large number of results saying "Not attempting realignment in interval SomeChr:Location because there are too many reads." I'm guessing this is because RNA-seq will have far deeper coverage at certain points than WGS or WES. I see I can increase the cutoff for analysis using -maxReads.
1) Do you suggest this for RNA-seq?
2) If you do suggest it, is there an upper limit that I should not try beyond?

@sboyle‌ how many reads do you have? When I tested it on 250M reads, I got that several times (~170 locations), so it is expected, but I would not be worry about it, it should not have impact on the variant calling results.
If you really care about the specific regions that are ignored, since the gene there is something that is your main study, or if you see a variants (inconsistent indels, cluster of SNPs) that look as they might be as a results of ignoring the realignment, you can defiantly increase the threshold, but we didn't do that for our data.
Also - what aligner are you using? and how good is the quality of your data? my suggestion assume for 2-pass star and high quality (2x76, > 30M reads) data

Thank you for your comment @ami. I have around 120M reads. I just tried raising the -maxReads value to 100000 and there are still several regions claiming too many reads. I am just hoping this does not cause a large number of incorrect indels to sneak through. Has anyone else looked at this situation?
I'm using Star 1-pass, high quality data, 2X101 bp. Does the 2-pass help significantly? I read the alignment comparison paper by Par Engstrom in Pau Bertones lab "Systematic evaluation of spliced alignment programs for RNA-seq data" and it did not seem that the 2-pass was that much better, however, maybe for variant calling it is?

Update: actually, increasing to 1000000 actually is helping a lot. I only have 3 positions that aren't realigning so far.

Based on our results it was helping in removing many FP calls close to splicing positions. I would recommend using it. If you do and have your conclusions based on the calls with and without it, please share them with us!
If you are planning on using HC, I wouldn't be so worry about the indel realigment issue, since it is manly important for the BQSR step, but the HC won't need it.

I was wondering if any of you guys have heard of this approach for calling SNVs. Mapping reads to an index including splice junction sequences. Please check this out and tell me what you think. lilab.stanford.edu/SNPiR/readme

Hi @yasinkaymaz‌ , we didn't evaluate it but we collaborate with that team in order to compare and improve both pipelines together. I think that the main differences will be in the regions that are close to the spice positions, since they are filtering those calls and our approach is to fix those regions. I would expect that their pipeline will have better specificity (since they have more filters and since they map to to an index including splice junction sequences) and worse sensitivity. We probably call more TP but also more FP.
The other 2 main big differences are the ability of our pipeline to also INDELs and the up to date tools that we are using (newer version of BWA and HC instead of UG - a different and improve type of caller, you can read about the differences between the tools on line). Again, this is my guess based on the different steps and tools in the pipeline, but we didn't compare the results yet.
If you do it before us, please update us.
Are you interesting only in SNPs?

Hi @ami‌ , thanks for the comment. I have run my samples through your pipeline and got some pretty encouraging results. Only issue got my attention was that splice site regions with in exon boundaries (2-3bp inside right before intron) were hot spot for indels and SNVs. I was not sure if those calls were true or alignment artifact. Right now I am trying to set this SNPiR pipeline and run for a parallel comparison. One caveat is, I think, the splice junction sequences that you concatenate to original genome are probably based on known annotated junctions. I don't think they truly account for novel genes/splice junctions, which will introduce noise.
Yes, I am also interested in indels as well as SNVs.

Hi @yasinkaymaz,
We do not use any known annotated junctions, to avoid issues like you mentioned ("I don't think they truly account for novel genes/splice junctions, which will introduce noise") but instead we use the first pass of the alignment to identify the potential spice junctions (that are specific to the given data and thus there is no problem if they are novel).

Did you use the 2-pass star protocol as we suggested? If so, it will be great if we can get a snapshot of regions that are hot spot for indels and SNVs, so we can understand the problem and see if it is something that we fix. If you are willing to provide us such data, @Geraldine_VdAuwera‌ can instruct you what should be done.

@ami I was talking about SNPiR pipeline when I mentioned known junctions because authors provide a set of junction sequences to be used as alignment targets. To me this is a prior info so you need to know where those junctions are to be able to generate such sequences.
I used tophat as a splice aligner instead of star for my initial pipeline. I would be happy to have your comments on my results once I summarize them.

Using Tophat2 might create the "hot spots", but it will be good to verify it when you will have some data for us to check. You can also try the 2-pass star and see if you still see that problem.
We appreciate your feedback, thanks.

I have ran STAR a few times and do not forget to set sjdbOverhang to readlength-1!
(read like Spliced Junction Data-Base Overhang. Overhang is the amount of sequence a read extends into the exon from another exon. At most this can be readlength-1 or else this read will not have any anchor in the other exon).

Also the parameters alignSJoverhangMin = 5 and alignSJDBoverhangMin = 3 might explain the 'hot spots' that 'yasinkaymaz' mentioned. Depending on the usage of annotation one of the two will be used. The reads not passing the alignSJoverhangMin/alignSJDBoverhangMin are possibly written as unspliced reads with alignment errors. Maybe for variant discovery the best setting is alignSJDBoverhangMin = 1 or so.
Good luck on perfecting improving RNA-seq calling.

Am very interested to hear of anyone's experience comparing the lilab.stanford.edu/SNPiR/readme pipeline with the 'GATK' best practices protocol above. I'm about to run my RNA-seq data through each to compare...

@mmterpstra, thanks for the suggestions, we haven't test all the STAR parameters extensively, but we do discuss that with Alex Dobin (the developer of star) and we appreciate any suggestions based on our users experience, so thanks again.
btw - I assume that @yasinkaymaz 's hot spots were due to different reasons since he is using tophat and not star.
"The reads not passing the alignSJoverhangMin/alignSJDBoverhangMin are possibly written as unspliced reads with alignment errors" - in this case, if the SJ is found in other reads, the SplitNCigarReads tool will trim the part with the alignment errors

Hi @mmterpstra‌ , @ami is right. I used tophat as a splice aligner not star. I personally think reads that span introns or overlap splice acceptor/donor sites (hot spots in my results) should be realigned more sensitively (blat is a good option). That's why I wanted to try SNPiR pipeline. I am still working on it in my free time and I will definitely let you guys know about the results.

@yasinkaymaz‌ - thanks again for your comments. When you have time, can you please generate a snapshot of those hot spots... we mentioned them few times, and I'm curious to see how they actually look like. Thanks!

I have two more questions
1) Should I still follow RNA seq best prectice if I use dthe transcriptome assembly as reference?
2) How should I annotate my SNPs from the pipeline, if I don't have an annotated genome file?

@Keifa_1983‌ why do you use BWA? If you are using the bwa-mem it can work, but you should know that it was not meant to be run on RNA-seq data. I expect that you will get many false calls due to that reason (although BWA-mem split reads, it does not remove the overhang parts of the reads). If you still want to try it (I won't recommend it unless you have a really good reason to do it) You should not use reassign mapping qualities and the splitNCigarRead tool.

can this workflow be used with RNAseq from cancer samples (where ploidy may be an issue) or should HC be replaced by mutect in this workflow? does anyone have experience using this workflow with cancer samples?

@nbahlis‌ HC is not able to appropriately handle the cancer-specific issues such as uncertain ploidy. But MuTect will have issues with the RNA-specific issues. So, there is not currently one tool that is completely appropriate for handling cancer RNAseq data. In time this will change but for now you may want to try both and see what gives you the most meaningful results. Good luck!

Is anybody having suggestion to process RNA-seq data with the aim to perform allele specific expression analyses ?
I'm currently using HC to call variants, and planned to use DP field from the VCF file generated by GATK.
I was wondering if I should use read downsampling ?
I'm working on 2 x 100 60M reads aligned with STAR 2-passes and processed as suggested in GATK Best Practices for RNA-seq.

Glad to hear it's working for you! We're looking at methods to do allele specific analysis, but we haven't got anything ready for public use yet. In the meantime what you can do is use the AD values to estimate allele ratios. I wouldn't necessarily recommend downsampling unless you find you have problems with excessive depth. @ami may comment on how downsampling would affect the analysis, I'm not sure.

We trim the sequence reads for adapter and low quality before alignment to the reference but as it has been mentioned in the previous posts STAR should be executed with option 'sjdbOverhang' set to readlength-1. Does that mean that it needs all the reads to be of same length, meaning no trimmed reads when aligning with STAR?

For RNA-seq analysis in cohort, could I use HC in a two step procedure as recommended for DNA (HC + GenotypeGVCFs) ?
In the "Best Pratices for RNA-seq" it appears we should use HC to process each sample individually for calling and genotyping.
But there is no explanation for doing thing in a different way than with cohort DNA-seq analysis.
I'm a curious guy and I really want to know why !

Hi @Mikebesanski‌,
I'm also not sure what is the effect of downsampling on the allele specific expression analysis, although I think you won't see big differences if you use it or not (since the random downsampling should not change the ratio between the alleles too much). The best thing to do is to run the HC (and any other downstream tools) with and without the downsampling on a small region and compare the results. In general this is the best way to check those kind of things, since it might also depend on your specific data. Such a comparison should not take too much time or compute resources.
We (and I'm sure that other users as well) will highly(!) appreciate if you can report your conclusions and thoughts after you do it.

Hi (again) @Mikebesanski‌,
The only (critical) reason that we don't have such recommendations is that we didn't evaluate such project (with more then one sample simultaneously) so we can't share any conclusions. Again, the best approach is probably to roll up the sleeves and compare the two or more reasonable options on part of the data. And of course not to forget to share also those results with our community. It can help others and you may also get comments/suggestions from us and form our other very smart (and curious like you) users.
In the future, we would probably have some recommendations for such projects and we will work with more and more data.

Hi @vsmarwah,
It is a very specific question about STAR options and I wouldn't want to answer without being 100% sure about it. I think you should ask the developer of the STAT tool.

[However, your question gave me an idea, and I would suggest to Alex Dobin, which will give a talk at the Broad in few weeks, to find a way or a platform where he can also answer questions regarding his part of the pipeline].

Thanks a lot @ami‌.
I'm currently running the pipeline on RNA-seq data in a "two step manner" (HC + GenotypeGVCFs) and without any downsampling.
As soon as I perform some comparisons (with the "one step manner" with HC only and/or with downsampling), it will be a pleasure to let you know.

This might not be neccecary but the gatk bundle is still missing some formats (For the star/picard tools genes.gtf/genes.refflat/rrna.interval_list) that should be logical to include or to describe the setup with the prerequisites pages.

Unless I'm mistaken, those files are not used in the basic workflow we describe, so that's why we wouldn't include them outright in the bundle. However, if you can tell me what is the utility of these files and if that is of general interest to users, we can talk about including them in the next version.

That's a fair point. We're generally hesitant to take on the task of maintaining additional files because it adds overhead, so I can't guarantee we'll actually start bundling these, but I'll bring this up in our next group meeting. I'll let you know what comes of it.

Just a small question. If you have a RNA-seq samples that has multiple runs/lanes where do you need to do the SplitNCigarReads part? So for each lane separate or at the merged bam? I asume the merged bam but just want to be sure.

I ran this pipeline (from alignment to haplotypecaller) on 70 RNA-seq samples and used GenotypeGVCFs to join them into a single VCF file which has 1.5million variants. After filtration (following this pipeline) and annotation by Snpeff, I found that 70% of the variants are intronic and 14% are intergenic. This looks weird because RNA-seq reads should be mostly exonic (if we suppose the library contained mostly matured mRNA).
My question is:
1. Should I include transcriptome information in read alignment?
2. Or should I specify a target file when running haplotypecaller?
3. Or you think it's fine to have so many introinc/intergenic variants. (Either because of novel transcripts, or unmatured RNA, from 70 samples?)
Thanks.

By the way, the annotation says there are just ~61,000 coding variants. Considering it's a combination from 70 samples, the number is much lower than what we regularly get from exome-seq. I know variants may not be called in the genes with low expression, but I just wonder what's the range other people get.

@sirian‌ We haven't yet fully investigated the interaction between the GVCF and the RNAseq modes, so for now only single-sample calling is supported for RNAseq. We know some users have started to try this and seem to be getting reasonable results, but we can't guarantee that it works properly until we've investigated it ourselves in more detail.

@Geraldine_VdAuwera said:
sirian‌ We haven't yet fully investigated the interaction between the GVCF and the RNAseq modes, so for now only single-sample calling is supported for RNAseq. We know some users have started to try this and seem to be getting reasonable results, but we can't guarantee that it works properly until we've investigated it ourselves in more detail.

Hi,
I'm working with Cancer RNAseq data aiming for finding RNA editing sites. I worked through this pipeline. As discussed in this forum, it seams no tool is good at calling variance (DNA, RNA difference in my case) from cancer RNAseq sample.
My question is the vcf file I got seams do not have dbSNP identifier information for the SNPs. Id the program does not provide this information automatically and thus I have to add the annotation information by some other approaches?
Best,
Jun

Thanks for making it work with RNA-seq data. My concern is the issue, as you mentioned, that HC will miss heterozygous when reads supporting the alternative allele are extremely fewer comparing to the reference supporting reads. So I have a few questions related to the issue:

1) Did you check how many of the variants missed due to this issue? We intend to study RNA-editing events and it could happen in a very low frequency resulting in exact the same scenario as the issue, only it is not a heterozygous on the DNA level. Well, we don't have DNA-seq data, so hope the pipeline to detect variants from RNA-seq data is very sensitive.
2) I am not sure what do you mean by DNA-aware mode of the HC. Did you mean to run HC again without -recoverDanglingHeads -dontUseSoftClippedBases and merge results? Or you mean to run HC again on DNAseq data of the same sample? I have to say this is not available for us.
3) With the command to call variants suggested here, if I change it to output GVCF, will those sites that does not satisfy stand_call_conf 20.0 stand_emit_conf 20.0 but does have reads of alternative alleles be properly recorded? I mean if GT,AD and GQ fields of those sites will reflect the existing of alternative alleles.

That is why we are working on functionality to jointly analyze both RNAseq and the corresponding DNAseq, so the caller can compare what is the "real" genotype of the sample (per the DNA) and how the RNAseq-based call matches or differs from that (indicating either differential allele expression or RNA editing, depending on the case). But that is not yet ready.

1) I don't have any numbers on hand, sorry. Perhaps @ami can comment on that. In any case though, if you don't have the corresponding DNAseq to compare against, I'm not sure I see how you are going to identify RNA-editing events...?

2) This is what I refer to at the start of this comment, but like I said it's not ready. In the meantime you can produce the DNA-based calls yourself, and set up an analysis that compares the calls from both in a pair-wise manner.

3) If you set the HC to output a GVCF, the thresholds will be ignored, and all alternate alleles will be recorded accordingly. Note however that we have not yet validated the RNAseq and GVCF functionalities in combination.

, although we have to deal with lots of false positives and carefully justify what we find. Surely the DNAseq of the same sample will help, but not sure if the cost is worth. For now I think the sensitivity of a RNAseq variant caller is more important than its specificity.

Hmm, I see, fair enough. You'll still need to get gDNA and cDNA sequenced to validate your candidate sites, though. You could save a lot of time by just sequencing the genome (or exome) up front, and eliminate a lot of candidates that won't pan out based on the DNAseq. But that's a question of managing the time/money cost tradeoffs (we don't judge).

I think the HC in GVCF mode is your best tradeoff, as it will yield the highest sensitivity. Normally we don't advise analysing the raw gVCF file directly (instead, you put it through GenotypeGVCFs will all the other samples in your cohort) but in your case you'll probably want to look at the raw allele counts.

I tried exactly as described in this tutorial. The result I got have some some like this:
RefReadCount AltReadCount
1 88

This kind of SNPs usually present in dbsnps.
Actually, this is not so uncommon in the result. I think the 1 reference read count most likely because of technical error. I'm not sure why they are not filtered, because they are listed in dbsnps thus increased the confidence?
In this case, is it a good choice to add a filter criteria that read counts less or equal to 2 are not considered?

I intended to use AD to calculate allele specific expression as you mentioned in this forum before (in this case 1,143. The '1' obviously because of technical error. What I surprised before is all this cases, the 'variances' are present in dbsnp). Now I know I should also consider GT which in this case is 1/1, means homozygous.

I did run splitNCigarReads, but I still got some variants that resulted from wrong splicing. In the following example, the alignment at the top (above RefSeq track) are the alignments from STAR mapping, before splitting. The alignments at the bottom (below RefSeq track) are after running splitNcigarReads, indel realignment, and base recalibration. As you can see, reads were incorrectly spliced by STAR and the overhang mismatches were not trimmed off by splitNcigarReads. The three mismatches at the end of the subreads on the left actually match the beginning of the exon on the right.
Could you comment on this? Thanks.

Thanks.
I used exactly the same command in the pipeline on this page. I actually don't quite understand how splitNcigarReads can remove "intronic overhang" if it is before the N cigar. How does it know it is intronic, without any annotation information?

as i have to identify an variant that we have found in diseased samples using exome seq, sanger seq as heterozygous and now with RNA seq but my RNA-seq is giving this info i.e homozygous for alternate allele
chrxxx XXXXXXXX . C T 70.28 . AC=2;AF=1.00;AN=2;DP=3;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=95.00;MQ0=0;QD=23.43 GT:AD:DP:GQ:PL 1/1:0,3:3:9:98,9,0

PL(1/1=0 ) shows less likely for this variant to be homozygous variant

@h_asif‌ It is called as homozygous for alternate allele (hom-var), HOWEVER, the quality of the call and the genotype is low, since you only have a coverage of 3 reads in that site. As you can see in the PL the differences between the hom-var and the heterozygous are very low, so even if it is a real variant (which is hard to say based on the RNA seq data in this case, although you do have other data sets that show that there is variant in this position) it is not clear if it is hom-var or heterozygous (ideally you want to see 0 for the called genotype and bigger difference from the other possible genotypes), but either way, it's clear that the subject is definitely not "hom-ref" (=homozygous with the reference allele)