I’m working on RNA-seq analysis of a non-model plant. After doing de novo transcriptome assembly from about 400 million PE reads and blast against nr database, I found out there is almost much contamination from various species of fusarium (a kind of fungus), Acanthamoeba castellani (a kind of amoebae), Crassostrea gigas (kind of oyster), Aplysia californica (sea slug), Saccoglossus kowalevskii (acorn worm) and capitella teleta (polychaete worm).

Could you please let me know how I can remove these contaminations, if there is any database for that? Also, please advise me if contamination removal should be performed on initial sequencing reads or assembly.

Make sure that what you are seeing are actual contaminants, and not simply low evalue best hits. When I look for contamination I use megablast and I expect a very high evalue and high percent identity >95%. If I find that a few transcripts belong to specific species of bacteria/fungus, I download that genome and filter my reads with the tools Brian mentions below against that genome. Another thing to do is a GC content plot of your draft assembly. Different species have different GC contents, so if you have multiple peaks in such a histogram, it may indicate contamination. I believe QUAST can give you GC content plots online if you input your assembly.

Thanks, Adrian. Unfortunately, I found high e-value and identity for contamination. Also, there were multiple peaks for GC histogram indicating contamination as you kindly mentioned. So, I have to move for contamination removal. It would be great if you could please let me know the suitable parameters for filtering contamination.

It's best to remove contamination as early as possible, to avoid chimeric assemblies and reduce assembly time, memory requirements, and fragmentation. The disadvantage is that it is harder to identify read-level contamination because reads are short, so sometimes it makes more sense to do these things iteratively - assemble, identify contamination, map to the contaminant genomes, remove contaminant reads, reassemble what's left over.

It seems extremely unlikely to me that you actually have contamination from so many different unrelated sources, unless they were multiplexed together. Is your lab researching all of these organisms? I've never seen a situation in which anything is heavily contaminated with multiple different animals. Often, there is heavy contamination by one organism, or trace contamination from multiple (particularly microbes). It's more likely that these BLAST hits are wrong (perhaps, the contaminant organism is not in the databases, even at a high taxonomic level). Rather than BLASTing in protein space, I suggest you do so in DNA space (against nt) to see if you can get some long exact matches to a specific organism rather than some genes that might be heavily conserved; that would give greater confidence.

I wrote several tools for decontamination, particularly BBSplit and Seal, both of which map reads to multiple genomes simultaneously to bin them into one file each by the best-matching organism, so that you can get a clean assembly of each one. They require assemblies, though. Seal may work better with transcriptomes than BBSplit. If this is cross-contamination (due to multiple projects being handled together or multiplexed together), CrossBlock is an even better solution.

Another thing to note is that you cannot quantify contamination by mapping the assemblies to nr/nt, particularly in transcriptomes. You have to use the raw reads, because even 1% contamination at the read level can assemble into 50% or more of your assembly. So, have you quantified the contamination at the read level? Again, I use Seal for that (since BLAST is way too slow to process all reads) once the contaminant organisms have been identified, but you can just blast a few thousand of the reads, instead.

Thanks Brian for your helpful reply. Sequencing has been done by sequencing center, not our lab, so everything is possible. This assembly also resulted from merging several assemblies at different k-mers, so it may contain over contaminants. Actually, the various species of fusarium genus is the most abundant contaminant. However, they must be removed at the first as you kindly explained. Regarding your suggestion for doing blast against nt, assume the contaminants also exist in the blast results, how we should deal with them, they have to remove, again.

No, I have not checked the contamination at the read level. As I found on net Crossblock is suitable when the contamination level is low and the genome identity is lower than 97%, so it may not a good choice for my case which there is various species of single genus (fusarium), so I have to use Seal, do you agree with me? Please correct me if I’m wrong.

Crossblock is a little different than Seal/BBSplit - it requires the raw reads from all libraries handled and multiplexed together. If your sequencing was done elsewhere you probably would not be able to obtain that.

So, yes, you can use Seal. Seal is much like BBDuk except it better supports multiple references that might share some kmers. But both Seal and BBDuk require the genomes of the contaminants. You can look on NCBI (or elsewhere) to see if you can find the complete genomes of the contam organisms you are hitting. If you can find all of the genomes of your contaminating organisms, you can remove all the contamination with BBDuk/Seal and not bother with the rest of this post.

It is very important to quantify your level of contamination at the read level. If it turns out that you have 100x coverage of your primary genome and 5x coverage of a contaminant genome, it's easy to separate them based on coverage, before or after assembly, without needing to find the specific references of the contam organisms. I suggest that you try this:

1) Take a single assembly with a relatively short kmer (say, 31-40), map all the reads to it, and generate a contig-level gc vs coverage or length vs coverage scatterplot (you can do that with pileup.sh's output). It's likely that you will see a very clear cluster of main-genome contigs with high coverage, and other clusters with a different gc and lower coverage. In that case you can bin the reads by coverage using BBNorm, or bin the contigs by coverage using filterbycoverage, then separate the reads by mapping to to fake referencee files, one with all the high-coverage contigs and one with all the low-coverage contigs. Then reassemble the high-coverage contigs.

Many thanks Brian for being here and help. Your second suggestion, filtering based on coverage, sounds that is not an easy task at least for me as an almost beginner. Please let me know if there is a tutorial or any documentation for it. Fortunately, I found the genome of the most contaminant at NCBI, however not complete genome, so I can go ahead with Seal, (can be problematic no complete genome, here?). It will be great if I have your recommendation to set the parameters for getting the best results from Seal. As I read, it seems that kmer is a major parameter, yes? which kmer do you suggest for 100bp PE reads?

Normally I use 31, which is Seal's max, and its default. There's not much reason to use anything else for 100bp reads. The main sensitivity factors are hdist (hamming distance, number of mismatches allowed per kmer) and mkf (min kmer fraction, the fraction of a read's kmers that must be shared with the reference). In this case I suspect the genomes will be too large to use a hamming distance above zero. Setting mkf to a value above zero will increase the specificity.

But, it's safe to just use the defaults. If you don't have the target organism's genome, there's little point in using Seal, so I'll just give the BBDuk command:

Thanks so much Brian. I just updated my previous post, I found the genome of most contaminants at NCBI, however, they are not complete genome. Please let me know if it can be problematic for using Seal? So sorry I did not understand your last sentence and suggestion for using bbduck, why you suggested it.

Seal and BBDuk are about the same for filtering against a list of contaminants. Seal is better if you want to bin the input between different references. In your case, you simply have a list of contaminants you want to remove; you don't need to do any binning. For example:

...would create lots of output files, one for every contig inside any of the references, plus one for the reads that don't match any of the references, which is what you want to keep. The BBDuk command:

I used your suggested tool, bbduk, for contamination filtering. It sounds great, so fast. It reported just 2.3% of contamination for 11 contaminant reference genomes. I did transcriptome assembly on clean data with Trinity and conducted the blastn of the assembly against one of the prominent contaminant genome (as before indicated by blastx against nr). Based on blastn results, there was still 55 contigs (with the high percentage of identity and low e-value) matched with the contaminant genome. Of course the contamination was dramatically decreased after using bbduk tool, however, the existing 55 contigs is my concern. Since I also plan to do assembly with a multi-k-mers assembler and merge different assemblies, even so tiny contamination in the each assembly may be problematic in the final assembly. Could you please let me know your idea about it, why does (small) contamination remain even after using bbduk tool? Is it possible to increase the stringency of the software for avoiding any contamination?

First off - 2.3% contamination is really quite severe. That is far higher than you would expect from barcode misassignment, chimera formation, or most of the other common culprits. Sounds like physical contamination to me.

You can increase the stringency with BBDuk by decreasing the kmer length (to, say, 25) and/or increasing the hamming distance to 1 (flag hdist=1 [which needs more memory] or qhdist=1 [which needs more time]). Both of these will help detect reads with large numbers of errors. If you use BBDuk to remove all reads containing kmers in the reference, it is not possible to assemble contigs matching that reference using kmers as long or longer than what you specified for BBDuk. But, I think Trinity defaults to 25-mers. So, contaminant reads with reference 25-mers (but not 31-mers) can still be assembled. Also, reads from a similar but not identical organism can also be assembled.

Alternatively, you could just filter again with BBDuk using the contaminant contigs as the reference, with k=25 and hdist=1, which would be the easiest approach.

It's also possible to use mapping, rather than kmer-matching, for filtering. Mapping will sometimes have higher sensitivity, though for maximal sensitivity, you need to use multiple approaches. For example:

Thank you very much for your helpful response. I didn’t really think that 2.3% is severe contamination. Thanks for letting me know it. As far as I found the maximum k-mer in the bbduk is 31-mer and doing assembly with smaller or larger than this k-mer would be assembled the contaminated reads, which still remains. As I mentioned in my previous post, I plan to do various assemblies with different k-mers (using a multiple k-mer assembler), and then merge different assemblies to get the final assembly. So, I think mapping, as you kindly suggested me, may be better than increasing the stringency of the bbduk tool in this case. So, I plan to map the clean reads resulted from bbduk tool against the genome of contaminants, hope I get rid of any contamination. I wish the bbmap was accepted the several genomes at one time. Sorry, I have paired reads in two files and so, getting the unmapped (clean) reads as two separate files will be desired, but I could not find the related flag for getting two output files in the bbmap.sh, could you please advise me on this issue? Brian, please correct me if I’m wrong.

Hi Brian, I am trying mapping. As you kindly suggested I concatenated all contaminant reference genomes that is about 3GB in size, so mapping step is moving so slow. Is there any way to increase the speed with no lose much sensitivity of the task?

Thanks so much Brian. Sorry for maybe a stupid question, with your suggested flags to speed the mapping process, could you please let me know if the sensitivity of the task still high and acceptable?. you know, removing the contaminants reads and making a high-quality transcriptome assembly with clean reads are of my big concerns.

"minratio=0.8" will reduce the sensitivity to roughly 90% identity. For contaminant removal that's usually adequate. You could leave that flag off, though. The others will only reduce sensitivity very slightly, and "qtrim=r trimq=10 untrim" will actually increase the sensitivity.

Finally, I could run the program with your suggested commands. The speed was really increased, thanks again for all your help. Just one thing, since I used the flags of qtrim=r trimq=10 untrim for increasing the speed and sounds that they are related to trimming, could you please let me know if it's an important to apply this program bbmap.sh) before or after trimming? however, I used it on the trimmed reads. Sorry if the question is so basic.

When you run BBMap with "qtrim=r trimq=10 untrim", it will quality-trim the right end of the read to Q10, then map the read, then restore the trimmed bases afterward (as soft-clipped bases). This increases the sensitivity over not trimming, but does not throw away any data, which I find useful. You can get identical mapping results by trimming to Q10 prior to mapping, but then you would lose the trimmed bases.

Thank you very much, Brian for all your help and developing such a great tool. We usually use the threshold of Q20 for trimming, so it sounds that using the bbmap after or before the trimming could not generate different results. Please correct me if I'm wrong.

Thanks for your reply. However, I don't know why there is still a bit contamination. At first, I used bbduk program (with default settings) for removing contamination, the output was then subjected to bbmap.sh with the flags of qtrim=r trimq=10 untrim fast maxindel=10 maxsites=1, as you kindly instructed me. Finally, I run Trinity on these clean reads and conducted the blastn of the transcriptome assembly against the contaminant genomes for fast evaluating the probable contamination; based on blastn results, there is still a bit contamination even after using both bbduk and bbmap programs, actually 747 transcripts of the 70275 transcripts turned out the best hit. Could you please share me your opinion about it? Please let me know if there is another way to remove the remaining contamination or this contamination level is trivial and I can go ahead with these results?

Are they hitting an organism that you used for filtering? What percent identity and length are the hits, and what organism are they hitting? And how long are these transcripts? Basically, how confident are you that they are contaminants?

Yes, I run blastn of the assembly against those contaminant genomes used for filtering. The percent identity is 70-100% and the subject length is from 400 to 9,386,848 bp. Regarding the length of transcripts that showed the best match with contaminant sequences in the blastn, the min, max and average length are 300, 7700, and 1350 bp. As I mentioned in my previous post, for fast assessment of the contamination, I did blastn of the assembly against contaminant genomes (I concatenated all contaminant genomes). Brian, please help me out what should I do in your professional view. Please let me know why there is still a bit contamination and is there another way to remove the remaining contamination or should I move forward with these data?

A 7700bp match with >99% identity is almost certainly the organism it hits. But shorter hits, under 1000bp with 70% identity, are not very informative except for homology; you can't say for sure if they are contaminants.

If you are confident they are contaminants, you could remove the transcripts, and remove all the reads that map to them. It's hard to say why some contamination slipped through, but it's probably because the reads did not match the contaminant references closely enough. You can, of course, increase the sensitivity of read decontamination. For BBDuk, you could decrease K to 27 and add the flag "qhdist=1"; for BBMap, you could run at default parameters instead of the fast parameters. But, some elements like ribosomes are highly conserved and may appear to be contaminants when they are not, and increasing the sensitivity of decontamination will incur more false positive removals of such things.

Sometimes you can get a feel for contamination by looking at the gc content (or tetramer PCA plot) of all the sequences; if the ones blasting to other organisms have the same gc content as the others, there's a good chance that they are not, in fact, contaminants. You can get gc content like this:

Thank you for being here and help. I just did blastn of the assembly against contaminant genomes and not sure for real contamination.

Brian I run stats.sh for gc content evaluation as you kindly suggested me. Actually, I separately run the script for assembly file with and without contaminant transcripts (those were the best hits in the blastn) as well as for contaminant transcripts and calculated the gc average, which sounds there is a real small contamination. The gc average of assembly file with and without contaminant transcripts was 0.385262 and 0.385067, respectively. It was 0.410834 for contaminant sequences. Based on these results, would you please let me know your idea about it?. Is it necessary to again run bbduk and bbmap scripts with your suggested settings to remove these contaminations?, in fact if I tend to do assembly with multiple k-mers and merge them, the small remaining contamination won't be accumulated and created a big problem?.

In the case of re-running bbduk and bbmap, you told me decreasing K to 27 and add the flag "qhdist=1" for bbduk and for bbmap run it with default settings instead of fast parameters, could you please advise me it's better to run these programs on original reads or those clean reads that have been already subjected to these scripts?

I'm spending lots of time on this issue and you helped me much, I don't know how to thank you.

It does not matter whether you run on the original or already decontaminated reads, since the new settings are more sensitive. And a GC% of 38.5 for "noncontaminant" and 41.0 for "contaminant" sequences are really close; based on the GC, it seems likely that the "contaminants" are actually from the correct genome and not contaminants after all. But, that's not really sufficient evidence.

I am not really sure whether or not these remaining sequences are contaminants; that's something you'll have to decide. If you want to be extra cautious, just remove them.

Thanks a lot, Brian. To be honest, I have no enough experience to make a right decision at these situations. I don't really know to re-run the scripts of bbduk and bbmap. It would be great if you please let me know what do you do if you were in my shoes.

After the read decontamination step that you already performed, I would probably assume that the blast hits under 99% identity are probably spurious, and leave them in. The best approach would be to sequence the organism again (from a new DNA sample) and only keep the contigs supported by reads from both sequencing runs; it's not really possible to absolutely determine what is genomic and what is contaminant without sequencing again. But short of doing that... you basically have to determine which is more important, false positives or false negatives. In RNA-seq you don't really expect to recover the whole transcriptome anyway so it's probably OK to throw away transcripts that might be contamination.

Ultimately, you could go either way (keep them all, discard them all, or just discard the ones with hits of at least 99% identity) and it would be a defensible decision. But I would probably just discard the high-identity hits.

Thanks a lot, Brian. As you kindly advise me I'll discard just the high identity hits since I believe you. Sorry for another question, I should remove those hits and corresponding reads, yes? Is there any tool within the bbmap package for this purpose? or whether can I use those hits as a reference file and do mapping with bbmap.sh?

Just one thing about bbduk, could you please told me if it's better to add the flags of qtrim and trimq for removing contamination with bbduk?, you know the amount of matched reads (contaminants) was about 500 MB using k=31 while it changed to about 4 GB using k=27, much difference.

Wow, that's a huge difference... just bear in mind, that does not mean all of those reads are certainly contaminants, just that they all share 27-mers with suspected contaminant organisms. Meaning... they are probably contaminants.

Quality-trimming can increase the speed and sensitivity of BBMap, but it will not increase the sensitivity of filtering with BBDuk - it will strictly reduce the sensitivity. So, it is not recommended prior to running BBDuk.

Hmmm... there's nothing like that currently. Would it be more convenient if I made the bitbucket repo public so you could pull from it? Or - what do you find to be the most convenient way to download the latest version of software via the command line?

Right now I have to download the tar from sourceforge, so I need to find the link. Then I need to untar it, and it untars into bbmap/, so I need to rename it to the current ver so that I know what ver it was. It would be awesome if you could use git or svn or even a constant wget link that always gives you altest ver with a small file called VERSION would be great.

'...use the raw reads, because even 1% contamination at the read level can assemble into 50% or more of your assembly...'
I found your words are very correct and want to cite it in my benchmark paper. Could you kindly let me know this is your finding or cite from other paper?
Thank you so much.