I'd like to introduce another member of the BBTools package, BBDuk. "Duk" stands for Decontamination Using Kmers. BBDuk is extremely fast, scalable, and memory-efficient, while maintaining greater sensitivity and specificity than other tools. It can do lots of different things, for example -

(if your data is very low quality, you may wish to use more sensitive settings of hdist=2 k=21)

...where "ktrim=r" is for right-trimming (3' adapters), "ktrim=l" is for left-trimming (5' adapters), and "ktrim=N" just masks the kmers with "N". "k" specifies the kmer size to use (must be at most the length of the adapters) while "mink" allows it to use shorter kmers at the ends of the read (for example, k=11 for the last 11 bases). "hdist" means "hamming distance"; this allows one mismatch. Instead of "ref=file" you can alternately (or additionally) say "literal=ACTGGT,TTTGGTG" for those two literal strings. The BBTools package currently includes Truseq and Nextera adapters sequences in /bbmap/resources/ as truseq.fa.gz, truseq_rna.fa.gz, and nextera.fa.gz. You can specify whether or not BBDuk looks for the reverse-complement of the reference sequences as well as the forward sequence with the flag "rcomp=t" or "rcomp=f"; by default it looks for both. You can also restrict kmer operations such as adapter-trimming to only read 1 or read 2 with the "skipr1" or "skipr2" flags, or restrict them to the leftmost or rightmost X bases of a read with the restrictleft or restrictright flags. For normal paired-end fragment libraries, I recommend adding the flags "tbo", which specifies to also trim adapters based on pair overlap detection (which does not require known adapter sequences), and "tpe", which specifies to trim both reads to the same length (in the event that an adapter kmer was only detected in one of them).

This will quality-trim to Q10 using the Phred algorithm, which is much more accurate than naive trimming. "qtrim=rl" means it will trim the left and right sides; you can instead set "qtrim=l" or "qtrim=r".

This will remove all reads that have a 31-mer match to phix (a common Illumina spikein, which is included in /bbmap/resources/), again allowing one mismatch. The "outm" stream will catch reads that matched a reference kmers. This allows you to split a set of reads based on the presence of something. "stats" will produce a report of which contaminant sequences were seen, and how many reads had them.

This will mask all 25-mers in E.coli that are also shared by Salmonella, by converting them to N. You can change them to some other letter instead, like X.

BBDuk can handle single-ended reads, pairs in two files, or pairs interleaved in a single file (which will be autodetected based on read names, or can be overridden with the 'int=t' or 'int=f' flag). For example:

This will process the input as interleaved (which is forced since there are two output files). It will trim to Q10 on the right end only, and throw away reads shorter than 25bp after trimming. If one read is too short and the other read is OK, both will be thrown away, to preserve pair ordering. This can be changed with the "rieb" (removeIfEitherBad) flag; setting it to false will keep the reads unless both of them are too short.

BBDuk can process fasta, fastq, scarf, qual, and sam files, raw, gzipped, or bzipped. It can also handle references that are very large (even the human genome) in a reasonable amount of time and memory, if you increase the -Xmx parameter; it needs around 15 to 30 bytes per kmer. It can do operations such as quality-trimming and contaminant-filtering in the same pass, but not two different operations using kmers (such as left and right kmer trimming), although BBDuk2 can do that. BBDuk can also do various other filtering procedures such as complexity filtering, length filtering, gc-content filtering, average-quality filtering, chastity-filtering, and filtering by number of Ns.

For more details and features, you can run bbduk.sh with no parameters for the help menu, or ask a question in this thread.

Note! If your OS does not support shellscripts, you would replace bbduk.sh -Xmx1g
withjava -ea -Xmx1g -cp C:\bbmap\current\ jgi.BBDukF
...but leave the rest of the command the same. "C:\bbmap\current" would vary depending on where you unzipped bbmap.

The "-Xmx1g" flag tells BBDuk to use 1GB of RAM. When using the shellscript, BBDuk does not strictly need that flag and can autodetect the amount of memory available. When using a large reference, or a large value of "hdist" or "edist" (hamming distance and edit distance, both of which greatly increase the amount of memory needed), you may need to set this higher.

P.S. When processing dual files, instead of "in1=r1.fq in2=r2.fq out1=clean1.fq out2=clean2.fq", you can simply use "in=r#.fq out=clean#.fq". The "#" symbol will be replaced by 1 and 2.

"int=f" makes the reads treated as single-ended, to simplify things. "right" means the adapters will be 3' type. So, they will be added at a random location from 0 to 149, and possibly run off the 3' end of the read, but the read length stays at 150. If the adapter ends before the end of the read, random bases are used to fill the rest. Approximately 50% of the reads get adapters, and 50% don't. After the adapter is added, each of the adapter bases is possibly changed to a new base, with a probability from the read's quality score for that base, to simulate sequencing error.

Next, I ran 3 different programs - Trimmomatic, Cutadapt, and BBDuk, and measured their performance. This was on a 16-core, dual-socket Sandy Bridge E node with 128GB RAM, reserved for exclusive use, and only interacting with local disk, so the rest of the cluster had no impact on the tests. Hyperthreading was enabled.

So, Cutadapt is extremely slow, and BBDuk takes both the speed and efficiency wins by a large margin. Of course, accuracy is more important than speed, so I graded the results. addadapters.sh already replaced each read's original name with a synthetic name indicating its original length and the length that the read should be after trimming. For example, "@0_150_11" means that the read was originally 150bp, but should be 11bp after trimming, because an adapter was added at position 11 (0-based). This allows quantification of both the number of bases correctly and incorrectly removed. Ideally, "Adapters Remaining" and "Non-Adapter Removed" should both be zero after trimming. For reference, this is what the untrimmed file looks like:

BBDuk performs quite well, vastly outperforming Cutadapt and Trimommatic on every metric. Trimmomatic and Cutadapt both do quite poorly, though of the two, Cutadapt has both a higher true positive rate and a much lower false-positive rate than Trimmomatic, so takes second place in accuracy. If anyone has any other adapter-trimming tools they commonly use, please reply and I'll be happy to test them with the same methodology. Also, if anyone has suggestions for better parameters, please reply; I have no experience with either tool so I'm basically using the defaults.

All of this is testable and repeatable - you can use your own data and your own adapters, or the attached "gruseq.fa.gz" file and this qual file to replicate my results. The exact numbers will depend somewhat on the quality values of the real data, and very slightly on the organism.

P.S. You can get slightly better accuracy with two passes using different values of k and hdist, like this:

You can use whatever adapter sequence you want (with the "ref=" flag), and I include the Illumina truseq adapters with the BBMap package, in the "resources" directory. The command above will allow up to 1 mismatch (hdist=1) and trim as few as 11 trailing adapter bases. You can of course increase the number of mismatches and decrease the minimal number of trailing bases that are trimmed. If you have really low-quality reads with an average insert size close to the read length (and thus a high rate of adapters) you should probably set hdist=2 and mink=6 to remove more of them, but be aware that those settings will increase the false positive detection rate.

It's best to do adapter-trimming first, then quality-trimming, because if you do quality-trimming first, sometimes adapters will be partially trimmed and become too short to be recognized as adapter sequence. When you run BBDuk with both quality-trimming and adapter-trimming in the same run, it will do adapter-trimming first, then quality-trimming.

Thank you a lot for your nice and fast reply
I wanted to inform you that based on my first simple tests on illumina data your method performed best, followed closely by trimmomatic
the other methods tested where cutadapt and fastx but the latter are way behind bbduk and trimmomatic

sorry if I write on the wrong post, but I wanted to ask you about dedupe
I wanted to know if it possible to output simple stats about the repeated sequences
For example:
number of copies, and length (I am assuming 100% identity among the copies)
Alternatively, also a single file with all the repetitions would do
for now, if I am not wrong the option outd gives the file with only 1 copy of each of the repeated sequences

"outd" will print every duplicate removed. So if some read has 5 copies, then 1 copy will go to "out" and 4 copies will go to "outd". Therefore I think that's what you're looking for.

By default only exact duplicates and exact containments will be removed. So if you have a 200bp contig X and a 100bp contig Y such that Y is a substring of X, then Y will be removed and will be kept. You can adjust this with the "ac" (absorb containment) flag.

Also, if the input is paired reads, they will only be removed if both reads exactly match another pair.

Hello Brian
Thank you for your reply
Actually I realised the copies were there just after I posted my question
anyway it was not clear to me that only n-1 of the copies goes to the file specified in dout

Dear Brian,
Thank you for BBDuk. Great tool!
I have been using it so far for quality and adapter trimming, but there is one additional trimming step that I would like to do with it and I am not sure if and how it is possible with BBDuk.
The first 10 bases of my reads are usually of high quality, but I observe that their GCAT-content is not as even as it should be. So I would like to cut the first 10 bases of all reads during the quality and adapter trimming. Is that possible with BBDuk?
I have been doing this with nesoni clip so far, but it would be great if I could do all the trimming in BBDuk in one or two steps.
Thank you,
Manuel Kleiner

A couple of comments. First, this is currently possible with Reformat but not with the released version of BBDuk. I'll update it later today or this week so that it will be possible to do both adapter and positional trimming in once command. Until then, you can do this:

reformat.sh in=reads.fq out=trimmed.fq ftl=10

...where "ftl" means "force trim left".

Second, it's important to make sure these bases should really be trimmed. We have been generating some Nextera libraries recently with very erratic base frequency for the first 20 bases:

The top is the base composition histogram before adapter-trimming, and the bottom is after (this has read 1 from 0-150 and read 2 from 152-302); note how the right part of the read looks much better after adapter trimming. But the first 20 bases look terrible! However, I mapped the adapter-trimmed reads to the assembly with BBMap using the 'mhist' flag, which generates a histogram of the rates of match/substitution/insertion/deletion rates by read position:

The error rate is a little higher for the first few bases, but still well under 1%, so we are not going to trim the first 20bp off of those reads, as was initially proposed. The reads are accurate even though the base composition is highly biased, because the fragmentation was not random (this uses some kind of enzyme). Generally, before you trim off bases because of a skewed base composition histogram, I suggest mapping to see if there actually is a higher error rate there.

What I would term "naive trimming" to Q10 would trim only the last base with quality 2, and stop because the next base has Q40. This would leave 4 internal bases with Q2, which is not desirable.

The Phred algorithm would trim the last 6 bases, because their average quality (calculated by summing the error probabilities) is 2.79, which is below 10. Trimming regions with average quality below a threshold gives the optimal result in terms of the ratio of retained bases to the expected number of errors.

In that case, it would trim all but the first 4 bases. On the other hand, in this case:

40, 40, 40, 2, 2, 2, 40, 40, 40, 40

...it would trim all but the LAST four bases, because the there are more high-quality bases on the right end than the left end. The result after trimming is always guaranteed to be the largest possible area with an average quality above the threshold, such that no flanking sub-region has average quality below the threshold. Illumina reads typically have lowest quality on the ends and higher quality in the middle, but you can get low quality in the middle if there's a laser failure or air bubbles in the middle of the run. A single bad base in the middle won't cause half the read to be trimmed unless the average quality threshold is set pretty high, but multiple bad bases in the middle will cause either the first or last half of the read to be trimmed to get rid of them.