I'd like to introduce a new read-pairing tool, BBMerge, and present a comparison with some similar tools, FLASH, PEAR, COPE, and fastq-join. BBMerge has actually existed since Aug. 2012 but I only recently got around to benchmarking it.

First, what does BBMerge do? Illumina paired reads can be intentionally generated with an insert size (length of template molecule) that is shorter than the sum of the lengths of read 1 and read 2. For example, at JGI, the bulk of our data is 2x150bp fragment libraries designed to have an average insert size of 270bp. Because they overlap, the two reads to be merged into a single, longer read. Single reads are easier to assemble, and longer reads give a better assembly; furthermore, the merging process can detect or correct errors where the reads overlap, yielding a lower error rate than the raw reads. But if the reads are merged incorrectly – yielding a single read with a different length than the actual insert size of the original molecule – errors will be added. So for this process to be useful, it is crucial that reads be merged conservatively enough that the advantages outweigh the disadvantages from incorrect merges.

On to the comparison. I generated 4 million synthetic pairs of 2x150bp reads from E.coli, with errors mimicking the Illumina quality profile. The insert size mean was 270bp and the standard deviation was 30bp, with the min and max insert sizes capped at 3 standard deviations. The exact command was:

The fastq-join (from ea-utils) command line swept -p from 0 to 25 and -m from 4 to 12 parameters; only the -p sweep is shown, as it gave better results. Sample command:

fastq-join r1.fq r2.fq -o fqj -p 0

All programs (except fastq-merge, which does not support the option) were capped at not allowing insert sizes shorter than 150bp. This is because FLASH (which appears to be currently the most popular merger) does not seem to be capable of handling reads with insert sizes shorter than read length. If FLASH is forced to allow such overlaps (using the –O flag) then the results it produces from such reads are 100% wrong, resulting in performance that seems bad to the point of being hard to believe, which would undermine this analysis, so I limited the range of insert sizes to at minimum 180, which FLASH can handle. The output was graded by my grading script to determine the percent correct and incorrect:

grademerge.sh in=x.fq

Then the results were plotted in a ROC curve, attached. The X-axis is logarithmic in one figure and linear in the other.

FLASH achieves a slightly high merging rate at its default sensitivity, but does so at the expense of an extremely high false positive rate. BBMerge has a systematically lower false positive rate than FLASH at any sensitivity. fastq-join yields the highest merging rate, at the cost of the highest false-positive rate, second only to PEAR. COPE is very competitive with FLASH and strictly exceeds it with "N=0"; unfortunately it is unstable at that setting. COPE is stable at "N=1" (the default) but is generally inferior to FLASH at that setting. PEAR appears to be inferior to all other options across the board.

Next, I tested the speed. Since BBMerge, FLASH, and PEAR are multithreaded, I tested with various numbers of threads to determine peak performance. The tests were run on a 4x8-core Sandy Bridge E node with no other processes running; these had hyperthreading enabled, for 32 cores and 64 threads. Input came from a ramdisk and was sent to a ramdisk, so I/O bandwidth was not limiting. The results attached to this post indicate what happens when input was from local disk and output went to a networked filesystem, but that was I/O limited; the image embedded below reflects the ramdisk run.

FLASH has a slight advantage with a low number of threads, but BBMerge ultimately scales better to reach a higher peak of 235 megabasepairs/s (Mbp/s) versus FLASH’s 160 Mbp/s. BBMerge’s peak came at 20 threads and FLASH’s at 36 threads. 235 Mbp/s in this case equates to 499 megabytes per second and BBMerge was probably input-limited (by CPU usage of the input thread) at that point. PEAR was very slow so I did not run the full suite of speed testing on it, but it achieved 11.8 Mbp/s at 16 threads (on slightly different hardware). The single-threaded programs (COPE and fastq-join) beat PEAR up to around 16 threads, but never touched BBMerge or FLASH. Interestingly, PEAR is the only program that scales well with hyperthreading; I speculate that the program does a lot of floating-point operations.

In conclusion, for this dataset, BBMerge had substantially better specificity at similar sensitivity to all competitors. fastq-join had the highest merge rate, by a slight margin, but at the cost of a false positive rate that I would consider unusable in a production environment. The speeds of BBMerge and FLASH are similar per thread, with FLASH being around 10% higher; but BBMerge appears to be more scalable, and as such exceeded 150% of FLASH's peak throughput. So, for this dataset, unless false positives are irrelevant, BBMerge wins.

It appears imprudent to use FLASH (or most other mergers) with the default sensitivity, due to the huge increase in false positives for virtually no gain in true positives. On the ROC curve, FLASH's rightmost point is the default “-x 0.25”, while the third from the right is “-x 0.15”, which gives a much better signal-to-noise ratio of 33.3dB and 71.1% merged compared to the default 28.8dB and 73.8% merged. If you desire the highest merge rate regardless of false-positives, fastq-join appears optimal. For libraries that have all N-containing reads removed, it may be that COPE is slightly better than FLASH, though I have not verified this. If false-positives are detrimental to your project, I recommend BBMerge on default sensitivity, which gave a 42.7dB SNR and 55.8% merge rate for this data. The merge rate can be increased using the "loose" or "vloose" flags, at the expense of more false-positives, but I believe that for most projects, minimizing noise is more important than maximizing signal.

BBMerge supports interleaved reads, ASCII-33 or 64, gzip, fasta, scarf, and various other file formats. It also makes handy insert-size distribution histograms. Everything in the BBTools package is compatible with any OS that supports Java - Linux, Windows, Mac, even cell phones. The shellscripts may only work in Linux, but the actual programs will run anywhere. For more details, run the shellscript with no arguments, or post a question here.

I have updated the speed results by moving the input and output files to a ramdisk to eliminate I/O bottlenecks, and finishing all datapoints for PEAR. Results attached. Also, if anyone has a different program they use for read merging that they want added to the roundup, please post it here before my test data is purged from /scratch in ~6 weeks.

Quote:

Originally Posted by GenoMax

Brian: You have put a single help document together for all the programs in BBMap. I am not sure you have done that.

I am losing track of all the programs that are now part of BBMap suite

I am too... I definitely need to put together such a document! Hopefully soon. I have an internal document describing them but it's not totally relevant outside our cluster.

Ive been using it over the last couple of weeks and it has been really quick. Im actually using it to run adapter trimming (with small inserts, all of your reads should overlap), so adding in some options for adapter matching would be handy (although I do know that you already have a trimming script in the bbtools suite)

You may want to add COPE to your tests since it is published. Also, there are a number of other unpublished (I think) programs on this blog post, but I don't think it's necessary to test every script out there. I mention COPE because it works with Fasta, whereas the other tools require Fastq, so that is a major convenience that this tool provides.

There are a couple of caveats with COPE that I can share. 1) The paired reads have to be in the same order in the forward and reverse files. The program will run just fine if they are not, but the number of reads merged will be very low (this would be a nice thing to check for). 2) There is an option for setting the overlap mode and this has to be given or the program will fail silently ("-m 0" is the simple overlap mode).

Ive been using it over the last couple of weeks and it has been really quick. Im actually using it to run adapter trimming (with small inserts, all of your reads should overlap), so adding in some options for adapter matching would be handy (although I do know that you already have a trimming script in the bbtools suite)

jimmybee,

FYI, the latest version of BBMerge by default rejects insert sizes shorter than 35 and does not look for insert sizes shorter than 25, so if you have super-short insert libraries, you might want to reduce those with these flags:
mininsert=12 minoi=12

"mininsert" will reject anything shorter than that value as "too short", while "minoi" (minoverlapinsert) will tell the program to not look for inserts shorter than that. The difference is subtle. Sorry, the default values listed for them in the shellscript are wrong; I will fix that with the next release. Of course for most purposes you don't ever see insert sizes that short but you can in certain experiments involving tiny RNAs.

I have no plans to explicitly add adapter matching to BBMerge, but it sounds like it may be useful to add a mode that trims reads with insert sizes shorter than read length rather than merging them. Also, it would be fairly easy to add overlap-detection to BBDuk for greater adapter-trimming sensitivity. Hmmm, I'll probably do both.

I know that solution doesn't logicially follow from the error message, but from what I remember, you will get either no messages on failure with this program or nondescript messages. If that doesn't work, I'd check the input again, and then try without the "-s 33" as a last suggestion. It could be the input, but I'm guessing it's just the combination of options, or lack of, that is causing the issue.

Thanks SES, that solved it. I guess all outputs must be specified. I've attached the new graphs and updated the inline images in the first post.

COPE performed much better when I set the flag "-N 0" because 5% of my synthetic reads contain an N, so I used that flag, but it was unstable and seg-faulted at and value of -c under 0.92. I included both settings, "-N 1" (default) and "-N 0". Overall it was very close to FLASH when I swept through the -c parameter (match ratio cutoff) from 1.0 to 0.75 (the default) - slightly better with -N 1 (but unstable) and slightly worse with -N 0 (but stable). There are also -d and -B parameters that probably impact the merge rate, which I left at the default, so as with most of the other programs, it's still possible that better results could be achieved. Sample command:

I also added fastq-join (from ea-utils), which was generally outperformed by FLASH. I swept through the -p (from 0 to 25) and -m (from 4 to 12) parameters; only the -p sweep is shown, as it gave better results. Sample command:

fastq-join r1.fq r2.fq -o fqj -p 0

These were similar in speed - both were much faster than PEAR and slightly slower than BBMerge or FLASH, when those three were limited to a single thread. fastq-join is slightly faster than COPE.

Is there a list of parameters that I can use? For example, insert size and cpu threads. I tried to assembled a set of ~24 million pair-end reads and found that only 44.59% were assembled. Is there a way to increase (e.g. set loose)? Also, the input reads were 100bp in length and 10th percentile showed size 74. Why is that? By the way, how can the insert range as high as 188?

If you run the shellscript bbmerge.sh with no parameters, it will display the help information. To answer your specific questions:

By default it uses all available threads, but you can override that with the "t=" flag. You can get a higher merge rate at the expense of more false positives with the "loose" or "vloose" (very loose) flag. It's also possible to more finely tune it with other flags like "minoverlap", "margin", "entropy", and "mismatches", but those are more complicated to tweak. You can use the "mininsert=" flag to ban merges with an insert size shorter than that value.

2x100bp reads can have a 74bp insert or a 188bp insert size. 74bp insert means that the molecule being sequenced was shorter than read length, and as a result the data collection continued off the end of the genomic sequence and into the adapter. So, before merging, the reads each contained 26bp of junk. And a 188bp insert size means that the reads overlapped by 12 base pairs. BBMerge does not look for overlaps shorter than 12bp in the default mode; the shorter the overlap, the more likely that it's a false positive.

The higher the error rate of reads, the fewer successful joins there will be. If you have sufficient coverage (at least 30x average) you can try error-correcting the reads first with my error correction tool; that should increase the merge rate.

Ive been using it over the last couple of weeks and it has been really quick. Im actually using it to run adapter trimming (with small inserts, all of your reads should overlap), so adding in some options for adapter matching would be handy (although I do know that you already have a trimming script in the bbtools suite)

jimmybee,

I added some relevant flags and modes, in version 32.32. Thanks for the suggestions!

First, BBMerge now has a new flag, "tbo" (trimbyoverlap). Rather than joining reads, it will just trim the adapter sequence off the ends of reads that have an apparent insert size shorter than read length.

Second, I added the ability to overlap to BBDuk also, with the same flag, "tbo". So if you are trimming adapters fragment libraries with the normal read orientation, you can trim both by specifying an adapter sequence file and by overlap at the same time, which drastically reduces the false negative rate (reads that did not have adapters completely removed) by around a factor of 12. I highly recommend using both for fragment libraries, as the combination outperforms either one alone:

2x100bp reads can have a 74bp insert or a 188bp insert size. 74bp insert means that the molecule being sequenced was shorter than read length, and as a result the data collection continued off the end of the genomic sequence and into the adapter. So, before merging, the reads each contained 26bp of junk. And a 188bp insert size means that the reads overlapped by 12 base pairs. BBMerge does not look for overlaps shorter than 12bp in the default mode; the shorter the overlap, the more likely that it's a false positive.

-Brian

Thanks for the reply. It turns out that I confused your "insert size" with "inner distance" of pair-end reads. Since the ~25 million reads were exome-seq data, 44.594% assemble rate is a bit lower than what I expected. If the assemble is perfect, it means 50% of the fragments were actually longer than 188. I hope you could improve those fragments whose "insert size" is longer than 188 in the future.

Thanks for the reply. It turns out that I confused your "insert size" with "inner distance" of pair-end reads. Since the ~25 million reads were exome-seq data, 44.594% assemble rate is a bit lower than what I expected. If the assemble is perfect, it means 50% of the fragments were actually longer than 188. I hope you could improve those fragments whose "insert size" is longer than 188 in the future.

Woody,

Did you try error-correcting and using the 'vloose' setting? Both of those can substantially improve the merge rate.

If you want to see what the actual, unbiased insert size distribution of your data, then you can generate a histogram by mapping: