BBNorm is a kmer-based normalization tool for NGS reads, which may also be used for error-correction and generating kmer-frequency plots. It is extremely fast and memory-efficient due to the use of atomic counters and probabilistic data structures.

First, what is normalization? Many assemblers perform poorly in the presence of too much data, and data with irregular coverage, such as MDA-amplified single cells or metagenomes. And even if an assembler performs well with these datasets (Spades, for example, does a very good job with single cells, though still benefits from normalization in my tests), more data will increase the runtime and memory usage, potentially to the point that the data cannot be assembled.

Subsampling randomly discards some percentage of the reads, to reduce the average coverage, and is computationally cheap (you can do that quickly with reformat). However, if you have one area with 10000x coverage and another area with 10x coverage, subsampling to 1% will reduce the 10000x area to 100x - a good level for some assemblers - but the 10x area to 0.1x, which cannot be assembled. So with irregular coverage, it is not ideal!

Normalization discards reads with a probability based on the local coverage. So, for example, if you normalize to a target 40x coverage, reads in regions with 10x coverage will all be retained; in an 80x region they will be discarded with a 0.5 probability; and in a 10000x area they will be discarded with a 0.004 probability. As a result, after normalization, areas with coverage above the target will be reduced to the target, and areas below the target will be left untouched. This generally makes the data much easier to assemble and typically increases continuity (L50) by a substantial amount.

To normalize to 40x coverage with BBNorm, and discard reads with an apparent depth under 2x (which typically indicates the reads have errors):

bbnorm.sh in=reads.fq out=normalized.fq target=40 mindepth=2

Error-correction is also useful in many cases when you have sufficient depth and wish to avoid false-positive variant calls, or achieve a higher mapping rate, or want to merge paired reads via overlap, or assemble high error-rate data with an assembler that is not very tolerant of errors, or reduce the memory usage of a DeBruijn assembler. Like normalization, error-correction is not universally a good idea - JGI does not normalize and error-correct all data prior to use, for example - but it is highly beneficial in many situations. Also, BBNorm only corrects substitution errors, not indels, since that is the error mode that occurs in Illumina data. In other words, it will NOT error-correct PacBio or 454 data, which feature indel errors. BBNorm is currently being used for error-correction by the OLC assembler Omega.

To error-correct reads with BBNorm:

ecc.sh in=reads.fq out=corrected.fq

To error-correct and normalize at the same time, just add the flag "ecc" when running BBNorm.

Lastly, BBNorm can be used for producing kmer frequency histograms, and binning reads based on coverage depth. The histograms are useful to determine things like the ploidy of an organism, the genome size and coverage, the heterozygousity rate, and the presence of contaminants (which typically have drastically different coverage than the genome of interest).

To generate a kmer frequency histogram:

khist.sh in=reads.fq hist=histogram.txt

You can also use the "hist" flag during normalization or error-correction, for the histogram of input reads, for free; "histout" will generate the histogram of output reads, but cost an additional pass.

So, what is the difference between bbnorm.sh, ecc.sh, and khist.sh? They all call the same program, just with different default parameters (you can see the exact parameters by looking at the bottom of the shellscripts). If you specify all parameters manually, they are equivalent.

How does BBNorm work, and why is it better than other tools?

BBNorm counts kmers; by default, 31-mers. It reads the input once to count them. Then it reads the input a second time to process the reads according to their kmer frequencies. For this reason, unlike most other BBTools, BBNorm CANNOT accept piped input. For normalization, it discards reads with probability based on the ratio of the desired coverage to the median of the counts of a read's kmers. For error-correction, situations where there are adjacent kmers in a read with drastically different frequencies - for example, differing by a factor of 180 - are detected; the offending base is altered to a different base, if doing so will restore the kmer to a similar frequency as its adjacent kmers.

BBNorm is memory-efficient because it does not explicitly store kmers - everything is in a probabilistic data structure called a count-min sketch. As a result, BBNorm will never run out of memory, slow down, or use disk, no matter how much data you have or how big the genome is. Rather, the accuracy will decline as the table's loading increases - but because kmers are not explicitly stored, it can store several times more than an explicit data structure (such as Google Sparse Hash). And for normalization, the reduction in accuracy at extremely high loading does not matter, because the median is used - so even if multiple kmers within a read have an incorrectly high count, they will not even be considered, and thus the results will not be affected at all. As a result - in practice, you should use all available memory even for a tiny genome with a small number of reads; but even for a huge genome with very high coverage, BBNorm will still work, and produce good results quickly on a computer with limited memory.

Speedwise, BBNorm is multithreaded in all stages, using atomic counters which do not require locking - this allows it to scale efficiently with processor core counts.

BBTools has another program functionally related to BBNorm, "kmercountexact.sh". It does NOT use probabilistic data structures, and uses locking rather than atomic counters, and as a result may not scale as well, and will run out of memory on large datasets. However, it is still extremely fast and memory-efficient - using ~15 bytes per kmer (with an optional count-min-sketch prefilter to remove low-count error kmers). It cannot normalize or error-correct, but it can generate the exact kmer count of a dataset as well as the exact kmer frequency histogram (and do rudimentary peak calling for genome size estimation). In practice, when I am interested in kmer frequency histograms, I use KmerCountExact for isolates, and BBNorm for metagenomes.

How does BBNorm compare to normalize_by_median from the khmer package? The implementation (apart from language and possibly better usage of processor cors) sounds very similar.

The implementation is a bit different in a couple of respects. Normalization can preferentially retain reads with errors, since they have a low apparent coverage; as a result, normalized data - particularly from single-cells - will often have a much higher error rate than the original data, even if low-depth reads are discarded. BBNorm, by default, uses 2-pass normalization which allows it - if there is sufficient initial depth - to preferentially discard low-quality reads, and still hit the target depth with a very narrow peak. So, if you look at the post-normalization kmer frequency histogram, BBNorm's output will have substantially fewer error kmers and a substantially narrower peak. This can be confirmed by mapping; the error rate in the resulting data is much lower.

I'm working on publishing BBNorm, which will have comparative benchmarks versus other normalization tools, but in my informal testing it's way faster and yields better assemblies than the two other normalizers I have tested. The specific way that the decision is made on whether or not to discard a read has a huge impact on the end results, as does the way in which pairs are handled, and exactly how kmers are counted, and how a kmer's frequency is used to estimate a read's depth. BBNorm has various heuristics in place for each of these that substantially improved assemblies compared to leaving the heuristic disabled; my earlier description of discarding a read or not based on the median frequency of the read's kmers is actually a gross oversimplification. Also, using error-correction in conjunction with normalization leads to different results, as it can make it easier to correctly determine the depth of a read.

I guess I would say the the theory is similar, but the implementation is probably very different than other normalizers.

I'm trying to do some normalization but I want to set my target coverage to 10X rather than 40X. Is there any way to change that in BBNorm? I tried target=10, but it still says 40X on the run description.

By default, BBNorm will run 2 passes. The first pass will normalize to some depth higher than the ultimate desired depth, and the second pass will normalize to the target depth. This allows, in the first pass, preferential discarding of reads that are low quality. So the result from the second pass should still be a target of 10x.

You can instead set "passes=1" which will aim for the target on the first pass and not do a second pass. This is slightly faster but will typically yield data with more errors. Neither is universally better, though.

If you are going to target a depth of 10x, it's important to also reduce "mindepth" - by default it is 6, which is appropriate for 40x but not for 10x. Probably 2 would be better. Everything with apparent depth below that gets discarded.

Yes, it can. BBNorm will (by default, it can be changed) discard pairs based on the depth of the lower mate, so if read 1 has high coverage and read 2 has low coverage, the pair will not be discarded. If both are high depth, they will be discarded.

1) This is kind of tricky. Typically, though, I would recommend normalizing data independently if it is different (such as different insert size) since it has a different use, and you don't want it all mixed together anyway. If it is the same type - for example, 2x150bp reads with short inserts - then I would normalize it all together regardless of whether it came from a different platform or library, because it will all be used the same way.

2) I recommend pre-processing (adapter trimming, contaminant removal, quality-trimming or filtering) prior to normalization, because those processes all remove spurious kmers that make it harder to determine read depth, and thus improve the normalization results.

3) If you target 40x coverage for a 1.2Gbp genome, BBNorm should output approximately 20*1.2Gbp of data. Normally it will go a little bit over to try to ensure everywhere has at least 40x.

4) Normalizing RNA-seq data can certainly be done prior to assembly. But if you have 2 isoforms of a gene - one that uses exons 1, 2, and 3, and one that only uses exons 1 and 3, and one of them is expressed 100x more highly than the other, then after normalization, the less-expressed isoform may not get assembled, only the more abundant one. So there are definite disadvantages. But, it's worth trying if you get a bad assembly from the raw data.

I am wondering what depth you need and what depth is optimal (if the two differ) for proper read error correcting using BBMap or any other error correcting program. The Quake website mentioned >15x coverage but a quick round of Googling hasn't given me much more than that.

The reason I ask is because I have a couple lanes of MiSeq data (600 cycle PE sequencing), which individually total to about 3x coverage of my genome each. Therefore, a kmer based error correction wouldn't work well, even if I were to concatenate the two together. We do have an additional HiSeq lane (100bp PE) and a few GAII lanes (so 50-60x coverage total), so we have the option of concatenating all of the datasets together (though one GAII lane isn't paired-end). However, then we would have the separate the individual lanes back out, since we next plan to merge the MiSeq reads to create longer, 454-like reads.

Therefore, my second question is about what workflow would be best to accomplish this task? Are there some settings in ecc.sh or the like that would allow decent error correction with low coverage? Or alternatively, is there an easy way of separating data from different lanes if we were to concatenate a bunch together to give the coverage necessary to confidently correct? Thanks in advance for the help.

At default settings, BBNorm will correct using kmers that occur a minimum of 22 times (echighthresh flag). Whether 50x is sufficient to do a good job depends on things like the ploidy of the organism, how heterozygous it is, and the uniformity of your coverage. 50x of fairly uniform coverage is plenty for a haploid organism, or a diploid organism like human with a low het rate, but not for a wild tetraploid plant. You can always reduce the setting, of course, but I would not put it below ~16x typically. You can't get good error-correction with 3x depth nor matter what you do. Bear in mind that the longer the kmer is compared to read length, the lower the kmer depth will be compared to read depth.

To deal with multiple different data sources, you can run BBNorm with the "extra" flag to use additional data to build kmer frequency tables but not as output, like this:

Then cat all the files together, and error-correct them:ecc.sh in=combined.fq out=ecc.fq ordered int=f

Then demultiplex:demuxbyname.sh in=ecc.fq out=demuxed_%.fq names=miseq,gaII_1 int=f
(note the % symbol; it will be replaced by a name)

That will keep all the read order the same. So, if all the reads are initially either single-ended or interleaved (i.e. one file per library) pairs will be kept together, and you can de-interleave them afterward if you want. You can convert between 2-file and interleaved with reformat.sh.

Is it possible to use BBNorm to normalize filtered subreads from PacBio sequencing?

BBNorm seems to work well for my Illumina MiSeq data.
But, when I tried it for PacBio filtered subreads (fastq format), the resulting output file was empty.

Actually, I'm working on PacBio sequencing data from MDA-amplified single cell genomes. When I assembled the data using HGAP (SMRT Analysis), the results were not good (too many rather short contigs). Some of my colleagues told me that uneven coverage might be the cause of bad assembly. So, I've been trying to normalize raw data.

BBNorm cannot be used for raw PacBio data, as the error rate is too high; it is throwing everything away as the apparent depth is always 1, since all the kmers are unique. Normalization uses 31-mers (by default) which requires that, on average, the error rate is below around 1/40 (so that a large number of kmers will be error-free). However, raw PacBio data has on average a 1/7 error rate, so most programs that use long kmers will not work on it at all. BBMap, on the other hand, uses short kmers (typically 9-13) and it can process PacBio data, but does not do normalization - a longer kmer is needed.

PacBio CCS or "Reads of Insert" that are self-corrected, with multiple passes to drop the error rate below 3% or so, could be normalized by BBNorm. So, if you intentionally fragment your reads to around 3kbp or less, but run long movies, then self-correct, normalization should work.

PacBio data has a very flat coverage distribution, which is great, and means that typically it does not need normalization. But MDA'd single cells have highly variable coverage regardless of the platform, and approaches like HGAP to correct by consensus of multiple reads covering the same locus will not work anywhere that has very low coverage. I think your best bet is really to shear to a smaller fragment size, self-correct to generate "Reads of Insert", and use those to assemble. I doubt normalization will give you a better assembly with error-corrected single-cell PacBio data, but if it did, you would have to use custom parameters to not throw away low-coverage data (namely, "mindepth=0 lowthresh=0"), since a lot of the single-cell contigs have very low coverage. BBNorm (and, I imagine, all other normalizers) have defaults set for Illumina reads.

Hi Brian
I am trying to assemble a genome from single cell amplification using spades and subsequently using tetramer + coverage analysis (i.e. CONCOCT) to remove "contaminating reads" which seem to be created at the sequencing. If I understood correctly bbnorm will keep enough information to allow coverage information to be used subsequently to bin contigs. Is that the case?

"High-depth k-mers,presumably derived from MDA amplification bias, cause problems in the assembly, especially if the k-mer depth varies in orders of magnitude for different regions of the genome. We removed reads representing high-abundance k-mers (>64x k-mer depth) and
trimmed reads that contain unique k-mers. The results of the k-mer based coverage normalization are shown in table S8."

Hi suzumar, if you apply any kind of coverage normalization (whether with khmer or bbnorm) prior to assembly, you won't be able to use those reads to calculate the coverage of the contigs. However, you can go back and use the unnormalized reads to calculate the contig coverage.

Re the science paper, I don't know what they actually did; if you can find the software and/or detailed instructions in their methods, I'd be happy to comment.