"What BLAST's max-target-sequences doesn't do" (2015 blog post, with links to follow up posts in 2018), an issue Suaji Kumar and I called a scary BLAST+ bug to do with alignment limits, which the NCBI BLAST team viewed as expected behaviour of a heuristic setting.

"we examined the new example and it became clear that the demonstrated behavior was a bug, resulting from an overly aggressive optimization, introduced in 2012 for BLASTN and MegaBLAST (DNADNA alignments). This bug has been fixed in the BLAST+ 2.8.1 release, due out in December 2018. The aberrant behavior seems to occur only in alignments with an extremely large number of gaps, which is the case in the example provided by Shah and collaborators."

So, what was this aggressive optimisation in BLASTN and MegaBLAST, and how did it combine with the internal candidate alignment limit and database order to produce counter intuitive results? It turns out to explain the details I'd not yet followed up from blog post part three.

2018-12-07

My November blog posts discussing the BLAST+ tools behaviour with an alignment limit setting (see What BLAST's max-target-sequences doesn't do, and the links from it), touched on database order, which comes into play as a tie breaker.

Well, how is the BLAST database order defined? It turns out to be the reverse of the FASTA file used with makeblastdb, or in other words: Last-in, First-out (LIFO).

This is important because one key setting which the internal limit on the number of alignments (N_i) that BLAST+ considers depends on, is if composition-based statistics (CBS) are being used. This is the default with BLASTP, but not for MEGABLAST (i.e. the blastn binary).

The key point is that requesting N=1 alignments, but otherwise the blastp tool's default settings, gives an internal limit N_i = 2*N + 50 = 52, but with the blastn tool you get an internal alignment limit N_i = 10. Evidently the BLAST+ developers were comfortable with a lower limit, so I presume there is less chance of the hit ordering changing in the final stages of the algorithm, but this emphasises why it is especially important to avoid duplicates in a nucleotide BLAST database.

I could reproduce their initial example locally. I could indeed see the database order coming into play - but so far nothing that cannot be explained by using this as a tie breaker. De-duplicating their database to make it non-redundant greatly improves things, but some of the queries still showed the (December 2015) problem where using -max_target_seqs 1 does not give the expected top alignment.

Here I focus on what might be the most quoted part of Shah et al. (2018), which is causing what I consider to be unwarranted panic:

To our surprise, we have recently discovered that this intuition is incorrect. Instead, BLAST returns the first N hits that exceed the specified E-value threshold, which may or may not be the highest scoring N hits. The invocation using the parameter ‘-max_target_seqs 1’ simply returns the first good hit found in the database, not the best hit as one would assume. Worse yet, the output produced depends on the order in which the sequences occur in the database. For the same query, different results will be returned by BLAST when using different versions of the database even if all versions contain the same best hit for this database sequence. Even ordering the database in a different way would cause BLAST to return a different ‘top hit’ when setting the max_target_seqs parameter to 1.

This does not seem to be the case. If I am misreading their message, I am not alone. See for example Emma Bell's blog post, or John Walshaw's comments on my 2015 post. It is possible Shah et al. have found a separate issue, but since no test case was given, that cannot currently be verified.