A key element in evaluating the quality of a pairwise sequence alignment
is the "substitution matrix", which assigns a score for aligning any possible
pair of residues. The theory of amino acid substitution matrices is described
in [1], and applied to DNA sequence comparison in [2]. In general, different
substitution matrices are tailored to detecting similarities among sequences
that are diverged by differing degrees [1-3]. A single matrix may nevertheless
be reasonably efficient over a relatively broad range of evolutionary change
[1-3]. Experimentation has shown that the BLOSUM-62 matrix [4] is among the
best for detecting most weak protein similarities. For particularly long
and weak alignments, the BLOSUM-45 matrix may prove superior. A detailed
statistical theory for gapped alignments has not been developed, and the best
gap costs to use with a given substitution matrix are determined empirically.
Short alignments need to be relatively strong (i.e. have a higher percentage
of matching residues) to rise above background noise. Such short but strong
alignments are more easily detected using a matrix with a higher "relative
entropy" [1] than that of BLOSUM-62. In particular, short query sequences
can only produce short alignments, and therefore database searches with
short queries should use an appropriately tailored matrix. The BLOSUM series
does not include any matrices with relative entropies suitable for the shortest
queries, so the older PAM matrices [5,6] may be used instead. For proteins,
a provisional table of recommended substitution matrices and gap costs for
various query lengths is:

The raw score of an alignment is the sum of the scores for aligning pairs of
residues and the scores for gaps. Gapped BLAST and PSI-BLAST use "affine gap
costs" which charge the score -a for the existence of a gap, and the score -b
for each residue in the gap. Thus a gap of k residues receives a total score
of -(a+bk); specifically, a gap of length 1 receives the score -(a+b).

To convert a raw score S into a normalized score S' expressed in bits,
one uses the formula S' = (lambda*S - ln K)/(ln 2), where lambda and K are
parameters dependent upon the scoring system (substitution matrix and gap
costs) employed [7-9]. For determining S', the more important of these
parameters is lambda. The "lambda ratio" quoted here is the ratio of the
lambda for the given scoring system to that for one using the same substitution
scores, but with infinite gap costs [8]. This ratio indicates what proportion
of information in an ungapped alignment must be sacrificed in the hope of
improving its score through extension using gaps. We have found empirically
that the most effective gap costs tend to be those with lambda ratios in the
range 0.8 to 0.9.