3. Sensitivity and specificity

How can we measure how good a pattern is at detecting a particular
sequence feature? There are two different, complementary measures that
can be used to describe how well a pattern performs:

Sensitivity is the fraction of the true matches that
actually are correctly predicted as matches by the pattern. If the
pattern is too stringent (to avoid garbage), then too few of the true
matches will be identified by it.

Specificity is the fraction of the sequences
predicted as matches that really are true matches. If the pattern is
too inclusive (to catch as many as possible), then a lot of garbage
will also be falsely identified by it.

We want a pattern to be both sensitive and specific, so that all true
matches are found, and nothing but true matches. But as usual, nature
will rarely allow us to find such a pattern; we must try to find a
good compromise.

A regular expression can easily be designed that has 100% sensitivity:
just use the expression that matches
anything:.* (dot, star). Everything will be
identified as matching this expression, so all true positives will
also be identified, and none will be left out (i.e. no false
negatives).

Conversely, it is easy to prepare a regexp that is 100% specific: just
use a regexp that exactly matches one of the known members of the
domain family. Only that single member will be predicted, and no
others, (i.e. no false positives).

Let us use an analogy: we have an exam (svenska:
"tentamen") to check that a class of students actually have learned
what they should about bioinformatics. The exam must of course be
designed so that those students that actually have learned their stuff
should pass, while those who haven't learned anything will fail.

If the exam is too difficult, then those who manage
to pass it will most likely really know their stuff. But there will be
many students among those who fail (for stupid reasons, like being too
stressed by the difficult questions) who really should have passed it
but didn't (false negatives). The exam is specific, but it is
not sensitive.

If the exam is too easy, then there will be many
among those who pass it who have just made wild guesses, and really
haven't learned anything. But on the other hand, of the students who
really have learned what they should, not many will fail (even the
stress-sensitive will get some answers right). The exam is
sensitive, but it is not specific.

Let us define this in a little more technical, precise terms. We
define the notions of true and false positive and negative matches in
the following way:

The pattern says

Yes, a match

No, not a match

Reality

Yes, a match

True positive; TPos

False negative; FNeg

No, not a match

False positive; FPos

True negative; TNeg

If we use these definitions, then the following holds:

Sensitivity = TPos / (TPos + FNeg)

Specificity = TPos / (TPos + FPos)

Warning: in the medical context (diagnosis), specificity is defined
differently: there it is TNeg / (FPos + TNeg).