I'm looking at a paper that uses several sequence logos to illustrate the consensus sequence of certain sites.

Here is the most important of the sequence logos I'm interested in:

The explanations I found about the exact meaning of this visualization were rather abstract and mathematical. I understand that the height of the letter stack is indicating in some way how strongly conserved that position is, but I have some difficulty judging the absolute values.

When are the positions considered to be significantly conserved? The paper where this data is from mentions the positions 5 and 6 as characteristic for the consensus sequence, which is not really obvious to me from this graph.

How do I interpret the absolute value on the y-axis? What does it actually mean if position 16 has around 0.6 bits of information?

I'm not so much interested in an exact mathematical treatment, but a general guide on how to read these graphs and how to interpret them in practice.

For this kind of logo you compare a number of sequences. The height on the y-axis is indeed an indicator for the frequency of a certain base. The sequence logos I know use a height to 1 for positions where only one base occurs and go lower for less conserved bases. I might have some papers about the generation of these logos in my database, I can look this up later today if you are interested.
–
ChrisJul 31 '14 at 15:36

Finally, Bioinformatics questions are coming in this site. Anyway, everything is in the main paper. I will be explaining in the answer soon!
–
Devashish DasAug 1 '14 at 12:49

DNA/RNA only has 4 nucleotides, and information content is often represented in log2 form. So a fully conserved site of, say, 10 sequences contains log(10/10/.25) = log(4) = 2 bits of information. The .25 is the probability of a given nucleotide, and the overall probability is represented by the height of the letter.
–
jelloAug 1 '14 at 15:55

1 Answer
1

The starting point is an alignment for the region under investigation.

A. Basically, to get the height of the stack of letters for every position one has to calculate the degree of certainty about the residue (= degree of conservation) at this position in the sequences belonging to this class. I'll explain what it means:

The key parameter in this context is information. 1 bit of information can be understood as the amount "knowledge" you get when receiving an answer to one yes/no question.

Let's say we have equal proportions of all four bases at a certain position in a set of DNA sequences. To "guess" a base at this position in some sequence we have to ask two binary questions. For example: "is it Pyrimidine?" If 'yes' we'd ask "is it 'T'?" Otherwise: "is it G?". So, when the frequencies are equal we can get 2 bits of information from one observation of a base at this position in a sequence.

If the frequencies are distorted, we already have some ideas about the bases when coming to an individual observation. If we already know, say, that we have just G and A at this position with ratio of 1:1, we can just ask "is it G?", so we get clearly 1 bit. When the ratios are not even (or when the number of alternative states is not 2n), the analogy with the questions becomes much less clear and we have to resort to the very simple formula for Shannon entropy. Intuition and brief inspection of this formula would bring you the idea, that when the ratios of the residues are biased, we always have less information than for the case of equal frequencies (2 in the case of DNA).

Now to estimate the "the degree of certainty" we simply calculate the information based on the observed frequencies of different residues at this position and subtract it from the theoretical maximum (again: 2 in the case of DNA):

certainty = maximal information possible - actual information

This value defines the height of a stack of letters at each position.

The maximum value (2 bits) would be observed when we always have the same base, because "actual information" would be zero (there is no need to ask questions - we "know" the answer anycase). The minimum (0 bits) is when we have no idea about the base. One bit would be obtained if we had, say, just two equally frequent bases. And, for example, the value of 0.6 would be observed if you had, e.g. 68.5% of the time the same base A and in all other case C, G or T with equal frequencies of 10.5%.

B. As you probably already know: the proportions of all alternative bases are shown as relative heights of individual letters.