Introduction to Shannon entropy

Motivated by what I perceive to be a deep misunderstanding of the concept of entropy I have decided to take us onto a journey into the world of entropy. Recent confusions as to how to calculate entropy for mutating genes have will be addressed in some detail.

I will start with referencing Shannon’s seminal paper on entropy and slowly expand the discussion to include the formulas relevant for calculating the entropy in the genome.

Suppose we have a set of possible events with probability the entropy for such a situation is defined by:

I will apply this formula to a simplified example in which there are two events with probability p and q where q=(1-p) since the sum of the probabilities adds up to 1. Theh entropy in the case of two possibilities with probabilities p and q = 1 is given by:

Figure 1: Entropy for a system with two states. The horizontal axis describes the probability p and the vertical axis the entropy as calculated using the formula above

Some important observations:

Entropy is zero when probability for p=0 or p=1

Entropy is maximum when p=q=0.5

Observation 1 can be expressed more generally by observing that if the probability for any one of the variables is 1 that the entropy is minimal.

Observation 2 can be made more universal by observing that the entropy for n variables is maximum when or in other words, a uniform distribution.

Extending Shannon entropy to the genome

Various people have taken the work by Shannon and applied it, quite succesfully, to the genome.

Now we can extend this using Adami’s 2002 paper Adami C. (2002) “What is complexity?”, BioEssays 24, 1085-1094.

The entropy of an ensemble of sequences X, in which sequences occur with probabilities is calculated as

where the sum goes over all different genotypes i in X. By setting the base of the logarithm to the size of the monomer alphabet, the maximal entropy (in the absence of selection) is given by , and in fact corresponds to the maximal information that can potentially be stored in a sequence of length L. The amount of information a population X stores about the environment E is now given by:

The entropy of an ensemble of sequences is estimated by summing up the entropy at every site along the sequence. The per-site entropy is given by:

for site j, where denotes the probability to find nucleotides i at position j. The entropy is now approximated by the sum of per-site entropies:

so that an approximation for the physical complexity of a population of sequences with length L is obtained by

From this it should be obvious that is zero when a particular mutation becomes fixated in the genome since one of the probabilities will be 1 and the others zero.

For those who wonder:

Schneider has some interesting results which show what happens to the entropy in the genome with and No selection. It should be obious that, not surprisingly, the information remains constant around zero.

Figure 2 (from Schneider’s talk): Notice how the information in the genome increases under selection but disappears once selective constraints are removed.
Once selection is added, the information quickly approaches the theoretical values

From Jerry’s to Shannon

Jerry’s entropy using the following transformations

For the human genome we look at a particular location and observe over many samples and we find , , and , we calculate the total number of states to be:

Some relevant papers

Abstract: If DNA were a random string over its alphabet {A,C,G,T}, an optimal code would assign 2 bits to each nucleotide. We imagine DNA to be a highly ordered, purposeful molecule, and might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly this has not been the case for many natural DNA sequences, including portions of the human genome. We introduce a new statistical model (compression algorithm), the strongest reported to date, for naturally occurring DNA sequences. Conventional techniques code a nucleotide using only slightly fewer bits (1.90) than one obtains by relying only on the frequency statistics of individual nucleotides (1.95). Our method in some cases increases this gap by more than five-fold (1.66) and may lead to better performance in microbiological pattern recognition applications.

One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexact match information in the model. The existence of matches at various distances forms a panel of experts which are then combined into a single prediction. The structure of this combination is novel and its parameters are learned using Expectation Maximization (EM).

Experiments are reported using a wide variety of DNA sequences andcompared whenever possible with earlier work. Four reasonable notions for the string
distance function used to identify near matches, are implemented and experimentally compared.

We also report lower entropy estimates for coding regions extracted from a large collection of non-redundant human genes. The conventional estimate is 1.92 bits. Our model produces only slightly better results (1.91 bits) when considering nucleotides, but achieves 1.84-1.87 bits when the prediction problem is divided into two stages: i) predict the next amino acid based on inexact polypeptide matches, and ii) predict the particular codon. Our results suggest that matches at the amino acid level play some role, but a small one, in determining the statistical structure of non-redundant coding sequences.