Monday, June 25, 2012

As biologists most of us feel little bit awkward with mathematics and statistics, and more so with the application of mathematics and statistics in interpreting biological data. Recently we see a meteoric rise in availability of biological sequences. However, interpreting them properly still is a bottleneck.
HMMs are there for a long time now, very valuable in speech recognition, weather prediction etc. But recently, we find a surge in application of HMMs in interpreting biological sequences such as gene prediction, profile searches, domain prediction etc. A beginner however gets awed by the complexity and finds it difficult to understand completely how this amazing method/tool can help interpret sequences. Here is my attempt to simplify things a bit.

What is HMM?

Profile HMMs are based on position specific scoring matrices. Lets take the example of the following diagram where we have multiple sequences aligned.
The following are the characteristics:
1. Columns in a multiple sequence alignment is called as a state.
2. Number of alphabets taken together is called as order of the markov chain e.g; zero order, first order and so on.
3. PSSM(Position specific scoring matrix) is calculated for calculating the match states probabilities.

In a HMM, there is a transition probability and an emission probability and there is a state. Lets look at the simplest HMM heuristic:

Suppose say we have a multiple sequence alignment, as represented in Figure - 1.

Figure 1

[ state -> is the column in a multiple sequence alignment.Transition probabilities are
determined using deletions and insertions. Emission probabilities
at each state are determined by counting the occurrence of each
character in each column. ]

So, lets try to generate the Emission and Transition probabilities from this alignment from the figure.

There are multiple algorithms that can be used to build and train HMMs.

i) EM, or Baum-Welsch, algorithm.

This algorithm uses a randomMSA, such as the
one constructed by ClustalW. The number of match states is
defined as the number of conserved columns. This algorithm uses a
two stage, iterative process; first, it uses the aligned residues to
build a model, and then it realigns the sequences to the model.
These two steps are repeated until the model no longer changes.

ii)Viterbi algorithm:

Uses dynamic programming
to calculate the best estimate of the emission probabilities.

iii)“Surgery” algorithm:

Uses dynamic
adjustments of the HMM length to calculate the best
models.

HMMER software comes with a nice user guide. I recommend you read this guide from cover to cover and if you still did not get it, then continue reading my post.

Use of HMM in aligning against another sequence or profile: After we create the HMM, we can use it to
search for sequences that are similar to the ones used to create the
HMM. So, we use the model to estimate the probability that a
sequence belongs to the model. This probability is often
expressed as negative log likelihood score and is dependent on the
length of the sequence and the length of the model. Once we have
a score, we need a way to assess its significance, just as we did for
pairwise sequence alignments.

HMMs have some distinct advantages.
They are built on a formal probabilistic basis. They also allow
for more sensitive searching than pairwise alignments do. The
probabilistic theory can be used to guide the scoring parameters; the
theory also allows us to train an HMM using unaligned sequences if we
don’t know or trust the alignment. There is a consistent
theory behind gap and insertion penalties. Finally, less skill
and human intervention is needed to train a good HMM than to train a
hand-constructed profile. Because of this, we can create
libraries of hundreds of profile HMMs and apply them on a large
scale.

Its one major drawback is that its
probabilistic basis does not capture higher-order correlations between
sequence positions. It assumes that the identity of a particular
position is independent of the identity of all of the other positions
in the sequence, and we know that not to be true.

Publicly available program HMMER3 just does this.Components of Hmmer3 Program:
phmmer: Makes a blastp like search.
jackhmmer: Iteratively searches a sequence against a sequence database (something similar to Psi-blast).hmmbuild: Builds a profile HMM from multiple sequence alignment.hmmsearch: Searches a profile hmm against a sequence database(Query: HMM, target: AA).hmmscan: Searches a sequence against a profile hmm database(Query: AA; target: HMM e.g; pfam).hmmalign: Make a multiple sequence alignment of many sequences to a existing profile.

hmmconvert: Converts profiles into hmm format and vice versa.hmmemit: Generate sample sequences from profiles.hmmfetch: Get a profile HMM by name or accession number from the HMM databasehmmpress: Format an hmm database into a binary format to be used by hmmscan.hmmstat: Show summary statistics for each profile in an hmm database.

In the next post, we will discuss about Applications of Profile search and command line options of Hmmer3 with examples.

Thursday, June 07, 2012

Do you want to incorporate the environment into your
everyday life through environmental activism, advocacy, social justice and
event planning? Interning for Terra-Scapes Environmental Consulting is a unique
opportunity for a student with a passion for the environment. Engaging the community through educational initiatives
about current environmental concerns will be your goal. The Environmental Organizer will work on
several pertinent environmental campaigns ranging from protecting our oceans
and fisheries to the prohibition of additional chicken farms on the Eastern
Shore.

The Environmental Organizer will have the opportunity
to gain experience and forge connections in the environmental and green
business communities. The position entails working with the following prominent
environmental organizations: Pew
Environment Group, Sierra Club, Chesapeake Bay Foundation, Lynnhaven River Now,
Nature Conservancy, and more.

Responsibilities:

·Research, organize, and execute logistics for environmental
workshops and events.

·Recruit and communicate with new and existing volunteers.

·Strengthen and build coalitions with new and existing groups,
scientists, students, elected officials, and businesses.

·Work with the media on press releases, advisories, letters to the
editor, and opinion editorials.