2
BLAST “Basic Local Alignment Search Tool” Given a sequence, how to find its ‘relatives’ in a database of millions of sequences? An exhaustive search using pairwise alignment is impossible. BLAST is a heuristic method that finds good, but not necessarily optimal, matches. It is fast. Basis: good alignments should contain “words” that match each other very well (identical or with very high alignment score).

4
BLAST  The scoring matrix: example:BLOSUM62 The expected score of two random sequences aligned to each other is negative. High scores of alignments between two random sequences follow the Extreme Value Distribution.

5
BLAST The general idea of the original BLAST: Find a match of words in a database. This is done efficiently using pre-processed database and hashing technique. Extend the match to find longer local alignments. Report results exceeding a predefined threshold.

14
The “two-hit” method. Two sequences with good similarity should share >1 word. Only extend matches when two hits occur on the same diagonal within certain distance. It reduces sensitivity. Pair it with lower per-word threshold t.

15
BLAST There are many variants with different specialties. For example Gapped BLAST Allows the detection of a gapped alignment. PSI-BLAST “Position Specific Iterated BLAST” To detect weak relationships. Constructs Position Specific Score Matrix automatically ……

16
Discrete time finite Markov Chain Possible states: finite discrete set S {E 1, E 2, … …, E s } From time t to t+1, make stochastic movement from one state to another The Markov Property: At time t, the process is at E j, Then at time t+1, the probability it is at E k only depends on E j The temporally homogeneous transition probabilities property: Prob(E j → E k ) is independent of time t.

18
Discrete time finite Markov Chain Consider the two step transition probability from Ei to Ej: This is the ij th element of So, the two-step transition matrix Extending this argument, we have ---------------------------------------------------------------------------- Absorbing states: p ii =1. Once enter this state, stay in this state. We don’t consider this.

19
Discrete time finite Markov Chain The stationary state: The probability of being at each state stays constant. is the stationary state. For finite, aperiodic, irreducible Markov chains, exists and is unique. Periodic: if a state can only be returned to at t 0, 2t 0, 3t 0,…….., t 0 >1 Irreducible: any state can be eventually reached from any state

20
Discrete time finite Markov Chain Note: Markov chain is an elegant model in many situations in bioinformatics, e.g. evolutionary process, sequence analysis etc. BUT the two Markov assumptions may not hold true.

22
Hidden Markov Model The emissions are probablistic, state-dependent and time-independent Example: E1 E2 ABAB 0.10.8 0.5 0.25 0.75 We don’t observe the states of the Markov Chain, but we observe the emissions. If both E1 and E2 have the same chance to initiate the chain, and we observe sequence “BBB”, what is the most likely state sequence that the chain went through?

24
Hidden Markov Model Parameters: initial distribution transition probabilities emission probabilities Common questions: How to efficiently calculate emissions: How to find the most likely hidden state: How to find the most likely parameters:

28
The Forward and Backward Algorithm If there are a total of S states, and we study an emision from T steps of the chain, then #(all possible Q)=S T When either S or T is big, direct calculation is impossible. Let Let denote the hidden state at time t, Let b i denote the emission probabilities from state E i Consider the “forward variables”: Emissions up to step t chain at state i at step t

29
The Forward and Backward Algorithm At step 1: At step t+1: … … … By induction, we know all Then P(O) can be found by A total of 2TS 2 computations are needed.

30
The Forward and Backward Algorithm Back to our simple example, find P(“BBB”) Now. What saved us the computing time? It is the reusing the shared components. And what allowed us to do that?. The short memory of the Markov Chain.

31
The Forward and Backward Algorithm The backward algorithm: From T-1 to 1, we can iteratively calculate: Emissions after step t chain at state i at step t

32
The Forward and Backward Algorithm Back to our simple example, observing “BBB”, we have

34
The Viterbi Algorithm To find: So, to maximize the conditional probability, we can simply maximize the joint probability. Define:

35
The Viterbi Algorithm Then our goal becomes: So, this problem is solved by dynamic programming.

36
The Viterbi Algorithm Time: 1 2 3 States: E1 0.5*0.5 0.15 0.0675 E2 0.5*0.75 0.05625 0.01125 The calculation is from left border to right border, while in the sequence alignment, it is from upper-left corner to the right and lower borders. Again, why is this efficient calculation possible? The short memory of the Markov Chain!

37
The Estimation of parameters When the topology of an HMM is known, how do we find the initial distribution, the transition probabilities and the emission probabilities, from a set of emitted sequences? Difficulties: Usually the number of parameters is not small. Even the simple chain in our example has 5 parameters. Normally the landscape of the likelihood function is bumpy. Finding the global optimum is hard.

38
The Estimation of parameters The general idea of the Baun-Welch method: (1)Start with some initial values (2)Re-estimate the parameters (1)The previously mentioned forward- and backward- probabilities will be used. (2)It is guaranteed that the likelihood of the data improves with the re-estimation (3)Repeat (2) until a local optimum is reached We will discuss the details next time. To be continued