2Evolution Evolution of new organisms is driven by DiversityDifferent individuals carry different variants of the same basic blue printMutationsThe DNA sequence can be changed due to single base changes, deletion/insertion of DNA segments, etc.Selection bias

5Primate evolutionA phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species; also called a phylogenetic tree.

6Historical NoteUntil mid 1950’s phylogenies were constructed by experts based on their opinion (subjective criteria)Since then, focus on objective criteria for constructing phylogenetic treesThousands of articles in the last decadesImportant for many aspects of biologyClassificationUnderstanding biological mechanisms

7Morphological vs. MolecularClassical phylogenetic analysis: morphological features: number of legs, lengths of legs, etc.Modern biological methods allow to use molecular featuresGene sequencesProtein sequencesAnalysis based on homologous sequences (e.g., globins) in different species

13Theory of Evolution Basic ideaspeciation events lead to creation of different species.Speciation caused by physical separation into groups where different genetic variants become dominantAny two species share a (possibly distant) common ancestor

14Phylogenenetic trees Aardvark Bison Chimp Dog ElephantLeafs - current day speciesNodes - hypothetical most recent common ancestorsEdges length - “time” from one speciation to the next

15Types of Trees A natural model to consider is that of rooted treesCommonAncestor

16Types of treesUnrooted tree represents the same phylogeny without the root nodeDepending on the model, data from current day species often does not distinguish between different placements of the root.

18Positioning Roots in Unrooted TreesWe can estimate the position of the root by introducing an outgroup:a set of species that are definitely distant from all the species of interestProposed rootFalconAardvarkBisonChimpDogElephant

19Dangers of Gene DuplicationIf we happen to consider genes 1A, 2B, and 3A of species 1,2,3, we get a wrong tree that does not represent the phylogeny of the host species of the given sequences because duplication does not create new species.Gene DuplicationSSSSpeciation events1A2A3A3B2B1BIn the sequel we assume all given sequences are orthologs.

20Types of Data Distance-basedInput is a matrix of distances between species.Can be fraction of residue they disagree on, or alignment score between them, etc.Character-basedInput is a multiple sequence alignment. Sequences consist of characters (e.g., residues) that are examined separately.Genome/Proteome –basedInput is whole genome or proteome sequences.No MSA or obvious distance definition.

21Tree Construction: Two Popular MethodsDistance Based- A weighted tree that realizes the distances between the objects (or gets close to it).Character Based – A tree that optimizes an objective function based on all characters in input sequences (major methods are parsimony and likelihood).We start with distance based methods, considering the following question:Given a set of species (leaves in a supposed tree), and distances between them – construct a phylogeny which best “fits” the distances.

22Exact solution: Additive setsGiven a set M of L objects with an L×L distance matrix:d(i,i)=0, and for i≠j, d(i,j)>0d(i,j)=d(j,i).For all i,j,k it holds that d(i,k) ≤ d(i,j)+d(j,k).Can we construct a weighted tree which realizes these distances?

23Additive sets (cont)We say that the set M with L objects is additive if there is a tree T, L of its nodes correspond to the L objects, with positive weights on the edges, such that for all i,j, d(i,j) = dT(i,j), the length of the path from i to j in T.Note: Sometimes the tree is required to be binary, and then the edge weights are required to be non-negative.

24Three objects sets always additive:For L=3: There is always a (unique) tree with one internal node.abcijkmThus

25How about four objects? L=4: Not all sets with 4 objects are additive:e.g., there is no tree which realizes the distances below.ijkl23

26The Four Points ConditionTheorem: A set M of L objects is additive iff any subset of four objects can be labeled i,j,k,l so that:d(i,k) + d(j,l) = d(i,l) +d(k,j) ≥ d(i,j) + d(k,l)We call {{i,j},{k,l}} the “split” of {i,j,k,l}.ikljProof:Additivity 4 Points Condition: By the figure...

30Induction step for L>4:Remove Object L from the setBy induction, there is a tree, T’, for {1,2,…,L-1}.For each pair of labeled nodes (i,j) in T’, let aij, bij, cij be defined by the following figure:aijbijcijijLmij

31Induction step: T’ Pick i and j that minimize cij.T is constructed by adding L (and possibly mij) to T’, as in the figure. Then d(i,L) = dT(i,L) and d(j,L) = dT(j,L)Remains to prove: For each k ≠ i,j: d(k,L) = dT(k,L).aijbijcijijLmijT’

32Induction step (cont.) T’Let k ≠i,j be an arbitrary node in T’, and let n be the branching point of k in the path from i to j.By the minimality of cij , {{i,j},{k,L}} is not a “split” of {i,j,k,L}. So assume WLOG that {{i,L},{j,k}} is a“split” of {i,j, k,L}.aijbijcijijLmijT’kn

34From Additive Distance to a TreeBy following the proof, the four point condition can be used to construct a tree from a distance matrix, or to decide that there is no such tree (namely that the distance is not additive).But this algorithm will go over all quartets, resultingin O(L4) many steps for L species (too sllllllllllllow).The most popular method for constructing trees for additive sets uses the neighbor joining approach.

35Constructing additive trees: The neighbor joining problemLet i, j be sisters (neighboring leaves) in a tree, let k betheir father, and let m be any other vertex.Using eq.we can compute the distances from k to all other leaves.This suggest the following method to construct tree from anadditive distance matrix:Find sisters i,j in the tree,Replace i,j by their father, k, and recursively construct a tree T for the smaller set.Add i,j as children of k in T.

36Neighbor FindingHow can we find from distances alone a pair of sisters (neighboring leaves)?Closest nodes are not necessarily neighboring leaves.ABCDNext, we show a way to find neighbors from distances.

37Neighbor Finding: Seitou & Nei methodTheorem (Saitou&Nei) Assume d is additive, with all tree edge weights positive. If D(i,j) is minimal (among all pairs of leaves), then i and j are sistertaxa in the tree.ijklmT1T2The proof is rather involved, and will be skipped (no tears pls).

39A simpler neighbor finding method:Select an arbitrary (fixed) node r.For each pair of labeled nodes (i,j) let C(i,j) be defined by the following expression (also see figure):rC(i,j)jClaim: Let i, j be such that C(i,j) is maximized.Then i and j are neighboring leaves.i

40Sisters Identification: ExampleSelect arbitrarily r=A.C(B,C)=( )/2=5C(B,D)=( )/2=8C(C,D)=( )/2=5ABCD5462025Claim: Let i, j be such that C(i,j) is maximized.Then i and j are neighboring leaves.

41Neighbor Joining AlgorithmSet M to contain all leaves, and select a root r. |M|=LIf L =2, return a tree of two verticesIteration:Choose i,j such that C(i,j) is maximalCreate a new vertex k, and update distancesremove i,j, and add k to MRecursively construct a tree on the smaller set.When done, add i,j as children on k, at distances d(i,k) and d(j,k).ijkm

42Complexity of Neighbor Joining AlgorithmNaive Implementation:Initialization: θ(L2) to compute the C(i,j)’s.Each Iteration:O(L) to update {C(i,k):i L} for the new node k.O(L2) to find the maximal C(i,j).Total of O(L3).ijkm

43Complexity of Neighbor Joining AlgorithmUsing a Heap to store the C(i,j)’s:Initialization: θ(L2) to compute and heapify the C(i,j)’s.Each Iteration:O(1) to find the maximal C(i,j).O(L log L) to delete {C(m,i), C(m,j)} and add C(m,k) for all vertices m.Total of O(L2 log L).(implementation details are omitted)

44Reconstructing Trees from Additive MatricesGiven a distance matrix constituting an additive metric, the topology of the corresponding additive tree is unique.Q: Do we have to test additivity before running NJ?A: This would be bad news, as this takes O(L4) time!EABCDE27463A1213C1B2D

45Reconstructing Trees from Additive MatricesQ: Do we have to test additivity before running NJ?A: By Seito-Nei, if matrix is additive, NJ will construct the correct tree. Algorithm does not care about awareness and need not know anything about the matrix!EABCDE27463A1213C1B2D

46NJ Algorithm: ExampleIdentify i,j as neighbours if their divergence is minimal.Combine i,j into a new node u.update the distance matrix.If only 3 nodes are left – finish.imjn0.10.4klLet ri be the sum of distancesfrom i to every other node

50Reconstructing Trees from non Additive MatricesQ: What if the distance matrix is not additive?A: We could still run NJ!Q: But can anything be said about the resultingtree?A: Not really. Resulting tree topology could even vary according to way ties are resolved on the way.

51Almost Additive MatrixA distance matrix d’ is “almost additive” if there exists an additive matrix d such thatAtteson: If d’ is almost additive with respect to a tree T, then the output of NJ is a tree T’ with the same topology as T

582. Calculation of New DistancesAfter we have joined two species in a subtree we have to compute the distances from every other node to the new subtree. We do this with a simple average of distances:Dist[Spinach, MonHum]= (Dist[Spinach, Monkey] + Dist[Spinach, Human])/2= ( )/2 = 88.55Mon-HumSpinachHumanMonkey