Title: Phylogenetic Tree
1Phylogenetic Tree
2Phylogenetic Tree What it is
- Drawing evolutionary tree from characteristics of
organisms or some measured distances between them - Represented as a tree where nodes are the
organisms/objects and arcs are the proximity
between the respective nodes - Based on how close the organisms are
3Phylogenetic Tree Motivation
- Pure curiosity biological science
- One species can be studied for a related one
- Drug test on monkeys for human
- Rare species can be spared in a study
- Drug design on evolution of micro-organism
aids/flu vaccine/drug design depends on how do
they evolve - Tracking pathogen sources
- Genesis, archeology,,,
4Phylogenetic Tree topology
- Evolutionary distance is not same as elapsed
time former is a crude approximation of the
latter (if distance can be calculated at all) - Leaves are objects, internal nodes may or may not
be objects (may represent hypothetical ancestors) - Mostly binary trees, sometimes not
5Phylogenetic Tree source data types
- Discrete characters
- does it have long beaks?
- Could be Boolean or multi-valued
- Provided in matrix form (objects X characters)
- Numerical distance matrix
- Symmetric pairwise distances measured by some
means, e.g., by aligning sequences - Continuous character character value is in
numerical domain
6Characters for phylogeny
- Characters should be relevant in the context of
phylogeny depends on the user scientist - Characters should be independent inherited
without interference between the characters (eye
color and hair color may not be a good
combination in character set) - All characters must evolve from the same
ancestor we presume that (1) it is tree, (2) it
is a connected tree - Closest objects are called homologous max
possible characters have same values or related
values
7Phylogeny using character state matrix
- A state is a tuple with values for each
character (value could be unassigned) - Internal node may be a state without any object
assigned on it - Leaves are where the states correspond to objects
with the respective assigned characters - P 178 a source character state matrix
8Phylogeny using character state matrix Problems
- Convergence evolution two non-homologous objects
(most characters does not match, loosely
speaking) happen to have same value on a
character (needs a cycle in the graph)
9Phylogeny using character state matrix Problems
- In one case evolution suggests character value of
c evolves from long to short, in another case
the reverse confusion over the direction of
evolution - Again, the tree property would be violated to
accommodate this
10Character domain types
- Domain of character c could be
- red lt - gt blue lt - gt yellow lt - gt green
- C cannot evolve from blue to green without taking
value yellow first - C is ordered
- C can be directed and ordered, instead of
undirected as above
11Perfect phylogeny
- Noise-free input
- Each edge in phylogeny is a transition of the
respective characters value - All nodes with the same value for a character
must form a subtree (with the transition at its
root) - Such a tree is perfect phylogeny
12Perfect phylogeny problem
- Given a character state matrix does there exist a
perfect phylogeny over it - P 178 table does not have a perfect phylogeny
(presume transitions always 0 -gt 1). Why? - P 180 table and its perfect phylogeny
- What do you do when you do not have perfect
phylogeny? Presume data is noisy and minimize
errors in drawing perfect phylogeny
13Perfect phylogeny problem
- You can always try all possible trees over the
objects and check whether each tree is perfect
phylogeny or not - The total number of such trees is pi3n (2i-5)
Exponential
14Perfect phylogeny problem to check existence
(Boolean matrix)
15Perfect phylogeny problem to check existence
(Boolean matrix)
- Organize char state matrix colum-wise for each
col i set of objects is Oi - Every pair of Oi and Ok should be
- either Oi ? Ok
- or Oi ? Ok
- or Oi ? Ok null
- Either one belongs to the other one or they do
not overlap at all - If they overlap, no perfect phylogeny exist
16Perfect phylogeny problem to check existence
(Boolean matrix)
- In contrary, suppose Oi and Ok overlaps and a
perfect phylogeny exists - say, i is the edge between (u, v) v and subtree
has i1, but all other nodes have i0. - Suppose, three objects a, b, and c such that, a,
b ? Oi, but c is not a,b in subtree of v and c
is not there - But, suppose b, c ? Ok, and a is not b,c must
belong to some other subtree separated by edge k - Contradiction
17Perfect phylogeny problem to check existence
(Boolean matrix)
- When no overlap exists
- Contained sets go within same subtree, if Oi ?
Ok, then i-subtree is subtree of k-subtree - Disjoint sets are separate subtrees
- Proves if and only if of the condition for
perfect phylogeny - Algorithm for checking Pairwise checking of
object set may take O(m2) for m characters, but
set overlap may check even more time
18Perfect phylogeny problem Algorithm (Boolean
matrix)
- Sort the columns by number of 1s (descending)
- Scan each row to find which col number has the
rightmost 1 for that box - Scan each column every box should agree
- Complexity O(mn) count, O(m log m) sort, O(mn)
index matrix creation, O(mn) checking over index
matrix total O(mn) presuming n gt log m
19Perfect phylogeny problem Algorithm (Boolean
matrix)
- Exercise try the algorithm for tables 6.1 p 178
and 6.2 p 180 - Construction Algorithm (1) sort characters/col
increasing order, (2) each object (3) each
character (4) if edge for char exists put obj
on the end, (5) else create an edge and put
object at the end, (6 cosmetic step) if more
objects in a leaf node create edges for each
object - O(nm)
- Exc. Try it on table 6.2 p180
20Perfect phylogeny problem Algorithm (non-Boolean
matrix, but)
- If two states per character but the order of
transition not known, then presume an order - majority state 0, minority 1 (more ancestors are
available) - Same Lemma must be applied after this
presumption no overlapping set of objects
21Phylogeny problem arbitrary domain size,
unordered characters
- (Def) Triangulated graph no big hole cycle
with gt3 vertices has a short-cut edge - Sub-trees of a tree form triangulated graph (as
intersection graph?) - (Def) Intersection Graph over subsets subsets
are nodes and edges between pairs of overlapping
subsets
22Phylogeny problem arbitrary domain size,
unordered characters
- Fig 6.7, p187 intersection graph for Table 6.3
p188 not triangulated, yet - (Def) c-Triangulated graph Connect edges of
intersection graph G where nodes are of different
characters, and if the graph becomes now
triangulated, then G is c-triangulated - Fig 6.7 is c-triangulated
23Phylogeny problem arbitrary domain size,
unordered characters
- Iff a character state matrix translates to a
c-triangulated graph then it admits perfect
phylogeny - Creatingchecking c-triangulation is NP-hard
(related to finding max-clique problem)
24Phylogeny problem arbitrary domain size,
unordered characters 2 characters
- For 2 characters, the intersection graph is
bi-partite - Perfect phylogeny means (iff) the state
intersection graph is acyclic
25Phylogeny construction arbitrary domain size,
unordered characters 2 characters
- Algorithm
- (1) Construct intersection graph
- (2) make nodes for edges (intersection of the
objects in old nodes now goes to the new nodes) - (3) connect new nodes if they have overlapping
objects - (4) spanning tree of the graph is phylogeny
- (5 cosmetic step) objects huddled on a node
should be put on separate leaves - Try on Table 6.4 p190, and check against Fig 6.8
p189
26When Perfect Phylogeny does not exist
- Eliminate problematic characters which ones, an
optimization problem min number of characters
Compatibility criterion - Minimize convergence (character goes back to its
previous value) Parsimony criterion - Both NP-complete problems
27When Perfect Phylogeny does not exist Parsimony
- Compatibility problem Does there exist a subset
of characters such that Lemma 6.1
(non-overlapping set of objects) is valid (or
Perfect Phylogeny exists)? - Equivalent to K-clique problem does there exist
a connected-subgraph with K or more nodes?
28When Perfect Phylogeny does not exist Parsimony
- Poly-transformation from Clique to compatibility
problem nodes to character, 3 objects for each
edge with specific character values - Every pair of NP-complete problems have two way
poly-trans - Compatibility can also be poly-trans to Clique
characters to nodes, non-overlapping (compatible)
characters to edges
29Phylogeny with Distance Matrix
30Phylogeny with Distance Matrix
- Input is a distance matrix (square, symmetric)
between all pair of objects, instead of character
state matrix - Output is phylogeny with leaves as objects and
arcs have distances as labels
31Phylogeny with Distance Matrix
- Additive matrix when you can draw a tree where
distance between every pair of leaves on the tree
is the real distance on distance matrix - Matrices are unlikely to be additive in practice
- For non-additive matrix, minimize deviation over
the tree NP-hard problem
32Phylogeny with Distance Matrix
- Typically we have 2 matrices (1) upper bound on
distances, and (2) that for lower bounds - Metric space
- dijgt0, dii0, dijdji, for all i, j
- dij lt dik dkj
- Additive metric spaces follow 4 point condition
- dijdkldikdjl gt dildjk
33Phylogeny with Distance Matrix
- Tree should have 3-degree internal nodes (Fig
6.9, p194) - Arc xy to be split proportionately at c, to add a
node z by arc cz, so that distances xz, zy are
proper
34Phylogeny with Distance Matrix
- Mxz dxc dzc
- Myz dyc dzc
- Mxy dxc dyc
- Three equations, three unknowns dxc, dyc, dzc to
be solved for - The tree drawn is unique for 3 objects x, y and z
35Phylogeny with Distance Matrix
- Adding 4th object w is same as adding 3rd object
z - Add between older objects x and y splitting xy at
c2 - If c2 coincides with c, ignore this and redo the
same between zc - Object w may hang (from c2) between xz or yz, but
will not have 2 different opportunities
36Phylogeny with Distance Matrix
- The property of uniqueness of the tree remain
valid for any k objects for kgt4, for metric
additive distance matrix - The algorithm may have to try all possible places
to split an arc, but there will be a unique
position, for metric additive space
37Phylogeny Ultrametric tree
- Excercise Get MST of a complete graph over table
6.5 p195 - Ultrametric tree construction
- Input Distance matrices for High cut-off Mh, Low
cut-off Ml (table 6.6 p 201) - Output Phylogeny where leaf-to-leaf distances
are within the bounds provided by the 2 matrices
(fig 6.16 p202)
38Phylogeny Ultrametric tree
- Algorithm
- Compute MST T over Mh (algorithm?) provides
basis for structure of the tree - Compute cut-off values between each edge on T
using Ml provides basis for distances on the
tree edges - Compute the ultrametric tree U and find distance
on each arc using the cut-offs
39Phylogeny Ultrametric tree
- Step 2.1 input T, output is rooted tree R where
internal nodes represent edges of T - Sort MST T by edge weights (from Mh)
non-increasing - Pick up edges by the sort as root in each
iteration - The path between the end nodes must go via the
root the two nodes edge should be in two
different subtrees - Next edge in the sort to be picked up that has
the corresponding node (x) on the respective side
of the previous root (xy) - Until no edge for a node (x) is left (all such xy
is picked up), then the node x is on a leaf
40Phylogeny Ultrametric tree
- Step 2.2 (cut-off)
- For each pair of nodes (x, y) look at the path in
R - See which is the least common ancestor, say (ab)
note each internal node represents an edge - Look up table Ml, if Ml_xy is more than current
cut-off(ab) replace it with M_xy - In other words, the highest Ml value on any edge
on the path from x to y in T should be its
distance on the ultrametric tree - On example p201-202 root (ad) is updated for
pairs of all nodes on the opposite sides EB(1),
ED(1), AD(4), AB(3), CB(4), CD(3)
41Phylogeny Ultrametric tree
- Step 3 (ultrametric tree) Recompute R again same
way as before - But, now put distance on internal nodes
- Height of an internal node is its cut-off / 2
- Note, computation of R starts with root downwards
- Adjust distances between the nodes as heights are
being calculated - Done
42Phylogeny UPGMA
- Intra-cluster mean distance
- Inter-cluster distance
- WPGMA distances from the root to every branch
tip are equal
43Phylogeny UPGMA
- Intra-cluster mean distance
a b c d e a 0 17 21 31 23 b 17 0 30
34 21 c 21 30 0 28 39 d 31 34 28 0
43 e 23 21 39 43 0
44Phylogeny UPGMA
a b c d e a 0 17 21 31 23 b 17 0 30
34 21 c 21 30 0 28 39 d 31 34 28 0
43 e 23 21 39 43 0
(a,b) c d e (a,b) 0 25.5 32.5 22 c 25.5
0 28 39 d 32.5 28 0 43 e 22 39 43 0
45Phylogeny UPGMA
a b c d e a 0 17 21 31 23 b 17 0 30
34 21 c 21 30 0 28 39 d 31 34 28 0
43 e 23 21 39 43 0
(a,b) c d e (a,b) 0 25.5 32.5 22 c 25.5
0 28 39 d 32.5 28 0 43 e 22 39 43 0
Setting d ( a , u ) d ( b , u ) D 1 ( a , b )
/ 2
The branches joining a and b to u then have
lengths d ( a , u ) d ( b , u ) 17 / 2
8.5 Assuming, Ultra-metric space.
46Metric Ultra-metric space
such that for all x , y , z ? M, one has d
( x , y ) 0 d ( x , y ) 0 d ( x , y
) d ( y , x ) (symmetry) d ( x , z ) max
d ( x , y ) , d ( y , z ) (strong triangle or
ultrametric inequality). For metric space,
d(x,z) d(x,y) d(y,z)
47Comparing phylogenies
D 2 ( ( a , b ) , c ) ( D 1 ( a , c ) 1 D
1 ( b , c ) 1 ) / ( 1 1 ) ( 21 30 ) / 2
25.5 D 2 ( ( a , b ) , d ) ( D 1 ( a , d ) D
1 ( b , d ) ) / 2 ( 31 34 ) / 2 32.5 D 2
( ( a , b ) , e ) ( D 1 ( a , e ) D 1 ( b , e
) ) / 2 ( 23 21 ) / 2 22
48Phylogeny UPGMA
(a,b) c d e (a,b) 0 25.5 32.5 22 c 25.5
0 28 39 d 32.5 28 0 43 e 22 39 43 0
We deduce the missing branch length d ( u , v )
d ( e , v ) - d ( a , u ) d ( e , v ) - d ( b
, u ) 11 - 8.5 2.5
49Phylogeny UPGMA
calculated by proportional averaging D 3 ( ( (
a , b ) , e ) , c ) ( D 2 ( ( a , b ) , c ) 2
D 2 ( e , c ) 1 ) / ( 2 1 ) ( 25.5 2
39 1 ) / 3 30 Thanks to this proportional
average, the calculation of this new distance
accounts for the larger size of the ( a , b )
cluster (two elements) with respect to e (one
element). Similarly D 3 ( ( ( a , b ) , e ) , d
) ( D 2 ( ( a , b ) , d ) 2 D 2 ( e , d )
1 ) / ( 2 1 ) ( 32.5 2 43 1 ) / 3
36 Replace ltproportional averagegt with mean,
you get WPGMA
50Phylogeny UPGMA
(a,b) c d e (a,b) 0 25.5 32.5 22 c 25.5
0 28 39 d 32.5 28 0 43 e 22 39 43 0
((a,b),e) c d ((a,b),e) 0 30 36 c 30 0
28 d 36 28 0
51Phylogeny UPGMA
((a,b),e) c d ((a,b),e) 0 30 36 c 30 0
28 d 36 28 0
There is a single entry to update, keeping in
mind that the two elements c and d each have a
contribution of 1 in the average computation D
4 ( ( c , d ) , ( ( a , b ) , e ) ) ( D 3 ( c ,
( ( a , b ) , e ) ) 1 D 3 ( d , ( ( a , b ) ,
e ) ) 1 ) / ( 1 1 ) ( 30 1 36 1 ) / 2
33 Final step The final D 4 matrix
is ((a,b),e) (c,d) ((a,b),e) 0 33 (c,d) 33
0
52Phylogeny UPGMA
53Phylogeny UPGMA
Time Complexity O(n3) to O(n2 log n)
54Phylogeny Neighbor-joining
- Bottom up, as for UPGMA, but non-rooted
- Distance matrix transformed to Q-matrix
(negative) - Min Q-value used to connect cluster pairs
- Cluster distance update formula (in
distance-space not Q-space) - Iterate d ? Q ? cluster_join ? d_update ?
iterate - Topology additive distances, node-pair distances
are conserved - Assumption Balanced Minimum Evolution
- Greedy optimization (underlying linear
programming) - Fast (?), O(n3) complexity for n nodes
- Correct optimized tree even if the source
d-matrix is noisy - Wiki https//en.wikipedia.org/wiki/Neighbor_jo
ining
55Phylogeny General Comments
- Morphological and molecular (sequences)
- Distance matrix
- Maximum parsimony
- Maximum likelihood and Bayesian inference
- Post-analysis of Tree-support evaluation
- Shortcomings
- convergent evolution, horizontal gene-transfer
- hybrids, or non-binary tree, or phylogeny network
- missing species/taxa
- https//en.wikipedia.org/wiki/Computational_phylog
enetics
56Comparing phylogenies
- Two trees are expected to be isomorphic
- All nodes should be on the leaves, if not make it
so - Pick up a node u and its sibling v on T1
- Look for u in T2 and if its sibling is not v
return False - If the sibling is v then merge uv into its parent
(and remove subtree with u and v) - Continue bottom up until both T1 and T2 become
single node trees, then return True