Phylogeny II : Parsimony, ML, SEMPHY - PowerPoint PPT Presentation

About This Presentation

Title:

Phylogeny II : Parsimony, ML, SEMPHY

Description:

M-step: select new topology based on these expected sufficient statistics ... Find: topology T that maximizes. Si,j is a matrix of # of ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 45

Provided by: NirFri

Category:

more less

Transcript and Presenter's Notes

Title: Phylogeny II : Parsimony, ML, SEMPHY

1
Phylogeny II Parsimony, ML, SEMPHY
2
Phylogenetic Tree

Topology bifurcating
Leaves - 1N
Internal nodes N12N-2

3
Character Based Methods

We start with a multiple alignment
Assumptions
All sequences are homologous
Each position in alignment is homologous
Positions evolve independently
No gaps
We seek to explain the evolution of each position
in the alignment

4
Parsimony

Character-based method
A way to score trees (but not to build trees!)
Assumptions
Independence of characters (no interactions)
Best tree is one where minimal changes take place

5
A Simple Example

What is the parsimony score of

A CAGGTA B CAGACA C CGGGTA D TGCACT E TGCGTA
6
A Simple Example
A CAGGTA B CAGACA C CGGGTA D TGCACT E TGCGTA

Each column is scored separately.
Lets look at the first column
Minimal tree has one evolutionary change

C
T
C
T
C
C
C
T
T ? C
7
Evaluating Parsimony Scores

How do we compute the Parsimony score for a given
tree?
Traditional Parsimony
Each base change has a cost of 1
Weighted Parsimony
Each change is weighted by the score c(a,b)

8
Traditional Parsimony
a
a

Solved independently for each position
Linear time solution

a,g
a
9
Evaluating Weighted Parsimony

Dynamic programming on the tree
S(i,a) cost of tree rooted at i if i is labeled
by a
Initialization
For each leaf i set S(i,a) 0 if i is labeled by
a, otherwise S(i,a) ?
Iteration
if k is a node with children i and j, then
S(k,a) minb(S(i,b)c(a,b))
minb(S(j,b)c(a,b))
Termination
cost of tree is minaS(r,a) where r is the root

10
Cost of Evaluating Parsimony

Score is evaluated on each position independetly.
Scores are then summed over all positions.
If there are n nodes, m characters, and k
possible values for each character, then
complexity is O(nmk)
By keeping traceback information, we can
reconstruct most parsimonious values at each
ancestor node

11
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 Species 1
- A G G G T A A C T G Species 2 - A C G A T T A
T T A Species 3 - A T A A T T G T C T Species 4
- A A T G T T G T C G
How many possible unrooted trees?
12
Maximum Parsimony
How many possible unrooted trees?
1 2 3 4 5 6 7 8 9
10 Species 1 - A G G G T A A C T G Species 2 - A
C G A T T A T T A Species 3 - A T A A T T G T C
T Species 4 - A A T G T T G T C G
13
Maximum Parsimony
How many substitutions?
MP
14
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A
C T G 2 - A C G A T T A T T A 3 - A T A A T T G
T C T 4 - A A T G T T G T C G
15
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C
T G 2 - A C G A T T A T T A 3 - A T A A T T G T
C T 4 - A A T G T T G T C G
16
Maximum Parsimony
4 1 - G 2 - C 3 - T 4 - A
17
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C
T G 2 - A C G A T T A T T A 3 - A T A A T T G T
C T 4 - A A T G T T G T C G
18
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C
T G 2 - A C G A T T A T T A 3 - A T A A T T G T
C T 4 - A A T G T T G T C G
19
Maximum Parsimony
4 1 - G 2 - A 3 - A 4 - G
20
Maximum Parsimony
21
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C
T G 2 - A C G A T T A T T A 3 - A T A A T T G T
C T 4 - A A T G T T G T C G
0 3 2 2 0 1 1 1 1 3 14
22
Searching for Trees
23
Searching for the Optimal Tree

Exhaustive Search
Very intensive
Branch and Bound
A compromise
Heuristic
Fast
Usually starts with NJ

24
Phylogenetic Tree Assumptions

Topology bifurcating
Leaves - 1N
Internal nodes N12N-2
Lengths t ti for each branch
Phylogenetic tree (Topology, Lengths) (T,t)

25
Probabilistic Methods

The phylogenetic tree represents a generative
probabilistic model (like HMMs) for the observed
sequences.
Background probabilities q(a)
Mutation probabilities P(ab, t)
Models for evolutionary mutations
Jukes Cantor
Kimura 2-parameter model
Such models are used to derive the probabilities

26
Jukes Cantor model

A model for mutation rates

Mutation occurs at a constant rate
Each nucleotide is equally likely to mutate into
any other nucleotide with rate a.

27
Kimura 2-parameter model

Allows a different rate for transitions and
transversions.

28
Mutation Probabilities

The rate matrix R is used to derive the mutation
probability matrix S
S is obtained by integration. For Jukes Cantor
q can be obtained by setting t to infinity

29
Mutation Probabilities

Both models satisfy the following properties
Lack of memory
Reversibility
Exist stationary probabilities Pa s.t.

30
Probabilistic Approach

Given P,q, the tree topology and branch lengths,
we can compute

x5
t4
x4
t2
t3
t1
x1
x2
x3
31
Computing the Tree Likelihood

We are interested in the probability of observed
data given tree and branch lengths
Computed by summing over internal nodes
This can be done efficiently using a tree upward
traversal pass.

32
Tree Likelihood Computation

Define P(Lka) prob. of leaves below node k
given that xka
Init for leaves P(Lka)1 if xka 0 otherwise
Iteration if k is node with children i and j,
then
TerminationLikelihood is

33
Maximum Likelihood (ML)

Score each tree by
Assumption of independent positions
Branch lengths t can be optimized
Gradient ascent
EM
We look for the highest scoring tree
Exhaustive search
Sampling methods (Metropolis)

34
Optimal Tree Search

Perform search over possible topologies

Parameter space
Parametric optimization (EM)
Local Maxima
35
Computational Problem

Such procedures are computationally expensive!
Computation of optimal parameters, per candidate,
requires non-trivial optimization step.
Spend non-negligible computation on a candidate,
even if it is a low scoring one.
In practice, such learning procedures can only
consider small sets of candidate structures

36
Structural EM

Idea Use parameters found for current topology
to help evaluate new topologies.
Outline
Perform search in (T, t) space.
Use EM-like iterations
E-step use current solution to compute expected
sufficient statistics for all topologies
M-step select new topology based on these
expected sufficient statistics

37
The Complete-Data Scenario

Suppose we observe H, the ancestral sequences.

38
Expected Likelihood

Start with a tree (T0,t0)
Compute
Formal justification
Define
Theorem
Consequence improvement in expected score ?
improvement in likelihood

39
Proof

Theorem
Simple application of Jensens inequality

40
Algorithm Outline
Unlike standard EM for trees, we compute all
possible pairwise statistics Time O(N2M)
41
Algorithm Outline
Pairwise weights
This stage also computes the branch length for
each pair (i,j)
42
Algorithm Outline
Max. Spanning Tree
Fast greedy procedure to find tree By
construction Q(T,t) ? Q(T0,t0) Thus,
l(T,t) ? l(T0,t0)
43
Algorithm Outline
Fix Tree
Remove redundant nodes Add nodes to break large
degree This operation preserves likelihood
l(T1,t) l(T,t) ? l(T0,t0)
44
Assessing trees the Bootstrap