Title: Phylogenetic Tree Reconstruction
1Phylogenetic Tree Reconstruction
2Phylogenetic Tree
- A tree that represents the relationship of a set
of species or genetic sequences is called a
phylogenetic tree.
3Phylogenetic tree
- Phylogeny the relationship of species
- Leaves species or sequences (OTUs operational
taxonomic units) - Internal nodes ancestors of particular groups of
the OTUs. - Branch Length the degree of relatedness between
the species or sequences corresponding to the
nodes at the endpoints of the branch.
4Orthologues / Paralogues
- Genes which diverged because of speciation are
called orthologues. Above a tree of orthologues
based on a set of alpha haemoglobins. - Genes which diverged by gene duplication are
called paralogues. Below a tree of paralogues,
the alpha, beta, gamma, delta, epsilon, zeta and
theta chains of human haemoglobins, and
myoglobin.
5Rooted / Unrooted Tree
6Counting Trees
7Counting Trees
8Rrooting the tree
To root a tree mentally, imagine that the tree is
made of string. Grab the string at the root
and tug on it until the ends of the string (the
taxa) fall opposite the root
9Steps of phylogenetic tree reconstruction
- Choosing a family of homologous sequences
- Aligning the sequences and obtaining a reduced
multiple alignment by discarding the columns that
contain gaps. - Inferring a phylogenetic tree from the reduced
multiple alignment.
10Methods of Phylogenetic tree reconstruction
- Maximum parsimony methods
- Distance methods
- Probabilistic methods arising from the maximum
likelihood approach
11Maximum Parsimony Method
- Site
- OTU 1 2 3 4 5 6 7 8 9
- -----------------------
- 1 T C A G A T C A A
- 2 T T A G A A C A A
- 3 T T C G A T C G A
- 4 T T C T A A G G A
- Target find rooted tree topologies, not branch
lengths. - Principle search for tree that requires the
smallest number of character state changes
between the OTUs (the sequence always evolve in
the most economic way.) - Operation at least two different kinds of
residues at the site, each of which is found in
at least two of the OUT sequences
12Maximum Parsimony
13Maximum Parsimony
14Traditional Parsimony
- If N is moderate, Fitch's algorithm is realistic.
- If N is large, the branch and bound algorithm
should be used.
15Distance Methods
- Reconstruct trees (rooted or unrooted, depending
on the method) from a set of pairwise distances,
d (d_ij), between the sequences in a fixed
reduced multiple alignment. - Given a tree relating the OTUs, obtain a
tree-generated distance matrix d'. - Question is dd'? Or, is the distance function
additive?
16Additivity
- Theorem (Four point condition) Let d be a
distance function on a M (a set of N OUTs) and
Ngt4. Then d is additive if and only if the
following condition holds for every set of four
distinct numbers 1 lti,j,k,lltN two of the sums
d_ijd_kl, d_ikdjl, d_ild_jk
coincide and are greater than or equal to the
third one. (Saitou and Nei, 1987, Mol. Biol.
Evol.)
17Additivity
18Distance function
- distance score counted as
- number of mismatched positions in the alignment
- number of sequence positions that must be changed
to generate the second sequence - Success depends on degree the distances among a
set of sequences can be made additive on a
predicted evolutionary tree
19Example of Distance Analysis
- Distances can be shown as a table
- A ACGCGTTGGGCGATGGCAAC
- B ACGCGTTGGGCGACGGTAAT
- C ACGCATTGAATGATGATAAT
- D ACACATTGAGTGATAATAAT
20Neighbour joining
- Very popular method
- Produces unrooted tree
- Assumes additivity distance between pairs of
leaves sum of lengths of edges connecting them
that is, dd'. - Constructs tree by sequentially joining subtrees
21Neighbor Joining Once we know the correct (i,j)
pair
22Neighbour Joining
23Neighbor joining algorithm
24Neighbour Joining why not pick the smallest
(i,j) pair?
25Example of Distance Analysis
- Using this information, a tree can be drawn
- A ACGCGTTGGGCGATGGCAAC
- B ACGCGTTGGGCGACGGTAAT
- C ACGCATTGAATGATGATAAT
- D ACACATTGAGTGATAATAAT
26Drawbacks of neighbor joining
- In practice, the distance function is often a
pseudodistance function, which does not satisfy
the four-point condition. (The triangle
inequality is hard to satisfy either) - The algorithm may produce more than one tree,
these trees may have branches of negative
lengths, the matrix d' may not coincide with the
original distance matrix d.
27Special distance function Ultrameric distance
- Definition A distance function d on a set M of
OTUs is called ultrameric, if for any three
distinct elements x_i, x_j, x_k, two of the
distances d_ij, d_ik, d_jk concide and are
greater than or equal to the third. - It satisfies the four-point condition.
- It is additive, and can be recovered by the
generated phylogenetic tree.
28UPUPGMA -- Unweighted Pair Group Method with
Arithmetic meanGMA(sequential clustering method)
29UPGMA distance function between two clusters
30UPGMA
31UPGMA Step 1combine B and C
32UPGMA step 2combine BC and D
(1012)/2
(46)/2
33UPGMA step 3combine A and E
34UPGMA step 4combine AE and BCD
35UPGMA Result
36(No Transcript)
37When UPGMA fails
38 Maximum Likelihood method
- Assumption Maximum likelihood supposes a model
of evolution along tree branches. - Strategy Find parameters (tree, branch lengths,
substitution rate) that maximizes the likelihood
assigned to the data. - Note Model of evolution does not include
insertion and deletion of the nucleotides. - In Phylip package program PROTML
39Probabilistic Methods
- The phylogenetic tree represents a generative
probabilistic model (like HMMs) for the observed
sequences. - Background probabilities q(a)
- Mutation probabilities P(ab, t)
- Models for evolutionary mutations
- Jukes Cantor
- Kimura model
- Felsenstein model
- Hasegawa-Kishino-Yano model
40Jukes Cantor model
- A model for mutation rates
- Mutation occurs at a constant rate
- Each nucleotide is equally likely to mutate into
any other nucleotide with rate alpha.
41Kimura 2-parameter model
- Allows a different rate for transitions and
transversions.
42Mutation Probabilities
- The rate matrix R is used to derive the mutation
probability matrix S - S is obtained by integration. For Jukes Cantor
- q can be obtained by setting t to infinity
43Mutation Probabilities
- All models satisfy the following properties
- Markovian property
-
- Reversibility
- Exist stationary probabilities Pa s.t.
44Probabilistic Approach
- Given P,q, the tree topology and branch lengths,
we can compute
45Computing the Tree Likelihood
46Tree Likelihood Computation
- Define P(Lka) prob. of leaves below node k
given that xka - Init for leaves P(Lka)1 if xka 0 otherwise
- Iteration if k is node with children i and j,
then - TerminationLikelihood is
47Maximum Likelihood (ML)
- Score each tree by
- Assumption of independent positions
- Branch lengths t can be optimized
- Gradient ascent
- EM
- We look for the highest scoring tree
- Exhaustive
- Sampling methods (Metropolis)
48Optimal Tree Search
- Perform search over possible topologies
49Computational Problem
- Such procedures are computationally expensive!
- Computation of optimal parameters, per candidate,
requires non-trivial optimization step. - Spend non-negligible computation on a candidate,
even if it is a low scoring one. - In practice, such learning procedures can only
consider small sets of candidate structures
50Max Likelihood versus Parsimony
- (Example from BSA p. 225)
- Choose tree T, with unequal branch lengths.
- Generate 1000 sequences of length N according to
probabilistic model - (A) Reconstruction by ML (B)
Reconstruction by Parsimony
51Max Likelihood versus NJ
- (Example from BSA p. 225)
- Choose tree T, with unequal branch lengths.
- Generate 1000 sequences of length N according to
probabilistic model - (A) Reconstruction by ML (B)
Reconstruction by NJ
Conclusion ML infers right tree as N gets
largerl. If the probabilistic model is correct,
the ML distances shall be very close to additive,
therefore the NJ method predicts the correct
tree.
52Phylip - practicalities
- Menu-driven, no command line
- Input file format
- First line ltnumber of sequencesgt ltnumber of
letters per sequencegt - Next lines Sequences
- First ten characters is the sequence name
- Then sequence follows. Spaces and newlines are
allowed. - Dashes (-) signify gaps
- Example
53The End