Title: Luonan Chen
1Phylogenetic Prediction and Evolutionary Tree
Chaper-6
2Methods
- Phenetic Approach Clustering approach,
Fitch-Margoliash method, Neighbor-joining method - -- based on similarity
- Cladistic Approach maximum parsimony and maxmum
likelihood approaches - -- based on genealogy
- -- Cladistic approaches are superior to
clustering techniques but require more CPU time
3Distance Matrix (Phenetic approach depends on
distance matrix)
- DNA
- -- Hamming distance, e.g. D2 for agbc and agta
- -- Levenshtein distance (edit distance) minimal
number to change one string into the other by
deletion, insertion or substitution. E.g. D3 for
ag-tcc and cgctca - Protein
- -- PAM matrix, BLOSUM matrix
Similarity measure of the number of sequence
positions that match in an alignment. Distance
the number of positions that are different and
that must be changed to
convert
one sequence into the other.
4Amino Acid Distances
- Distances between amino acid sequences are a bit
more complicated to calculate. - Some amino acids can replace one another with
relatively little effect on the structure and
function of the final protein while other
replacements can be functionally devastating. - From the standpoint of the genetic code, some
amino acid changes can be made by a single DNA
mutation while others require two or even three
changes in the DNA sequence. - In practice, what has been done is to calculate
tables of frequencies of all amino acid
replacements within families of related protein
sequences in the databanks i.e. PAM and BLOSSUM
5The PAM 250 scoring matrix
A R N D C Q E G H I L K
M F P S T W Y V A 2
R -2 6 N 0 0 2
D 0 -1 2 4
C -2 -4 4 -5 4 Q 0 1 1 2 -5
4 E 0 -1 1 3 -5 2 4
G 1 -3 0 1 -3 -1 0 5 H -1
2 2 1 -3 3 1 -2 6 I -1 -2 -2
-2 -2 -2 -2 -3 -2 5 L -2 -3 -3 -4
-6 -2 -3 -4 -2 2 6 K -1 3 1 0
-5 1 0 -2 0 -2 -3 5 M -1 0 -2
-3 -5 -1 -2 -3 -2 2 4 0 6 F -4
-4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9
P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5
6 S 1 0 1 0 0 -1 0 1 -1 -1
-3 0 -2 -3 1 3 T 1 -1 0 0 -2
-1 0 0 -1 0 -2 0 -1 -2 0 1 3
W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2
-5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0
-1 -1 -4 -2 7 -5 -3 -3 0 10 V 0
-2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0
-6 -2 4 Dayhoff, M, Schwartz, RM, Orcutt, BC
(1978) A model of evolutionary change in
proteins. in Atlas of Protein Sequence and
Structure, vol 5, sup. 3, pp 345-352. M. Dayhoff
ed., National Biomedical Research Foundation,
Silver Spring, MD.
6Phenetic Methods
- Computer algorithms based on the phenetic model
rely on Distance Methods to build of trees from
sequence data. - Phenetic methods count each base of sequence
difference equally, so a single event that
creates a large change in sequence
(insertion/deletion or recombination) will move
two sequences far apart on the final tree. - Phenetic approaches generally lead to faster
algorithms and they often have nicer statistical
properties for molecular data. - The phenetic approach is popular with molecular
evolutionists because it relies heavily on
objective character data (such as sequences) and
it requires relatively few assumptions.
7Clustering Algorithms
- Clustering algorithms use distances to calculate
phylogenetic trees. These trees are based solely
on the relative numbers of similarities and
differences between a set of sequences. - Start with a matrix of pairwise distances
- Cluster methods construct a tree by linking the
least distant pairs of taxa, followed by
successively more distant taxa.
8UPGMA
- The simplest of the distance methods is the UPGMA
(Unweighted Pair Group Method using Arithmetic
averages) -
- Minimize the total edge length of tree.
- Many multiple alignment programs such as PILEUP
use a variant of UPGMA to create a dendrogram of
DNA sequences which is then used to guide the
multiple alignment algorithm.
9Clustering Method (UPGMA)(Molecular clock
hypothesis uniform rate of mutation in the tree
branches)
Other distance matrix (PAM) can be used
average distance
Improvement Choosing outgroup
10Fitch-Margoliash (FM) Algorithm-I
(Minimize total edge length at each step)
Distance Table for 5-sequence
- 1. The most closely related sequences are D and
E. A new Table is made with the remaining
sequences combined. - 2. The average distances from D and E to ABC are
calculated.
11Fitch-Margoliash Algorithm-II
average
- 3. Distance D--E de DABC dm, EABC em,
(mg(c2fab)/2) - Then d4 e6
- 4. D and E are treaded as a single DE. A new
Table is calculated by the average distances
between A,B,C and DE - 5. The next most closely related sequences are C
with DE. We have new Table. Similarly, we have
c9, and n10, (ng (ef)/2) - 6.In the same way, we finally, a10, b12, g5
and f20. Then we have the tree.
How about starting with A and B ?
12Neighbor Joining
- The Neighbor Joining method is the most popular
way to build trees from distance measurements - (Saitou and Nei 1987, Mol. Biol. Evol. 4406)
- Neighbor Joining corrects the UPGMA method for
its (frequently invalid) assumption that the same
rate of evolution applies to each branch of a
tree. - The distance matrix is adjusted for differences
in the rate of evolution of each taxon (branch). - Neighbor Joining has given the best results in
simulation studies and it is the most
computationally efficient of the distance
algorithms (N. Saitou and T. Imanishi, Mol.
Biol. Evol. 6514 (1989)
13Neighbor-joining Method (No
requirement for molecular clock hypothesis)
- The algorithm is the same as FM method except the
choice of pair - The choice of pair is based on minimal sum of the
branch lengths of the tree - lt Algorithm gt
- Step-1 Star-like connection S0abcde78.5
- Step-2 Comparing the distances of all pairs, and
choosing one pair with minimal overall lengths.
SAB67.7, SBC81,SCE76,SED70. Choose AB pair. - Step-3 As Steps 2 and 3 of FM method,
calculating average distance from AB to CDE, we
have a10, b12. - Step-4 As FM method, calculate a new Table with
A and B forming a single pair. - Continue the computation to find the next
pair until finished.
Overall length when AB joining
14Cladistic Methods
- For character data about the physical traits of
organisms (such as morphology of organs etc.)
and for deeper levels of taxonomy, the cladistic
approach is almost certainly superior. - Cladistic methods are often difficult to
implement with molecular data because all of the
assumptions are generally not satisfied.
15Cladistic methods
- Cladistic methods are based on the assumption
that a set of sequences evolved from a common
ancestor by a process of mutation and selection
without mixing (hybridization or other horizontal
gene transfers). - These methods work best if a specific tree, or at
least an ancestral sequence, is already known so
that comparisons can be made between a finite
number of alternate trees rather than calculating
all possible trees for a given set of sequences.
16Parsimony
- Parsimony is the most popular method for
reconstructing ancestral relationships. - Parsimony allows the use of all known
evolutionary information in building a tree - In contrast, distance methods compress all of the
differences between pairs of sequences into a
single number
17Building Trees with Parsimony
- Parsimony involves evaluating all possible trees
and giving each a score based on the number of
evolutionary changes that are needed to explain
the observed data. - The best tree is the one that requires the fewest
base changes for all sequences to derive from a
common ancestor.
18Cladistic Method Maximum Parsimony Method (
minimizing overall number of evolutionary changes
(mutations) )
Sequences ATCT, ATGG, TCCA, TTCA
Optimal Solution
4 mutations
7 mutations
Problem varying rates of evolution Note gaps
are treated as a fifth base
19Maximum Likelihood
- The method of Maximum Likelihood attempts to
reconstruct a phylogeny using an explicit model
of evolution. - This method works best when it is used to test
(or improve) an existing tree. - Even with simple models of evolutionary change,
the computational task is enormous, making this
the slowest of all phylogenetic methods.
20Assumptions for Maximum Likelihood
- The frequencies of DNA transitions (Clt-gtT,Alt-gtG)
and transversions (C or Tlt-gtA or G). - The assumptions for protein sequence changes are
taken from the PAM matrix - and are quite likely
to be violated in real data. - Since each nucleotide site evolves independently,
the tree is calculated separately for each site.
The product of the likelihood's for each site
provides the overall likelihood of the observed
data.
21Improving Efficiency by Compression
- S1 ACGTTTGGGGCCCCTTT
- S2 ACGTTTGGGGCCCCTTT
- S3 ACGTTTGGGGCCCCTTT
- S4 ACGCTCGGGGCCCCTTT
Data Compression
- S1 A C G T T
- S2 A C G T T
- S3 A C G T T
- S4 A C G C T
- 1, 5,5,2,4
22Conclusions
- Given the huge variety of methods for computing
phylogenies, how can the biologist determine what
is the best method for analyzing a given data
set? - Published papers that address phylogenetic issues
generally make use of several different
algorithms and data sets in order to support
their conclusions. - In some cases different methods of analysis can
work synergistically - Neighbor Joining methods generally produce just
one tree, which can help to validate a tree built
with the parsimony or maximum likelihood method - Using several alternate methods can give an
indication of the robustness of a given
conclusion.