Luonan Chen - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Luonan Chen

Description:

Phenetic Approach : Clustering approach, Fitch-Margoliash method, Neighbor-joining method ... Fitch-Margoliash (FM) Algorithm-I ... Fitch-Margoliash Algorithm-II ... – PowerPoint PPT presentation

Number of Views:181
Avg rating:3.0/5.0
Slides: 23
Provided by: luona
Category:
Tags: chen | fitch | luonan

less

Transcript and Presenter's Notes

Title: Luonan Chen


1
Phylogenetic Prediction and Evolutionary Tree
Chaper-6
  • Luonan Chen

2
Methods
  • Phenetic Approach Clustering approach,
    Fitch-Margoliash method, Neighbor-joining method
  • -- based on similarity
  • Cladistic Approach maximum parsimony and maxmum
    likelihood approaches
  • -- based on genealogy
  • -- Cladistic approaches are superior to
    clustering techniques but require more CPU time

3
Distance Matrix (Phenetic approach depends on
distance matrix)
  • DNA
  • -- Hamming distance, e.g. D2 for agbc and agta
  • -- Levenshtein distance (edit distance) minimal
    number to change one string into the other by
    deletion, insertion or substitution. E.g. D3 for
    ag-tcc and cgctca
  • Protein
  • -- PAM matrix, BLOSUM matrix

Similarity measure of the number of sequence
positions that match in an alignment. Distance
the number of positions that are different and
that must be changed to
convert
one sequence into the other.
4
Amino Acid Distances
  • Distances between amino acid sequences are a bit
    more complicated to calculate.
  • Some amino acids can replace one another with
    relatively little effect on the structure and
    function of the final protein while other
    replacements can be functionally devastating.
  • From the standpoint of the genetic code, some
    amino acid changes can be made by a single DNA
    mutation while others require two or even three
    changes in the DNA sequence.
  • In practice, what has been done is to calculate
    tables of frequencies of all amino acid
    replacements within families of related protein
    sequences in the databanks i.e. PAM and BLOSSUM

5
The PAM 250 scoring matrix
A R N D C Q E G H I L K
M F P S T W Y V A 2
R -2 6 N 0 0 2
D 0 -1 2 4
C -2 -4 4 -5 4 Q 0 1 1 2 -5
4 E 0 -1 1 3 -5 2 4
G 1 -3 0 1 -3 -1 0 5 H -1
2 2 1 -3 3 1 -2 6 I -1 -2 -2
-2 -2 -2 -2 -3 -2 5 L -2 -3 -3 -4
-6 -2 -3 -4 -2 2 6 K -1 3 1 0
-5 1 0 -2 0 -2 -3 5 M -1 0 -2
-3 -5 -1 -2 -3 -2 2 4 0 6 F -4
-4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9
P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5
6 S 1 0 1 0 0 -1 0 1 -1 -1
-3 0 -2 -3 1 3 T 1 -1 0 0 -2
-1 0 0 -1 0 -2 0 -1 -2 0 1 3
W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2
-5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0
-1 -1 -4 -2 7 -5 -3 -3 0 10 V 0
-2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0
-6 -2 4 Dayhoff, M, Schwartz, RM, Orcutt, BC
(1978) A model of evolutionary change in
proteins. in Atlas of Protein Sequence and
Structure, vol 5, sup. 3, pp 345-352. M. Dayhoff
ed., National Biomedical Research Foundation,
Silver Spring, MD.
6
Phenetic Methods
  • Computer algorithms based on the phenetic model
    rely on Distance Methods to build of trees from
    sequence data.
  • Phenetic methods count each base of sequence
    difference equally, so a single event that
    creates a large change in sequence
    (insertion/deletion or recombination) will move
    two sequences far apart on the final tree.
  • Phenetic approaches generally lead to faster
    algorithms and they often have nicer statistical
    properties for molecular data.
  • The phenetic approach is popular with molecular
    evolutionists because it relies heavily on
    objective character data (such as sequences) and
    it requires relatively few assumptions.

7
Clustering Algorithms
  • Clustering algorithms use distances to calculate
    phylogenetic trees. These trees are based solely
    on the relative numbers of similarities and
    differences between a set of sequences.
  • Start with a matrix of pairwise distances
  • Cluster methods construct a tree by linking the
    least distant pairs of taxa, followed by
    successively more distant taxa.

8
UPGMA
  • The simplest of the distance methods is the UPGMA
    (Unweighted Pair Group Method using Arithmetic
    averages)
  • Minimize the total edge length of tree.
  • Many multiple alignment programs such as PILEUP
    use a variant of UPGMA to create a dendrogram of
    DNA sequences which is then used to guide the
    multiple alignment algorithm.

9
Clustering Method (UPGMA)(Molecular clock
hypothesis uniform rate of mutation in the tree
branches)
Other distance matrix (PAM) can be used
average distance
Improvement Choosing outgroup
10
Fitch-Margoliash (FM) Algorithm-I
(Minimize total edge length at each step)
Distance Table for 5-sequence
  • 1. The most closely related sequences are D and
    E. A new Table is made with the remaining
    sequences combined.
  • 2. The average distances from D and E to ABC are
    calculated.

11
Fitch-Margoliash Algorithm-II
average
  • 3. Distance D--E de DABC dm, EABC em,
    (mg(c2fab)/2)
  • Then d4 e6
  • 4. D and E are treaded as a single DE. A new
    Table is calculated by the average distances
    between A,B,C and DE
  • 5. The next most closely related sequences are C
    with DE. We have new Table. Similarly, we have
    c9, and n10, (ng (ef)/2)
  • 6.In the same way, we finally, a10, b12, g5
    and f20. Then we have the tree.

How about starting with A and B ?
12
Neighbor Joining
  • The Neighbor Joining method is the most popular
    way to build trees from distance measurements
  • (Saitou and Nei 1987, Mol. Biol. Evol. 4406)
  • Neighbor Joining corrects the UPGMA method for
    its (frequently invalid) assumption that the same
    rate of evolution applies to each branch of a
    tree.
  • The distance matrix is adjusted for differences
    in the rate of evolution of each taxon (branch).
  • Neighbor Joining has given the best results in
    simulation studies and it is the most
    computationally efficient of the distance
    algorithms (N. Saitou and T. Imanishi, Mol.
    Biol. Evol. 6514 (1989)

13
Neighbor-joining Method (No
requirement for molecular clock hypothesis)
  • The algorithm is the same as FM method except the
    choice of pair
  • The choice of pair is based on minimal sum of the
    branch lengths of the tree
  • lt Algorithm gt
  • Step-1 Star-like connection S0abcde78.5
  • Step-2 Comparing the distances of all pairs, and
    choosing one pair with minimal overall lengths.
    SAB67.7, SBC81,SCE76,SED70. Choose AB pair.
  • Step-3 As Steps 2 and 3 of FM method,
    calculating average distance from AB to CDE, we
    have a10, b12.
  • Step-4 As FM method, calculate a new Table with
    A and B forming a single pair.
  • Continue the computation to find the next
    pair until finished.

Overall length when AB joining
14
Cladistic Methods
  • For character data about the physical traits of
    organisms (such as morphology of organs etc.)
    and for deeper levels of taxonomy, the cladistic
    approach is almost certainly superior.
  • Cladistic methods are often difficult to
    implement with molecular data because all of the
    assumptions are generally not satisfied.

15
Cladistic methods
  • Cladistic methods are based on the assumption
    that a set of sequences evolved from a common
    ancestor by a process of mutation and selection
    without mixing (hybridization or other horizontal
    gene transfers).
  • These methods work best if a specific tree, or at
    least an ancestral sequence, is already known so
    that comparisons can be made between a finite
    number of alternate trees rather than calculating
    all possible trees for a given set of sequences.

16
Parsimony
  • Parsimony is the most popular method for
    reconstructing ancestral relationships.
  • Parsimony allows the use of all known
    evolutionary information in building a tree
  • In contrast, distance methods compress all of the
    differences between pairs of sequences into a
    single number

17
Building Trees with Parsimony
  • Parsimony involves evaluating all possible trees
    and giving each a score based on the number of
    evolutionary changes that are needed to explain
    the observed data.
  • The best tree is the one that requires the fewest
    base changes for all sequences to derive from a
    common ancestor.

18
Cladistic Method Maximum Parsimony Method (
minimizing overall number of evolutionary changes
(mutations) )
Sequences ATCT, ATGG, TCCA, TTCA
Optimal Solution
4 mutations
7 mutations
Problem varying rates of evolution Note gaps
are treated as a fifth base
19
Maximum Likelihood
  • The method of Maximum Likelihood attempts to
    reconstruct a phylogeny using an explicit model
    of evolution.
  • This method works best when it is used to test
    (or improve) an existing tree.
  • Even with simple models of evolutionary change,
    the computational task is enormous, making this
    the slowest of all phylogenetic methods.

20
Assumptions for Maximum Likelihood
  • The frequencies of DNA transitions (Clt-gtT,Alt-gtG)
    and transversions (C or Tlt-gtA or G).
  • The assumptions for protein sequence changes are
    taken from the PAM matrix - and are quite likely
    to be violated in real data.
  • Since each nucleotide site evolves independently,
    the tree is calculated separately for each site.
    The product of the likelihood's for each site
    provides the overall likelihood of the observed
    data.

21
Improving Efficiency by Compression
  • S1 ACGTTTGGGGCCCCTTT
  • S2 ACGTTTGGGGCCCCTTT
  • S3 ACGTTTGGGGCCCCTTT
  • S4 ACGCTCGGGGCCCCTTT

Data Compression
  • S1 A C G T T
  • S2 A C G T T
  • S3 A C G T T
  • S4 A C G C T
  • 1, 5,5,2,4

22
Conclusions
  • Given the huge variety of methods for computing
    phylogenies, how can the biologist determine what
    is the best method for analyzing a given data
    set?
  • Published papers that address phylogenetic issues
    generally make use of several different
    algorithms and data sets in order to support
    their conclusions.
  • In some cases different methods of analysis can
    work synergistically
  • Neighbor Joining methods generally produce just
    one tree, which can help to validate a tree built
    with the parsimony or maximum likelihood method
  • Using several alternate methods can give an
    indication of the robustness of a given
    conclusion.
Write a Comment
User Comments (0)
About PowerShow.com