Phylogenetic Analysis - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Phylogenetic Analysis

Description:

Phylogenetic Analysis General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure of ... – PowerPoint PPT presentation

Number of Views:177
Avg rating:3.0/5.0
Slides: 53
Provided by: johnsa98
Category:

less

Transcript and Presenter's Notes

Title: Phylogenetic Analysis


1
Phylogenetic Analysis
2
General comments on phylogenetics
  • Phylogenetics is the branch of biology that deals
    with evolutionary relatedness
  • Uses some measure of evolutionary relatedness
    e.g., morphological features

3
  • Phylogenetics on sequence data is an attempt to
    reconstruct the evolutionary history of those
    sequences
  • Relationships between individual sequences are
    not necessarily the same as those between the
    organisms they are found in

4
  • The ultimate goal is to be able to use sequence
    data from many sequences to give information
    about phylogenetic history of organisms
  • Phylogenetic relationships usually depicted as
    trees, with branches representing ancestors of
    children the bottom of the tree (individual
    organisms) are leaves. Individual branch points
    are nodes.

5
Phylogenetic trees
C
A
D
time
B
A
B
C
D
An unrooted tree
A rooted tree
time?
6
  • We will only consider binary trees edges split
    only into two branches (daughter edges)
  • rooted trees have an explicit ancestor the
    direction of time is explicit in these trees
  • unrooted trees do not have an explicit ancestor
    the direction of time is undetermined in such
    trees

7
Types of phylogenetic analysis methods
  • Phenetic trees are constructed based on
    observed characteristics, not on evolutionary
    history
  • Cladistic trees are constructed based on fitting
    observed characteristics to some model of
    evolutionary history

Distance methods
Parsimony and Maximum Likelihood methods
8
Similarity and Homology
  • The evolutionary relationship between sequences
    is inferred from the similarity of the sequences
  • Similarity is a measurable quantity (e.g.,
    identity, alignment score, etc.)
  • Homology is the inference from sequence
    similarity data that sequences are evolutionarily
    related

9
Sequence alignments
  • Aligning sequences gives information about
  • Similarity
  • Areas of sequences that are conserved through
    evolution

10
The real problem
  • How do we compare sequences?
  • Seq 1 CTGCACTA
  • Seq 2 CACTA
  • or C---ACTA

11
The real problem
  • How do we compare sequences?
  • Seq 1 CTGCACTA
  • Seq 2 CACTA
  • or C---ACTA
  • Scoring tries to approximate evolution scores
    for substitutions and for gaps (insertions/deletio
    ns)
  • Scores sum of terms for substitutions and for
    gaps (sequence as character string)

41
17
12
Sequence alignment I
  • Simplest scoring 1 for match, 0 for no match
  • CTGCACTA
  • CACTA
  • CTGCACTA
  • C---ACTA

Score 5
Score 5
13
Sequence alignment II
  • Slightly more advanced scoring 1 for match, 0
    for no match, -1 for gap
  • CTGCACTA
  • CACTA
  • CTGCACTA
  • C---ACTA

Score 5
Score 2
14
  • G C A T
  • G 1 0 0 0
  • C 0 1 0 0
  • A 0 0 1 0
  • T 0 0 0 1
  • G C A T
  • G 1 -1 -1 -1
  • C -1 1 -1 -1
  • A -1 -1 1 -1
  • T -1 -1 -1 1
  • Identity scoring matrices top, simple form
    below, with mismatch penalty

15
In-class exercise II
  • Using the advanced scoring method calculate the
    scores for the following pairs of nucleotide
    sequences

CCTGGGCTATGC CAGGGTT-TGC
CCTGGGCTATGC CA-GGG-TTTGC
16
What about proteins?
  • Chemistry of amino acids means that some
    substitutions in the sequence are better than
    others
  • Substitution matrix empirically derived scores
    for frequency of substitution of each amino acid
    for all 19 others.

17
BLOSUM 62 Substitution matrix
18
In-class exercise III
  • Using the BLOSUM62 substitution matrix and a gap
    penalty of -2, score the following pairs of
    protein sequences (do not penalize end gaps)

YIHMNVFLSFML RVGAANFPNPRL YIHMNVFLSFML FIHMNLFVSFML
YIHMNVFLSFML IHMNLFV--SFML YIHMNVFLSFML IVLSMMFFLNHY
19
Dynamic programming strategy
  • Break alignment problem into small pieces
  • Optimize first piece
  • Then extend into second piece since first piece
    is optimized already, program only needs to
    optimize extension
  • Continue until end of comparison

20
H E E
0 -6 -12 -18
H -6 10 4 -2
E -12 4 16 10
A -18 -2 10 15
E -24 -8 4 16
21
H E E
0 -6 -12 -18
H -6 10 4 -2
E -12 4 16 10
A -18 -2 10 15
E -24 -8 4 16
22
H E E
0 -6 -12 -18
H -6 10 4 -2
E -12 4 16 10
A -18 -2 10 15
E -24 -8 4 16
23
Why multiple alignments?
  • Alignment of more than two sequences
  • Usually gives better information about conserved
    regions and function (more data)
  • Better estimate of significance when using a
    sequence of unknown function
  • Must use multiple alignments when establishing
    phylogenetic relationships

24
Dynamic programming extended to many dimensions?
  • No uses up too much computer time and space
  • E.g. 200 amino acids in a pairwise alignment
    must evaluate 4 x 104 matrix elements
  • If 3 sequences, 8 x 106 matrix elements
  • If 6 sequences, 6.4 x 1013 matrix elements

25
  • Need to find more efficient method
  • Sacrifice certainty of optimum alignment for
    certainty of good alignment but faster

26
Feng-doolittle algorithm
  • Does all pairwise alignments and scores them
  • Converts pairwise scores to distances
  • D -logSeff -log (Sobs Srand)/(Smax Srand)
  • Sobs pairwise alignment score
  • Srand expected score for random alignment
  • Smax average of self-alignments of the two
    sequences

27
  • As Smax approaches Srand (increasing evolutionary
    distance), Seff goes down to make the distance
    measure positive, use the -log

28
  • Once the distances have been calculated,
    construct a guide tree (more in the phylogeny
    class) tells what order to group the sequences
  • Sequences can be aligned with sequences or
    groups groups can be aligned with groups

29
  • Sequence-sequence alignments dynamic programming
  • Sequence-group alignments all possible pairwise
    alignments between sequence and group are tried,
    highest scoring pair is how it gets aligned to
    group
  • Group-group alignments all possible pairwise
    alignments of sequences between groups are tried
    highest scoring pair is how groups get aligned

30
Example
Seq5
Seq3
Seq4
Seq1
Seq2
Alignment 2
Alignment 1
Alignment 3
Final alignment
31
  • Notice that this method does not guarantee the
    optimum alignment just a good one.
  • Gaps are preserved from alignment to alignment
    once a gap, always a gap

32
Distance methods
  • Measuring distance -- just like when we talked
    about multiple alignment, distance represents all
    the differences at the various positions these
    differences can be treated as equal or weighted
    according to empirical knowledge of substitution
    rates

33
  • Another way to say this is that there are a set
    of distances dij between each pair of sequences
    i,j in the dataset. dij can be the fraction f of
    sites u where residues xi and xj differ or dij
    can be such a fraction but weighted in some way
    (e.g. Jukes-Cantor distance)

34
Clustering algorithms
  • UPGMA -- this is the distance clustering method
    that is used in pileup to make the guide tree
  • dij is the average distance between pairs of
    sequences found in two clusters, Ci and Cj.
  • Texts notation Ci number of sequences in Ci

35
  • The algorithm in the text means just what we said
    before find the closest distance between two
    sequences, cluster those then find the next
    closest distance, cluster those as sequences are
    added to existing clusters find the average
    distance between existing clusters
  • Work through the notation!
  • UPGMA assumes a molecular clock mechanism of
    evolution

36
  • Neighbor-joining corrects for UPGMAs assumption
    of the same rate of evolution for each branch by
    modifying the distance matrix to reflect
    different rates of change.
  • The net difference between sequence i and all
    other sequences is
  • ri Sdik

k
37
  • The rate-corrected distance matrix is then
  • Mij dij - (ri rj)/(n - 2)
  • Join the two sequences whose Mij is minimal then
    calculate the distance from this new node to all
    other sequences using
  • dkm (dim djm - dij)/2
  • Again correct for rates and join nodes.

38
In-class exercise I
  • Retrieve the file named phylo2 from bioinfI.list
    in my directory
  • Open it in the editor, select all the sequencs
  • Select Functions ? Evolution ? PAUPSearch in
    Tree Optimality Criterion choose distance in
    Method for Obtaining Best Tree choose heuristic.
    Leave everything else as default (make sure
    bootstrap option is not selected)
  • Select Run. Inspect output

39
Parsimony methods
  • Parsimony methods are based on the idea that the
    most probable evolutionary pathway is the one
    that requires the smallest number of changes from
    some ancestral state
  • For sequences, this implies treating each
    position separately and finding the minimal
    number of substitutions at each position

40
Example of parsimonious tree building
  • Tree on left requires only one change, tree on
    left requires two left tree is most parsimonious

41
  • Parsimony methods assign a cost to each tree
    available to the dataset, then screen trees
    available to the dataset and select the most
    parsimonious
  • Screening all the trees available to even a
    smallish dataset would take too much time branch
    and bound method builds trees with increasing
    numbers of leaves but abandons the topology
    whenever the current tree has a bigger cost than
    any complete tree

42
In-class exercise II
  • Use same data set and program as in exercise I,
    but choose maximum parsimony. Use heuristic for
    the tree building method.
  • Inspect your tree. Compare it to the distance
    generated tree.

43
Maximum likelihood methods
  • Maximum likelihood reconstructs a tree according
    to an explicit model of evolution. For the given
    model, no other method will work as well
  • But, such models must be simple, because the
    method is computationally intensive

44
  • Actually, all the other methods discussed
    implicitly use a simple model of evolution
    similar to the typical model made explicit in
    maximum likelihood
  • All sites selectively neutral
  • All mutate independently, forward and reverse
    rates equal, given by m

45
  • Also assume discrete generations and sites change
    independently
  • Given this model, can calculate probability that
    a site with initial nucleotide I will change to
    nucleotide j within time t
  • Ptij dije-mt (1 - e-mt)gj, where dij 1 if i
    j and dij 0 otherwise, and where gj is the
    equilibrium frequency of nucleotide j

46
  • The likelihood that some site is in state i at
    the kth node of a tree is Li(k)
  • The likelihoods for all states for each site for
    each node are calculated separately the product
    of the likelihoods for each site gives the
    overall likelihood for the observed data
  • Different tree topologies are searched to find
    the highest overall likelihood

47
  • Maximum likelihood is maybe the gold standard
    for phylogenetic analysis but because of its
    computational intensity it can only be used for
    select data and only after much initial fine
    tuning of many parameters of sequence alignments
  • Often used to distinguish between several already
    generated trees

48
Assessing trees
  • The bootstrap randomly sample all positions
    (columns in an alignment) with replacement --
    meaning some columns can be repeated -- but
    conserving the number of positions build a large
    dataset of these randomized samples

49
Bootstrap alignment process
50
  • Then use your method (distance, parsimony,
    likelihood) to generate another tree
  • Do this a thousand or so times
  • Note that if the assumptions the method is based
    on hold, you should always get the same tree from
    the bootstrapped alignments as you did originally
  • The frequency of some feature of your phylogeny
    in the bootstrapped set gives some measure of the
    confidence you can have for this feature

51
In-class exercise III
  • Use the same dataset, select distance again.
    This time, select the bootstrap box.
  • In options, make sure to select the box labelled
    Save a file containing PAUP screen output. Take
    defaults for everything else. Run.
  • Inspect your output. In particular, look at the
    paup.log file and compare it to the
    paupdisplay.figure file.

52
  • Repeat for the maximum parsimony method.
  • Were the original trees (not bootstrapped)
    meaningful?
Write a Comment
User Comments (0)
About PowerShow.com