Phylogenetic Analysis

About This Presentation

Title:

Phylogenetic Analysis

Description:

Phylogenetic Analysis General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure of ... – PowerPoint PPT presentation

Number of Views:180

Avg rating:3.0/5.0

Slides: 53

Provided by: johnsa98

Category:

more less

Transcript and Presenter's Notes

Title: Phylogenetic Analysis

1
Phylogenetic Analysis
2
General comments on phylogenetics

Phylogenetics is the branch of biology that deals
with evolutionary relatedness
Uses some measure of evolutionary relatedness
e.g., morphological features

Phylogenetics on sequence data is an attempt to
reconstruct the evolutionary history of those
sequences
Relationships between individual sequences are
not necessarily the same as those between the
organisms they are found in

The ultimate goal is to be able to use sequence
data from many sequences to give information
about phylogenetic history of organisms
Phylogenetic relationships usually depicted as
trees, with branches representing ancestors of
children the bottom of the tree (individual
organisms) are leaves. Individual branch points
are nodes.

5
Phylogenetic trees
C
A
D
time
B
A
B
C
D
An unrooted tree
A rooted tree
time?
6

We will only consider binary trees edges split
only into two branches (daughter edges)
rooted trees have an explicit ancestor the
direction of time is explicit in these trees
unrooted trees do not have an explicit ancestor
the direction of time is undetermined in such
trees

7
Types of phylogenetic analysis methods

Phenetic trees are constructed based on
observed characteristics, not on evolutionary
history
Cladistic trees are constructed based on fitting
observed characteristics to some model of
evolutionary history

Distance methods
Parsimony and Maximum Likelihood methods
8
Similarity and Homology

The evolutionary relationship between sequences
is inferred from the similarity of the sequences
Similarity is a measurable quantity (e.g.,
identity, alignment score, etc.)
Homology is the inference from sequence
similarity data that sequences are evolutionarily
related

9
Sequence alignments

Aligning sequences gives information about
Similarity
Areas of sequences that are conserved through
evolution

10
The real problem

How do we compare sequences?
Seq 1 CTGCACTA
Seq 2 CACTA
or C---ACTA

11
The real problem

How do we compare sequences?
Seq 1 CTGCACTA
Seq 2 CACTA
or C---ACTA
Scoring tries to approximate evolution scores
for substitutions and for gaps (insertions/deletio
ns)
Scores sum of terms for substitutions and for
gaps (sequence as character string)

41
17
12
Sequence alignment I

Simplest scoring 1 for match, 0 for no match
CTGCACTA
CACTA
CTGCACTA
C---ACTA

Score 5
Score 5
13
Sequence alignment II

Slightly more advanced scoring 1 for match, 0
for no match, -1 for gap

CTGCACTA
CACTA
CTGCACTA
C---ACTA

Score 5
Score 2
14

G C A T
G 1 0 0 0
C 0 1 0 0
A 0 0 1 0
T 0 0 0 1
G C A T
G 1 -1 -1 -1
C -1 1 -1 -1
A -1 -1 1 -1
T -1 -1 -1 1
Identity scoring matrices top, simple form
below, with mismatch penalty

15
In-class exercise II

Using the advanced scoring method calculate the
scores for the following pairs of nucleotide
sequences

CCTGGGCTATGC CAGGGTT-TGC
CCTGGGCTATGC CA-GGG-TTTGC
16
What about proteins?

Chemistry of amino acids means that some
substitutions in the sequence are better than
others
Substitution matrix empirically derived scores
for frequency of substitution of each amino acid
for all 19 others.

17
BLOSUM 62 Substitution matrix
18
In-class exercise III

Using the BLOSUM62 substitution matrix and a gap
penalty of -2, score the following pairs of
protein sequences (do not penalize end gaps)

YIHMNVFLSFML RVGAANFPNPRL YIHMNVFLSFML FIHMNLFVSFML
YIHMNVFLSFML IHMNLFV--SFML YIHMNVFLSFML IVLSMMFFLNHY
19
Dynamic programming strategy

Break alignment problem into small pieces
Optimize first piece
Then extend into second piece since first piece
is optimized already, program only needs to
optimize extension
Continue until end of comparison

20
H E E
0 -6 -12 -18
H -6 10 4 -2
E -12 4 16 10
A -18 -2 10 15
E -24 -8 4 16
21
H E E
0 -6 -12 -18
H -6 10 4 -2
E -12 4 16 10
A -18 -2 10 15
E -24 -8 4 16
22
H E E
0 -6 -12 -18
H -6 10 4 -2
E -12 4 16 10
A -18 -2 10 15
E -24 -8 4 16
23
Why multiple alignments?

Alignment of more than two sequences
Usually gives better information about conserved
regions and function (more data)
Better estimate of significance when using a
sequence of unknown function
Must use multiple alignments when establishing
phylogenetic relationships

24
Dynamic programming extended to many dimensions?

No uses up too much computer time and space
E.g. 200 amino acids in a pairwise alignment
must evaluate 4 x 104 matrix elements
If 3 sequences, 8 x 106 matrix elements
If 6 sequences, 6.4 x 1013 matrix elements

Need to find more efficient method
Sacrifice certainty of optimum alignment for
certainty of good alignment but faster

26
Feng-doolittle algorithm

Does all pairwise alignments and scores them
Converts pairwise scores to distances
D -logSeff -log (Sobs Srand)/(Smax Srand)
Sobs pairwise alignment score
Srand expected score for random alignment
Smax average of self-alignments of the two
sequences

As Smax approaches Srand (increasing evolutionary
distance), Seff goes down to make the distance
measure positive, use the -log

Once the distances have been calculated,
construct a guide tree (more in the phylogeny
class) tells what order to group the sequences
Sequences can be aligned with sequences or
groups groups can be aligned with groups

Sequence-sequence alignments dynamic programming
Sequence-group alignments all possible pairwise
alignments between sequence and group are tried,
highest scoring pair is how it gets aligned to
group
Group-group alignments all possible pairwise
alignments of sequences between groups are tried
highest scoring pair is how groups get aligned

30
Example
Seq5
Seq3
Seq4
Seq1
Seq2
Alignment 2
Alignment 1
Alignment 3
Final alignment
31

Notice that this method does not guarantee the
optimum alignment just a good one.
Gaps are preserved from alignment to alignment
once a gap, always a gap

32
Distance methods

Measuring distance -- just like when we talked
about multiple alignment, distance represents all
the differences at the various positions these
differences can be treated as equal or weighted
according to empirical knowledge of substitution
rates

Another way to say this is that there are a set
of distances dij between each pair of sequences
i,j in the dataset. dij can be the fraction f of
sites u where residues xi and xj differ or dij
can be such a fraction but weighted in some way
(e.g. Jukes-Cantor distance)

34
Clustering algorithms

UPGMA -- this is the distance clustering method
that is used in pileup to make the guide tree
dij is the average distance between pairs of
sequences found in two clusters, Ci and Cj.
Texts notation Ci number of sequences in Ci

The algorithm in the text means just what we said
before find the closest distance between two
sequences, cluster those then find the next
closest distance, cluster those as sequences are
added to existing clusters find the average
distance between existing clusters
Work through the notation!
UPGMA assumes a molecular clock mechanism of
evolution

Neighbor-joining corrects for UPGMAs assumption
of the same rate of evolution for each branch by
modifying the distance matrix to reflect
different rates of change.
The net difference between sequence i and all
other sequences is
ri Sdik

k
37

The rate-corrected distance matrix is then
Mij dij - (ri rj)/(n - 2)
Join the two sequences whose Mij is minimal then
calculate the distance from this new node to all
other sequences using
dkm (dim djm - dij)/2
Again correct for rates and join nodes.

38
In-class exercise I

Retrieve the file named phylo2 from bioinfI.list
in my directory
Open it in the editor, select all the sequencs
Select Functions ? Evolution ? PAUPSearch in
Tree Optimality Criterion choose distance in
Method for Obtaining Best Tree choose heuristic.
Leave everything else as default (make sure
bootstrap option is not selected)
Select Run. Inspect output

39
Parsimony methods

Parsimony methods are based on the idea that the
most probable evolutionary pathway is the one
that requires the smallest number of changes from
some ancestral state
For sequences, this implies treating each
position separately and finding the minimal
number of substitutions at each position

40
Example of parsimonious tree building

Tree on left requires only one change, tree on
left requires two left tree is most parsimonious

Parsimony methods assign a cost to each tree
available to the dataset, then screen trees
available to the dataset and select the most
parsimonious
Screening all the trees available to even a
smallish dataset would take too much time branch
and bound method builds trees with increasing
numbers of leaves but abandons the topology
whenever the current tree has a bigger cost than
any complete tree

42
In-class exercise II

Use same data set and program as in exercise I,
but choose maximum parsimony. Use heuristic for
the tree building method.
Inspect your tree. Compare it to the distance
generated tree.

43
Maximum likelihood methods

Maximum likelihood reconstructs a tree according
to an explicit model of evolution. For the given
model, no other method will work as well
But, such models must be simple, because the
method is computationally intensive

Actually, all the other methods discussed
implicitly use a simple model of evolution
similar to the typical model made explicit in
maximum likelihood
All sites selectively neutral
All mutate independently, forward and reverse
rates equal, given by m

Also assume discrete generations and sites change
independently
Given this model, can calculate probability that
a site with initial nucleotide I will change to
nucleotide j within time t
Ptij dije-mt (1 - e-mt)gj, where dij 1 if i
j and dij 0 otherwise, and where gj is the
equilibrium frequency of nucleotide j

The likelihood that some site is in state i at
the kth node of a tree is Li(k)
The likelihoods for all states for each site for
each node are calculated separately the product
of the likelihoods for each site gives the
overall likelihood for the observed data
Different tree topologies are searched to find
the highest overall likelihood

Maximum likelihood is maybe the gold standard
for phylogenetic analysis but because of its
computational intensity it can only be used for
select data and only after much initial fine
tuning of many parameters of sequence alignments
Often used to distinguish between several already
generated trees

48
Assessing trees

The bootstrap randomly sample all positions
(columns in an alignment) with replacement --
meaning some columns can be repeated -- but
conserving the number of positions build a large
dataset of these randomized samples

49
Bootstrap alignment process
50

Then use your method (distance, parsimony,
likelihood) to generate another tree
Do this a thousand or so times
Note that if the assumptions the method is based
on hold, you should always get the same tree from
the bootstrapped alignments as you did originally
The frequency of some feature of your phylogeny
in the bootstrapped set gives some measure of the
confidence you can have for this feature

51
In-class exercise III

Use the same dataset, select distance again.
This time, select the bootstrap box.
In options, make sure to select the box labelled
Save a file containing PAUP screen output. Take
defaults for everything else. Run.
Inspect your output. In particular, look at the
paup.log file and compare it to the
paupdisplay.figure file.

Repeat for the maximum parsimony method.
Were the original trees (not bootstrapped)
meaningful?

Write a Comment

User Comments (0)

About PowerShow.com

Phylogenetic Analysis - PowerPoint PPT Presentation

Phylogenetic Analysis

Phylogenetic Analysis General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure of ... – PowerPoint PPT presentation