Title: MICA 8006 Protein Sequence Analysis
1MICA 8006 Protein Sequence Analysis Sept. 28 -
alignment methods Sept. 30 - phylogenetic
methods Dr. Steven Cannon Feel free to email
with questions cann0010_at_umn.edu
2Which sequences are most related in this
alignment?
3Which sequences are most related in this
alignment?
Indels removed, sorted by pairwise ID
4What is most related in this alignment?
Indels removed, sorted by pairwise ID -- average
distance tree, Jalview
5Outline - Terms - Clustering - Distance
methods - UPGMA - NJ - Parsimony -
Maximum likelihood - Bayesian likelihood -
Bootstrapping - Programs, data formats -
Examples
6Sequence --gt tree 5 10 Alpha
ABCDEFGHIK Beta AB--EFGHIK Gamma
?BCDSFG?? Delta CIKDEFGHIK Epsilon
DIKDEFGHIK --------Gamma !
--2 --Epsilon ! ! --4 ! --3
--Delta 1 ! ! -----Beta
! -----------Alpha
7Why phylogenetic trees? Branch lengths may be
used to indicate numbers of change or amount of
evolution Beta
Epsilon
-------------------------------------3
1-------2
--------Delta
----------Gamma ------A
8Some terms Dendrogram or cladogram no branch
lengths topology only Phylogram, phylogeny
usually indicates branch lengths clade subtree
group of sequences
9Basic clustering methods Calculate distances
between each pair of sequences. Single linkage
similarity between any 2 groups min.
pairwise difference (or maximum similarity)
between any 2 members of 2 groups Complete
linkage similarity bet any 2 groups maximum
pairwise difference (or minimum similarity)
between any 2 members of 2 groups Average
linkage takes average of similarity between
members of 2 clusters UPGMA (Unweighted
Pair-Group Method using Arithmetic
averages)
10DISTANCE methods - Operate on distance matrices,
not individual characters. - loss of
information - branch lengths may be negative
(non-interpretable) - A problem usually the
distance matrix generates conflicts. Trees are
not usually additive.
11DISTANCE methods - Operate on distance matrices,
not individual characters. - loss of
information - branch lengths may be negative
(non-interpretable) - A problem usually the
distance matrix generates conflicts. Trees are
not usually additive. UPGMA, WPGMA o Calculate
distances between all taxa. o Group the most
similar. o Calculate distance between new node
and the other taxa. o Similarity between any
single taxon and some group is the average
of all the pairwise similarities in that group.
o Repeat until all nodes and groups have been
joined. o Assumes an additive tree, and equal
rates of evolution.
12DISTANCE methods UPGMA, WPGMA I
aagtcatgct II aaatcaggct III
cagacagtca Distance matrix I II III
I - II 0.20 - III 0.50 0.50 -
I 0.10 .
/ \ 0.10 \ / 0.40 II
\ \ III
13DISTANCE methods UPGMA, WPGMA I
aagtcatgct II aaatcaggct III
cagacagtca Distance matrix I II III
I - II 0.20 - III 0.50 0.50
- 'Distortion' is the difference between the
observed matrix and a matrix derived form the
resulting tree. How to minimize the
distortion? The Fitch and Margoliash method
finds a tree with the least distortion in the
t(t-1)/2 pairwise distances of t taxa. (Costly)
I 0.10 .
/ \ 0.10 \ / 0.40 II
\ \ III
14- DISTANCE methods
- Neighbor joining
- Fast
- Handles branch lengths badly
- Not model-based
- Performs poorly when many sequences
15Parsimony Parsimony simple stingy The
simplest theory is preferable The preferred tree
requires the smallest number of changes to
explain the differences among the sequences.
Site 1 2 3 4 1 A G G A 2 A G G G 3 A A C A
4 A A C G
16Tree I 1 A G G A A A C A 3
\ /
---- / \ 2
A G G G A A C G 4 Tree II
1 A G G A A G G G 2
\ / ----
/ \ 3 A A C A
A A C G 4 Tree III 1
A G G A A G G G 2
\ / ----
/ \ 4 A A C G A
A C A 3
Parsimony Site 1 2 3 4 1 A G G A 2 A G G G
3 A A C A 4 A A C G
Site 1 uninformative. Site 2 G --gt A tree I.
Site 3 G --gt C tree I. Site 4 A --gt G tree
II.
17Maximum likelihood Allows for multiple character
changes on a branch. Likelihood of data
probability of the data (sequences), given the
model (tree sequence model) Ld P(DH) To
evaluate a tree For each sequence and character
in the sequence, calculate the probability of
observing that character, given a substitution
model and an ancestral state in that tree.
Probabilities for each aligned position are
multiplied to get tree likelihood.
18Bayesian likelihood A search method and
refinement (with a twist) on maximum
likelihood Likelihood of model (tree) given the
data (seqs, matrix) probability of the data
given the tree x the tree probability over the
number of possible trees for this number of
taxa. MCMC -- Markov Chain Monte Carlo -- walk
through tree space, (usually) accepting more
likely trees. Our belief about the phylogeny
changes during the search. We end with
probabilities for each examined tree and clade.
19Bootstrapping likelihood Generate 1000
pseudo-alignments, sampled with replacement, and
calculate. Get consensus, count clade
frequencies.
20Programs, data formats PAUP Nexus
format Phylip Phylip / New Hampshire / Newick
format (((a0.2,b0.2)0.4,(c0.3,(d0.5,((e0.5
,f0.2)0.0, g0.3)0.0)0.0)0.1)0.6,(h0.6,i
0.3)0.3,j0.5) Mega2, etc.
21Neighbor joining on default Clustalw alignment
22Neighbor joining on cleaned Clustalw alignment
bootstrap
23Parsimony on cleaned alignment
24Parsimony on cleaned alignment, with ML branches
25Parsimony on cleaned alignment, with ML branches
rooted(?) what about branch lengths?
26NBS seqs in legumes Medicago in red
27NBS seqs non-TIR subfam. NJ, rooted
28NBS seqs non-TIR subfam. ParsML, rooted
29NBS seqs non-TIR subfam. ParsML, rooted
30Some key points - Prepare a good
alignment. Think about the results. Compare
alternate methods. Consider bootstrap, branch
lengths, rooting Add additional sequences,
species for context Consider gene duplication,
loss, genomic context