Title: Likelihood surface 2dimensions
1Likelihood surface (2-dimensions)
For multiple parameters, imagine extending to
multiple dimensions.
2Defining what a tree means
unrooted tree (used when the root isnt known)
rooted tree (all real trees are rooted)
ancestral sequence
time vaguely radiates out from somewhere near the
center
divergence time is the sum of (horizontal)
branch lengths
3(No Transcript)
4(No Transcript)
5(No Transcript)
6(No Transcript)
7(No Transcript)
8(No Transcript)
9(No Transcript)
10(No Transcript)
11Tom Nicholas favorite tree (vertebrate Cyp450)
12A tree has topology and distances
Note - topologically, these are the SAME tree. In
general, two trees are the same if they can be
inter-converted by branch rotations.
13The number of tree topologies grows extremely fast
3 leaves 3 branches 1 internal node 1 topology (3
insertions)
4 leaves 5 branches 2 internal nodes 3 topologies
(x3) (5 insertions)
In general, an unrooted tree with N leaves
has 2N 3 branches N 2 internal nodes O(N!)
topologies
5 leaves 7 branches 3 internal nodes 15
topologies (x5) (7 insertions)
14There are many rooted trees for each unrooted tree
For each unrooted tree, there are 2N - 3 times as
many rooted trees, where N is the number of
leaves ( internal branches 2N 3).
15Trees can be enormously complex
This tree has 203 7-pass chemosensory protein
sequences and could adopt gtgtgt10100 topologies.
Even a tree of 20 sequences could adopt 2 x 1020
topologies.
Note that this does NOT include branch length
differences!
16Duplication and Divergence
- Full gene duplication can occur in two patterns
- 1) speciation (orthology)
- 2) duplication within species (paralogy)
- Gene duplication within a species probably
occurs only during aberrant DNA rearrangements. - Immediately after duplication, the two copies of
a gene are fully redundant and one often mutates
to non-functionality and fixes in the population
(pseudogene). - Some of the time (unclear how often) the two
copies diverge to become functionally distinct
enough that each function is under negative
selection and both are retained.
17Orthology and paralogy trees
(duplication)
18How duplications can arise
2N ? 4N (e.g. failure in mitosis)
1) genome duplication
2) unequal crossing-over (non-allelic
recombination)
3) replication slippage (seems to happen only
with very short sequences)
4) transposon hops run amok
5) events involving nonhomology-based repair of
double strand breaks (often called nonhomologous
end joining, NHEJ)
19Main mechanism in animals is probably non-allelic
recombination (NAR)
mispairing of existing duplication (red box)
duplication product
deletion product
20To reconstruct species phylogeny
- sequence orthologs from each species
- construct gene tree it probably corresponds to
the species tree (question assuming you
construct the gene tree correctly what could
violate this?) - do this for more than one gene to be sure you
should get (nearly) the same tree every time.
21Species phylogeny (and a footnote on
humanocentrism)
This tree is amazingly misleading
correct tree
Not to mention 1) the straight line to humans,
2) leaving out gt90 of all species, including
three hominoids bonobo, siamang, gibbon, and 3)
the fact that WE are an African ape.
22notice that small cold things are implicitly
less evolved
notice large size (not to mention white male)
23How do you make a tree from data?
- data typically takes the form of either a
multiple sequence alignment or a set of all
pairwise sequence alignments. - common methods for going from the data to a tree
include - - Distance Matrix methods (e.g.
Neighbor-Joining) - - Parsimony
- - Maximum-likelihood
- in theory, the tree-enumerating methods
(especially maximum-likelihood) are superior, but
they are not always practical for large trees.
24Distance matrix methods
- methods based on a set of pairwise distances (no
multiple alignment needed, though pairwise
distances usually come from one). - all methods (in essence) build a tree that tries
to best match the distances. - usual standard for best match is the least
squares of the tree distances compared to the
real pairwise distances
Let Dm be the matrix distances and Dt be the
tree distances. Find the tree (an internally
consistent set of Dt values) that minimizes
25Example distance matrix (fraction amino acid
divergence)
fraction divergence
EGL-2 dEag rEag UNC-103 dErg HERG EGL-2 0.000 0.2
76 0.353 0.542 0.547 0.525 dEag 0.276 0.000 0.305
0.512 0.501 0.508 rEag 0.353 0.305 0.000 0.533 0.5
15 0.510 UNC-103 0.542 0.512 0.533 0.000 0.274 0.2
55 dErg 0.547 0.501 0.515 0.274 0.000 0.263 HERG 0
.525 0.508 0.510 0.255 0.263 0.000
26Least-squares solution
- There are methods that directly solve the
least-squares problem - However, they are computationally slow and
rarely used (maximum-likelihood better if
feasible). - Fortunately, there are more direct
approximations that work remarkably well, most
notably Neighbor-Joining. - The direct methods use various types of
sequential clustering, of which the simplest is
UPGMA (a horrid acronym that stands for something
I can never remember).
27Sequential clustering approach (UPGMA)
repeat until all clustered
28Neighbor-Joining Algorithm (NJ)
Essentially as on previous slide, but correction
for distance to other leaves is made.
Specifically, for two leaves i and j, we denote
the set of all other leaves as L, and the size of
that set as , and we compute the corrected
distance Dij as
heres an intuitive rationale (consider
clustering the first two leaves)
29Neighbor-Joining corrects for different rates of
evolution on branches
these two branches have changed faster
30Parsimony method
- intuitively appealing find the tree that can
explain the observed sequences with the smallest
number of changes. - BUT requires enumeration of tree topologies, so
not feasible with very large data sets (also
inferior to ML methods, which are not much more
computationally intensive).
1 AAG 2 AAA 3 GGA 4 AGA
Maximum-likelihood method (next class) is similar
except that it accounts for rates of change and
sums over all possible histories.
31Assignment for Thursday
1) get the set of assigned sequences (course
website) and load them with Bonsai. 2) build a
pairwise tree with them, then recompute the tree
using uncorrected identities (because it is
more intuitive than scores and corrections for
this assignment). 3) find the pairwise distance
matrix values the underlie the tree and describe
what you think they mean (it is simple). 4) use
these distance matrix values to hand-compute the
first cluster by the N-J algorithm (you dont
have to give exact numbers, just indicate what
leaves are clustered and how the correction is
made in principle). 5) challenge explain
qualitatively what changes about the tree when
you toggle between corrected identities and
uncorrected identities. Send in an image file
of the tree, the tab-delimited text version of
the pairwise distance values, the hand
computation, and your explanation in part 5.