Title: Phylogenetic trees:
1Phylogenetic trees What to look for and where?
Lessons from Statistical Physics
Elchanan Mossel, U.C. Berkeley and Microsoft
Research mossel_at_stat.berkeley.edu,
www.stat.berkeley.edu/mossel/
2Statistical physics
- Statistical physics is a sub-field of
mathematical physics studying complex systems
with simple microscopic interactions. - The Ising model on a graph G(V,E) is a
probability measure (Gibbs distribution) on the
space of configurations s V ? -1,1 such that
Ps is given by - exp(S(v, w) e E s(v)s(w)/T)/Z exp(? S(v, w) e
E s(v)s(w))/Z - Or, Weight(?) exp(? u v ?(u) ?(v) )
- Traditionally studied on cubes in Zd.
The Ising model on 200 x 200 grid
3Statistical physics - intuition
- The Ising model on the nxn grid is given by
- exp(S(v, w) e E s(v)s(w)/T)/Z exp(? S(v, w) e
E s(v)s(w))/Z - We expect that
- T small, ? large ) strong correlations
- Corr(?boundary,?0) gt ? gt 0 for all n.
- T large, ? small ) weak correlations
- Corr(?boundary,?0) ! 0 as n ! 1.
2n
0
boundary
- Onsager (1944) proved it where
- Critical ? ?c ln(121/2)/2
- For most other graphs, we know very little
The Ising model on 200 x 200 grid ? ?c
4Statistical physics on trees
- The Ising model on a tree T(V,E) is given by
- exp( S(v, w) e E ?(v,w) ?(v) ?(w))/Z
- It is equivalent to the following model
- Let r be a root (chosen arbitrarily).
- Let ?(r) 1 with probability ½ and for
- Each edge (u,v) directed away from the root, let
- ?(v) ?(u) with probability ?(u,v).
- ?(v) is independent 1 otherwise.
- ?(u,v) ( e?(u,v)-e-?(u,v) )/
(e?(u,v)e-?(u,v))
-
-
-
-
5Ising Model on binary Trees
low
interm.
high
bias
bias
no bias
no bias
bias
typical boundary
typical boundary
Unique Gibbs measure 8 e, 2?(e) 1
Extermality 8 e, 2?(e) gt 1 8 e, 2 ?2(e) 1
Non-Extermality 8 e, 2?(e)2 gt 1
6Statistical physics on trees History
- Uniqueness studied by Bethe (1930s).
- Extremality phase more recently Spitzer 75,
Higuchi 77, Bleher-Ruiz-Zagrebnov 95,
Evans-Kenyon-Peres-Schulman 2000, Ioffe 99, M 98,
Haggstrom-M 2000, Kenyon-M-Peres 2001,
Martinelli-Sinclair Weitz- 2003, Martin-2003 - Many problems are still open.
- Extremality has rich connections with
- Noisy computation/communication
- von-Neumann53, Evans-Shculmann00,
- Mixing of Markov chains Berger-Kenyon-Mossel-Pere
s01,Martinelli-Sinclair-Weitz05 - Spinglasses and Random Sat problems
Parisi,Mezard,Montanari Mezard-Montanari06
7Phylogeny
- Phylogeny is the true evolutionary relationships
between groups of living things
Noah
Shem
Ham
Japheth
Cush
Mizraim
Kannan
8History of Phylogeny
- Intuitively animal kingdom or plant
kingdom. - More scientifically morphology, fossils, etc.
Darwin - But Is a human more like a great ape or like a
chimpanzee?
No brain, Cant move
Stupid Walks
Stupid Swims
Stupid Flies
Too smart Barely moves
9Molecular Phylogeny
- Molecular Phylogeny Based on DNA, RNA or protein
sequences of organisms. - Mutation mechanisms
- Substitutions
- Transpositions
- Insertions, Deletions, etc.
- Will only consider substitutions
- and assume sequences are aligned.
Noah
acctga
Shem
Ham
Japheth
acctga
acctaa
acctga
Put
Cush
Mizraim
Kannan
acctga
acctga
agctga
acctga
10Simplifying assumptions models
- Assumption Letters of sequences (characters)
evolve independently and identically. - CFN model The first stochastic model invented by
Cavender, Farris and Neyman (70s) - Let ?(r) 1 with probability ½ and for
- Each edge (u,v) directed away from the root, let
- ?(v) ?(u) with probability ?(u,v).
- ?(v) is independent 1 otherwise.
- This is exactly the Ising model on the
evolutionary tree! - Dictionary A,C (Pyrimidine group) G,T
- (Purine group). - Some results can be generalized to other models.
11Simplifying assumptions trees
- Assumption 1 Evolution is on a tree.
- Assumption 2 Trees are binary -- All internal
degrees are 3. - Given a set of species (labeled vertices) X, an
X-tree is a tree which has X as the set of
leaves. - Two X-trees T1 and T2 are identical if theres a
graph isomorphism between T1 and T2 that is the
identity map on X. - Most results to trees all of whose internal
degrees are at least 3.
u
u
Me
v
Me Me
Me
w
w
d
a
c
b
d
a
b
c
c
a
b
d
12The Phylogenetic Challenge
Time
Contemporary Genetic sequences
Evolutionary model
Genetic sequences
??
- How to reconstruct Phylogenetic tree from genetic
data at contemporary species??
13Phylogeny
- Tree is unknown.
- Given sequences at the leaves of the tree.
- Want to reconstruct the tree (un-rooted).
- How hard is it as a function of
- n size of tree leaves.
- k length of sequences.
14Phylogeny
15 n and k
Length of sequence!
- Interested to know k characters needed to
reconstruct the tree with n leaves.
- Erdos-Steel-Szekeley-Warnow96
- If ? lt ?(e) lt 1 - ? for all e.
- Tree can be recovered from
- Sequences of length k nc.
- In polynomial time.
- Question How about shorter sequences?
- Previously, best lower bound on sequence length
is k ?(log n). - However, in practice
- Sometimes hard to find long sequences.
- Short sequences often suffice.
16 Lesson 1 Phylogenetic lower bound for forgetful
trees
- ThM2004 Trans AMS
- If 2 ?2(e) lt 1 for all e then we show
- A lower bound on sequence length of k nc, where
- c gt 0 is a function of ? maxe ?(e) and
- c ! 1 as ? ! 0.
- Th M2003 JCB
- Similar theorem for general mutation models if
mutation rates are high. - Proofs are easy.
17 Poly. lower bound for Phylogeny
XT
?
?
L
k
Known
Known
q-L
k
- If for all k characters we can couple bottom q-L
levels, then X is independent of the data. - By forgetfulness of tree, if k lt nc, X is
independent of data with high probability. - Similar idea can be used to test trees
(MRiesenfeld)
18 Lesson 2 Recent history is easy
- In the proof of lower bound, the deep
convergences were hard to reconstruct. - Theorem M04
- If ? lt ?(e) lt 1 - ? for all e, then
- most of the tree can be
- reconstructed from
- sequences of length k O(log n).
- most of tree a forest F such that the true
tree is obtained from F by adding o(n) edges. - Result were refined experiments in
Daskalakis-Hill-Jaffe-Miahescu-Mossel-Rao - Proof is not easy based on Distorted Metrics.
19 Lesson 3 Species that remember their past can
reconstruct their history.
- Thm Daskalakis-Mossel-Roch To appear STOC06
- If 2 ?2(e) gt 1 for all e then
- The tree can be recovered with high probability
from sequences of length - k O( log n ).
- Solves M. Steels Favourite conjecture
- Builds on M2004 Trans AMS
- Hard proof Mixes probability, algorithms,
statistical physics.
20 Proof Sketch Logarithmic reconstruction
- Two parts of the proofs
- I. Statistical / algorithmic.
- II. Probability / statistical physics.
- By Forest result we may recover a forest
containing 90 of the edges of the tree from
O(log n) samples. - Doesnt use the 2 ?2 gt 1
21 Logarithmic Reconstruction
- II. Here we use the condition that 2 ? 2 gt 1 in
order to estimate the characters at the inner
nodes of the forest.
Like I.
22Ising Model on binary Trees
low
interm.
high
bias
bias
no bias
k ?(nc)
no bias
bias
Most tree from k O(log n)
k O( log n )
typical boundary
typical boundary
Unique Gibbs measure 8 e, 2?(e) 1
Extermality 8 e, 2?(e) gt 1 8 e, 2 ?2(e) 1
Non-Extermality 8 e, 2?(e)2 gt 1
23 Many more challenges to come
- We know very little
- We dont understand methods used in practice
- Maximum Likelihood (NP hard on arbitrary data
Chor-Tuller05 Roch05) - Markov Chain Monte Carlo (Can be exponentially
slow on mixtures M-Vigda05). - In what sense Parsimony Maximum Likelihood?
(2 Conjectures by Steel) - Other mutation models rates across sites, gene
order etc. etc. - all the problems on Gibbs measures on trees