Phylogenetic trees: - PowerPoint PPT Presentation

About This Presentation

Title:

Phylogenetic trees:

Description:

Phylogenetic trees: What to look for and where? Lessons from Statistical Physics ... Lesson 1: Phylogenetic lower bound for forgetful trees. Th[M2004; Trans AMS] ... – PowerPoint PPT presentation

Number of Views:133

Avg rating:3.0/5.0

Slides: 24

Provided by: chris802

Learn more at: https://www.stat.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Phylogenetic trees:

1
Phylogenetic trees What to look for and where?
Lessons from Statistical Physics
Elchanan Mossel, U.C. Berkeley and Microsoft
Research mossel_at_stat.berkeley.edu,
www.stat.berkeley.edu/mossel/
2
Statistical physics

Statistical physics is a sub-field of
mathematical physics studying complex systems
with simple microscopic interactions.
The Ising model on a graph G(V,E) is a
probability measure (Gibbs distribution) on the
space of configurations s V ? -1,1 such that
Ps is given by
exp(S(v, w) e E s(v)s(w)/T)/Z exp(? S(v, w) e
E s(v)s(w))/Z
Or, Weight(?) exp(? u v ?(u) ?(v) )
Traditionally studied on cubes in Zd.

The Ising model on 200 x 200 grid
3
Statistical physics - intuition

The Ising model on the nxn grid is given by
exp(S(v, w) e E s(v)s(w)/T)/Z exp(? S(v, w) e
E s(v)s(w))/Z
We expect that
T small, ? large ) strong correlations
Corr(?boundary,?0) gt ? gt 0 for all n.
T large, ? small ) weak correlations
Corr(?boundary,?0) ! 0 as n ! 1.

2n
0
boundary

Onsager (1944) proved it where
Critical ? ?c ln(121/2)/2
For most other graphs, we know very little

The Ising model on 200 x 200 grid ? ?c
4
Statistical physics on trees

The Ising model on a tree T(V,E) is given by
exp( S(v, w) e E ?(v,w) ?(v) ?(w))/Z
It is equivalent to the following model
Let r be a root (chosen arbitrarily).
Let ?(r) 1 with probability ½ and for
Each edge (u,v) directed away from the root, let
?(v) ?(u) with probability ?(u,v).
?(v) is independent 1 otherwise.
?(u,v) ( e?(u,v)-e-?(u,v) )/
(e?(u,v)e-?(u,v))

-

-

-
-

5
Ising Model on binary Trees
low
interm.
high
bias
bias
no bias
no bias
bias
typical boundary
typical boundary
Unique Gibbs measure 8 e, 2?(e) 1
Extermality 8 e, 2?(e) gt 1 8 e, 2 ?2(e) 1
Non-Extermality 8 e, 2?(e)2 gt 1
6
Statistical physics on trees History

Uniqueness studied by Bethe (1930s).
Extremality phase more recently Spitzer 75,
Higuchi 77, Bleher-Ruiz-Zagrebnov 95,
Evans-Kenyon-Peres-Schulman 2000, Ioffe 99, M 98,
Haggstrom-M 2000, Kenyon-M-Peres 2001,
Martinelli-Sinclair Weitz- 2003, Martin-2003
Many problems are still open.
Extremality has rich connections with
Noisy computation/communication
von-Neumann53, Evans-Shculmann00,
Mixing of Markov chains Berger-Kenyon-Mossel-Pere
s01,Martinelli-Sinclair-Weitz05
Spinglasses and Random Sat problems
Parisi,Mezard,Montanari Mezard-Montanari06

7
Phylogeny

Phylogeny is the true evolutionary relationships
between groups of living things

Noah
Shem
Ham
Japheth
Cush
Mizraim
Kannan
8
History of Phylogeny

Intuitively animal kingdom or plant
kingdom.
More scientifically morphology, fossils, etc.
Darwin
But Is a human more like a great ape or like a
chimpanzee?

No brain, Cant move
Stupid Walks
Stupid Swims
Stupid Flies
Too smart Barely moves
9
Molecular Phylogeny

Molecular Phylogeny Based on DNA, RNA or protein
sequences of organisms.
Mutation mechanisms
Substitutions
Transpositions
Insertions, Deletions, etc.
Will only consider substitutions
and assume sequences are aligned.

Noah
acctga
Shem
Ham
Japheth
acctga
acctaa
acctga
Put
Cush
Mizraim
Kannan
acctga
acctga
agctga
acctga
10
Simplifying assumptions models

Assumption Letters of sequences (characters)
evolve independently and identically.
CFN model The first stochastic model invented by
Cavender, Farris and Neyman (70s)
Let ?(r) 1 with probability ½ and for
Each edge (u,v) directed away from the root, let
?(v) ?(u) with probability ?(u,v).
?(v) is independent 1 otherwise.
This is exactly the Ising model on the
evolutionary tree!
Dictionary A,C (Pyrimidine group) G,T
- (Purine group).
Some results can be generalized to other models.

11
Simplifying assumptions trees

Assumption 1 Evolution is on a tree.
Assumption 2 Trees are binary -- All internal
degrees are 3.
Given a set of species (labeled vertices) X, an
X-tree is a tree which has X as the set of
leaves.
Two X-trees T1 and T2 are identical if theres a
graph isomorphism between T1 and T2 that is the
identity map on X.
Most results to trees all of whose internal
degrees are at least 3.

u
u
Me
v
Me Me
Me
w
w
d
a
c
b
d
a
b
c
c
a
b
d
12
The Phylogenetic Challenge
Time
Contemporary Genetic sequences
Evolutionary model
Genetic sequences
??

How to reconstruct Phylogenetic tree from genetic
data at contemporary species??

13
Phylogeny

Tree is unknown.
Given sequences at the leaves of the tree.
Want to reconstruct the tree (un-rooted).
How hard is it as a function of
n size of tree leaves.
k length of sequences.

14
Phylogeny
15

n and k
Length of sequence!

Interested to know k characters needed to
reconstruct the tree with n leaves.

Erdos-Steel-Szekeley-Warnow96
If ? lt ?(e) lt 1 - ? for all e.
Tree can be recovered from
Sequences of length k nc.
In polynomial time.

Question How about shorter sequences?
Previously, best lower bound on sequence length
is k ?(log n).
However, in practice
Sometimes hard to find long sequences.
Short sequences often suffice.

16

Lesson 1 Phylogenetic lower bound for forgetful
trees

ThM2004 Trans AMS
If 2 ?2(e) lt 1 for all e then we show
A lower bound on sequence length of k nc, where
c gt 0 is a function of ? maxe ?(e) and
c ! 1 as ? ! 0.
Th M2003 JCB
Similar theorem for general mutation models if
mutation rates are high.
Proofs are easy.

17

Poly. lower bound for Phylogeny

Proof by coupling

XT
?
?
L
k
Known
Known
q-L
k

If for all k characters we can couple bottom q-L
levels, then X is independent of the data.
By forgetfulness of tree, if k lt nc, X is
independent of data with high probability.
Similar idea can be used to test trees
(MRiesenfeld)

18

Lesson 2 Recent history is easy

In the proof of lower bound, the deep
convergences were hard to reconstruct.
Theorem M04
If ? lt ?(e) lt 1 - ? for all e, then
most of the tree can be
reconstructed from
sequences of length k O(log n).
most of tree a forest F such that the true
tree is obtained from F by adding o(n) edges.
Result were refined experiments in
Daskalakis-Hill-Jaffe-Miahescu-Mossel-Rao
Proof is not easy based on Distorted Metrics.

19

Lesson 3 Species that remember their past can
reconstruct their history.

Thm Daskalakis-Mossel-Roch To appear STOC06
If 2 ?2(e) gt 1 for all e then
The tree can be recovered with high probability
from sequences of length
k O( log n ).
Solves M. Steels Favourite conjecture
Builds on M2004 Trans AMS
Hard proof Mixes probability, algorithms,
statistical physics.

20

Proof Sketch Logarithmic reconstruction

Two parts of the proofs
I. Statistical / algorithmic.
II. Probability / statistical physics.
By Forest result we may recover a forest
containing 90 of the edges of the tree from
O(log n) samples.
Doesnt use the 2 ?2 gt 1

21

Logarithmic Reconstruction

II. Here we use the condition that 2 ? 2 gt 1 in
order to estimate the characters at the inner
nodes of the forest.

Like I.
22
Ising Model on binary Trees
low
interm.
high
bias
bias
no bias
k ?(nc)
no bias
bias
Most tree from k O(log n)
k O( log n )
typical boundary
typical boundary
Unique Gibbs measure 8 e, 2?(e) 1
Extermality 8 e, 2?(e) gt 1 8 e, 2 ?2(e) 1
Non-Extermality 8 e, 2?(e)2 gt 1
23

Many more challenges to come