Title: From Branching processes to Phylogeny:
1From Branching processes to Phylogeny A history
of reproduction
Elchanan Mossel, U.C. Berkeley
2Trees in genetics
- Darwin Today - tree of species.
- Today tree of fathers.
- Today tree of mothers.
- Today Descendants for a few generations if there
are no crossovers (graph has no cycles).
Noah
Shem
Ham
Japheth
Cush
Put
Mizraim
Kannan
3 No brain, Cant move
Information on the tree
- Tree of species characteristics. Wont discuss.
- Tree of mothers Mitochondria.
- Tree of fathers Y chromosome.
- No crossover Full DNA.
Stupid Flies
Stupid Walks
Stupid Swims
Too smart Barely moves
acctga
Noah
Shem
Ham
Japheth
acctga
acctaa
acctga
Put
Cush
Mizraim
Kannan
acctga
acctga
agctga
acctga
4Genetic mutation models
- How does the D.N.A and other genetic data mutates
along each edge of the tree? - Actual mutation is complicated (for Y indels,
snips, micro satellites, mini satellites). - A simplified model Split sequence to
subsequences. All subsequences have same mutation
rate. - Thats the standard model in theoretical
phylogeny. -
5Formal model finite state space
- Finite set A of information values.
- Tree T(V,E) rooted at r.
- Vertex v in V, has information sv in A.
- Edge e(v, u), where v is the parent of u, has a
mutation matrix Me of size A byA - A sample of the process is (?v)v in T.
- For each sample, we know ?v, for v in B(T),
- where B(T) is the boundary of the tree B(T)
n. - We are given k independent samples ?1B(T),,
?kB(T).
6Formal model infinite state space
- Infinite set A assuming no mutations back
Homoplasy-free models - Defined on an un-rooted tree T(V,E).
- Edge e has (non-mutation) parameter ?(e).
- Sample Perform percolation edge e open with
probability ?(e). - All the vertices v in the same open-cluster have
the same color ?v. Different clusters get
different colors. - We are given k independent samples ?1B(T),,
?kB(T).
7 Reconstructing the topology
- Given ?1B(T),, ?kB(T), want to reconstruct the
topology, i.e., the structure of the underlying
tree on the set of labeled leaves. - Formally, we want to find for all u and v in B(T)
their graph-metric distance d(u, v). - Assume all internal degrees are at least 3.
- Assume B(T) consists only of leaves vertices of
degree 1.
u
u
Me
Me Me
v
Me
w
w
8 Summary Conjectures and results
Statistical physics
Phylogeny
Binary tree in ordered phase
k O(log n)
conj
Binary tree unordered
k poly(n)
conj
Percolation
Homoplasy free
critical ? 1/2
Ising model
CFN
critical ? 2?2 1
Sub-critical random cluster
High mutation
Problems How general? What is the critical
point? (extremality vs. spectral)
9 Homoplasy free models and percolation
- Th1M2002 Suppose that n32q and
- T is a uniformly chosen (q1)-level 3-regular
tree. - For all e, ?(e)lt ?, and ?lt 1/2.
- Then in order to reconstruct the topology with
probability ? 0.1, at least k (?q) nO(-log
?) samples are needed. - Th2M2002 Suppose that T has n leaves and
- For all e, ½ ? lt ?(e)lt 1 - ?.
- Then k O(log n log ?) samples suffice to
reconstruct the topology with probability 1-?.
10The CFN model
- Cavendar-Farris-Neyman model
- 2 data types 1 and 1.
- Mutation along edge e with probability ?(e) copy
data from parent. Otherwise, choose 1/-1 with
probability ½ independently of everything else. - ThmCFN Suppose that for all e, 1 - ? gt ?(e) gt ?
gt 0. - Then given k samples of the process at the n
leaves, - It is possible to reconstruct the underlying
topology with probability 1 - ?, if k nO(-log
?).
11Phase transition for the CFN model
- Th1M2002 Suppose that n32q and
- T is a uniformly chosen (q1)-level 3-regular
tree. - For all e, ?(e)lt ?, and 2?2 lt 1.
- Then in order to reconstruct the topology with
probability ? gt 0.1, at least k (2-q?-2q)
nO(-log ?) samples are needed. - Th2M2002 Suppose that T is a balanced tree on
n leaves and - For all e, ?min lt ?(e)lt ?max and 2?2min gt 1, ?max
lt 1. - Then k O(log n log ?) samples suffice to
reconstruct the topology with probability 1-?.
12Metric spaces on trees
- Let D be a positive function on the edges E.
- Define
- Claim Given D(v,u) for all v and u in B(T), it
is possible to reconstruct the topology of T. - Proof Suffices to find d(u, v) for all u, v
B(T), where d is the graph metric distance. - d(u1,u2) 2 iff for all w1 and w2 it holds that
- D(u1,u2,w1,w2) D(u1,w1)D(u2,w2)
D(u1,u2)D(w1,w2) is non-negative (Four point
condition).
w1
u1
w1
u1
w2
u2
w2
u2
13Metric spaces on trees
- Similarly, d(u1,u2) 4 iff for all w1 and w2
s.t. for i1,2 and j1,2, d(ui,wj) 4, it holds
that - D(u1,u2,w1,w2) 0.
- Remark 1 If D(u1,u2,w1,w2) gt 0, then
- D(u1,u2,w1,w2) ? 2 min_e D(e). Therefore,
it suffices to know D with accuracy min_e D(e)/4. - Remark 2 For a balanced tree, in order to
reconstruct the underlying topology up to l
levels from the boundary, it suffices to know
D(u,v) for all u and v s.t. d(u,v) ? 2 l 2
(with accuracy min_e D(e)/4).
14Proof of CFN theorem
- Define D(e) - log ?(e).
- D(u,v) -log(Cov(?v,?u)), where Cov(?v, ?u)
E?v?u. - Estimate Cov(?v, ?u) by Cor(?v, ?u) where
- Need D with accuracy m min D(e)/4 c?, or
- Cor (1 ? c?)Cov.
- Cor(?v, ?u) is a sum of k i.i.d. ? 1 variables
with expected value Cov(?v, ?u). - Cov(?v, ?u) may be a small as ? 2 depth(T)
n-O(-log ?). - Given k nO(-log ?) samples, it is possible to
estimate D and therefore reconstruct T with high
probability.
15Extensions
- Steel 94 Trick to extend to general Me provided
that that det(Me)? -1,-1? ? - ?, ? ? 1 - ?,
1, - using D(e) -log( det(Me) ) (more or less).
- ESSW If tree has small depth then k is smaller.
- For Balanced trees In order to reconstruct up to
l levels from the boundary, suffices to have k
T(log(n)). - Proof Cov(?v, ?u). for u and v with d(u,v) ? 2 l
2 is at - least ?2l 2.
-
16Whats next?
- Its important to minimize the number of samples.
- Do we need k nO(1), or
- do we need k polylog(n)?
- Since the number of trees on n leaves is
exponential in n log n, and each sample consists
of n bits, we need at least O(log(n)) samples.
17The Ising model on the binary tree
- The (Free)-Ising-Gibbs measure on the binary
tree - Set sr, the root spin, to be /- with probability
½. - For all pairs of (parent, child) (v, w), set sw
sv, with probability ?, otherwise sw /-
with probability ½. - Different Perspective Topology is known and
looking at a single sample.
-
-
-
-
18The Ising model on the binary tree
- Studied in statistical physics Spitzer 75,
Higuchi 77, Bleher-Ruiz-Zagrebnov 95,
Evans-Kenyon-Peres-Schulman 2000, M 98 - Interesting phenomena double phase transition
(different from Ising model in Zd). - When 2 ? ? 1, unique Gibbs measure.
- When 2 ? 2 ? 1, free measure is extremal.
- In other words,
19The Ising model on the binary tree
From BRZ or EKPS
mutual information H(s?) H(sr)) - H(sr,s?)
Temp ? sr s? 1 Uniq I(sr,s?) Free measure
high lt 1/2 unbiased V ? 0 extremal
med. (1/2,1/v2) biased X ? 0 extremal
low gt 1/v2 biased X Inf gt 0 Non-ext
Remark 2 ? 2 1 phase transition also
transition for mixing time of Glauber dynamics
for Ising model on tree (with Kenyon and Peres)
20Lower bound on number of samples
- Th1M2002 Suppose that n32q and
- T is a uniformly chosen (q1)-level 3-regular
tree. - For all e, ?(e)lt ?, and 2?2 lt 1.
- Then in order to reconstruct the topology with
probability ? gt 0.1, at least k (2-q?-2q)
nO(-log ?) samples are needed. - Proof of lower bound uses information
inequalities. - Lemma 1EKPS For the single sample process on
the binary tree of q levels, I(sr,s?) ? (2 ? 2)
q.
21Lower bound on number of samples
- Lemma 2Fanos inequality X ? Y
- Want to reconstruct the random variable X given
the random variable Y, - X takes m values.
- Let pe be the error probability. Then
- 1 pe log2(m) gt H(XY).
- Lemma 3Data Processing Lemma X ? Y ? Z
- If X and Z are independent given Y, then
- I(X, Z) ? min I(X, Y) , I(Y, Z) .
-
22Lower bound on number of samples
- Assume that topology of bottom q - l levels is
known, i.e., we know d(u, v) if d(u, v) ? 2(q
l). - Then the the conditional distribution of the
topology T of the first l levels is uniform. - Let slt for t1,,k be the data at level l for
sample t. Recall that we are given the Data
(s?t). - By Data Processing,
l
X
?
?
Y
-----
k
q - l
Known
Known
Z
k
-----------
23Lower bound on number of samples
- By independence,
- After some manipulations and by EKPS Lemma
- So
- Let
Since T - is chosen uniformly,
- Now
- By Fanos Lemma, the probability of error in
reconstructing the topology, pe, satisfies -
- In order to get pe lt 1 - ?, need
-
24Upper bound on number of samples
- Th2M2002 Suppose that T is a balanced tree on
n leaves and - For all e, ?min lt ?(e)lt ?max and 2?2min gt 1, ?max
lt 1. - Then k O(log n log ?) samples suffice to
reconstruct the topology with probability 1-?. - Proves a conjecture of M. Steel.
-
25Algorithmic aspects of phase transition
- Higuchi 77 For Ising model, when 2 ? 2 gt 1,
and all q level binary trees Esr Maj(s?) gt ? gt
0, where Maj is the majority function (? is
independent of q). - Looks good for phylogeny because can apply Maj
even when do not know the topology. - But, doesnt work when ? is non-constant.
- All edges on blue area have ? 1
- All edges on black area have ? 2
- ? 1 lt ? 2 is close to 1.
- Maj(s?) is very close to Maj of black tree.
- Maj of black tree very close to sv .
- sv and sr are weakly correlated.
r
v
26Algorithmic aspects of phase transition
- Mossel 98 For Ising model, when 2 ? 2 gt 1,
and all q level binary trees Esr Rec-Majl(s?) gt
? gt 0, where Rec-Majl is the recursive majority
function of l levels (? is independent of q l
depends on ?).
Rec-Majl for l1
- Looks bad for phylogeny, as we need to know the
tree topology. - But main lemma of Mossel98 is extendable to
non-constant ?.
27Main Lemma M2001
- Lemma Suppose that 2 ? min2 gt 1, then there
exists an l, and ? gt 0 such that the CFN model on
the binary tree of l levels with - ?(e) ? ? min, for all e not adjacent to ?T.
- ?(e) ? ? ? min , for all e adjacent to ?T.
- satisfies Esr Maj(s?) ? ?.
- Roughly, given data of quality ? ?, we can
reconstruct the root with quality ? ?.
28Reconstructing the topology M2001
- The algorithm Repeat the following
- Reconstruct the topology up to l levels from the
boundary using 4-points method. - For each sample, reconstruct the data l levels
from the boundary using majority algorithm. - Recall reconstruction near the boundary take
O(log n) samples. - By main lemma quality stays above ?.
- Remark The same algorithm gives (almost) tight
upper bounds also when 2 ? min2 lt 1.
29Remarks and open problems
- For CFN model, algorithm is very nice
- Polynomial time.
- Adaptive (dont need to know ? min and ? max in
advance). - Nearly optimal.
- Main problem extending main lemma to
non-balanced trees and other mutation models
(reconstructing local topology still works). - Secondary problem extending lower bounds to
other models.
30Proving main Lemma
- Need to estimate Esr Maj(s?). Estimate has two
parts - Case 1 For all e adjacent to ?T, ?(e) is small.
Here we use a perturbation argument, i.e.
estimate partial derivatives of Esr Maj(s?)
with respect to various variables (using
something like Russo formula). - Case 2 Some e adjacent to ?T has large ?(e). Use
percolation theory arguments. - Both cases uses isoperimetric estimates for the
discrete cube.