From Branching processes to Phylogeny: - PowerPoint PPT Presentation

About This Presentation

Title:

From Branching processes to Phylogeny:

Description:

Today: Descendants for a few generations if there are no ... Flies. Stupid. Walks. Too smart. Barely moves. acctga. acctga. acctga. acctga. acctaa. acctga ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 31

Provided by: chris802

Learn more at: https://www.stat.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: From Branching processes to Phylogeny:

1
From Branching processes to Phylogeny A history
of reproduction
Elchanan Mossel, U.C. Berkeley
2
Trees in genetics

Darwin Today - tree of species.
Today tree of fathers.
Today tree of mothers.
Today Descendants for a few generations if there
are no crossovers (graph has no cycles).

Noah
Shem
Ham
Japheth
Cush
Put
Mizraim
Kannan
3

No brain, Cant move
Information on the tree

Tree of species characteristics. Wont discuss.
Tree of mothers Mitochondria.
Tree of fathers Y chromosome.
No crossover Full DNA.

Stupid Flies
Stupid Walks
Stupid Swims
Too smart Barely moves
acctga
Noah
Shem
Ham
Japheth
acctga
acctaa
acctga
Put
Cush
Mizraim
Kannan
acctga
acctga
agctga
acctga
4
Genetic mutation models

How does the D.N.A and other genetic data mutates
along each edge of the tree?
Actual mutation is complicated (for Y indels,
snips, micro satellites, mini satellites).
A simplified model Split sequence to
subsequences. All subsequences have same mutation
rate.
Thats the standard model in theoretical
phylogeny.

5
Formal model finite state space

Finite set A of information values.
Tree T(V,E) rooted at r.
Vertex v in V, has information sv in A.
Edge e(v, u), where v is the parent of u, has a
mutation matrix Me of size A byA
A sample of the process is (?v)v in T.
For each sample, we know ?v, for v in B(T),
where B(T) is the boundary of the tree B(T)
n.
We are given k independent samples ?1B(T),,
?kB(T).

6
Formal model infinite state space

Infinite set A assuming no mutations back
Homoplasy-free models
Defined on an un-rooted tree T(V,E).
Edge e has (non-mutation) parameter ?(e).
Sample Perform percolation edge e open with
probability ?(e).
All the vertices v in the same open-cluster have
the same color ?v. Different clusters get
different colors.
We are given k independent samples ?1B(T),,
?kB(T).

7

Reconstructing the topology

Given ?1B(T),, ?kB(T), want to reconstruct the
topology, i.e., the structure of the underlying
tree on the set of labeled leaves.
Formally, we want to find for all u and v in B(T)
their graph-metric distance d(u, v).
Assume all internal degrees are at least 3.
Assume B(T) consists only of leaves vertices of
degree 1.

u
u
Me
Me Me
v
Me
w
w
8

Summary Conjectures and results
Statistical physics
Phylogeny
Binary tree in ordered phase
k O(log n)
conj
Binary tree unordered
k poly(n)
conj
Percolation
Homoplasy free
critical ? 1/2
Ising model
CFN
critical ? 2?2 1
Sub-critical random cluster
High mutation
Problems How general? What is the critical
point? (extremality vs. spectral)
9

Homoplasy free models and percolation

Th1M2002 Suppose that n32q and
T is a uniformly chosen (q1)-level 3-regular
tree.
For all e, ?(e)lt ?, and ?lt 1/2.
Then in order to reconstruct the topology with
probability ? 0.1, at least k (?q) nO(-log
?) samples are needed.
Th2M2002 Suppose that T has n leaves and
For all e, ½ ? lt ?(e)lt 1 - ?.
Then k O(log n log ?) samples suffice to
reconstruct the topology with probability 1-?.

10
The CFN model

Cavendar-Farris-Neyman model
2 data types 1 and 1.
Mutation along edge e with probability ?(e) copy
data from parent. Otherwise, choose 1/-1 with
probability ½ independently of everything else.
ThmCFN Suppose that for all e, 1 - ? gt ?(e) gt ?
gt 0.
Then given k samples of the process at the n
leaves,
It is possible to reconstruct the underlying
topology with probability 1 - ?, if k nO(-log
?).

11
Phase transition for the CFN model

Th1M2002 Suppose that n32q and
T is a uniformly chosen (q1)-level 3-regular
tree.
For all e, ?(e)lt ?, and 2?2 lt 1.
Then in order to reconstruct the topology with
probability ? gt 0.1, at least k (2-q?-2q)
nO(-log ?) samples are needed.
Th2M2002 Suppose that T is a balanced tree on
n leaves and
For all e, ?min lt ?(e)lt ?max and 2?2min gt 1, ?max
lt 1.
Then k O(log n log ?) samples suffice to
reconstruct the topology with probability 1-?.

12
Metric spaces on trees

Let D be a positive function on the edges E.
Define
Claim Given D(v,u) for all v and u in B(T), it
is possible to reconstruct the topology of T.
Proof Suffices to find d(u, v) for all u, v
B(T), where d is the graph metric distance.
d(u1,u2) 2 iff for all w1 and w2 it holds that
D(u1,u2,w1,w2) D(u1,w1)D(u2,w2)
D(u1,u2)D(w1,w2) is non-negative (Four point
condition).

w1
u1
w1
u1
w2
u2
w2
u2
13
Metric spaces on trees

Similarly, d(u1,u2) 4 iff for all w1 and w2
s.t. for i1,2 and j1,2, d(ui,wj) 4, it holds
that
D(u1,u2,w1,w2) 0.
Remark 1 If D(u1,u2,w1,w2) gt 0, then
D(u1,u2,w1,w2) ? 2 min_e D(e). Therefore,
it suffices to know D with accuracy min_e D(e)/4.
Remark 2 For a balanced tree, in order to
reconstruct the underlying topology up to l
levels from the boundary, it suffices to know
D(u,v) for all u and v s.t. d(u,v) ? 2 l 2
(with accuracy min_e D(e)/4).

14
Proof of CFN theorem

Define D(e) - log ?(e).
D(u,v) -log(Cov(?v,?u)), where Cov(?v, ?u)
E?v?u.
Estimate Cov(?v, ?u) by Cor(?v, ?u) where
Need D with accuracy m min D(e)/4 c?, or
Cor (1 ? c?)Cov.
Cor(?v, ?u) is a sum of k i.i.d. ? 1 variables
with expected value Cov(?v, ?u).
Cov(?v, ?u) may be a small as ? 2 depth(T)
n-O(-log ?).
Given k nO(-log ?) samples, it is possible to
estimate D and therefore reconstruct T with high
probability.

15
Extensions

Steel 94 Trick to extend to general Me provided
that that det(Me)? -1,-1? ? - ?, ? ? 1 - ?,
1,
using D(e) -log( det(Me) ) (more or less).
ESSW If tree has small depth then k is smaller.
For Balanced trees In order to reconstruct up to
l levels from the boundary, suffices to have k
T(log(n)).
Proof Cov(?v, ?u). for u and v with d(u,v) ? 2 l
2 is at
least ?2l 2.

16
Whats next?

Its important to minimize the number of samples.
Do we need k nO(1), or
do we need k polylog(n)?
Since the number of trees on n leaves is
exponential in n log n, and each sample consists
of n bits, we need at least O(log(n)) samples.

17
The Ising model on the binary tree

The (Free)-Ising-Gibbs measure on the binary
tree
Set sr, the root spin, to be /- with probability
½.
For all pairs of (parent, child) (v, w), set sw
sv, with probability ?, otherwise sw /-
with probability ½.
Different Perspective Topology is known and
looking at a single sample.

-

-

-
-

18
The Ising model on the binary tree

Studied in statistical physics Spitzer 75,
Higuchi 77, Bleher-Ruiz-Zagrebnov 95,
Evans-Kenyon-Peres-Schulman 2000, M 98
Interesting phenomena double phase transition
(different from Ising model in Zd).
When 2 ? ? 1, unique Gibbs measure.
When 2 ? 2 ? 1, free measure is extremal.
In other words,

19
The Ising model on the binary tree
From BRZ or EKPS
mutual information H(s?) H(sr)) - H(sr,s?)
Temp ? sr s? 1 Uniq I(sr,s?) Free measure
high lt 1/2 unbiased V ? 0 extremal
med. (1/2,1/v2) biased X ? 0 extremal
low gt 1/v2 biased X Inf gt 0 Non-ext
Remark 2 ? 2 1 phase transition also
transition for mixing time of Glauber dynamics
for Ising model on tree (with Kenyon and Peres)
20
Lower bound on number of samples

Th1M2002 Suppose that n32q and
T is a uniformly chosen (q1)-level 3-regular
tree.
For all e, ?(e)lt ?, and 2?2 lt 1.
Then in order to reconstruct the topology with
probability ? gt 0.1, at least k (2-q?-2q)
nO(-log ?) samples are needed.
Proof of lower bound uses information
inequalities.
Lemma 1EKPS For the single sample process on
the binary tree of q levels, I(sr,s?) ? (2 ? 2)
q.

21
Lower bound on number of samples

Lemma 2Fanos inequality X ? Y
Want to reconstruct the random variable X given
the random variable Y,
X takes m values.
Let pe be the error probability. Then
1 pe log2(m) gt H(XY).
Lemma 3Data Processing Lemma X ? Y ? Z
If X and Z are independent given Y, then
I(X, Z) ? min I(X, Y) , I(Y, Z) .

22
Lower bound on number of samples

Assume that topology of bottom q - l levels is
known, i.e., we know d(u, v) if d(u, v) ? 2(q
l).
Then the the conditional distribution of the
topology T of the first l levels is uniform.
Let slt for t1,,k be the data at level l for
sample t. Recall that we are given the Data
(s?t).
By Data Processing,

l
X
?
?
Y
-----
k
q - l
Known
Known
Z
k
-----------
23
Lower bound on number of samples

By independence,
After some manipulations and by EKPS Lemma
So
Let
Since T
is chosen uniformly,
Now
By Fanos Lemma, the probability of error in
reconstructing the topology, pe, satisfies
In order to get pe lt 1 - ?, need

24
Upper bound on number of samples

Th2M2002 Suppose that T is a balanced tree on
n leaves and
For all e, ?min lt ?(e)lt ?max and 2?2min gt 1, ?max
lt 1.
Then k O(log n log ?) samples suffice to
reconstruct the topology with probability 1-?.
Proves a conjecture of M. Steel.

25
Algorithmic aspects of phase transition

Higuchi 77 For Ising model, when 2 ? 2 gt 1,
and all q level binary trees Esr Maj(s?) gt ? gt
0, where Maj is the majority function (? is
independent of q).
Looks good for phylogeny because can apply Maj
even when do not know the topology.
But, doesnt work when ? is non-constant.
All edges on blue area have ? 1
All edges on black area have ? 2
? 1 lt ? 2 is close to 1.
Maj(s?) is very close to Maj of black tree.
Maj of black tree very close to sv .
sv and sr are weakly correlated.

r
v
26
Algorithmic aspects of phase transition

Mossel 98 For Ising model, when 2 ? 2 gt 1,
and all q level binary trees Esr Rec-Majl(s?) gt
? gt 0, where Rec-Majl is the recursive majority
function of l levels (? is independent of q l
depends on ?).

Rec-Majl for l1

Looks bad for phylogeny, as we need to know the
tree topology.
But main lemma of Mossel98 is extendable to
non-constant ?.

27
Main Lemma M2001

Lemma Suppose that 2 ? min2 gt 1, then there
exists an l, and ? gt 0 such that the CFN model on
the binary tree of l levels with
?(e) ? ? min, for all e not adjacent to ?T.
?(e) ? ? ? min , for all e adjacent to ?T.
satisfies Esr Maj(s?) ? ?.
Roughly, given data of quality ? ?, we can
reconstruct the root with quality ? ?.

28
Reconstructing the topology M2001

The algorithm Repeat the following
Reconstruct the topology up to l levels from the
boundary using 4-points method.
For each sample, reconstruct the data l levels
from the boundary using majority algorithm.
Recall reconstruction near the boundary take
O(log n) samples.
By main lemma quality stays above ?.
Remark The same algorithm gives (almost) tight
upper bounds also when 2 ? min2 lt 1.

29
Remarks and open problems

For CFN model, algorithm is very nice
Polynomial time.
Adaptive (dont need to know ? min and ? max in
advance).
Nearly optimal.
Main problem extending main lemma to
non-balanced trees and other mutation models
(reconstructing local topology still works).
Secondary problem extending lower bounds to
other models.

30
Proving main Lemma

Need to estimate Esr Maj(s?). Estimate has two
parts
Case 1 For all e adjacent to ?T, ?(e) is small.
Here we use a perturbation argument, i.e.
estimate partial derivatives of Esr Maj(s?)
with respect to various variables (using
something like Russo formula).
Case 2 Some e adjacent to ?T has large ?(e). Use
percolation theory arguments.
Both cases uses isoperimetric estimates for the
discrete cube.