From Branching processes to Phylogeny: - PowerPoint PPT Presentation

About This Presentation
Title:

From Branching processes to Phylogeny:

Description:

Today: Descendants for a few generations if there are no ... Flies. Stupid. Walks. Too smart. Barely moves. acctga. acctga. acctga. acctga. acctaa. acctga ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 31
Provided by: chris802
Category:

less

Transcript and Presenter's Notes

Title: From Branching processes to Phylogeny:


1
From Branching processes to Phylogeny A history
of reproduction
Elchanan Mossel, U.C. Berkeley
2
Trees in genetics
  • Darwin Today - tree of species.
  • Today tree of fathers.
  • Today tree of mothers.
  • Today Descendants for a few generations if there
    are no crossovers (graph has no cycles).

Noah
Shem
Ham
Japheth
Cush
Put
Mizraim
Kannan
3

No brain, Cant move
Information on the tree
  • Tree of species characteristics. Wont discuss.
  • Tree of mothers Mitochondria.
  • Tree of fathers Y chromosome.
  • No crossover Full DNA.

Stupid Flies
Stupid Walks
Stupid Swims
Too smart Barely moves
acctga
Noah
Shem
Ham
Japheth
acctga
acctaa
acctga
Put
Cush
Mizraim
Kannan
acctga
acctga
agctga
acctga
4
Genetic mutation models
  • How does the D.N.A and other genetic data mutates
    along each edge of the tree?
  • Actual mutation is complicated (for Y indels,
    snips, micro satellites, mini satellites).
  • A simplified model Split sequence to
    subsequences. All subsequences have same mutation
    rate.
  • Thats the standard model in theoretical
    phylogeny.

5
Formal model finite state space
  • Finite set A of information values.
  • Tree T(V,E) rooted at r.
  • Vertex v in V, has information sv in A.
  • Edge e(v, u), where v is the parent of u, has a
    mutation matrix Me of size A byA
  • A sample of the process is (?v)v in T.
  • For each sample, we know ?v, for v in B(T),
  • where B(T) is the boundary of the tree B(T)
    n.
  • We are given k independent samples ?1B(T),,
    ?kB(T).

6
Formal model infinite state space
  • Infinite set A assuming no mutations back
    Homoplasy-free models
  • Defined on an un-rooted tree T(V,E).
  • Edge e has (non-mutation) parameter ?(e).
  • Sample Perform percolation edge e open with
    probability ?(e).
  • All the vertices v in the same open-cluster have
    the same color ?v. Different clusters get
    different colors.
  • We are given k independent samples ?1B(T),,
    ?kB(T).

7

Reconstructing the topology
  • Given ?1B(T),, ?kB(T), want to reconstruct the
    topology, i.e., the structure of the underlying
    tree on the set of labeled leaves.
  • Formally, we want to find for all u and v in B(T)
    their graph-metric distance d(u, v).
  • Assume all internal degrees are at least 3.
  • Assume B(T) consists only of leaves vertices of
    degree 1.

u
u
Me
Me Me
v
Me
w
w
8

Summary Conjectures and results
Statistical physics
Phylogeny
Binary tree in ordered phase
k O(log n)
conj
Binary tree unordered
k poly(n)
conj
Percolation
Homoplasy free
critical ? 1/2
Ising model
CFN
critical ? 2?2 1
Sub-critical random cluster
High mutation
Problems How general? What is the critical
point? (extremality vs. spectral)
9

Homoplasy free models and percolation
  • Th1M2002 Suppose that n32q and
  • T is a uniformly chosen (q1)-level 3-regular
    tree.
  • For all e, ?(e)lt ?, and ?lt 1/2.
  • Then in order to reconstruct the topology with
    probability ? 0.1, at least k (?q) nO(-log
    ?) samples are needed.
  • Th2M2002 Suppose that T has n leaves and
  • For all e, ½ ? lt ?(e)lt 1 - ?.
  • Then k O(log n log ?) samples suffice to
    reconstruct the topology with probability 1-?.

10
The CFN model
  • Cavendar-Farris-Neyman model
  • 2 data types 1 and 1.
  • Mutation along edge e with probability ?(e) copy
    data from parent. Otherwise, choose 1/-1 with
    probability ½ independently of everything else.
  • ThmCFN Suppose that for all e, 1 - ? gt ?(e) gt ?
    gt 0.
  • Then given k samples of the process at the n
    leaves,
  • It is possible to reconstruct the underlying
    topology with probability 1 - ?, if k nO(-log
    ?).

11
Phase transition for the CFN model
  • Th1M2002 Suppose that n32q and
  • T is a uniformly chosen (q1)-level 3-regular
    tree.
  • For all e, ?(e)lt ?, and 2?2 lt 1.
  • Then in order to reconstruct the topology with
    probability ? gt 0.1, at least k (2-q?-2q)
    nO(-log ?) samples are needed.
  • Th2M2002 Suppose that T is a balanced tree on
    n leaves and
  • For all e, ?min lt ?(e)lt ?max and 2?2min gt 1, ?max
    lt 1.
  • Then k O(log n log ?) samples suffice to
    reconstruct the topology with probability 1-?.

12
Metric spaces on trees
  • Let D be a positive function on the edges E.
  • Define
  • Claim Given D(v,u) for all v and u in B(T), it
    is possible to reconstruct the topology of T.
  • Proof Suffices to find d(u, v) for all u, v
    B(T), where d is the graph metric distance.
  • d(u1,u2) 2 iff for all w1 and w2 it holds that
  • D(u1,u2,w1,w2) D(u1,w1)D(u2,w2)
    D(u1,u2)D(w1,w2) is non-negative (Four point
    condition).

w1
u1
w1
u1
w2
u2
w2
u2
13
Metric spaces on trees
  • Similarly, d(u1,u2) 4 iff for all w1 and w2
    s.t. for i1,2 and j1,2, d(ui,wj) 4, it holds
    that
  • D(u1,u2,w1,w2) 0.
  • Remark 1 If D(u1,u2,w1,w2) gt 0, then
  • D(u1,u2,w1,w2) ? 2 min_e D(e). Therefore,
    it suffices to know D with accuracy min_e D(e)/4.
  • Remark 2 For a balanced tree, in order to
    reconstruct the underlying topology up to l
    levels from the boundary, it suffices to know
    D(u,v) for all u and v s.t. d(u,v) ? 2 l 2
    (with accuracy min_e D(e)/4).

14
Proof of CFN theorem
  • Define D(e) - log ?(e).
  • D(u,v) -log(Cov(?v,?u)), where Cov(?v, ?u)
    E?v?u.
  • Estimate Cov(?v, ?u) by Cor(?v, ?u) where
  • Need D with accuracy m min D(e)/4 c?, or
  • Cor (1 ? c?)Cov.
  • Cor(?v, ?u) is a sum of k i.i.d. ? 1 variables
    with expected value Cov(?v, ?u).
  • Cov(?v, ?u) may be a small as ? 2 depth(T)
    n-O(-log ?).
  • Given k nO(-log ?) samples, it is possible to
    estimate D and therefore reconstruct T with high
    probability.

15
Extensions
  • Steel 94 Trick to extend to general Me provided
    that that det(Me)? -1,-1? ? - ?, ? ? 1 - ?,
    1,
  • using D(e) -log( det(Me) ) (more or less).
  • ESSW If tree has small depth then k is smaller.
  • For Balanced trees In order to reconstruct up to
    l levels from the boundary, suffices to have k
    T(log(n)).
  • Proof Cov(?v, ?u). for u and v with d(u,v) ? 2 l
    2 is at
  • least ?2l 2.

16
Whats next?
  • Its important to minimize the number of samples.
  • Do we need k nO(1), or
  • do we need k polylog(n)?
  • Since the number of trees on n leaves is
    exponential in n log n, and each sample consists
    of n bits, we need at least O(log(n)) samples.

17
The Ising model on the binary tree
  • The (Free)-Ising-Gibbs measure on the binary
    tree
  • Set sr, the root spin, to be /- with probability
    ½.
  • For all pairs of (parent, child) (v, w), set sw
    sv, with probability ?, otherwise sw /-
    with probability ½.
  • Different Perspective Topology is known and
    looking at a single sample.




-



-



-
-


18
The Ising model on the binary tree
  • Studied in statistical physics Spitzer 75,
    Higuchi 77, Bleher-Ruiz-Zagrebnov 95,
    Evans-Kenyon-Peres-Schulman 2000, M 98
  • Interesting phenomena double phase transition
    (different from Ising model in Zd).
  • When 2 ? ? 1, unique Gibbs measure.
  • When 2 ? 2 ? 1, free measure is extremal.
  • In other words,

19
The Ising model on the binary tree
From BRZ or EKPS
mutual information H(s?) H(sr)) - H(sr,s?)
Temp ? sr s? 1 Uniq I(sr,s?) Free measure
high lt 1/2 unbiased V ? 0 extremal
med. (1/2,1/v2) biased X ? 0 extremal
low gt 1/v2 biased X Inf gt 0 Non-ext
Remark 2 ? 2 1 phase transition also
transition for mixing time of Glauber dynamics
for Ising model on tree (with Kenyon and Peres)
20
Lower bound on number of samples
  • Th1M2002 Suppose that n32q and
  • T is a uniformly chosen (q1)-level 3-regular
    tree.
  • For all e, ?(e)lt ?, and 2?2 lt 1.
  • Then in order to reconstruct the topology with
    probability ? gt 0.1, at least k (2-q?-2q)
    nO(-log ?) samples are needed.
  • Proof of lower bound uses information
    inequalities.
  • Lemma 1EKPS For the single sample process on
    the binary tree of q levels, I(sr,s?) ? (2 ? 2)
    q.

21
Lower bound on number of samples
  • Lemma 2Fanos inequality X ? Y
  • Want to reconstruct the random variable X given
    the random variable Y,
  • X takes m values.
  • Let pe be the error probability. Then
  • 1 pe log2(m) gt H(XY).
  • Lemma 3Data Processing Lemma X ? Y ? Z
  • If X and Z are independent given Y, then
  • I(X, Z) ? min I(X, Y) , I(Y, Z) .

22
Lower bound on number of samples
  • Assume that topology of bottom q - l levels is
    known, i.e., we know d(u, v) if d(u, v) ? 2(q
    l).
  • Then the the conditional distribution of the
    topology T of the first l levels is uniform.
  • Let slt for t1,,k be the data at level l for
    sample t. Recall that we are given the Data
    (s?t).
  • By Data Processing,


l
X
?
?
Y
-----
k
q - l
Known
Known
Z
k
-----------
23
Lower bound on number of samples
  • By independence,
  • After some manipulations and by EKPS Lemma
  • So
  • Let
    Since T
  • is chosen uniformly,
  • Now
  • By Fanos Lemma, the probability of error in
    reconstructing the topology, pe, satisfies
  • In order to get pe lt 1 - ?, need


24
Upper bound on number of samples
  • Th2M2002 Suppose that T is a balanced tree on
    n leaves and
  • For all e, ?min lt ?(e)lt ?max and 2?2min gt 1, ?max
    lt 1.
  • Then k O(log n log ?) samples suffice to
    reconstruct the topology with probability 1-?.
  • Proves a conjecture of M. Steel.


25
Algorithmic aspects of phase transition
  • Higuchi 77 For Ising model, when 2 ? 2 gt 1,
    and all q level binary trees Esr Maj(s?) gt ? gt
    0, where Maj is the majority function (? is
    independent of q).
  • Looks good for phylogeny because can apply Maj
    even when do not know the topology.
  • But, doesnt work when ? is non-constant.
  • All edges on blue area have ? 1
  • All edges on black area have ? 2
  • ? 1 lt ? 2 is close to 1.
  • Maj(s?) is very close to Maj of black tree.
  • Maj of black tree very close to sv .
  • sv and sr are weakly correlated.

r
v
26
Algorithmic aspects of phase transition
  • Mossel 98 For Ising model, when 2 ? 2 gt 1,
    and all q level binary trees Esr Rec-Majl(s?) gt
    ? gt 0, where Rec-Majl is the recursive majority
    function of l levels (? is independent of q l
    depends on ?).

Rec-Majl for l1
  • Looks bad for phylogeny, as we need to know the
    tree topology.
  • But main lemma of Mossel98 is extendable to
    non-constant ?.

27
Main Lemma M2001
  • Lemma Suppose that 2 ? min2 gt 1, then there
    exists an l, and ? gt 0 such that the CFN model on
    the binary tree of l levels with
  • ?(e) ? ? min, for all e not adjacent to ?T.
  • ?(e) ? ? ? min , for all e adjacent to ?T.
  • satisfies Esr Maj(s?) ? ?.
  • Roughly, given data of quality ? ?, we can
    reconstruct the root with quality ? ?.

28
Reconstructing the topology M2001
  • The algorithm Repeat the following
  • Reconstruct the topology up to l levels from the
    boundary using 4-points method.
  • For each sample, reconstruct the data l levels
    from the boundary using majority algorithm.
  • Recall reconstruction near the boundary take
    O(log n) samples.
  • By main lemma quality stays above ?.
  • Remark The same algorithm gives (almost) tight
    upper bounds also when 2 ? min2 lt 1.

29
Remarks and open problems
  • For CFN model, algorithm is very nice
  • Polynomial time.
  • Adaptive (dont need to know ? min and ? max in
    advance).
  • Nearly optimal.
  • Main problem extending main lemma to
    non-balanced trees and other mutation models
    (reconstructing local topology still works).
  • Secondary problem extending lower bounds to
    other models.

30
Proving main Lemma
  • Need to estimate Esr Maj(s?). Estimate has two
    parts
  • Case 1 For all e adjacent to ?T, ?(e) is small.
    Here we use a perturbation argument, i.e.
    estimate partial derivatives of Esr Maj(s?)
    with respect to various variables (using
    something like Russo formula).
  • Case 2 Some e adjacent to ?T has large ?(e). Use
    percolation theory arguments.
  • Both cases uses isoperimetric estimates for the
    discrete cube.
Write a Comment
User Comments (0)
About PowerShow.com