Title: Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci
1Incomplete Lineage Sorting Consistent Phylogeny
Estimation From Multiple Loci a couple of
unrelated observations
- Elchanan Mossel, UC Berkeley
- Joint work with Sebastien Roch, Microsoft
Research - At Newton Institute Dec 07
2Lecture Plan
- A simple observation about gene trees and
population trees. - A comment on optimal and absolute converging
tree reconstruction - A comment on Generic models.
- A comment on Network Reconstruction.
- Disclaimer Last talk a bit philosophical (but
would be happy to provide hard technical proofs
?)
3Gene Trees and Population Trees
- Main goal in phylogenetics
- Recovering species/population histories.
- Data Current Genes.
- Issue In recent populations gene trees may
differ from population trees. - Model for evolution of trees in populations
- Coalescence
- Fixed size population N
- Each individual chooses a random parent in
- previous generation.
- generations N branch-length
- Main Question How to reconstruct population
trees from gene trees?
4Gene Trees The Engineering Approach
- Two common engineering approaches
- Approach 1
- Assume all genes come from a single tree.
- Kubato-Degnan Inconsistent.
- Approach 2
- Build tree for each tree on its own.
- Take majority tree.
- Degnan-Rosenberg Inconsistent.
- Q What should be done instead?
5Gene Trees A Rigorous Approach
- M-Roch A consistent estimator of the molecular
distance between two populations d(P1,P2) is - D(P1,P2) min dg(P1,P2) g 2 Genes
- ) distances between populations are identifiable.
- ) tree is identifiable
- Under standard coalescence assumptions, get good
rate - P(topology error) ( pops) exp(-c genes)
- c shortest branch length.
- Estimator can be plugged in into any distance
based method for reconstructing trees. - In M-Roch, use NJ, but similarly work for
- Short-quartets (ESSW)
- Distorted metrics and forests (M)
- etc.
6Comments on Absolute Convergence
- Algorithmic paradigm Want to reconstruct tree on
- n species using
- sequence length L and
- running time T.
- Absolute Convergence L poly(n) T poly(n).
- Q Is this the best we can do?
7resolution of Steels conjecture
Short branches all branches lt lc Long
branches all branches gt lc lc depends on
mutation model but not on tree, tree size etc.
Daskalakis- M-Roch06
seq. length L c log n
short branches
seq. length L nC
M04
long branches
n species
8The algorithmic challenge
- Conj For short branches, if data is generated
from the model - ML identifies the correct using
- L O(log n) samples
- (best bound known is L exp(O(N)).
- Conclusion In order to beat ML, need
algorithms with L O(log n) - Challenge The constant in O is important!
- Challenge Deal with short/long branches
(contract edges output forest) - Challenge General mutation models (not just CFN,
JC). - Comment Rigorous methods have running time
gaurentee. - Comment For Lpoly(n), know how to deal with all
challenges - ESSW
- M07 (forests long edges).
- Gornieu et. al (short edges).
9On generic parameters
- From Rhodes talk Generic models are easier to
identify. - Typically genetic parameters.
- How about generic trees?
10Mixtures and Phenomena in High Dims
- The Geometry of High Dimensions
- Almost every collection of k vectors are almost
orthogonal in high enough dimension n. - M-Roch (in preparation)
- For every k, as n -gt 1 the probability that a
mixture of k trees on n leaves is identifiable
goes to 1. - Holds for most reasonable measures on the space
of trees and most mutation models. - Basic idea In generic situations can (almost)
cluster samples according to trees. - Gives an efficient algorithm.
- Similar results hold for rates across sites
.
11A Comment on Dynamic Programming
- Q (Zhang)
- Given a tree is it possible to find the
- most informative k species?
- In terms of Pasrsimony?
- In terms of ML?
- Note
- If we know Parsimony/ML score for left/right
sub-tree, we know it for the root. - Q Can use dynamic programming?
- A Yes but with the right data structure
- Information per node
- Discrete version of
- the set
- of achievable distributions.
- Called Density Evolution in coding theory /
spin-glass theory. - Additive error 1/poly(n).
L
L2
L1
L
L2
L1
12Hardness of Distinguishing Network Models with
Hidden Nodes
- Basic question Is it possible to recover a
network G from observation at a subset of the
nodes? - Easier question Suppose we observe X1,,Xr. Is
it possible to determine if they come from nodes
S in G1 or nodes T in G2? - Problem It may be that the two distributions are
the same. - Assume The two distributions are different
(large total variation distance) - Q Assuming the two distributions are different
how hard is it to tell if its coming from G1 or
G2? - Related question What is a computational model
of a biologist?
G1
G2
13The distinguishing problem for Trees
- Q Assuming the two distributions are different
how hard is it to tell if its coming from T1 or
T2? - Note For trees the problem is easy
- Perform likelihood test.
- Easy to do efficiently (peeling, pruning,
dynamics programming). - samples needed poly(n).
T1
T2
14Two Models of a Biologist
- The Computationally Limited Biologist Cannot
solve hard computational problems, in particular
cannot sample from a general G-distributions. - The Computationally Unlimited Biologist
- Can sample from any distribution.
- Related to the following problem
- Can nature solve computationally hard problems?
From Shapiro at Weizmann
15Hardness Results
- The Computational Limited Biologist (Bogdanov-M)
Distinguishing problem can be solved efficiently
iff NPRP. - Computational Unlimited Biologist (Bogdanov-M)
- The problem is at least zero-knowledge hard.
- Zero-Knowledge Problem Can we decide if samples
from a computationally efficient distribution is
coming from the uniform distributions? - Related to cryptography.
-
-
G1
G2
16Reconstructing Networks
- Motivation abundance of stochastic networks in
biology, social networks, neuro-science etc. etc. - Network defines a distribution as follows
- G(V,E) Graph on n 1,2,,n
- Distribution defined on AV, where A is some
finite set. - Too each clique C in G, associate a function ?C
AC -gt R and - P? ?C ?C(?C)
- Called Markov Random Field, Factorized
Distribution etc. - Directed models also common.
- Markov Property If S separates A from B then
- ?A and ?B are conditionally independent
- given ?S
17Reconstructing Networks
- .
- Task 1 Given samples of ?, find G.
- Task 2 Given samples of ? restricted to a set S
find G. - Will consider the problem when n large and
maximum degree d is small. - (Note that specification of the model is of size
max(n,,exp(max C)) )
18Reconstructing Networks A Trivial Algorithm
- Lower bound (Bresler-M-Sly)
- In order to recover G of max-deg d need at least
c d log n samples. - Pf follows by counting of networks.
- Upper bound (Bresler-M-Sly)
- If distribution is non-degenerate c d log n
samples suffice. - Trivial Algorithm
- For each v 2 V
- Enumerate on N(v)
- For each w 2 V check if ?v ind. of ?w given
?N(v). - Non-Degeneracy
- For every v and every w 2 N(v) there exists two
assignments to N(v) ?1 and ?2 that differ at w
and dTV(P(?v ?1), P(?v ?2)) ? - For soft-core model suffices to have for all ?
?u,v - maxa,b,c,d ?(c,a)-?(d,a)?(c,b)-?(d,b) gt ?
- Running time O(nd1 log n)
19A Trivial Algorithm Related Result
- Trivial Algorithm
- For each v 2 V
- Enumerate on N(v)
- For each w 2 V check ?v ind. of ?w given ?N(v).
- Related work
- Algorithm was suggested before.
- Abbeel, D. Koller, A. Ng without restrictions
learn a model whose KL distance from generating
model is small (no guarantee of obtaining the
true model in - order to get O(1) KL distance need poly
samples). - M. J. Wainwright, P. Ravikumar, J. D Use L1
regularization to get true model for Ising
models, sampling complexity O(d5 log n) no
running time bounds. - Other related work assuming special form of
potentials ?
20Variants of the Trivial Algorithm
possible ws
- If graph has exponential decay of correlations
- Corr(?u,?v) exp(-c d(u,v))
- Suffices to enumerate over N(v)
- among w correlated with v.
- Running time O(n2 log n n f(d)).
- Missing nodes Suppose G is triangle free,
- then a variant of the algorithm can
- find one hidden node.
- Idea (with M. Biskups help)
- Run the algorithm as if the node is not hidden
- Noise The algorithm tolerates small
- amounts of noise (statistical robustness).
- Q What about higher amounts of noise?
- (From Bresler-M-Sly)
21Higher Noise Non Identifiable Example
- Bresler-M-Sly Example of non-identifiably
- Consider
- G1 path of length 2,
- G2 triangle Noise.
- Assume Ising model with random interactions and
random noise. - Then with constant probability, cannot
distinguish between the models. - Ising P? ?u,v 2 E exp(? ?(u) ?(v))
- Intuitive reason dimension of distribution
- is 3 in both cases.
hidden nodes
observed nodes
22(No Transcript)
23- Sebastien Roch
- Costis Daskalakis
- Andrej Bogdanov
24Fascinating workshop Principal Organiser
Professor Mike Steel (University of Canterbury,
NZ) Organisers Professor Vincent Moulton
(University of East Anglia) and Dr Katharina
Huber (University of East Anglia) Sponsored by
Allan Wilson Centre for Molecular Ecology and
Evolution As part of a great program Organiser
s Professor V Moulton (East Anglia), Professor
M Steel (Canterbury) and Professor D Huson
(Tubingen)