Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci

About This Presentation

Title:

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci

Description:

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci & a couple of unrelated observations Elchanan Mossel, UC Berkeley –

Number of Views:119

Avg rating:3.0/5.0

Slides: 25

Provided by: Sebas49

Category:

more less

Transcript and Presenter's Notes

Title: Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci

1
Incomplete Lineage Sorting Consistent Phylogeny
Estimation From Multiple Loci a couple of
unrelated observations

Elchanan Mossel, UC Berkeley
Joint work with Sebastien Roch, Microsoft
Research
At Newton Institute Dec 07

2
Lecture Plan

A simple observation about gene trees and
population trees.
A comment on optimal and absolute converging
tree reconstruction
A comment on Generic models.
A comment on Network Reconstruction.
Disclaimer Last talk a bit philosophical (but
would be happy to provide hard technical proofs
?)

3
Gene Trees and Population Trees

Main goal in phylogenetics
Recovering species/population histories.
Data Current Genes.
Issue In recent populations gene trees may
differ from population trees.
Model for evolution of trees in populations
Coalescence
Fixed size population N
Each individual chooses a random parent in
previous generation.
generations N branch-length
Main Question How to reconstruct population
trees from gene trees?

4
Gene Trees The Engineering Approach

Two common engineering approaches
Approach 1
Assume all genes come from a single tree.
Kubato-Degnan Inconsistent.
Approach 2
Build tree for each tree on its own.
Take majority tree.
Degnan-Rosenberg Inconsistent.
Q What should be done instead?

5
Gene Trees A Rigorous Approach

M-Roch A consistent estimator of the molecular
distance between two populations d(P1,P2) is
D(P1,P2) min dg(P1,P2) g 2 Genes
) distances between populations are identifiable.
) tree is identifiable
Under standard coalescence assumptions, get good
rate
P(topology error) ( pops) exp(-c genes)
c shortest branch length.
Estimator can be plugged in into any distance
based method for reconstructing trees.
In M-Roch, use NJ, but similarly work for
Short-quartets (ESSW)
Distorted metrics and forests (M)
etc.

6
Comments on Absolute Convergence

Algorithmic paradigm Want to reconstruct tree on
n species using
sequence length L and
running time T.
Absolute Convergence L poly(n) T poly(n).
Q Is this the best we can do?

7
resolution of Steels conjecture
Short branches all branches lt lc Long
branches all branches gt lc lc depends on
mutation model but not on tree, tree size etc.
Daskalakis- M-Roch06
seq. length L c log n
short branches
seq. length L nC
M04
long branches
n species
8
The algorithmic challenge

Conj For short branches, if data is generated
from the model
ML identifies the correct using
L O(log n) samples
(best bound known is L exp(O(N)).
Conclusion In order to beat ML, need
algorithms with L O(log n)
Challenge The constant in O is important!
Challenge Deal with short/long branches
(contract edges output forest)
Challenge General mutation models (not just CFN,
JC).
Comment Rigorous methods have running time
gaurentee.
Comment For Lpoly(n), know how to deal with all
challenges
ESSW
M07 (forests long edges).
Gornieu et. al (short edges).

9
On generic parameters

From Rhodes talk Generic models are easier to
identify.
Typically genetic parameters.
How about generic trees?

10
Mixtures and Phenomena in High Dims

The Geometry of High Dimensions
Almost every collection of k vectors are almost
orthogonal in high enough dimension n.
M-Roch (in preparation)
For every k, as n -gt 1 the probability that a
mixture of k trees on n leaves is identifiable
goes to 1.
Holds for most reasonable measures on the space
of trees and most mutation models.
Basic idea In generic situations can (almost)
cluster samples according to trees.
Gives an efficient algorithm.
Similar results hold for rates across sites

.
11
A Comment on Dynamic Programming

Q (Zhang)
Given a tree is it possible to find the
most informative k species?
In terms of Pasrsimony?
In terms of ML?
Note
If we know Parsimony/ML score for left/right
sub-tree, we know it for the root.
Q Can use dynamic programming?
A Yes but with the right data structure
Information per node
Discrete version of
the set
of achievable distributions.
Called Density Evolution in coding theory /
spin-glass theory.
Additive error 1/poly(n).

L
L2
L1
L
L2
L1
12
Hardness of Distinguishing Network Models with
Hidden Nodes

Basic question Is it possible to recover a
network G from observation at a subset of the
nodes?
Easier question Suppose we observe X1,,Xr. Is
it possible to determine if they come from nodes
S in G1 or nodes T in G2?
Problem It may be that the two distributions are
the same.
Assume The two distributions are different
(large total variation distance)
Q Assuming the two distributions are different
how hard is it to tell if its coming from G1 or
G2?
Related question What is a computational model
of a biologist?

G1
G2
13
The distinguishing problem for Trees

Q Assuming the two distributions are different
how hard is it to tell if its coming from T1 or
T2?
Note For trees the problem is easy
Perform likelihood test.
Easy to do efficiently (peeling, pruning,
dynamics programming).
samples needed poly(n).

T1
T2
14
Two Models of a Biologist

The Computationally Limited Biologist Cannot
solve hard computational problems, in particular
cannot sample from a general G-distributions.
The Computationally Unlimited Biologist
Can sample from any distribution.
Related to the following problem
Can nature solve computationally hard problems?

From Shapiro at Weizmann
15
Hardness Results

The Computational Limited Biologist (Bogdanov-M)
Distinguishing problem can be solved efficiently
iff NPRP.
Computational Unlimited Biologist (Bogdanov-M)
The problem is at least zero-knowledge hard.
Zero-Knowledge Problem Can we decide if samples
from a computationally efficient distribution is
coming from the uniform distributions?
Related to cryptography.

G1
G2
16
Reconstructing Networks

Motivation abundance of stochastic networks in
biology, social networks, neuro-science etc. etc.
Network defines a distribution as follows
G(V,E) Graph on n 1,2,,n
Distribution defined on AV, where A is some
finite set.
Too each clique C in G, associate a function ?C
AC -gt R and
P? ?C ?C(?C)
Called Markov Random Field, Factorized
Distribution etc.
Directed models also common.
Markov Property If S separates A from B then
?A and ?B are conditionally independent
given ?S

17
Reconstructing Networks

.
Task 1 Given samples of ?, find G.
Task 2 Given samples of ? restricted to a set S
find G.
Will consider the problem when n large and
maximum degree d is small.
(Note that specification of the model is of size
max(n,,exp(max C)) )

18
Reconstructing Networks A Trivial Algorithm

Lower bound (Bresler-M-Sly)
In order to recover G of max-deg d need at least
c d log n samples.
Pf follows by counting of networks.
Upper bound (Bresler-M-Sly)
If distribution is non-degenerate c d log n
samples suffice.
Trivial Algorithm
For each v 2 V
Enumerate on N(v)
For each w 2 V check if ?v ind. of ?w given
?N(v).
Non-Degeneracy
For every v and every w 2 N(v) there exists two
assignments to N(v) ?1 and ?2 that differ at w
and dTV(P(?v ?1), P(?v ?2)) ?
For soft-core model suffices to have for all ?
?u,v
maxa,b,c,d ?(c,a)-?(d,a)?(c,b)-?(d,b) gt ?
Running time O(nd1 log n)

19
A Trivial Algorithm Related Result

Trivial Algorithm
For each v 2 V
Enumerate on N(v)
For each w 2 V check ?v ind. of ?w given ?N(v).
Related work
Algorithm was suggested before.
Abbeel, D. Koller, A. Ng without restrictions
learn a model whose KL distance from generating
model is small (no guarantee of obtaining the
true model in
order to get O(1) KL distance need poly
samples).
M. J. Wainwright, P. Ravikumar, J. D Use L1
regularization to get true model for Ising
models, sampling complexity O(d5 log n) no
running time bounds.
Other related work assuming special form of
potentials ?

20
Variants of the Trivial Algorithm
possible ws

If graph has exponential decay of correlations
Corr(?u,?v) exp(-c d(u,v))
Suffices to enumerate over N(v)
among w correlated with v.
Running time O(n2 log n n f(d)).
Missing nodes Suppose G is triangle free,
then a variant of the algorithm can
find one hidden node.
Idea (with M. Biskups help)
Run the algorithm as if the node is not hidden
Noise The algorithm tolerates small
amounts of noise (statistical robustness).
Q What about higher amounts of noise?
(From Bresler-M-Sly)

21
Higher Noise Non Identifiable Example

Bresler-M-Sly Example of non-identifiably
Consider
G1 path of length 2,
G2 triangle Noise.
Assume Ising model with random interactions and
random noise.
Then with constant probability, cannot
distinguish between the models.
Ising P? ?u,v 2 E exp(? ?(u) ?(v))
Intuitive reason dimension of distribution
is 3 in both cases.

hidden nodes
observed nodes
22
(No Transcript)
23

Sebastien Roch
Costis Daskalakis
Andrej Bogdanov

24
Fascinating workshop Principal Organiser
Professor Mike Steel (University of Canterbury,
NZ) Organisers Professor Vincent Moulton
(University of East Anglia) and Dr Katharina
Huber (University of East Anglia) Sponsored by
Allan Wilson Centre for Molecular Ecology and
Evolution As part of a great program Organiser
s Professor V Moulton (East Anglia), Professor
M Steel (Canterbury) and Professor D Huson
(Tubingen)

Write a Comment

User Comments (0)