Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci

About This Presentation
Title:

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci

Description:

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci & a couple of unrelated observations Elchanan Mossel, UC Berkeley –

Number of Views:119
Avg rating:3.0/5.0
Slides: 25
Provided by: Sebas49
Category:

less

Transcript and Presenter's Notes

Title: Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci


1
Incomplete Lineage Sorting Consistent Phylogeny
Estimation From Multiple Loci a couple of
unrelated observations
  • Elchanan Mossel, UC Berkeley
  • Joint work with Sebastien Roch, Microsoft
    Research
  • At Newton Institute Dec 07

2
Lecture Plan
  • A simple observation about gene trees and
    population trees.
  • A comment on optimal and absolute converging
    tree reconstruction
  • A comment on Generic models.
  • A comment on Network Reconstruction.
  • Disclaimer Last talk a bit philosophical (but
    would be happy to provide hard technical proofs
    ?)

3
Gene Trees and Population Trees
  • Main goal in phylogenetics
  • Recovering species/population histories.
  • Data Current Genes.
  • Issue In recent populations gene trees may
    differ from population trees.
  • Model for evolution of trees in populations
  • Coalescence
  • Fixed size population N
  • Each individual chooses a random parent in
  • previous generation.
  • generations N branch-length
  • Main Question How to reconstruct population
    trees from gene trees?

4
Gene Trees The Engineering Approach
  • Two common engineering approaches
  • Approach 1
  • Assume all genes come from a single tree.
  • Kubato-Degnan Inconsistent.
  • Approach 2
  • Build tree for each tree on its own.
  • Take majority tree.
  • Degnan-Rosenberg Inconsistent.
  • Q What should be done instead?

5
Gene Trees A Rigorous Approach
  • M-Roch A consistent estimator of the molecular
    distance between two populations d(P1,P2) is
  • D(P1,P2) min dg(P1,P2) g 2 Genes
  • ) distances between populations are identifiable.
  • ) tree is identifiable
  • Under standard coalescence assumptions, get good
    rate
  • P(topology error) ( pops) exp(-c genes)
  • c shortest branch length.
  • Estimator can be plugged in into any distance
    based method for reconstructing trees.
  • In M-Roch, use NJ, but similarly work for
  • Short-quartets (ESSW)
  • Distorted metrics and forests (M)
  • etc.

6
Comments on Absolute Convergence
  • Algorithmic paradigm Want to reconstruct tree on
  • n species using
  • sequence length L and
  • running time T.
  • Absolute Convergence L poly(n) T poly(n).
  • Q Is this the best we can do?

7
resolution of Steels conjecture
Short branches all branches lt lc Long
branches all branches gt lc lc depends on
mutation model but not on tree, tree size etc.
Daskalakis- M-Roch06
seq. length L c log n
short branches
seq. length L nC
M04
long branches
n species
8
The algorithmic challenge
  • Conj For short branches, if data is generated
    from the model
  • ML identifies the correct using
  • L O(log n) samples
  • (best bound known is L exp(O(N)).
  • Conclusion In order to beat ML, need
    algorithms with L O(log n)
  • Challenge The constant in O is important!
  • Challenge Deal with short/long branches
    (contract edges output forest)
  • Challenge General mutation models (not just CFN,
    JC).
  • Comment Rigorous methods have running time
    gaurentee.
  • Comment For Lpoly(n), know how to deal with all
    challenges
  • ESSW
  • M07 (forests long edges).
  • Gornieu et. al (short edges).

9
On generic parameters
  • From Rhodes talk Generic models are easier to
    identify.
  • Typically genetic parameters.
  • How about generic trees?

10
Mixtures and Phenomena in High Dims
  • The Geometry of High Dimensions
  • Almost every collection of k vectors are almost
    orthogonal in high enough dimension n.
  • M-Roch (in preparation)
  • For every k, as n -gt 1 the probability that a
    mixture of k trees on n leaves is identifiable
    goes to 1.
  • Holds for most reasonable measures on the space
    of trees and most mutation models.
  • Basic idea In generic situations can (almost)
    cluster samples according to trees.
  • Gives an efficient algorithm.
  • Similar results hold for rates across sites

.
11
A Comment on Dynamic Programming
  • Q (Zhang)
  • Given a tree is it possible to find the
  • most informative k species?
  • In terms of Pasrsimony?
  • In terms of ML?
  • Note
  • If we know Parsimony/ML score for left/right
    sub-tree, we know it for the root.
  • Q Can use dynamic programming?
  • A Yes but with the right data structure
  • Information per node
  • Discrete version of
  • the set
  • of achievable distributions.
  • Called Density Evolution in coding theory /
    spin-glass theory.
  • Additive error 1/poly(n).

L
L2
L1
L
L2
L1
12
Hardness of Distinguishing Network Models with
Hidden Nodes
  • Basic question Is it possible to recover a
    network G from observation at a subset of the
    nodes?
  • Easier question Suppose we observe X1,,Xr. Is
    it possible to determine if they come from nodes
    S in G1 or nodes T in G2?
  • Problem It may be that the two distributions are
    the same.
  • Assume The two distributions are different
    (large total variation distance)
  • Q Assuming the two distributions are different
    how hard is it to tell if its coming from G1 or
    G2?
  • Related question What is a computational model
    of a biologist?

G1
G2
13
The distinguishing problem for Trees
  • Q Assuming the two distributions are different
    how hard is it to tell if its coming from T1 or
    T2?
  • Note For trees the problem is easy
  • Perform likelihood test.
  • Easy to do efficiently (peeling, pruning,
    dynamics programming).
  • samples needed poly(n).

T1
T2
14
Two Models of a Biologist
  • The Computationally Limited Biologist Cannot
    solve hard computational problems, in particular
    cannot sample from a general G-distributions.
  • The Computationally Unlimited Biologist
  • Can sample from any distribution.
  • Related to the following problem
  • Can nature solve computationally hard problems?

From Shapiro at Weizmann
15
Hardness Results
  • The Computational Limited Biologist (Bogdanov-M)
    Distinguishing problem can be solved efficiently
    iff NPRP.
  • Computational Unlimited Biologist (Bogdanov-M)
  • The problem is at least zero-knowledge hard.
  • Zero-Knowledge Problem Can we decide if samples
    from a computationally efficient distribution is
    coming from the uniform distributions?
  • Related to cryptography.

G1
G2
16
Reconstructing Networks
  • Motivation abundance of stochastic networks in
    biology, social networks, neuro-science etc. etc.
  • Network defines a distribution as follows
  • G(V,E) Graph on n 1,2,,n
  • Distribution defined on AV, where A is some
    finite set.
  • Too each clique C in G, associate a function ?C
    AC -gt R and
  • P? ?C ?C(?C)
  • Called Markov Random Field, Factorized
    Distribution etc.
  • Directed models also common.
  • Markov Property If S separates A from B then
  • ?A and ?B are conditionally independent
  • given ?S

17
Reconstructing Networks
  • .
  • Task 1 Given samples of ?, find G.
  • Task 2 Given samples of ? restricted to a set S
    find G.
  • Will consider the problem when n large and
    maximum degree d is small.
  • (Note that specification of the model is of size
    max(n,,exp(max C)) )

18
Reconstructing Networks A Trivial Algorithm
  • Lower bound (Bresler-M-Sly)
  • In order to recover G of max-deg d need at least
    c d log n samples.
  • Pf follows by counting of networks.
  • Upper bound (Bresler-M-Sly)
  • If distribution is non-degenerate c d log n
    samples suffice.
  • Trivial Algorithm
  • For each v 2 V
  • Enumerate on N(v)
  • For each w 2 V check if ?v ind. of ?w given
    ?N(v).
  • Non-Degeneracy
  • For every v and every w 2 N(v) there exists two
    assignments to N(v) ?1 and ?2 that differ at w
    and dTV(P(?v ?1), P(?v ?2)) ?
  • For soft-core model suffices to have for all ?
    ?u,v
  • maxa,b,c,d ?(c,a)-?(d,a)?(c,b)-?(d,b) gt ?
  • Running time O(nd1 log n)

19
A Trivial Algorithm Related Result
  • Trivial Algorithm
  • For each v 2 V
  • Enumerate on N(v)
  • For each w 2 V check ?v ind. of ?w given ?N(v).
  • Related work
  • Algorithm was suggested before.
  • Abbeel, D. Koller, A. Ng without restrictions
    learn a model whose KL distance from generating
    model is small (no guarantee of obtaining the
    true model in
  • order to get O(1) KL distance need poly
    samples).
  • M. J. Wainwright, P. Ravikumar, J. D Use L1
    regularization to get true model for Ising
    models, sampling complexity O(d5 log n) no
    running time bounds.
  • Other related work assuming special form of
    potentials ?

20
Variants of the Trivial Algorithm
possible ws
  • If graph has exponential decay of correlations
  • Corr(?u,?v) exp(-c d(u,v))
  • Suffices to enumerate over N(v)
  • among w correlated with v.
  • Running time O(n2 log n n f(d)).
  • Missing nodes Suppose G is triangle free,
  • then a variant of the algorithm can
  • find one hidden node.
  • Idea (with M. Biskups help)
  • Run the algorithm as if the node is not hidden
  • Noise The algorithm tolerates small
  • amounts of noise (statistical robustness).
  • Q What about higher amounts of noise?
  • (From Bresler-M-Sly)

21
Higher Noise Non Identifiable Example
  • Bresler-M-Sly Example of non-identifiably
  • Consider
  • G1 path of length 2,
  • G2 triangle Noise.
  • Assume Ising model with random interactions and
    random noise.
  • Then with constant probability, cannot
    distinguish between the models.
  • Ising P? ?u,v 2 E exp(? ?(u) ?(v))
  • Intuitive reason dimension of distribution
  • is 3 in both cases.

hidden nodes
observed nodes
22
(No Transcript)
23
  • Sebastien Roch
  • Costis Daskalakis
  • Andrej Bogdanov

24
Fascinating workshop Principal Organiser
Professor Mike Steel (University of Canterbury,
NZ) Organisers Professor Vincent Moulton
(University of East Anglia) and Dr Katharina
Huber (University of East Anglia) Sponsored by
Allan Wilson Centre for Molecular Ecology and
Evolution As part of a great program Organiser
s Professor V Moulton (East Anglia), Professor
M Steel (Canterbury) and Professor D Huson
(Tubingen)
Write a Comment
User Comments (0)
About PowerShow.com