Title: Phylogenetic Networks
1Phylogenetic Networks
- Anna Tholse
- MS Thesis Defense
- Department of Computer Science
- July 10, 2003
2Outline
- Background
- Network generation
- Distance measure for networks
- Network reconstruction
- Conclusion
3Phylogenetic Trees
- Mathematical model for representing evolutionary
histories among taxa - Rooted or unrooted
- Leaves taxa for which we have sampled data
- Internal nodes hypothetical ancestors
4Model Trees
- Not enough biological datasets exists
- Algorithms for simulating the true phylogeny have
been developed - Underlying model of topology
- uniform random all topologies are equally likely
- birth-death well balanced topologies,
biologically meaningful
5Sequence Evolution
- The model tree itself does not provide us with
all the information, we need sequence data - Mutational changes on the sequence (nucleotide
substitution, insertion and deletions, and
recombination) - Seq-gen (Rambaut and Grassly, 1997)
6Distance Measure
- Assess the performance of a reconstruction method
by computing the distance between the inferred
phylogeny and the model phylogeny - Robinson-Foulds
- Every edge e in a leaf-labeled tree T defines a
bipartition be - T is encoded by C(T) be e IN E(T)
- n-3 internal edges in an unrooted tree
7Robinson-Foulds Cont.
- False Positive rate (C(T2) - C(T1)) / (n-3)
- False Negative rate (C(T1) - C(T2)) / (n-3)
- RF value (FP FN) / 2
8Reconstruction Methods
- Maximum parsimony optimization method
- Produces the tree (or trees) that needs the
fewest evolutionary changes between sequences - Maximum likelihood optimization method
- Produces the tree (or trees) that is most likely
to give rise to the given sequences - Neighbor-joining distance method
- Produces the tree (or trees) that minimizes the
total branch lengths
9Simulations
- Important for testing the performance of
phylogeny reconstruction methods - Can generate test sets in arbitrary large numbers
with different settings - Parameter space is large
- Test a large range of parameters and do many runs
for each setting to estimate the variance
10Simulation Flow
- Create model topology
- Evolve sequences on the topology
- Feed resulting leaf sequences to the studied
reconstruction method - Compute distance between inferred phylogeny and
model phylogeny
11Outline
- Background
- Network generation
- Distance measure for networks
- Network reconstruction
- Conclusion
12Related Work
- SplitsTree (Huson, 1998) and NeighborNet (Bryant
and Moulton, 2002). Representing incompatible
splits. - Ancestral Recombination Graphs (Hudson, 1983 and
Griffiths Marjoram, 1996). Extension of the
coalescent model. - Lateral Gene Transfer (Hallett et al. 2001,
2003). Detection and representation.
13Non-treelike Evolutionary Events
- Not all evolutionary events can be captured by a
tree - Hybridization two lineages (edges) combine and
create a new lineage (edge) - Lateral gene transfer genetic material from one
lineage (edge) is transferred to another lineage - Definition homolog - a member of a chromosome
pair
14Hybridization
- A network (a) and its two induced trees (b,c)
- Different kinds of hybridization
- Diploid same of homologs as its 2 parents
- Polyploid double of homologs as its 2 parents
- Auto-polyploid double of homologs as parent
lineage
15Network Representation
- Rooted directed acyclic graphs (DAGs)
- Three kinds of nodes
- Root node indegree 0 and outdegree 2
- Tree node
- indegree 1 and outdegree 2 (internal node)
- indegree 1 and outdegree 0 (leaf node)
- Network node
- indegree 2 and outdegree 1 (diploid or
polyploid) - indegree 1 and outdegree 1 (auto-polyploid)
16Network Representation Cont.
--- network edge
- Two kinds directed edges
- e(u,v) is a tree edge iff v is a tree node
- e(u,v) is a network edge iff v is a network node
17Evolutionary Events
- Extinction A new node u is created at the end of
a lineage, no new lineage is started from u - Speciation A new node u is created at the end of
a lineage, and two new lineages are started from
u - Hybridization A new node u is created
- when two lineages combine (diploid or polyploid)
- when one lineage creates u and the new lineage
from u has double the number of homologs
(auto-polyploid)
18Network Generation I
- Start with one node (the root), and two sequences
(the homologs), setup an initial speciation that
starts two lineages - Consider, at any time t, all existing lineages
and with probability p an evolutionary events
takes place - hybridization find a coexisting lineage that
also seeks to hybridize (the evolutionary
distance between the two lineages cannot be too
large) - Evolve sequences at each new node created
19Network Generation II
- Generate a birth-death tree, let time on the root
be 0 and time at the last generated leaf be tl
and let ti be in the range 0, tl - Find all nodes who do not exceed time ti but
whose children do - Calculate evolutionary distance between the found
nodes
20Network Sequence Evolution
- In our simpler model we evolve sequences after
the phylogeny is created. - Seq-gen2, evolves the two homologs
simultaneously. Similar to Seq-gen, but at each
network node, the node randomly inherits one of
the parents two homologs. - Output the evolved pairs of sequences for each
leaf in the network.
21Outline
- Background
- Network generation
- Distance measure for networks
- Network reconstruction
- Conclusion
22Network Distance Measure
u
v
- Each edge e, in a rooted network induces a
tripartition on the leaves - e (u,v) X(e) A,B, Y(e) C,D, Z(e)
E,F
23Robinson-Foulds Extended
- FP(N1,N2) e2 IN E(N2) not ? e1 IN E(N1), e1
? e2 / E(N2) - FN(N1,N2) e1 IN E(N1) not ? e2 IN E(N2), e1
? e2 / E(N1) - RF(N1,N2) (FN(N1,N2) FP(N1,N2) / 2
24Convergence
- Convergence might cause the metric to return 0 in
cases where N1 and N2 do NOT have the same
topology X Y make up a convergent set in both
networks
25Class I and Class II Networks
- Class I network does not contain a convergent
set - Class II network contains a convergent set
- Low probability of a class II network
26Measure is a Provable Metric
- The pair (N,m), where N is the space of Class I
phylogenetic networks and m(.,.) is our error
measure, is a metric space. - For more details and proofs see Thesis or An
Error Metric for Phylogenetic Networks. Linder et
al. Department of Computer Science. Technical
Report. 2003.
27Evaluating the Metric Experimental Setup
- Number of network nodes 0, 1, 2, 3, 4, and 5
- Number of taxa 10, 20, 40, and 80
- Sequence lengths 25, 50, 100, 250 and 500
- Scaling 0.1, 0.5, 1 and 2
- SplitsTree, neighbor-joining, and greedy maximum
parsimony (using PAUP)
28Experimental Results
80 taxa, edge scaling 0.5, sequence length 500
- SplitsTree introduces too many network nodes
29Experimental Results Cont.
40 taxa, edge scaling 0.5, sequence length 1000
- Error rate grows as a function of the number of
network nodes present in model network
30Experimental Results Cont.
80 taxa, edge scaling 0.5 , 1 network node in
model network
- MP and NJ Slow decrease in error as sequence
length increases - Metric performance Performs as expected, it does
neither under- nor overemphasize the importance
of network nodes
31Outline
- Background
- Network generation
- Distance measure for networks
- Network reconstruction
- Conclusion
32Network Reconstruction
- Inferred phylogeny might look significantly
different from model phylogeny due to - Extinction
- Missing data in taxon sampling
- Lineage has undergone two or more simultaneous
hybridization events
33Related Work
- Most parsimonious network (Fitch, 1997)
- Parsimony for reconstructing evolutionary
histories when recombination is present (Hein,
1990)
34Parsimony on Phylogenetic Networks
- The evolutionary history of a site i in a set S
of sequences that evolved on a network N is
captured by one of the trees induced by N. - Parsimony score of a network N leaf-labeled by a
set of taxa S is
NCost(N,S) ?i Cost(N,i) where
Cost(N,i) minT IN T(N) TCost(T,i)
35Fixed-tree Maximum Parsimony on Phylogenetic
Networks (FTMPPN)
- Input A tree T, leaf-labeled by a set of aligned
sequences, S, and a bound B - Output A phylogenetic network N (containing T)
with at most B network nodes, with leaves labeled
by S and internal nodes labeled by additional
sequences that minimizes NCost(N,S)
36FTMPPN Investigated
- Determine if the best scoring phylogenetic
network shows a bias towards fewer, equivalent,
or larger number of network nodes than present in
the model network - Assess if the inferred topology is accurate
compared to the model topology
37FTMPPN Investigated Cont.
- Take a network, N, leaf-labeled by a set of
sequences, S - Find all the trees that are contained within N
- Introduce at most B network nodes to N to
minimize NCost
38Experimental Setup
- Used a subset of the generated data from previous
experiments - Used model networks directly
- Added at most 5 network nodes to each tree of the
network
39Number of Network Nodes (inferred vs. model)
80 taxa, edge scaling 0.5, sequence length 500
- The heuristic infers too many network nodes
40Topological Accuracy
40 taxa, edge scaling 0.5, sequence length 1000
- Poor, the inferred network nodes do not
correspond to the ones in the model network
41Conclusions
- Computational tools for phylogenetic network
generation and sequence evolution on the networks - Metric for the evaluation of the performance of
phylogenetic network reconstruction methods - Reconstructed network topologies might look
different from model network
42Conclusions Cont.
- Reconstruction using maximum parsimony where the
minimum of each site is taken appears ill-suited.
A better way might be to take the average of each
site over all trees?
43Acknowledgements
- Bernard Moret
- UT Austin Randy Linder, Luay Nakhleh, Anneke
Padolina, Jerry Sun, Ruth Timme, and Tandy Warnow