Title: Working with Trees in the Phyloinformatic Age
1Working with Trees in the Phyloinformatic Age
- William H. Piel
- Yale Peabody Museum
- Hilmar Lapp
- NESCent, Duke University
2Dealing with the Growth of Phyloinformatics
- Trees Too Many
- Search, organize, triage, summarize, synthesize
- Review existing methods
- Describe queries for BioSQL phylo extension
- Making generic queries
- Trees Too Big
- Visualizing and manipulating large trees
- Demo PhyloWidget
3Searching Stored Tree
- Path Enumerations
- Nested Sets
- Adjacency Lists
- Transitive Closure
4Dewey system
5Find clade for Z (ltCSDs)
Label Path
Root 0
NULL 0.1
A 0.1.1
B 0.1.2
NULL 0.2
NULL 0.2.1
C 0.2.1.1
D 0.2.1.2
E 0.2.2
Find common pattern starting from left
SELECT FROM nodes WHERE (path LIKE 0.2.1)
6- ATreeGrep
- Uses special suffix indexing to optimize speed
- Shasha, D., J. T. L. Wang, H. Shan and K. Zhang.
2002. ATreeGrep Approximate Searching in
Unordered Tree. Proceedings of the 14th SSDM,
Edinburgh, Scotland, pp. 89-98. - Crimson
- Uses nested subtrees to avoid long strings
- Zheng, Y. S. Fisher, S. Cohen, S. Guo, J. Kim,
and S. B. Davidson. 2006. Crimson A Data
Management System to Support Evaluating
Phylogenetic Tree Reconstruction Algorithms. 32nd
International Conference on Very Large Data
Bases, ACM, pp. 1231-1234.
7Searching Stored Tree
- Path Enumerations
- Nested Sets
- Adjacency Lists
- Metrics
- Transitive Closure
8Depth-first traversal scoring each node with a
lef and right ID
9Minimum Spanning Clade of Node 5
Label Left Right
1 18
2 7
A 3 4
B 5 6
8 17
9 14
C 10 11
D 12 13
E 15 16
SELECT FROM nodes INNER JOIN nodes AS
include ON (nodes.left_id BETWEEN include.left_id
AND include.right_id) WHERE include.node_id 5
10- PhyloFinder
- Duhong Chen et al.
- http//pilin.cs.iastate.edu/phylofinder/
- Mackey, A. 2002. Relational Modeling of
Biological Data Trees and Graphs. Bioinformatics
Technology Conference. http//www.oreillynet.com/p
ub/a/network/2002/11/27/bioconf.html
11Searching Stored Tree
- Path Enumerations
- Nested Sets
- Adjacency Lists
- Metrics
- Transitive Closure
12(No Transcript)
13SQL Query to find parent node of node D
SELECT FROM nodes AS parent INNER JOIN nodes
AS child ON (child.parent_id
parent.node_id) WHERE child.node_label D
but this requires an external procedure to
navigate the tree.
14Searching Stored Tree
- Path Enumerations
- Nested Sets
- Adjacency Lists
- Metrics
- Transitive Closure
15Searching trees by distance metrics USim
distanceWang, J. T. L., H. Shan, D. Shasha and
W. H. Piel. 2005. Fast Structural Search in
Phylogenetic Databases. Evolutionary
Bioinformatics Online, 1 37-46
A B C D
A 0 1 2 2
B 1 0 2 2
C 2 2 0 1
D 2 2 1 0
A B C D
A 0 1 2 3
B 1 0 2 3
C 1 1 0 2
D 1 1 1 0
16Searching Stored Tree
- Path Enumerations
- Nested Sets
- Adjacency Lists
- Transitive Closure
17Transitive Closure
- Finding paths between vertices on a graph
- DB2 and Oracle have special functions
- From EdgeStart With (child_id A and tree_id
T)Connect By (Prior parent_id child_id)And
(Prior tree_id tree_id) - Nakhleh, L., D. Miranker, F. Barbancon, W. H.
Piel, and M. Donoghue. 2003. Requirements of
phylogenetic databases. Third IEEE Symposium on
Bioinformatics and Bioengineering, p. 141-148. - Paths can be precomputed and stored BioSQL
18Dealing with the Growth of Phyloinformatics
- Trees Too Many
- Search, organize, triage, summarize, synthesize
- Review existing methods
- Describe queries for BioSQL phylo extension
- Making generic queries
- Trees Too Big
- Visualizing and manipulating large trees
- Demo PhyloWidget
19BioSQL http//www.biosql.org/ Schema for
persistent storage of sequences and features
tightly integrated with BioPerl ( BioPython,
BioJava, and BioRuby) phylodb extension
designed at NESCent Hackathon perl
command-line interface by Jamie Estill, GSoC
20Index of all paths from ancestors to descendants
CREATE TABLE node_path ( child_node_id
integer, parent_node_id integer, distance
integer)
21Find all paths where A and B share a common
parent_node_id
SELECT pA.parent_node_id FROM node_path pA,
node_path pB, nodes nA, nodes nB WHERE
pA.parent_node_id pB.parent_node_id AND
pA.child_node_id nA.node_id AND nA.node_label
'A' AND pB.child_node_id nB.node_id AND
nB.node_label 'B'
22of those paths, select one that has the shortest
path
SELECT pA.parent_node_id FROM node_path pA,
node_path pB, nodes nA, nodes nB WHERE
pA.parent_node_id pB.parent_node_id AND
pA.child_node_id nA.node_id AND nA.node_label
'A' AND pB.child_node_id nB.node_id AND
nB.node_label 'B' ORDER BY pA.distance LIMIT 1
23of those paths, select one that has the longest
path
SELECT pA.parent_node_id FROM node_path pA,
node_path pB, nodes nA, nodes nB WHERE
pA.parent_node_id pB.parent_node_id AND
pA.child_node_id nA.node_id AND nA.node_label
'A' AND pB.child_node_id nB.node_id AND
nB.node_label 'B' ORDER BY pA.distance
DESC LIMIT 1
24Find the maximum spanning clade (i.e. the
subtree) for each tree that includes A and B but
not C
SELECT e.parent_id AS parent, e.child_id AS
child, ch.node_label, pt.tree_id FROM node_path
p, edges e, nodes pt, nodes ch WHERE e.child_id
p.child_node_id AND pt.node_id e.parent_id AND
ch.node_id e.child_id AND p.parent_node_id IN
( SELECT pA.parent_node_id FROM
node_path pA, node_path pB, nodes nA, nodes nB
WHERE pA.parent_node_id pB.parent_node_id
AND pA.child_node_id nA.node_id
AND nA.node_label 'A' AND
pB.child_node_id nB.node_id AND
nB.node_label 'B') AND NOT EXISTS ( SELECT
1 FROM node_path np, nodes n WHERE
np.child_node_id n.node_id AND
n.node_label 'C' AND np.parent_node_id
p.parent_node_id)
25Find trees that contain a clade that includes A
and B but not C
SELECT DISTINCT t.tree_id, t.name FROM node_path
p, nodes ch, trees t WHERE ch.node_id
p.child_node_id AND ch.tree_id t.tree_id AND
p.parent_node_id IN ( SELECT
pA.parent_node_id FROM node_path pA,
node_path pB, nodes nA, nodes nB WHERE
pA.parent_node_id pB.parent_node_id AND
pA.child_node_id nA.node_id AND
nA.node_label 'A' AND pB.child_node_id
nB.node_id AND nB.node_label 'B') AND
NOT EXISTS ( SELECT 1 FROM node_path np,
nodes n WHERE np.child_node_id
n.node_id AND n.node_label 'C' AND
np.parent_node_id p.parent_node_id)
26Find trees that contain a clade that includes (A,
B, C) but not D or E
SELECT qry.tree_id, MIN(qry.name) AS
"tree_name" FROM ( SELECT DISTINCT ON (n.node_id)
n.node_id, t.tree_id, t.name FROM trees t,
nodes n, (SELECT DISTINCT ON
(inN.tree_id) inP.parent_node_id FROM
nodes inN, node_path inP WHERE
inN.node_label IN ('A','B','C') AND
inP.child_node_id inN.node_id GROUP BY
inN.tree_id, inP.parent_node_id HAVING
COUNT(inP.child_node_id) 3 ORDER BY
inN.tree_id, inP.parent_node_id DESC) AS lca,
WHERE n.node_id IN (lca2.parent_node_id) AND
t.tree_id n.tree_id AND NOT EXISTS (SELECT
1 FROM nodes outN, node_path outP
WHERE outN.node_label IN ('D','E') AND
outP.child_node_id outN.node_id AND
outP.parent_node_id lca.parent_node_id) AND
EXISTS (SELECT c.tree_id FROM trees c,
nodes q WHERE q.node_label IN ('D','E')
AND q.tree_id c.tree_id AND
c.tree_id t.tree_id GROUP BY c.tree_id
HAVING COUNT(c.tree_id) 2)) AS qry GROUP
BY (qry.tree_id) HAVING COUNT(qry.node_id) 1
27Here's a faster, cleaner version
SELECT t.tree_id, t.name FROM trees t
INNER JOIN (SELECT DISTINCT ON
(inN.tree_id) inP.parent_node_id, inN.tree_id
FROM nodes inN, node_path inP WHERE
inN.node_label IN ('A','B','C') AND
inP.child_node_id inN.node_id GROUP BY
inN.tree_id, inP.parent_node_id HAVING
COUNT(inP.child_node_id) 3 ORDER BY
inN.tree_id, inP.parent_node_id DESC) AS lca
USING (tree_id) WHERE NOT EXISTS
( SELECT 1 FROM nodes outN,
node_path outP WHERE outN.node_label IN
('D','E') AND outP.child_node_id
outN.node_id AND outP.parent_node_id
lca.parent_node_id) AND EXISTS (
SELECT c.tree_id FROM trees c, nodes q
WHERE q.node_label IN ('D','E') AND
q.tree_id c.tree_id AND c.tree_id
t.tree_id GROUP BY c.tree_id
HAVING COUNT(c.tree_id) 2)
28Matching a whole tree means querying for all
clades
(A, B) but not C, D, E (C, D) but not A, B, E (C,
D, E) but not A, B
29Dealing with the Growth of Phyloinformatics
- Trees Too Many
- Search, organize, triage, summarize, synthesize
- Review existing methods
- Describe queries for BioSQL phylo extension
- Making generic queries
- Trees Too Big
- Visualizing and manipulating large trees
- Demo PhyloWidget
30Mining trees for interesting, general,
relationship questions
(((Sus_scrofa, Hippopotamus),Balaenoptera),Equus_c
aballus) vs ((Sus_scrofa, (Hippopotamus,Balaenopte
ra)),Equus_caballus)
31Even if with perfectly-resolved OTUs, you will
still fail to hit relevant trees
32Step 1 for each clade all trees in database, run
a stem query on a classification tree (e.g. NCBI)
Step 2 label each node with an NCBI taxon id (if
there is a match)
Step 3 do the same for the query tree
Stem Queries Node 2 (gtA, B - C, D, E) Node 3
(gtA - B, C, D, E) Node 4 (gtB - A, C, D, E) Node
5 (gtC, D, E - A, B) Node 6 (gtC, D - A, B,
E) Node 7 (gtC - A, B, D, E) Node 8 (gtD - A, B,
C, E) Node 9 (gtE - A, B, C, D)
33Rename nodes according to their deepest stem
query
34Dealing with the Growth of Phyloinformatics
- Trees Too Many
- Search, organize, triage, summarize, synthesize
- Review existing methods
- Describe queries for BioSQL phylo extension
- Making generic queries
- Trees Too Big
- Visualizing and manipulating large trees
- Demo PhyloWidget
35PhyloWidget
- Greg Jordan
- Google Summer of Code student
- Nick Goldman's group, EBI
- Java Applet
- Uses the Processing graphics library
- Originally as a graphical phylogenetic query and
display tool for TreeBASE, BioSQL, etc - Can be used for
- Manipulating, visualizing large trees
- Building supertrees through pruning grafting
36Thanks