Working with Trees in the Phyloinformatic Age - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Working with Trees in the Phyloinformatic Age

Description:

Dealing with the Growth of Phyloinformatics. Trees: Too Many ... Equus caballus. Felis catus. Balaenoptera. Hippopotamus. Sus scrofa. Equus caballus. Felis catus ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 37
Provided by: PeterP155
Category:

less

Transcript and Presenter's Notes

Title: Working with Trees in the Phyloinformatic Age


1
Working with Trees in the Phyloinformatic Age
  • William H. Piel
  • Yale Peabody Museum
  • Hilmar Lapp
  • NESCent, Duke University

2
Dealing with the Growth of Phyloinformatics
  • Trees Too Many
  • Search, organize, triage, summarize, synthesize
  • Review existing methods
  • Describe queries for BioSQL phylo extension
  • Making generic queries
  • Trees Too Big
  • Visualizing and manipulating large trees
  • Demo PhyloWidget

3
Searching Stored Tree
  • Path Enumerations
  • Nested Sets
  • Adjacency Lists
  • Transitive Closure

4
Dewey system
5
Find clade for Z (ltCSDs)
Label Path
Root 0
NULL 0.1
A 0.1.1
B 0.1.2
NULL 0.2
NULL 0.2.1
C 0.2.1.1
D 0.2.1.2
E 0.2.2
Find common pattern starting from left
SELECT FROM nodes WHERE (path LIKE 0.2.1)
6
  • ATreeGrep
  • Uses special suffix indexing to optimize speed
  • Shasha, D., J. T. L. Wang, H. Shan and K. Zhang.
    2002. ATreeGrep Approximate Searching in
    Unordered Tree. Proceedings of the 14th SSDM,
    Edinburgh, Scotland, pp. 89-98.
  • Crimson
  • Uses nested subtrees to avoid long strings
  • Zheng, Y. S. Fisher, S. Cohen, S. Guo, J. Kim,
    and S. B. Davidson. 2006. Crimson A Data
    Management System to Support Evaluating
    Phylogenetic Tree Reconstruction Algorithms. 32nd
    International Conference on Very Large Data
    Bases, ACM, pp. 1231-1234.

7
Searching Stored Tree
  • Path Enumerations
  • Nested Sets
  • Adjacency Lists
  • Metrics
  • Transitive Closure

8
Depth-first traversal scoring each node with a
lef and right ID
9
Minimum Spanning Clade of Node 5
Label Left Right
1 18
2 7
A 3 4
B 5 6
8 17
9 14
C 10 11
D 12 13
E 15 16
SELECT FROM nodes INNER JOIN nodes AS
include ON (nodes.left_id BETWEEN include.left_id
AND include.right_id) WHERE include.node_id 5
10
  • PhyloFinder
  • Duhong Chen et al.
  • http//pilin.cs.iastate.edu/phylofinder/
  • Mackey, A. 2002. Relational Modeling of
    Biological Data Trees and Graphs. Bioinformatics
    Technology Conference. http//www.oreillynet.com/p
    ub/a/network/2002/11/27/bioconf.html

11
Searching Stored Tree
  • Path Enumerations
  • Nested Sets
  • Adjacency Lists
  • Metrics
  • Transitive Closure

12
(No Transcript)
13
SQL Query to find parent node of node D
SELECT FROM nodes AS parent INNER JOIN nodes
AS child ON (child.parent_id
parent.node_id) WHERE child.node_label D
but this requires an external procedure to
navigate the tree.
14
Searching Stored Tree
  • Path Enumerations
  • Nested Sets
  • Adjacency Lists
  • Metrics
  • Transitive Closure

15
Searching trees by distance metrics USim
distanceWang, J. T. L., H. Shan, D. Shasha and
W. H. Piel. 2005. Fast Structural Search in
Phylogenetic Databases. Evolutionary
Bioinformatics Online, 1 37-46
A B C D
A 0 1 2 2
B 1 0 2 2
C 2 2 0 1
D 2 2 1 0
A B C D
A 0 1 2 3
B 1 0 2 3
C 1 1 0 2
D 1 1 1 0
16
Searching Stored Tree
  • Path Enumerations
  • Nested Sets
  • Adjacency Lists
  • Transitive Closure

17
Transitive Closure
  • Finding paths between vertices on a graph
  • DB2 and Oracle have special functions
  • From EdgeStart With (child_id A and tree_id
    T)Connect By (Prior parent_id child_id)And
    (Prior tree_id tree_id)
  • Nakhleh, L., D. Miranker, F. Barbancon, W. H.
    Piel, and M. Donoghue. 2003. Requirements of
    phylogenetic databases. Third IEEE Symposium on
    Bioinformatics and Bioengineering, p. 141-148.
  • Paths can be precomputed and stored BioSQL

18
Dealing with the Growth of Phyloinformatics
  • Trees Too Many
  • Search, organize, triage, summarize, synthesize
  • Review existing methods
  • Describe queries for BioSQL phylo extension
  • Making generic queries
  • Trees Too Big
  • Visualizing and manipulating large trees
  • Demo PhyloWidget

19
BioSQL http//www.biosql.org/ Schema for
persistent storage of sequences and features
tightly integrated with BioPerl ( BioPython,
BioJava, and BioRuby) phylodb extension
designed at NESCent Hackathon perl
command-line interface by Jamie Estill, GSoC
20
Index of all paths from ancestors to descendants
CREATE TABLE node_path ( child_node_id
integer, parent_node_id integer, distance
integer)
21
Find all paths where A and B share a common
parent_node_id
SELECT pA.parent_node_id FROM node_path pA,
node_path pB, nodes nA, nodes nB WHERE
pA.parent_node_id pB.parent_node_id AND
pA.child_node_id nA.node_id AND nA.node_label
'A' AND pB.child_node_id nB.node_id AND
nB.node_label 'B'
22
of those paths, select one that has the shortest
path
SELECT pA.parent_node_id FROM node_path pA,
node_path pB, nodes nA, nodes nB WHERE
pA.parent_node_id pB.parent_node_id AND
pA.child_node_id nA.node_id AND nA.node_label
'A' AND pB.child_node_id nB.node_id AND
nB.node_label 'B' ORDER BY pA.distance LIMIT 1
23
of those paths, select one that has the longest
path
SELECT pA.parent_node_id FROM node_path pA,
node_path pB, nodes nA, nodes nB WHERE
pA.parent_node_id pB.parent_node_id AND
pA.child_node_id nA.node_id AND nA.node_label
'A' AND pB.child_node_id nB.node_id AND
nB.node_label 'B' ORDER BY pA.distance
DESC LIMIT 1
24
Find the maximum spanning clade (i.e. the
subtree) for each tree that includes A and B but
not C
SELECT e.parent_id AS parent, e.child_id AS
child, ch.node_label, pt.tree_id FROM node_path
p, edges e, nodes pt, nodes ch WHERE e.child_id
p.child_node_id AND pt.node_id e.parent_id AND
ch.node_id e.child_id AND p.parent_node_id IN
(       SELECT pA.parent_node_id       FROM  
node_path pA, node_path pB, nodes nA, nodes nB  
    WHERE pA.parent_node_id pB.parent_node_id  
    AND   pA.child_node_id nA.node_id      
AND   nA.node_label 'A'       AND  
pB.child_node_id nB.node_id       AND  
nB.node_label 'B') AND NOT EXISTS (     SELECT
1 FROM node_path np, nodes n     WHERE   
np.child_node_id n.node_id     AND
n.node_label  'C'     AND np.parent_node_id
p.parent_node_id)
25
Find trees that contain a clade that includes A
and B but not C
SELECT DISTINCT t.tree_id, t.name FROM node_path
p, nodes ch, trees t WHERE ch.node_id
p.child_node_id AND ch.tree_id t.tree_id AND
p.parent_node_id IN ( SELECT
pA.parent_node_id FROM node_path pA,
node_path pB, nodes nA, nodes nB WHERE
pA.parent_node_id pB.parent_node_id AND
pA.child_node_id nA.node_id AND
nA.node_label 'A' AND pB.child_node_id
nB.node_id AND nB.node_label 'B') AND
NOT EXISTS ( SELECT 1 FROM node_path np,
nodes n WHERE np.child_node_id
n.node_id AND n.node_label 'C' AND
np.parent_node_id p.parent_node_id)
26
Find trees that contain a clade that includes (A,
B, C) but not D or E
SELECT qry.tree_id, MIN(qry.name) AS
"tree_name" FROM ( SELECT DISTINCT ON (n.node_id)
n.node_id, t.tree_id, t.name FROM trees t,
nodes n, (SELECT DISTINCT ON
(inN.tree_id) inP.parent_node_id FROM
nodes inN, node_path inP WHERE
inN.node_label IN ('A','B','C') AND
inP.child_node_id inN.node_id GROUP BY
inN.tree_id, inP.parent_node_id HAVING
COUNT(inP.child_node_id) 3 ORDER BY
inN.tree_id, inP.parent_node_id DESC) AS lca,
WHERE n.node_id IN (lca2.parent_node_id) AND
t.tree_id n.tree_id AND NOT EXISTS (SELECT
1 FROM nodes outN, node_path outP
WHERE outN.node_label IN ('D','E') AND
outP.child_node_id outN.node_id AND
outP.parent_node_id lca.parent_node_id) AND
EXISTS (SELECT c.tree_id FROM trees c,
nodes q WHERE q.node_label IN ('D','E')
AND q.tree_id c.tree_id AND
c.tree_id t.tree_id GROUP BY c.tree_id
HAVING COUNT(c.tree_id) 2)) AS qry GROUP
BY (qry.tree_id) HAVING COUNT(qry.node_id) 1
27
Here's a faster, cleaner version
SELECT t.tree_id, t.name FROM trees t
INNER JOIN (SELECT DISTINCT ON
(inN.tree_id) inP.parent_node_id, inN.tree_id
FROM nodes inN, node_path inP WHERE
inN.node_label IN ('A','B','C') AND
inP.child_node_id inN.node_id GROUP BY
inN.tree_id, inP.parent_node_id HAVING
COUNT(inP.child_node_id) 3 ORDER BY
inN.tree_id, inP.parent_node_id DESC) AS lca
USING (tree_id) WHERE NOT EXISTS
( SELECT 1 FROM nodes outN,
node_path outP WHERE outN.node_label IN
('D','E') AND outP.child_node_id
outN.node_id AND outP.parent_node_id
lca.parent_node_id) AND EXISTS (
SELECT c.tree_id FROM trees c, nodes q
WHERE q.node_label IN ('D','E') AND
q.tree_id c.tree_id AND c.tree_id
t.tree_id GROUP BY c.tree_id
HAVING COUNT(c.tree_id) 2)
28
Matching a whole tree means querying for all
clades
(A, B) but not C, D, E (C, D) but not A, B, E (C,
D, E) but not A, B
29
Dealing with the Growth of Phyloinformatics
  • Trees Too Many
  • Search, organize, triage, summarize, synthesize
  • Review existing methods
  • Describe queries for BioSQL phylo extension
  • Making generic queries
  • Trees Too Big
  • Visualizing and manipulating large trees
  • Demo PhyloWidget

30
Mining trees for interesting, general,
relationship questions
(((Sus_scrofa, Hippopotamus),Balaenoptera),Equus_c
aballus) vs ((Sus_scrofa, (Hippopotamus,Balaenopte
ra)),Equus_caballus)
31
Even if with perfectly-resolved OTUs, you will
still fail to hit relevant trees
32
Step 1 for each clade all trees in database, run
a stem query on a classification tree (e.g. NCBI)
Step 2 label each node with an NCBI taxon id (if
there is a match)
Step 3 do the same for the query tree
Stem Queries Node 2 (gtA, B - C, D, E) Node 3
(gtA - B, C, D, E) Node 4 (gtB - A, C, D, E) Node
5 (gtC, D, E - A, B) Node 6 (gtC, D - A, B,
E) Node 7 (gtC - A, B, D, E) Node 8 (gtD - A, B,
C, E) Node 9 (gtE - A, B, C, D)
33
Rename nodes according to their deepest stem
query
34
Dealing with the Growth of Phyloinformatics
  • Trees Too Many
  • Search, organize, triage, summarize, synthesize
  • Review existing methods
  • Describe queries for BioSQL phylo extension
  • Making generic queries
  • Trees Too Big
  • Visualizing and manipulating large trees
  • Demo PhyloWidget

35
PhyloWidget
  • Greg Jordan
  • Google Summer of Code student
  • Nick Goldman's group, EBI
  • Java Applet
  • Uses the Processing graphics library
  • Originally as a graphical phylogenetic query and
    display tool for TreeBASE, BioSQL, etc
  • Can be used for
  • Manipulating, visualizing large trees
  • Building supertrees through pruning grafting

36
Thanks
Write a Comment
User Comments (0)
About PowerShow.com