Title: Efficiently Answering Reachability Queries on Large Directed Graphs
1Efficiently Answering Reachability Queries on
Large Directed Graphs
- Ruoming Jin
- Kent State University
- Joint work with Yang Xiang (KSU), Ning Ruan
(KSU), and Haixun Wang (IBM T.J. Watson)
2Reachability Query
The problem Given two vertices u and v in a
directed graph G, is there a path from u to v ?
15
- ?Query(1,11)
- Yes
- ?Query(3,9)
- No
14
11
13
10
12
6
7
8
9
3
4
5
1
2
Directed Graph ? DAG (directed acyclic graph) by
coalescing the strongly connected components
3Applications
- XML
- Biological networks
- Ontology
- Knowledge representation (Lattice operation)
- Object programming (Class relationship)
- Distributed systems (Reachable states)
Graph Databases
4Prior Work
2-HOP (O(nm1/2), and O(n4)), HOPI, and heuristic
algorithms
5Limitation of Tree-based approaches
- Finding a good tree cover is expensive
- Tree cover cannot represent some common types of
DAGs, like Grid - Compression limitations
- Chain (1-parent, 1-child)
- Tree (1-parent, multiple children)
- Most existing methods which utilize the tree
cover are greatly affected by how many edges are
left uncovered
6Overview of Path-Tree
- Chain-gtTree-gtPath-Tree (2 parents / multiple
children) - Path-tree cover is a spanning subgraph of G in a
tree shape (T) - A node in the tree T corresponds to a path in G
and an edge in T corresponds to the edges between
two paths in G - 3-tuple labeling exists for any path-tree to
answer reachability query in O(1)
7Path-Tree in a Nutshell
15
14
P4
11
13
10
12
P2
6
7
8
9
P4
P1
P3
3
4
5
P3
1
2
P2
P1
Path-Graph is not necessarily a planar graph The
reachability between any two nodes can be
answered in O(1)
8Key Problems
- How to construct a path-tree?
- Algorithm
- How can a path-tree help with reachability
queries? - Labeling
- Transitive Closure Compression
- How does path-tree compare with the existing
methods? - Optimality
9Constructing Path-Tree
- Step 1 Path-Decomposition of DAG
- Step 2 Minimal Equivalent Edge Set between any
two paths - Step 3 Path-Graph Construction
- Step 4 Path-Tree Cover Extraction
10Step 1 Path-Decomposition
15
(PID,SID) (2, 5)
14
11
For any two nodes (u, v) in the same path, u ?
v if and only if (u.sid ? v.sid)
13
10
12
6
7
8
9
P4
3
4
5
P3
1
2
P2
P1
Simple linear algorithm based on topological sort
can achieve a path-decomposition
11Step 2 Minimal equivalent edge set
- The reachability between any two paths can be
captured by a unique minimal set of edges
15
15
14
14
11
11
13
10
13
10
6
7
P1? P2
P1 ? P2
6
7
3
4
3
4
1
2
1
2
P2
P2
P1
P1
The edges in the minimal equivalent edge set do
not cross (always parallel)!
12Step 3 Path-Graph Construction
Weight reflects the cost we have to pay for the
transitive closure computation if we exclude this
path-tree edge
15
14
P2
11
2
4
13
10
12
5
P4
P1
2
2
1
1
6
7
8
9
1
P4
P3
3
4
5
P3
Weighted Directed Path-Graph
1
2
P2
P1
13Step 4 Extracting Path-Tree Cover
P2
P2
2
2
4
5
5
P4
P4
P1
P1
2
2
2
1
1
1
P3
P3
Weighted Directed Path-Graph
Maximal Directed Spanning Tree
Chu-Liu/Edmonds algorithm, O(m k logk)
14Key Problems
- How to construct a path-tree?
- Algorithm
- How can path-tree help with reachability queries?
- Labeling
- Transitive Closure Compression
- How does path-tree compare with the existing
methods? - Optimality
153-Tuple Labeling for Reachability
15
1,3
P2
14
11
1,4
P4
13
10
12
P1
1,1
2,2
6
7
8
9
P3
P4
3
4
5
Interval labeling (2-tuple) High-level
description about paths Pi ? Pj ?
P3
1
2
P2
P1
DFS labeling (1-tuple)
16DFS labeling
4
15
14
10
2
1
9
7
P3
P1
5
13
15
1
3
6
8
14
6
11
3
13
8
P2
4
10
11
2
7
12
5
P4
9
12
- Starting from the first vertex in the root-path
- Always try to visit the next vertex in the same
path - Label a node when all its neighbors has been
visited - L(v)N-x, x is the of nodes has been
labeled
173-Tuple Labeling for Reachability
4
15
14
10
2
1
9
7
P3
P1
5
13
15
1
3
6
8
14
6
11
3
13
8
P2
4
10
11
2
7
12
5
P4
1,3
9
12
P2
u?v if and only if 1) Interval label I(u) ??
I(v) 2) DFS label L(u) ? L(v)
?Query(9,15) P41,4 ?? P11,1 and 5 lt
15 Yes ?Query(9,2)?Query(5,9)
1,4
P4
P1
1,1
2,2
P3
18Transitive Closure Compression
15
Path-tree cover (including labeling) can be
constructed in O(m n logn)
14
11
13
10
12
6
7
8
9
3
4
5
1
2
An efficient procedure can compute and compress
the transitive closure in O(mk), k is number of
paths in path-tree
19Key Problems
- How to construct a path-tree?
- Algorithm
- How can path-tree help with reachability query?
- Labeling
- Transitive Closure Compression
- How does path-tree compare with the existing
methods? - Optimality
20Theoretical Analysis
- Optimal Path-Tree Cover (OPTC) Problem
- Given a path-decomposition, what is the optimal
path-tree cover to maximally compress the
transitive closure? - OptIndex weight assignment based on computing the
predecessor set - Optimal Path-Decomposition (OPD) Problem
- Assuming we only use path-decomposition to
compress the transitive closure, what is the
optimal path-decomposition to maximally compress
the transitive closure? - Minimal-cost flow problem
- What is the overall optimal path-decomposition?
21Superiority of Path-Tree Cover
- The optimal tree cover is a special case of
path-tree cover when each vertex corresponds to a
single path and the weight is based on OptIndex. - The path-tree cover approach can compress the
transitive closure with size being smaller than
or equal to the optimal tree cover approach (and
consequently optimal chain cover approach).
22Experimental Evaluation
- Implementation in C
- 12 Real datasets used in Dual-labeling paper and
GRIPP paper - Synthetic datasets
- Sparse DAG with edge density 2
- AMD Opteron 2.0GHz/ 2GB/ Linux
- PTree1 (OptIndex) and PTree2
- Mainly compare with Optimal Tree Cover
23Real Datasets
24Experimental Result (Real Data)
On average 10 times better than Tree
On average 3 times better than Tree
25Experimental Result (Synthetic Data)
26Experimental Result (Synthetic Data)
27Experimental Result (Synthetic Data)
28Conclusion
- A novel Path-Tree structure is proposed to assist
the compression of transitive closure and
answering reachability query - Path-tree has potential to integrate with other
existing methods to further improve the
efficiency of reachability query processing
29Thanks!!
30Step 3 Path-Graph Construction
Weight reflects the penalty if we exclude this
path-tree edge
15
14
P2
11
2
4
13
10
12
5
P4
P1
2
2
1
1
6
7
8
9
1
P4
P3
3
4
5
P3
Weighted Directed Path-Graph
1
2
P2
P1
31Step 2 Constructing Minimal Equivalent Edge Set
(Pi?Pj)
- Ordering the vertices in Pi and Pj by decreasing
order - Finding the first vertex v in P_j that P_i can
reach - Finding the last vertex u in P_i that reach v
- Removing all the edges cross (u,v) and
- repeat 2-4
323-Tuple Labeling for Reachability
15
1,3
P2
14
11
1,4
P4
13
10
12
P1
1,1
2,2
6
7
8
9
P3
P4
3
4
5
Interval labeling (2-tuple) High-level
description about paths Pi ? Pj ?
P3
1
2
P2
P1
DFS labeling (1-tuple)