Title: High-Performance Computation for Path Problems in Graphs
1 High-Performance Computation for Path Problems
in Graphs
Aydin Buluç John R. Gilbert University of
California, Santa Barbara SIAM Conf. on
Applications of Dynamical Systems May 20, 2009
Support DOE Office of Science, MIT Lincoln
Labs, NSF, DARPA, SGI
2Horizontal-vertical decomposition Mezic et al.
Slide courtesy of Igor Mezic group, UCSB
3Combinatorial Scientific Computing
- I observed that most of the coefficients in our
matrices were zero i.e., the nonzeros were
sparse in the matrix, and that typically the
triangular matrices associated with the forward
and back solution provided by Gaussian
elimination would remain sparse if pivot elements
were chosen with care
- Harry Markowitz, describing the 1950s work on
portfolio theory that won the 1990 Nobel Prize
for Economics
4A few directions in CSC
- Hybrid discrete continuous computations
- Multiscale combinatorial computation
- Analysis, management, and propagation of
uncertainty - Economic game-theoretic considerations
- Computational biology bioinformatics
- Computational ecology
- Knowledge discovery machine learning
- Relationship analysis
- Web search and information retrieval
- Sparse matrix methods
- Geometric modeling
- . . .
5The Parallel Computing Challenge
Two Nvidia 8800 GPUs gt 1 TFLOPS
LANL / IBM Roadrunner gt 1 PFLOPS
Intel 80-core chip gt 1 TFLOPS
- Parallelism is no longer optional
- in every part of a computation.
6The Parallel Computing Challenge
- Efficient sequential algorithms for
graph-theoretic - problems often follow long chains of
dependencies - Several parallelization strategies, but no silver
bullet - Partitioning (e.g. for preconditioning PDE
solvers) - Pointer-jumping (e.g. for connected components)
- Sometimes it just depends on what the input looks
like - A few simple examples . . .
7Sample kernel Sort logically triangular matrix
Permuted to unit upper triangular form
Original matrix
- Used in sparse linear solvers (e.g. Matlabs)
- Simple kernel, abstracts many other graph
operations (see next) - Sequential linear time, simple greedy
topological sort - Parallel no known method is efficient in both
work and span one parallel step per level
arbitrarily long dependent chains
8Bipartite matching
1
5
2
3
4
1
2
3
4
5
A
- Perfect matching set of edges that hits each
vertex exactly once - Matrix permutation to place nonzeros (or heavy
elements) on diagonal - Efficient sequential algorithms based on
augmenting paths - No known work/span efficient parallel algorithms
9Strongly connected components
G(A)
PAPT
- Symmetric permutation to block triangular form
- Diagonal blocks are strong Hall (irreducible /
strongly connected) - Sequential linear time by depth-first search
Tarjan - Parallel divide conquer, work and span depend
on input Fleischer, Hendrickson, Pinar
10Horizontal - vertical decomposition
- Defined and studied by Mezic et al. in a
dynamical systems context - Strongly connected components, ordered by levels
of DAG - Efficient linear-time sequential algorithms
- No work/span efficient parallel algorithms known
11Strong components of 1M-vertex RMAT graph
12Dulmage-Mendelsohn decomposition
13Applications of D-M decomposition
- Strongly connected components of directed graphs
- Connected components of undirected graphs
- Permutation to block triangular form for Axb
- Minimum-size vertex cover of bipartite graphs
- Extracting vertex separators from edge cuts for
arbitrary graphs - Nonzero structure prediction for sparse matrix
factorizations
14Strong Hall components are independent of
choice of matching
15The Primitives Challenge
- By analogy to numerical linear algebra. . .
- What should the combinatorial BLAS look like?
16 Primitives for HPC graph programming
- Visitor-based multithreaded Berry, Gregor,
Hendrickson, Lumsdaine - search templates natural for many algorithms
- relatively simple load balancing
- complex thread interactions, race conditions
- unclear how applicable to standard
architectures - Array-based data parallel G, Kepner,
Reinhardt, Robinson, Shah - relatively simple control structure
- user-friendly interface
- some algorithms hard to express naturally
- load balancing not so easy
- Scan-based vectorized Blelloch
- We dont know the right set of primitives yet!
17Array-based graph algorithms study Kepner,
Fineman, Kahn, Robinson
18Multiple-source breadth-first search
AT
X
19Multiple-source breadth-first search
?
AT
X
ATX
20Multiple-source breadth-first search
?
AT
X
ATX
- Sparse array representation gt space efficient
- Sparse matrix-matrix multiplication gt work
efficient - Span load balance depend on matrix-mult
implementation
21Matrices over semirings
- Matrix multiplication C AB (or
matrix/vector) -
- Ci,j Ai,1?B1,j Ai,2?B2,j Ai,n?Bn,j
- Replace scalar operations ? and by
- ? associative, distributes over ?, identity 1
- ? associative, commutative, identity 0
annihilates under ? - Then Ci,j Ai,1?B1,j ? Ai,2?B2,j ? ?
Ai,n?Bn,j - Examples (?,) (and,or) (,min) . . .
- Same data reference pattern and control flow
22SpGEMM Sparse Matrix x Sparse Matrix Buluc, G
- Shortest path calculations (APSP)
- Betweenness centrality
- BFS from multiple source vertices
- Subgraph / submatrix indexing
- Graph contraction
- Cycle detection
- Multigrid interpolation restriction
- Colored intersection searching
- Applying constraints in finite element modeling
- Context-free parsing
23Distributed-memory parallel sparse matrix
multiplication
- 2D block layout
- Outer product formulation
- Sequential hypersparse kernel
- Asynchronous MPI-2 implementation
- Experiments TACC Lonestar cluster
- Good scaling to 256 processors
Time vs Number of cores -- 1M-vertex RMAT
24All-Pairs Shortest Paths
- Directed graph with costs on edges
- Find least-cost paths between all reachable
vertex pairs - Several classical algorithms with
- Work matrix multiplication
- Span log2 n
- Case study of implementation on multicore
architecture - graphics processing unit (GPU)
25GPU characteristics
- Powerful two Nvidia 8800s gt 1 TFLOPS
- Inexpensive 500 each
But
- Difficult programming model
- One instruction stream drives 8 arithmetic units
- Performance is counterintuitive and fragile
- Memory access pattern has subtle effects on cost
- Extremely easy to underutilize the device
- Doing it wrong easily costs 100x in time
26Recursive All-Pairs Shortest Paths
- Based on R-Kleene algorithm
- Well suited for GPU architecture
- Fast matrix-multiply kernel
- In-place computation gt low memory bandwidth
- Few, large MatMul calls gt low GPU dispatch
overhead - Recursion stack on host CPU, not on multicore
GPU - Careful tuning of GPU code
A B
C D
is min, is add
A A recursive call B AB C CA D
D CB D D recursive call B BD C
DC A A BC
27Execution of Recursive APSP
28APSP Experiments and observations
128-core Nvidia 8800 Speedup
relative to. . . 1-core CPU
120x 480x 16-core CPU 17x
45x Iterative, 128-core GPU 40x 680x
MSSSP, 128-core GPU 3x
Time vs. Matrix Dimension
- Conclusions
- High performance is achievable but not simple
- Carefully chosen and optimized primitives will
be key
29H-V decomposition
- A span-efficient, but not work-efficient,
method for H-V decomposition uses APSP to
determine reachability
30Reachability Transitive closure
- APSP gt transitive closure of adjacency matrix
- Strong components identified by symmetric
nonzeros
31H-V structure Acyclic condensation
- Acyclic condensation is a sparse matrix-matrix
product - Levels identified by APSP for longest paths
- Practically speaking, a parallel method would
compromise between work and span efficiency
32 Remarks
- Combinatorial algorithms are pervasive in
scientific - computing and will become more so.
- Path computations on graphs are powerful tools,
but - efficiency is a challenge on parallel
architectures. - Carefully chosen and implemented primitive
operations - are key.
- Lots of exciting opportunities for research!