High-Performance Computation for Path Problems in Graphs - PowerPoint PPT Presentation

About This Presentation

Title:

High-Performance Computation for Path Problems in Graphs

Description:

'I observed that most of the coefficients in our matrices were zero; i.e., the ... Strongly connected components, ordered by levels of DAG ... – PowerPoint PPT presentation

Number of Views:141

Avg rating:3.0/5.0

Slides: 33

Provided by: Joh6367

Learn more at: https://sites.cs.ucsb.edu

Category:

more less

Transcript and Presenter's Notes

Title: High-Performance Computation for Path Problems in Graphs

1

High-Performance Computation for Path Problems
in Graphs
Aydin Buluç John R. Gilbert University of
California, Santa Barbara SIAM Conf. on
Applications of Dynamical Systems May 20, 2009
Support DOE Office of Science, MIT Lincoln
Labs, NSF, DARPA, SGI
2
Horizontal-vertical decomposition Mezic et al.
Slide courtesy of Igor Mezic group, UCSB
3
Combinatorial Scientific Computing

I observed that most of the coefficients in our
matrices were zero i.e., the nonzeros were
sparse in the matrix, and that typically the
triangular matrices associated with the forward
and back solution provided by Gaussian
elimination would remain sparse if pivot elements
were chosen with care

- Harry Markowitz, describing the 1950s work on
portfolio theory that won the 1990 Nobel Prize
for Economics
4
A few directions in CSC

Hybrid discrete continuous computations
Multiscale combinatorial computation
Analysis, management, and propagation of
uncertainty
Economic game-theoretic considerations
Computational biology bioinformatics
Computational ecology
Knowledge discovery machine learning
Relationship analysis
Web search and information retrieval
Sparse matrix methods
Geometric modeling
. . .

5
The Parallel Computing Challenge
Two Nvidia 8800 GPUs gt 1 TFLOPS
LANL / IBM Roadrunner gt 1 PFLOPS
Intel 80-core chip gt 1 TFLOPS

Parallelism is no longer optional
in every part of a computation.

6
The Parallel Computing Challenge

Efficient sequential algorithms for
graph-theoretic
problems often follow long chains of
dependencies
Several parallelization strategies, but no silver
bullet
Partitioning (e.g. for preconditioning PDE
solvers)
Pointer-jumping (e.g. for connected components)
Sometimes it just depends on what the input looks
like
A few simple examples . . .

7
Sample kernel Sort logically triangular matrix
Permuted to unit upper triangular form
Original matrix

Used in sparse linear solvers (e.g. Matlabs)
Simple kernel, abstracts many other graph
operations (see next)
Sequential linear time, simple greedy
topological sort
Parallel no known method is efficient in both
work and span one parallel step per level
arbitrarily long dependent chains

8
Bipartite matching
1
5
2
3
4
1
2
3
4
5
A

Perfect matching set of edges that hits each
vertex exactly once
Matrix permutation to place nonzeros (or heavy
elements) on diagonal
Efficient sequential algorithms based on
augmenting paths
No known work/span efficient parallel algorithms

9
Strongly connected components
G(A)
PAPT

Symmetric permutation to block triangular form
Diagonal blocks are strong Hall (irreducible /
strongly connected)
Sequential linear time by depth-first search
Tarjan
Parallel divide conquer, work and span depend
on input Fleischer, Hendrickson, Pinar

10
Horizontal - vertical decomposition

Defined and studied by Mezic et al. in a
dynamical systems context
Strongly connected components, ordered by levels
of DAG
Efficient linear-time sequential algorithms
No work/span efficient parallel algorithms known

11
Strong components of 1M-vertex RMAT graph
12
Dulmage-Mendelsohn decomposition
13
Applications of D-M decomposition

Strongly connected components of directed graphs
Connected components of undirected graphs
Permutation to block triangular form for Axb
Minimum-size vertex cover of bipartite graphs
Extracting vertex separators from edge cuts for
arbitrary graphs
Nonzero structure prediction for sparse matrix
factorizations

14
Strong Hall components are independent of
choice of matching
15
The Primitives Challenge

By analogy to numerical linear algebra. . .
What should the combinatorial BLAS look like?

16
Primitives for HPC graph programming

Visitor-based multithreaded Berry, Gregor,
Hendrickson, Lumsdaine
search templates natural for many algorithms
relatively simple load balancing
complex thread interactions, race conditions
unclear how applicable to standard
architectures
Array-based data parallel G, Kepner,
Reinhardt, Robinson, Shah
relatively simple control structure
user-friendly interface
some algorithms hard to express naturally
load balancing not so easy
Scan-based vectorized Blelloch
We dont know the right set of primitives yet!

17
Array-based graph algorithms study Kepner,
Fineman, Kahn, Robinson
18
Multiple-source breadth-first search
AT
X
19
Multiple-source breadth-first search
?
AT
X
ATX
20
Multiple-source breadth-first search
?
AT
X
ATX

Sparse array representation gt space efficient
Sparse matrix-matrix multiplication gt work
efficient
Span load balance depend on matrix-mult
implementation

21
Matrices over semirings

Matrix multiplication C AB (or
matrix/vector)
Ci,j Ai,1?B1,j Ai,2?B2,j Ai,n?Bn,j
Replace scalar operations ? and by
? associative, distributes over ?, identity 1
? associative, commutative, identity 0
annihilates under ?
Then Ci,j Ai,1?B1,j ? Ai,2?B2,j ? ?
Ai,n?Bn,j
Examples (?,) (and,or) (,min) . . .
Same data reference pattern and control flow

22
SpGEMM Sparse Matrix x Sparse Matrix Buluc, G

Shortest path calculations (APSP)
Betweenness centrality
BFS from multiple source vertices
Subgraph / submatrix indexing
Graph contraction
Cycle detection
Multigrid interpolation restriction
Colored intersection searching
Applying constraints in finite element modeling
Context-free parsing

23
Distributed-memory parallel sparse matrix
multiplication

2D block layout
Outer product formulation
Sequential hypersparse kernel

Asynchronous MPI-2 implementation
Experiments TACC Lonestar cluster
Good scaling to 256 processors

Time vs Number of cores -- 1M-vertex RMAT
24
All-Pairs Shortest Paths

Directed graph with costs on edges
Find least-cost paths between all reachable
vertex pairs
Several classical algorithms with
Work matrix multiplication
Span log2 n
Case study of implementation on multicore
architecture
graphics processing unit (GPU)

25
GPU characteristics

Powerful two Nvidia 8800s gt 1 TFLOPS
Inexpensive 500 each

But

Difficult programming model
One instruction stream drives 8 arithmetic units
Performance is counterintuitive and fragile
Memory access pattern has subtle effects on cost
Extremely easy to underutilize the device
Doing it wrong easily costs 100x in time

26
Recursive All-Pairs Shortest Paths

Based on R-Kleene algorithm
Well suited for GPU architecture
Fast matrix-multiply kernel
In-place computation gt low memory bandwidth
Few, large MatMul calls gt low GPU dispatch
overhead
Recursion stack on host CPU, not on multicore
GPU
Careful tuning of GPU code

A B
C D
is min, is add
A A recursive call B AB C CA D
D CB D D recursive call B BD C
DC A A BC
27
Execution of Recursive APSP
28
APSP Experiments and observations
128-core Nvidia 8800 Speedup
relative to. . . 1-core CPU
120x 480x 16-core CPU 17x
45x Iterative, 128-core GPU 40x 680x
MSSSP, 128-core GPU 3x
Time vs. Matrix Dimension

Conclusions
High performance is achievable but not simple
Carefully chosen and optimized primitives will
be key

29
H-V decomposition

A span-efficient, but not work-efficient,
method for H-V decomposition uses APSP to
determine reachability

30
Reachability Transitive closure

APSP gt transitive closure of adjacency matrix
Strong components identified by symmetric
nonzeros

31
H-V structure Acyclic condensation

Acyclic condensation is a sparse matrix-matrix
product
Levels identified by APSP for longest paths
Practically speaking, a parallel method would
compromise between work and span efficiency

32
Remarks