Title: Classroom Data Analysis
1Sparse Matrices and Combinatorial Algorithms
in Star-P Status and PlansApril 8, 2005
2Matlabs sparse matrix design principles
- All operations should give the same results for
sparse and full matrices (almost all) - Sparse matrices are never created automatically,
but once created they propagate - Performance is important -- but usability,
simplicity, completeness, and robustness are more
important - Storage for a sparse matrix should be O(nonzeros)
- Time for a sparse operation should be
O(flops)(as nearly as possible)
3Matlabs sparse matrix design principles
- All operations should give the same results for
sparse and full matrices (almost all) - Sparse matrices are never created automatically,
but once created they propagate - Performance is important -- but usability,
simplicity, completeness, and robustness are more
important - Storage for a sparse matrix should be O(nonzeros)
- Time for a sparse operation should be
O(flops)(as nearly as possible)
Star-P dsparse matrices same principles, some
different tradeoffs
4Sparse matrix operations
- dsparse layout, same semantics as ddense
- For now, only row distribution
- Matrix operators , -, max, etc.
- Matrix indexing and concatenation
- A (13, 4 5 2) B(, 7) C
- A \ b by direct methods
5Star-P sparse data structure
- Full
- 2-dimensional array of real or complex numbers
- (nrowsncols) memory
- Sparse
- compressed row storage
- about (2nzs nrows) memory
6Star-P distributed sparse data structure
P0
P1
- Each processor stores
- of local nonzeros
- range of local rows
- nonzeros in CSR form
P2
Pn
7Sorting in Star-P
- V, perm sort (V)
- Useful primitive for some sparse array
algorithms, e.g. the sparse constructor - Star-P uses a parallel sample sort
8The sparse( ) constructor
- A sparse (I, J, V, nr, nc)
- Input ddense vectors I, J, V, and dimensions
nr, nc - A(I(k), J(k)) V(k) for each k
- Sort triples (i, j, v) by (i, j)
- Adjust for desired row distribution
- Locally convert to compressed row indices
- Sum values with duplicate indices
9Sparse matrix times dense vector
- y A x
- The first call to matvec caches a
communication schedule for matrix A. Later
calls to multiply any vector by A use the
cached schedule. - Communication and computation overlap.
10Sparse linear systems
- x A \ b
- Matrix division uses MPI-based direct solvers
-
- SuperLU_dist nonsymmetric static pivoting
- MUMPS nonsymmetric multifrontal
- PSPASES Cholesky
- ppsetoption(SparseDirectSolver,SUPERLU)
- Iterative solvers implemented in Star-P
- Ongoing work on preconditioners
11Sparse eigenvalue solvers
- V, D eigs(A) etc.
- Implemented via MPI-based PARPACK
Lehoucq, Maschhoff, Sorensen, Yang - Work in progress
12Application Fluid dynamics
- function lambda peigs (A, B, sigma, iter, tol)
- m n size (A)
- C A - sigma B
- y rand (m, 1)
- for k 1iter
- q y ./ norm (y)
- v B q
- y C \ v
- theta dot (q, y)
- res norm (y - thetaq)
- if res lt tol
- break
- end
- end
- lambda 1 / theta
- Modeling density-driven instabilities in miscible
fluids (Goyal, Meiburg) - Groundwater modeling, oil recovery, etc.
- Mixed finite difference spectral method
- Large sparse generalized eigenvalue problem
13Page ranking example
- Uses power method, i.e. matvec and vector
arithmetic - Page ranking demo from SC03 on a web crawl of
mit.edu (170,000 pages)
14Combinatorial Scientific Computing
- Sparse matrix methods machine learning search
and information retrieval adaptive multilevel
modeling geometric meshing computational
biology . . . - How will combinatorial methods be used by people
who dont understand them in complete detail?
15Analogy Matrix division in Matlab
- x A \ b
- Works for either full or sparse A
- Is A square?
- no gt use QR to solve least squares problem
- Is A triangular or permuted triangular?
- yes gt sparse triangular solve
- Is A symmetric with positive diagonal elements?
- yes gt attempt Cholesky after symmetric minimum
degree - Otherwise
- gt use LU on A(, colamd(A))
16Combinatorics in Star-P
- A sparse matrix language is a good start on
primitives for combinatorial computing. - Random-access indexing A(i,j)
- Neighbor sequencing find (A(i,))
- Sparse table construction sparse (I, J, V)
- Other primitives we are building sorting,
searching, pointer-jumping, connectivity and
paths, . . .
17Matching and depth-first search in Matlab
- dmperm Dulmage-Mendelsohn decomposition
- Square, full rank A
- p, q, r dmperm(A)
- A(p,q) is block upper triangular with nonzero
diagonal - also, strongly connected components of a directed
graph - also, connected components of an undirected graph
- Arbitrary A
- p, q, r, s dmperm(A)
- maximum-size matching in a bipartite graph
- minimum-size vertex cover in a bipartite graph
- decomposition into strong Hall blocks
18Connected components (work in progress)
- Sequential Matlab uses depth-first search
(dmperm), which doesnt parallelize well - Shiloach-Vishkin algorithm
- repeat
- Link every (super)vertex to a random neighbor
- Shrink each tree to a supervertex by pointer
jumping - until no further change
- Hybrid SV / search method under construction
- Other potential graph kernels
- Breadth-first search (after Husbands, LBNL)
- Bipartite matching (after Riedy, UCB)
- Strongly connected components (after Pinar, LBNL)
19SSCA 2 -- Overview
- Randomly sized cliques
- Most intra-clique edges present
- Random inter-clique edges,gradually 'thinning'
with distance - Integer and character string edge labels
- Randomized vertex numbers
- Scalable data generator
- Kernel 1 Graph Construction
- Kernel 2 Classify Large Sets (Search)
- Kernel 3 Subgraph Extraction
- Kernel 4 Graph Clustering / Partitioning
20HPCS Benchmark SpectrumSSCA 2
21Scalable Data Generator
- Produce edge tuples containing the start vertex,
end vertex, and weight for each directed edge - Hierarchical nature, with random-sized cliques
connected by edges between some of the cliques - Interclique edges are assigned using a random
distribution that represents a hierarchical
thinning as vertices are further apart - Weights are either random integer values or
randomly selected character strings - Random permutations to avoid induced locality
- Vertex numbers local processor memory locality
- Parallel global memory locality
- Data sets must be developed in their entirety
before graph construction - May be parallelized data tuple vertex naming
schemes must be globally consistent - Data generator will not be timed
- Statistics will be saved to be used in later
verification stages
22Kernel 1 Graph Construction
- Construct the multigraph in a format usable by
all subsequent kernels - No subsequent modifications will be permitted to
benefit specific kernels - Graph construction will be timed
23Kernel 2 Sort on Selected Edge Weights
- Sort on edge weights to determine those vertex
pairs with - The largest integer weights
- With a desired string weight
- The two vertex pair lists will be saved for use
in the subsequent kernel - Start set SI for integer start sets
- Start set SC for character start sets
- Sorting weights will be timed
24Kernel 3 Extract Subgraphs
- Extract a series of subgraphs formed by following
paths of length k from a start set of initial
vertices - The set of initial vertices will be determined
from the previous kernel - Produce a series of subgraphs consisting of
vertices and edges on paths of length k from the
vertex pairs start sets SI and SC - A possible computational kernel for graph
extraction is Breadth First Search - Extracting graphs will be timed
25Kernel 4 Partition Graph Using a Clustering
Algorithm
- Apply a graph clustering algorithm to partition
the vertices of the graph into subgraphs no
larger than a maximum size so as to minimize the
number of edges that need be cut, e.g. - Kernighan and Lin
- Sangiovanni-Vincentelli, Chen, and Chua
- The output of this kernel will be a description
of the clusters and the edges that form the
interconnecting network - This step should identify the structure in the
graph specified in the data generation phase - It is important that the implementation of this
kernel does not utilize a priori knowledge of the
details in the data generator or the statistics
collected in the graph generation process - Identifying the clusters and the interconnecting
network will be timed
26 Concise SSCA2
- A serial Matlab implementation of SSCA2
- Coding style is simple and data parallel
- Our starting point for Star-P implementation
- Edge labels are only integers, not strings yet
27Data generation
- Using Lincoln Labs serial Matlab generator at
present - Data-parallel Star-P generator under construction
- Generate edge triples as distributed dense
vectors
28Kernel 1 Graph construction
- cSSCA2 K1 30 lines of executable serial
Matlab - Multigraph is a cell array of simple directed
graphs - Graphs are dsparse matrices, created by sparse( )
29Kernel 2 Search in large sets
- cSSCA2 K2 12 lines of executable serial
Matlab - Uses max( ) and find( ) to locate edges of max
weight
ng length(G.edgeWeights) number of graphs
maxwt 0 for g 1ng maxwt max(maxwt,
max(max(G.edgeWeightsg))) end intedges
zeros(0,2) for g 1ng intstart intend
find(G.edgeWeightsg maxwt) intedges
intedges intstart intend end
30Kernel 3 Subgraph extraction
- cSSCA2 K3 25 lines of executable serial
Matlab - Returns subgraphs consisting of vertices and
edges within specified distance of specified
starting vertices - Sparse matrix-vector product for breadth-first
search
31Kernel 4 Partitioning / clustering
- cSSCA2 K4 serial prototype exists, not yet
parallel - Different algorithm from Lincoln Labs executable
spec, which uses seed-growing breadth-first
search - cSSCA2 uses spectral methods for clustering,
based on eigenvectors (Fiedler vectors) of the
graph - One version seeks small balanced edge cuts
(graph partitioning) another seeks clusters
with low isoperimetric number (clustering)