Title: pOSKI: A Library to Parallelize OSKI
1pOSKI A Library to Parallelize OSKI
- Ankit Jain
- Berkeley Benchmarking and OPtimization (BeBOP)
Project - bebop.cs.berkeley.edu
- EECS Department, University of California,
Berkeley - April 28, 2008
2Outline
- pOSKI Goals
- OSKI Overview
- (Slides adopted from Rich Vuducs SIAM CSE 2005
Talk) - pOSKI Design
- Parallel Benchmark
- MPI-SpMV
3pOSKI Goals
- Provide a simple serial interface to exploit the
parallelism in sparse kernels (focus on SpMV for
now) - Target Multicore Architectures
- Hide the complex process of parallel tuning while
exposing its cost - Use heuristics, where possible, to limit search
space - Design it to be extensible so it can be used in
conjunction with other parallel libraries (e.g.
ParMETIS)
Take Sams Work and present it in a
distributable, easy-to-use format.
4Outline
- pOSKI Goals
- OSKI Overview
- (Slides adopted from Rich Vuducs SIAM CSE 2005
Talk) - pOSKI Design
- Parallel Benchmark
- MPI-SpMV
5OSKI Optimized Sparse Kernel Interface
- Sparse kernels tuned for users matrix machine
- Hides complexity of run-time tuning
- Low-level BLAS-style functionality
- Sparse matrix-vector multiply (SpMV), triangular
solve (TrSV), - Includes fast locality-aware kernels ATAx,
- Target cache-based superscalar uniprocessors
- Faster than standard implementations
- Up to 4x faster SpMV, 1.8x TrSV, 4x ATAx
- Written in C (can call from Fortran)
Note All Speedups listed are from Sequential
Platforms in 2005
6How OSKI Tunes (Overview)
Application Run-Time
Library Install-Time (offline)
1. Build for Target Arch.
2. Benchmark
Workload from program monitoring
History
Matrix
Benchmark data
Heuristic models
1. Evaluate Models
Generated code variants
2. Select Data Struct. Code
To user Matrix handle for kernel calls
Extensibility Advanced users may write
dynamically add Code variants and Heuristic
models to system.
7Cost of Tuning
- Non-trivial run-time tuning cost up to 40
mat-vecs - Dominated by conversion time
- Design point user calls tune routine
explicitly - Exposes cost
- Tuning time limited using estimated workload
- Provided by user or inferred by library
- User may save tuning results
- To apply on future runs with similar matrix
- Stored in human-readable format
8Optimizations Available in OSKI
- Optimizations for SpMV (bold ? heuristics)
- Register blocking (RB) up to 4x over CSR
- Variable block splitting 2.1x over CSR, 1.8x
over RB - Diagonals 2x over CSR
- Reordering to create dense structure splitting
2x over CSR - Symmetry 2.8x over CSR, 2.6x over RB
- Cache blocking 3x over CSR
- Multiple vectors (SpMM) 7x over CSR
- And combinations
- Sparse triangular solve
- Hybrid sparse/dense data structure 1.8x over CSR
- Higher-level kernels
- AATx, ATAx 4x over CSR, 1.8x over RB
- A?x 2x over CSR, 1.5x over RB
Note All Speedups listed are from Sequential
Platforms in 2005
9Outline
- pOSKI Goals
- OSKI Overview
- (Slides adopted from Rich Vuducs SIAM CSE 2005
Talk) - pOSKI Design
- Parallel Benchmark
- MPI-SpMV
10How pOSKI Tunes (Overview)
Library Install-Time (offline)
Application Run-Time (online)
Matrix
Load Balance
Build for Target Arch.
Parallel Benchmark
Parallel Heuristic models
P-OSKI
Submatrix
Submatrix
Parallel Benchmark data
To User pOSKI Matrix Handle For kernel Calls
Accumulate Handles
Evaluate Parallel Model
Evaluate Parallel Model
Benchmark data
Generated code variants
History
OSKI
Evaluate Models
Heuristic models
Build for Target Arch.
Benchmark
Select Data Struct. Code
OSKI_Matrix_Handle For kernel Calls
11Where the Optimizations Occur
12Current Implementation
- The Serial Interface
- Represents S?P composition of ParLab Proposal.
The parallelism is hidden under the covers - Each serial-looking function call triggers a set
of parallel events - Manages its own thread pool
- Supports up to the number of threads supported by
underlying hardware - Manages thread and data affinity
13Additional Future Interface
- The Parallel Interface
- Represents P?P composition of ParLab Proposal
- Meant for expert programmers
- Can be used to share threads with other parallel
libraries - No guarantees of thread of data affinity
management - Example Use y ? ATAx codes
- Alternate between SpMV and preconditioning step.
- Share threads between P-OSKI (for SpMV) and some
parallel preconditioning library - Example Use UPC Code
- Explicitly Parallel Execution Model
- User partitions matrix based on some information
P-OSKI would not be able to infer
14Thread and Data Affinity (1/3)
- Cache Coherent Non Uniform Memory Access (ccNUMA)
times on Modern MultiSocket, MultiCore
architectures - Modern OS first touch policy in allocating
memory - Thread Migration between Locality Domains is
expensive - In ccNUMA, a locality domain is a set of
processor cores together with locally connected
memory which can be accessed without resorting to
a network of any kind. - For now, we have to deal with these OS policies
ourselves. The ParLab OS Group is trying to
solve these problems in order to hide such issues
from the programmer.
15Thread and Data Affinity (2/3)
- The Problem with malloc() and free()
- malloc() first looks for free pages on heap and
then requests OS to allocate new pages. - If available free pages reside on a different
locality domain, malloc() still allocates them - Autotuning codes are malloc() and free()
intensive so this is a huge problem
16Thread and Data Affinity (3/3)
- The solution Managing our own memory
- One large chunk (heap) allocated at the beginning
of tuning per locality domain - Size of this heap controlled by user input
through environment variable P_OSKI_HEAP_IN_GB2
- Rare case allocated space is not big enough
- Stop all threads
- Free all allocated memory
- Grow the amount of space significantly across all
threads and locality domains - Print a strong warning to the user
17Outline
- pOSKI Goals
- OSKI Overview
- (Slides adopted from Rich Vuducs SIAM CSE 2005
Talk) - pOSKI Design
- Parallel Benchmark
- MPI-SpMV
18Justification
- OSKIs Benchmarking
- Single Threaded
- All the memory bandwidth is given to this one
thread - pOSKIs Benchmarking
- Benchmarks 1, 2, 4, , threads (based on
hardware limit) in parallel - Each thread uses up memory bandwidth which
resembles run-time more accurately - When each instance of OSKI choose appropriate
data structures and algorithms, it uses the data
from this parallel benchmark
19Results (1/2)
20Results (2/2)
- Justifies Need for Search
- Need Heuristics to reduce this since the
multicore search space is expanding exponentially
21Outline
- pOSKI Goals
- OSKI Overview
- (Slides adopted from Rich Vuducs SIAM CSE 2005
Talk) - pOSKI Design
- Parallel Benchmark
- MPI-SpMV
22Goals
- Target MultiNode, MultiCore architectures
- Design Build an MPI-layer on top of pOSKI
- MPI is a starting point
- Tuning Parameters
- Balance of Pthreads and MPI tasks
- Rajesh has found for collectives, the balance is
not always clear - Identifying if there are potential performance
gains by assigning some of the threads (or cores)
to only handle sending/receiving of messages - Status
- Just started, should have initial version in next
few weeks - Future Work
- Explore UPC for communication
- Distributed Load Balancing, Workload Generation
23Questions?
- pOSKI Goals
- OSKI Overview
- pOSKI Design
- Parallel Benchmark
- MPI-SpMV
24Extra Slides
25Motivation The Difficulty of Tuning
- n 21216
- nnz 1.5 M
- kernel SpMV
- Source NASA structural analysis problem
26Speedups on Itanium 2 The Need for Search
27Extra Slides
- Some Current Multicore Machines
28Rad Lab Opteron
29Niagara 2 (Victoria Falls)
30Nersc Power5 Bassi
31Cell Processor
32Extra Slides
- SpBLAS and OSKI Interfaces
33SpBLAS Interface
- Create a matrix handle
- Assert matrix properties
- Insert matrix entries
- Signal the end of matrix creation
- Call operations on the handle
- Destroy the handle
?
Tune here
34OSKI Interface
- The basic OSKI interface has a subset of the
matrix creation interface of the Sparse BLAS,
exposes the tuning step explicitly, and supports
a few extra kernels (e.g., A(T)Ax). - The OSKI interface was designed with the intent
of implementing the Sparse BLAS using OSKI
under-the-hood.
35Extra Slides
36Challenges of a Parallel Automatic Tuner
- Search space increases exponentially with number
of parameters - Parallelization across Architectural Parameters
- Across Multiple Threads
- Across Multiple Cores
- Across Multiple Sockets
- Parallelizing the data of a given problem
- Across Rows, Across Columns, or Checkerboard
- Based on User Input in v1
- Future Versions can integrate ParMETIS or other
graph partitioners
37A Memory Footprint Minimization Heuristic
- The Problem Search Space is too Large ?
Auto-tuning takes too long - The rate of increase in aggregate memory
bandwidth over time is not as fast as the rate of
increase in processing power per machine. - Our Two Step Tuning Process
- Calculate the top 20 memory efficient
configurations on Thread 0 - Each Thread finds its optimal block size for its
sub-matrix from the list in Step 1