pOSKI: A Library to Parallelize OSKI - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

pOSKI: A Library to Parallelize OSKI

Description:

Hide the complex process of parallel tuning while exposing its cost ... Hides complexity of run-time tuning. Low ... The parallelism is hidden under the covers ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 38

Provided by: csBer

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: pOSKI: A Library to Parallelize OSKI

1
pOSKI A Library to Parallelize OSKI

Ankit Jain
Berkeley Benchmarking and OPtimization (BeBOP)
Project
bebop.cs.berkeley.edu
EECS Department, University of California,
Berkeley
April 28, 2008

2
Outline

pOSKI Goals
OSKI Overview
(Slides adopted from Rich Vuducs SIAM CSE 2005
Talk)
pOSKI Design
Parallel Benchmark
MPI-SpMV

3
pOSKI Goals

Provide a simple serial interface to exploit the
parallelism in sparse kernels (focus on SpMV for
now)
Target Multicore Architectures
Hide the complex process of parallel tuning while
exposing its cost
Use heuristics, where possible, to limit search
space
Design it to be extensible so it can be used in
conjunction with other parallel libraries (e.g.
ParMETIS)

Take Sams Work and present it in a
distributable, easy-to-use format.
4
Outline

pOSKI Goals
OSKI Overview
(Slides adopted from Rich Vuducs SIAM CSE 2005
Talk)
pOSKI Design
Parallel Benchmark
MPI-SpMV

5
OSKI Optimized Sparse Kernel Interface

Sparse kernels tuned for users matrix machine
Hides complexity of run-time tuning
Low-level BLAS-style functionality
Sparse matrix-vector multiply (SpMV), triangular
solve (TrSV),
Includes fast locality-aware kernels ATAx,
Target cache-based superscalar uniprocessors
Faster than standard implementations
Up to 4x faster SpMV, 1.8x TrSV, 4x ATAx
Written in C (can call from Fortran)

Note All Speedups listed are from Sequential
Platforms in 2005
6
How OSKI Tunes (Overview)
Application Run-Time
Library Install-Time (offline)
1. Build for Target Arch.
2. Benchmark
Workload from program monitoring
History
Matrix
Benchmark data
Heuristic models
1. Evaluate Models
Generated code variants
2. Select Data Struct. Code
To user Matrix handle for kernel calls
Extensibility Advanced users may write
dynamically add Code variants and Heuristic
models to system.
7
Cost of Tuning

Non-trivial run-time tuning cost up to 40
mat-vecs
Dominated by conversion time
Design point user calls tune routine
explicitly
Exposes cost
Tuning time limited using estimated workload
Provided by user or inferred by library
User may save tuning results
To apply on future runs with similar matrix
Stored in human-readable format

8
Optimizations Available in OSKI

Optimizations for SpMV (bold ? heuristics)
Register blocking (RB) up to 4x over CSR
Variable block splitting 2.1x over CSR, 1.8x
over RB
Diagonals 2x over CSR
Reordering to create dense structure splitting
2x over CSR
Symmetry 2.8x over CSR, 2.6x over RB
Cache blocking 3x over CSR
Multiple vectors (SpMM) 7x over CSR
And combinations
Sparse triangular solve
Hybrid sparse/dense data structure 1.8x over CSR
Higher-level kernels
AATx, ATAx 4x over CSR, 1.8x over RB
A?x 2x over CSR, 1.5x over RB

Note All Speedups listed are from Sequential
Platforms in 2005
9
Outline

pOSKI Goals
OSKI Overview
(Slides adopted from Rich Vuducs SIAM CSE 2005
Talk)
pOSKI Design
Parallel Benchmark
MPI-SpMV

10
How pOSKI Tunes (Overview)
Library Install-Time (offline)
Application Run-Time (online)
Matrix
Load Balance
Build for Target Arch.
Parallel Benchmark
Parallel Heuristic models
P-OSKI
Submatrix
Submatrix
Parallel Benchmark data
To User pOSKI Matrix Handle For kernel Calls
Accumulate Handles
Evaluate Parallel Model
Evaluate Parallel Model
Benchmark data
Generated code variants
History
OSKI
Evaluate Models
Heuristic models
Build for Target Arch.
Benchmark
Select Data Struct. Code
OSKI_Matrix_Handle For kernel Calls
11
Where the Optimizations Occur
12
Current Implementation

The Serial Interface
Represents S?P composition of ParLab Proposal.
The parallelism is hidden under the covers
Each serial-looking function call triggers a set
of parallel events
Manages its own thread pool
Supports up to the number of threads supported by
underlying hardware
Manages thread and data affinity

13
Additional Future Interface

The Parallel Interface
Represents P?P composition of ParLab Proposal
Meant for expert programmers
Can be used to share threads with other parallel
libraries
No guarantees of thread of data affinity
management
Example Use y ? ATAx codes
Alternate between SpMV and preconditioning step.
Share threads between P-OSKI (for SpMV) and some
parallel preconditioning library
Example Use UPC Code
Explicitly Parallel Execution Model
User partitions matrix based on some information
P-OSKI would not be able to infer

14
Thread and Data Affinity (1/3)

Cache Coherent Non Uniform Memory Access (ccNUMA)
times on Modern MultiSocket, MultiCore
architectures
Modern OS first touch policy in allocating
memory
Thread Migration between Locality Domains is
expensive
In ccNUMA, a locality domain is a set of
processor cores together with locally connected
memory which can be accessed without resorting to
a network of any kind.
For now, we have to deal with these OS policies
ourselves. The ParLab OS Group is trying to
solve these problems in order to hide such issues
from the programmer.

15
Thread and Data Affinity (2/3)

The Problem with malloc() and free()
malloc() first looks for free pages on heap and
then requests OS to allocate new pages.
If available free pages reside on a different
locality domain, malloc() still allocates them
Autotuning codes are malloc() and free()
intensive so this is a huge problem

16
Thread and Data Affinity (3/3)

The solution Managing our own memory
One large chunk (heap) allocated at the beginning
of tuning per locality domain
Size of this heap controlled by user input
through environment variable P_OSKI_HEAP_IN_GB2
Rare case allocated space is not big enough
Stop all threads
Free all allocated memory
Grow the amount of space significantly across all
threads and locality domains
Print a strong warning to the user

17
Outline

pOSKI Goals
OSKI Overview
(Slides adopted from Rich Vuducs SIAM CSE 2005
Talk)
pOSKI Design
Parallel Benchmark
MPI-SpMV

18
Justification

OSKIs Benchmarking
Single Threaded
All the memory bandwidth is given to this one
thread
pOSKIs Benchmarking
Benchmarks 1, 2, 4, , threads (based on
hardware limit) in parallel
Each thread uses up memory bandwidth which
resembles run-time more accurately
When each instance of OSKI choose appropriate
data structures and algorithms, it uses the data
from this parallel benchmark

19
Results (1/2)
20
Results (2/2)

Justifies Need for Search
Need Heuristics to reduce this since the
multicore search space is expanding exponentially

21
Outline

pOSKI Goals
OSKI Overview
(Slides adopted from Rich Vuducs SIAM CSE 2005
Talk)
pOSKI Design
Parallel Benchmark
MPI-SpMV

22
Goals

Target MultiNode, MultiCore architectures
Design Build an MPI-layer on top of pOSKI
MPI is a starting point
Tuning Parameters
Balance of Pthreads and MPI tasks
Rajesh has found for collectives, the balance is
not always clear
Identifying if there are potential performance
gains by assigning some of the threads (or cores)
to only handle sending/receiving of messages
Status
Just started, should have initial version in next
few weeks
Future Work
Explore UPC for communication
Distributed Load Balancing, Workload Generation

23
Questions?

pOSKI Goals
OSKI Overview
pOSKI Design
Parallel Benchmark
MPI-SpMV

24
Extra Slides

Motivation for Tuning

25
Motivation The Difficulty of Tuning

n 21216
nnz 1.5 M
kernel SpMV
Source NASA structural analysis problem

26
Speedups on Itanium 2 The Need for Search
27
Extra Slides

Some Current Multicore Machines

28
Rad Lab Opteron
29
Niagara 2 (Victoria Falls)
30
Nersc Power5 Bassi
31
Cell Processor
32
Extra Slides

SpBLAS and OSKI Interfaces

33
SpBLAS Interface

Create a matrix handle
Assert matrix properties
Insert matrix entries
Signal the end of matrix creation
Call operations on the handle
Destroy the handle

?
Tune here
34
OSKI Interface

The basic OSKI interface has a subset of the
matrix creation interface of the Sparse BLAS,
exposes the tuning step explicitly, and supports
a few extra kernels (e.g., A(T)Ax).
The OSKI interface was designed with the intent
of implementing the Sparse BLAS using OSKI
under-the-hood.

35
Extra Slides

Other Ideas for pOSKI

36
Challenges of a Parallel Automatic Tuner

Search space increases exponentially with number
of parameters
Parallelization across Architectural Parameters
Across Multiple Threads
Across Multiple Cores
Across Multiple Sockets
Parallelizing the data of a given problem
Across Rows, Across Columns, or Checkerboard
Based on User Input in v1
Future Versions can integrate ParMETIS or other
graph partitioners

37
A Memory Footprint Minimization Heuristic

The Problem Search Space is too Large ?
Auto-tuning takes too long
The rate of increase in aggregate memory
bandwidth over time is not as fast as the rate of
increase in processing power per machine.
Our Two Step Tuning Process
Calculate the top 20 memory efficient
configurations on Thread 0
Each Thread finds its optimal block size for its
sub-matrix from the list in Step 1