pOSKI: A Library to Parallelize OSKI - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

pOSKI: A Library to Parallelize OSKI

Description:

Hide the complex process of parallel tuning while exposing its cost ... Hides complexity of run-time tuning. Low ... The parallelism is hidden under the covers ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 38
Provided by: csBer
Category:

less

Transcript and Presenter's Notes

Title: pOSKI: A Library to Parallelize OSKI


1
pOSKI A Library to Parallelize OSKI
  • Ankit Jain
  • Berkeley Benchmarking and OPtimization (BeBOP)
    Project
  • bebop.cs.berkeley.edu
  • EECS Department, University of California,
    Berkeley
  • April 28, 2008

2
Outline
  • pOSKI Goals
  • OSKI Overview
  • (Slides adopted from Rich Vuducs SIAM CSE 2005
    Talk)
  • pOSKI Design
  • Parallel Benchmark
  • MPI-SpMV

3
pOSKI Goals
  • Provide a simple serial interface to exploit the
    parallelism in sparse kernels (focus on SpMV for
    now)
  • Target Multicore Architectures
  • Hide the complex process of parallel tuning while
    exposing its cost
  • Use heuristics, where possible, to limit search
    space
  • Design it to be extensible so it can be used in
    conjunction with other parallel libraries (e.g.
    ParMETIS)

Take Sams Work and present it in a
distributable, easy-to-use format.
4
Outline
  • pOSKI Goals
  • OSKI Overview
  • (Slides adopted from Rich Vuducs SIAM CSE 2005
    Talk)
  • pOSKI Design
  • Parallel Benchmark
  • MPI-SpMV

5
OSKI Optimized Sparse Kernel Interface
  • Sparse kernels tuned for users matrix machine
  • Hides complexity of run-time tuning
  • Low-level BLAS-style functionality
  • Sparse matrix-vector multiply (SpMV), triangular
    solve (TrSV),
  • Includes fast locality-aware kernels ATAx,
  • Target cache-based superscalar uniprocessors
  • Faster than standard implementations
  • Up to 4x faster SpMV, 1.8x TrSV, 4x ATAx
  • Written in C (can call from Fortran)

Note All Speedups listed are from Sequential
Platforms in 2005
6
How OSKI Tunes (Overview)
Application Run-Time
Library Install-Time (offline)
1. Build for Target Arch.
2. Benchmark
Workload from program monitoring
History
Matrix
Benchmark data
Heuristic models
1. Evaluate Models
Generated code variants
2. Select Data Struct. Code
To user Matrix handle for kernel calls
Extensibility Advanced users may write
dynamically add Code variants and Heuristic
models to system.
7
Cost of Tuning
  • Non-trivial run-time tuning cost up to 40
    mat-vecs
  • Dominated by conversion time
  • Design point user calls tune routine
    explicitly
  • Exposes cost
  • Tuning time limited using estimated workload
  • Provided by user or inferred by library
  • User may save tuning results
  • To apply on future runs with similar matrix
  • Stored in human-readable format

8
Optimizations Available in OSKI
  • Optimizations for SpMV (bold ? heuristics)
  • Register blocking (RB) up to 4x over CSR
  • Variable block splitting 2.1x over CSR, 1.8x
    over RB
  • Diagonals 2x over CSR
  • Reordering to create dense structure splitting
    2x over CSR
  • Symmetry 2.8x over CSR, 2.6x over RB
  • Cache blocking 3x over CSR
  • Multiple vectors (SpMM) 7x over CSR
  • And combinations
  • Sparse triangular solve
  • Hybrid sparse/dense data structure 1.8x over CSR
  • Higher-level kernels
  • AATx, ATAx 4x over CSR, 1.8x over RB
  • A?x 2x over CSR, 1.5x over RB

Note All Speedups listed are from Sequential
Platforms in 2005
9
Outline
  • pOSKI Goals
  • OSKI Overview
  • (Slides adopted from Rich Vuducs SIAM CSE 2005
    Talk)
  • pOSKI Design
  • Parallel Benchmark
  • MPI-SpMV

10
How pOSKI Tunes (Overview)
Library Install-Time (offline)
Application Run-Time (online)
Matrix
Load Balance
Build for Target Arch.
Parallel Benchmark
Parallel Heuristic models
P-OSKI
Submatrix
Submatrix
Parallel Benchmark data
To User pOSKI Matrix Handle For kernel Calls
Accumulate Handles
Evaluate Parallel Model
Evaluate Parallel Model
Benchmark data
Generated code variants
History
OSKI
Evaluate Models
Heuristic models
Build for Target Arch.
Benchmark
Select Data Struct. Code
OSKI_Matrix_Handle For kernel Calls
11
Where the Optimizations Occur
12
Current Implementation
  • The Serial Interface
  • Represents S?P composition of ParLab Proposal.
    The parallelism is hidden under the covers
  • Each serial-looking function call triggers a set
    of parallel events
  • Manages its own thread pool
  • Supports up to the number of threads supported by
    underlying hardware
  • Manages thread and data affinity

13
Additional Future Interface
  • The Parallel Interface
  • Represents P?P composition of ParLab Proposal
  • Meant for expert programmers
  • Can be used to share threads with other parallel
    libraries
  • No guarantees of thread of data affinity
    management
  • Example Use y ? ATAx codes
  • Alternate between SpMV and preconditioning step.
  • Share threads between P-OSKI (for SpMV) and some
    parallel preconditioning library
  • Example Use UPC Code
  • Explicitly Parallel Execution Model
  • User partitions matrix based on some information
    P-OSKI would not be able to infer

14
Thread and Data Affinity (1/3)
  • Cache Coherent Non Uniform Memory Access (ccNUMA)
    times on Modern MultiSocket, MultiCore
    architectures
  • Modern OS first touch policy in allocating
    memory
  • Thread Migration between Locality Domains is
    expensive
  • In ccNUMA, a locality domain is a set of
    processor cores together with locally connected
    memory which can be accessed without resorting to
    a network of any kind.
  • For now, we have to deal with these OS policies
    ourselves. The ParLab OS Group is trying to
    solve these problems in order to hide such issues
    from the programmer.

15
Thread and Data Affinity (2/3)
  • The Problem with malloc() and free()
  • malloc() first looks for free pages on heap and
    then requests OS to allocate new pages.
  • If available free pages reside on a different
    locality domain, malloc() still allocates them
  • Autotuning codes are malloc() and free()
    intensive so this is a huge problem

16
Thread and Data Affinity (3/3)
  • The solution Managing our own memory
  • One large chunk (heap) allocated at the beginning
    of tuning per locality domain
  • Size of this heap controlled by user input
    through environment variable P_OSKI_HEAP_IN_GB2
  • Rare case allocated space is not big enough
  • Stop all threads
  • Free all allocated memory
  • Grow the amount of space significantly across all
    threads and locality domains
  • Print a strong warning to the user

17
Outline
  • pOSKI Goals
  • OSKI Overview
  • (Slides adopted from Rich Vuducs SIAM CSE 2005
    Talk)
  • pOSKI Design
  • Parallel Benchmark
  • MPI-SpMV

18
Justification
  • OSKIs Benchmarking
  • Single Threaded
  • All the memory bandwidth is given to this one
    thread
  • pOSKIs Benchmarking
  • Benchmarks 1, 2, 4, , threads (based on
    hardware limit) in parallel
  • Each thread uses up memory bandwidth which
    resembles run-time more accurately
  • When each instance of OSKI choose appropriate
    data structures and algorithms, it uses the data
    from this parallel benchmark

19
Results (1/2)
20
Results (2/2)
  • Justifies Need for Search
  • Need Heuristics to reduce this since the
    multicore search space is expanding exponentially

21
Outline
  • pOSKI Goals
  • OSKI Overview
  • (Slides adopted from Rich Vuducs SIAM CSE 2005
    Talk)
  • pOSKI Design
  • Parallel Benchmark
  • MPI-SpMV

22
Goals
  • Target MultiNode, MultiCore architectures
  • Design Build an MPI-layer on top of pOSKI
  • MPI is a starting point
  • Tuning Parameters
  • Balance of Pthreads and MPI tasks
  • Rajesh has found for collectives, the balance is
    not always clear
  • Identifying if there are potential performance
    gains by assigning some of the threads (or cores)
    to only handle sending/receiving of messages
  • Status
  • Just started, should have initial version in next
    few weeks
  • Future Work
  • Explore UPC for communication
  • Distributed Load Balancing, Workload Generation

23
Questions?
  • pOSKI Goals
  • OSKI Overview
  • pOSKI Design
  • Parallel Benchmark
  • MPI-SpMV

24
Extra Slides
  • Motivation for Tuning

25
Motivation The Difficulty of Tuning
  • n 21216
  • nnz 1.5 M
  • kernel SpMV
  • Source NASA structural analysis problem

26
Speedups on Itanium 2 The Need for Search
27
Extra Slides
  • Some Current Multicore Machines

28
Rad Lab Opteron
29
Niagara 2 (Victoria Falls)
30
Nersc Power5 Bassi
31
Cell Processor
32
Extra Slides
  • SpBLAS and OSKI Interfaces

33
SpBLAS Interface
  • Create a matrix handle
  • Assert matrix properties
  • Insert matrix entries
  • Signal the end of matrix creation
  • Call operations on the handle
  • Destroy the handle

?
Tune here
34
OSKI Interface
  • The basic OSKI interface has a subset of the
    matrix creation interface of the Sparse BLAS,
    exposes the tuning step explicitly, and supports
    a few extra kernels (e.g., A(T)Ax).
  • The OSKI interface was designed with the intent
    of implementing the Sparse BLAS using OSKI
    under-the-hood.

35
Extra Slides
  • Other Ideas for pOSKI

36
Challenges of a Parallel Automatic Tuner
  • Search space increases exponentially with number
    of parameters
  • Parallelization across Architectural Parameters
  • Across Multiple Threads
  • Across Multiple Cores
  • Across Multiple Sockets
  • Parallelizing the data of a given problem
  • Across Rows, Across Columns, or Checkerboard
  • Based on User Input in v1
  • Future Versions can integrate ParMETIS or other
    graph partitioners

37
A Memory Footprint Minimization Heuristic
  • The Problem Search Space is too Large ?
    Auto-tuning takes too long
  • The rate of increase in aggregate memory
    bandwidth over time is not as fast as the rate of
    increase in processing power per machine.
  • Our Two Step Tuning Process
  • Calculate the top 20 memory efficient
    configurations on Thread 0
  • Each Thread finds its optimal block size for its
    sub-matrix from the list in Step 1
Write a Comment
User Comments (0)
About PowerShow.com