STAPL: A High Productivity Programming Infrastructure for Parallel

1 / 50

About This Presentation

Title:

STAPL: A High Productivity Programming Infrastructure for Parallel

Description:

Parallel programming is Costly. Parallel programs are not portable ... Expresses the Data Task Parallelism Duality. pRange. View of a work space ... –

Number of Views:44

Avg rating:3.0/5.0

Slides: 51

Provided by: lawrencera

Learn more at: http://charm.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: STAPL: A High Productivity Programming Infrastructure for Parallel

1
STAPL A High Productivity Programming
Infrastructure for Parallel Distributed
Computing

Lawrence Rauchwerger
Parasol Lab, Dept of Computer Science
Texas AM University
http//parasol.tamu.edu/rwerger/

2
Motivation

Parallel programming is Costly
Parallel programs are not portable
Scalability Efficiency is (usually) poor
Dynamic programs are even harder
Small scale parallel machines ubiquitous

3
Our Approach STAPL

STAPL Parallel components library
Extensible, open ended
Parallel superset of STL
Sequential inter-operability
Layered architecture User Developer -
Specialist
Extensible
Portable (only lowest layer needs to be
specialized)
High Productivity Environment
components have (almost) sequential interfaces.

4
STAPL Specification

STL Philosophy
Shared Object View
User Layer No explicit communication
Machine Layer Architecture dependent code
Distributed Objects
no replication
no software coherence
Portable efficiency
Runtime System virtualizes underlying
architecture.
Concurrency Communication Layer
SPMD (for now) parallelism

5
STAPL Applications

Motion Planning
Probabilistic Roadmap Methods for motion planning
with application to protein folding, intelligent
CAD, animation, robotics, etc.
Molecular Dynamics
A discrete event simulation that computes
interactions between particles.

Protein Folding
6
STAPL Applications

Seismic Ray Tracing
Simulation of propagation of seismic rays in
earths crust.

Particle Transport Computation
Efficient Massively Parallel Implementation of
Discrete Ordinates Particle Transport Calculation.

Particle Transport Simulation
7
STAPL Overview
pRange
Runtime System
pContainer
Scheduler
Executor
pAlgorithm

Data is stored in pContainers
Parallel equivalents of all STL containers more
(e.g., pGraph)
STAPL provides generic pAlgorithms
Parallel equivalents of STL algorithms more
(e.g., list ranking)
pRanges bind pAlgorithms to pContainers
Similar to STL iterators, but also support
parallelism

8
STAPL Overview

pContainers
pRange
pAlgorithms
RTS ARMI Communication Infrastructure
Applications using STAPL

9
pContainer Overview

pContainer A distributed (no replication)
data structure with parallel (thread-safe)
methods
Ease of Use
Shared Object View
Handles data distribution and remote data access
internally (no explicit communication)
Efficiency
De-centralized distribution management
OO design to optimize specific containers
Minimum overhead over STL containers
Extendability
A set of base classes with basic functionality
New pContainers can be derived from Base classes
with extended and optimized functionality

10
pContainer Layered Architecture

pContainer provides different views for users
with different needs/levels of expertise
Basic User view
a single address space
interfaces similar to STL containers
Advanced User view
access to data distribution info to optimize
methods
can provide customized distributions that
exploit knowledge of application

11
pContainer Design

Base Sequential Container
STL Containers used to store data
Distribution Manager
provides shared object view
BasePContainer

12
STAPL Overview

pContainers
pRange
pAlgorithms
RTS ARMI Communication Infrastructure
Applications using STAPL

13
pRange Overview

Interface between pAlgorithms and pContainers
pAlgorithms expressed in terms of pRanges
pContainers provide pRanges
Similar to STL Iterator
Parallel programming support
Expression of computation as parallel task graph
Stores DDGs used in processing subranges
Less abstract than STL iterator
Access to pContainer methods
Expresses the DataTask Parallelism Duality

14
pRange

View of a work space
Set of tasks in a parallel computation
Can be recursively partitioned into subranges
Defined on disjoint portions of the work space
Leaf subrange in the hierarchy
Represents a single task
Smallest schedulable entity
Task
Function object to apply
Using same function for all subranges results in
SPMD
Description of the data to which function is
applied

15
pRange Example
pRange defined on application data
Application data stored in pMatrix
Subrange 1
Subrange 2
Thread 1
Thread 2
Thread 1
Function
Function
Dependences between subranges
Thread 2
Subrange 3
Subrange 6
Subrange 5
Subrange 4
Function
Function
Function
Function

Each subrange is a task
Boundary of each subrange is a set of cut edges
Data from several threads in subrange
If pRange partition matches data distribution
then data access is all local

16
pRange Example

Subranges of pRange
Matrix elements in several subranges
Each subrange has a function object

Each subrange has a boundary and a function
object
Data from several threads in subrange
pMatrix is distributed
If subrange partition matches data distribution
then all data access is local
DDGs can be defined on subranges of the pRange
and on elements inside each subrange
No DDG is shown here

Partitioning of subrange
Subranges can be recursively partitioned
Each subrange has a function object

17
Overview

pContainers
pRange
pAlgorithms
RTS ARMI Communication Infrastructure
Applications using STAPL

18
pAlgorithms

pAlgorithm is a set of parallel task objects
input for parallel tasks specified by the pRange
(Intermediate) results stored in pContainers
ARMI for communication between parallel tasks
pAlgorithms in STAPL
Parallel counterparts of STL algorithms provided
in STAPL
STAPL contains additional parallel algorithms
List ranking
Parallel Strongly Connected Components
Parallel Euler Tour
etc

19
Algorithm Adaptivity in STAPL

Problem Parallel algorithms highly sensitive to
Architecture number of processors, memory
interconnection, cache, available resources, etc
Environment thread management, memory
allocation, operating system policies, etc
Data Characteristics input type, layout, etc
Solution adaptively choose the best algorithm
from a library of options at run-time
Adaptive Patterns ?

20
Adaptive Framework
Installation Benchmarks
Data Repository

Overview of Approach
GivenMultiple implementation choices for the
same high level algorithm.
STAPL installationAnalyze each pAlgorithms
performance on system and create a selection
model.
Program executionGather parameters, query model,
and use predicted algorithm.

Architecture
Algorithm
Environment
Performance
Model
STAPL
User
Parallel Algorithm Choices
Code
Adaptive Executable
Data Characteristics
Run-Time Tests
Selected Algorithm
21
Model generation

Installation Benchmarking
Determine parameters that may affect
performance(i.e., num procs, input size,
algorithm specific)
Run all algorithms on a sampling of instance
space
Insert timings from runs into performance
database
Create a Decision Model
Generic interface enables learners to compete
Currently decision tree, neural net, Bayes naïve
classifier
Based on predicted accuracy (10-way validation
test).
Winning learner outputs query function in
Cfunc predict_pAlgorithm(attribute1,
attributes2, ..)

22
Runtime Algorithm Selection

Gather parameters
Immediately available (e.g., num procs)
Computed (e.g., disorder estimate for sorting)
Query model and execute
Query function statically linked at compile
time.Current work dynamic linking with online
model refinement.

23
Experiments

Investigated two operations
Parallel Sorting
Parallel Matrix Multiplication
Three Platforms
128 processor SGI Altix
1152 nodes, dual processor Xeon Cluster
68 nodes, 16 way IBM SMP Cluster

24
Parallel Sorting Algorithms

Sample Sort
Samples to define processor bucket thresholds
Scan and distribute elements to buckets
Each processor sort local elements
Radix Sort
Parallel version of linear time sequential
approach.
Passes over data multiple times, each time
considering r bits.
Column Sort
O(lg n) on running time
Requires 4 local sorts and 4 communication steps
Uses pMatrix data structure for workspace

25
Sorting Attributes

Attributes used to model sorting decision
Processor Count
Data Type
Input Size
Max Value
Smaller value ranges may favor radix sort by
reducing required passes
Presortedness
Generate data by varying initial state (sorted,
random, reversed) and displacement
Quantify at runtime with normalized average
distance metric derived from input sampling

26
Training Set Creation

1000 Training inputs per platform by uniform
random sampling of space defined below

P 64 linux cluster, frost only for sorted
and reverse sorted
27
Model Error Rate

Model accuracy with all training inputs is 94,
98 and 94 on Cluster, Altix, and SMP Cluster

28
Altix
Cluster
SMPCluster F(p, dist_norm)
F(p, n, dt, dist_norm, max)
F(p, n, dt, dist_norm,
max)
29
Parallel Sorting Experimental Results

Altix Model
Decision Tree

Validation (N120M) on Altix
Interpretation of dist_norm
0 Sorted
0.5 Reversed
Random
30
Overview

pContainers
pRange
pAlgorithms
RTS ARMI Communication Infrastructure
Applications using STAPL

31
Current Implementation Protocols

Shared-Memory (OpenMP/Pthreads)
shared request queues
Message Passing (MPI-1.1)
sends/receives
Mixed-Mode
combination of MPI with threads
flat view of parallelism (for now)
take advantage of shared-memory

32
STAPL Run-Time System

Scheduler
Determine an execution order (DDG)
Policies
Automatic Static, Block, Dynamic, Partial Self
Scheduling, Complete Self Scheduling
User defined
Executor
Execute DDG
Processor assignment
Synchronization and Communication

pAlgorithm
Run-Time
Cluster 1
Cluster 2
Cluster 3
33
ARMI STAPL Communication Infrastructure

ARMI Adaptive Remote Method Invocation
abstraction of shared-memory and message passing
communication layer
programmer expresses fine-grain parallelism that
ARMI adaptively coarsens
support for sync, async, point-to-point and
group communication
ARMI can be as easy/natural as shared memory and
as efficient as message passing

34
ARMI Communication Primitives

armi_sync
question ask a thread something
blocking version
function doesnt return until answer received
from rmi
non-blocking version
function returns without answer
program can poll with rtnHandle.ready() and then
access armis return value with rtnHandle.value()
collective operations
armi_broadcast, armi_reduce, etc.
can adaptively set groups for communication
arguments always passed by value

35
ARMI Synchronization Primitives

armi_fence, armi_barrier
tree-based barrier
implements distributed termination algorithm to
ensure that all outstanding ARMI requests have
been sent, received, and serviced
armi_wait
blocks until at least one (possibly more) ARMI
request is received and serviced
armi_flush
empties local send buffer, pushing outstanding
ARMI requests to remote destinations

36
ARMI Request Scheduling

Future Work
Optimizing communication with comm. thread
If servicing multiple computation threads,
aggregate all messages to a single remote host
and send together
If using different processing resources than
computation, spend free cycles optimizing
communication (i.e., aggregation factor, discover
program comm. patterns)
Explore benefits of extreme platform
customization
Further clarify layers of abstraction within ARMI
to ease specialized implementations
Customize implementation, consistency model to
target platform

37
Overview

pContainers
pRange
pAlgorithms
RTS ARMI Communication Infrastructure
Applications using STAPL

38
Particle Transport
Q What is the particle transport problem?
A Particle transport is all about counting
particles (such as neutrons). Given a physical
volume we want to know how many particles there
are and their locations, directions, and energies.
Q Why is it an important problem?
A Needed for the accurate simulation of complex
physical systems such as nuclear
reactions. Requires an estimated 50-80 of the
total execution time in multi-physics simulations.
Particle Transport is an important problem.
39
Transport Problem Applications

Oil Well Logging Tool
Shaft dug at possible well location
Radioactive sources placed in shaft with
monitoring equipment
Simulation allows for verification of new
techniques and tools

40
Discrete Ordinates Method
Iterative method for solving the first-order form
of the transport equation discretizes
Algorithm
41
Discrete Ordinates Method
42
TAXI Algorithm
43
Transport Sweeps

Involves a sweep of the spatial grid for each
direction in ?.
For orthogonal grids there are only eight
distinct sweep orderings.
Note A full transport sweep must process each
direction.

44
Multiple Simultaneous Sweeps

One approach is to sequentially process each
direction.
Another approach is to process each direction
simultaneously.
Requires processors to sequentially process each
direction during the sweep.

45
Sweep Dependence

Each sweep direction generates a unique
dependence graph. A sweep starting from cell 1
is shown.
For example, cell 3 must wait until cell 1 has
been processed and must be processed before cells
5 and 7.
Note that all cells in the same diagonal plane
can be processed simultaneously.

46
pRange Dependence Graph
angle-set A
angle-set B
angle-set C
angle-set D

Numbers are cellset indices
Colors indicate processors

47
Adding a reflecting boundary
48
Opposing reflecting boundary
49
Strong Scalability

System Specs
Large, dedicated IBM cluster at LLNL
68 Nodes.
16 375 Mhz Power 3 processors and 16GB RAM/node
Nodes connected by IBM SP switch

Problem Info
64x64x256 grid
6,291,456 unknowns

50
Work in Progress (Open Topics)

STAPL Algorithms
STAPL Adaptive Containers
ARMI v2 (multi-threaded, communication pattern
library)
STAPL RTS -- K42 Interface
A Compiler for STAPL
A high level, source to source compiler
Understands STAPL blocks
Optimizes composition
Automates composition
Generates checkers for STAPL programs

51
References

1 "STAPL An Adaptive, Generic Parallel C
Library", Ping An, Alin Jula, Silvius Rus,
Steven Saunders, Tim Smith, Gabriel Tanase,
Nathan Thomas, Nancy Amato and Lawrence
Rauchwerger, 14th Workshop on Languages and
Compilers for Parallel Computing (LCPC),
Cumberland Falls, KY, August, 2001.
2 "ARMI An Adaptive, Platform Independent
Communication Library, Steven Saunders and
Lawrence Rauchwerger, ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming
(PPoPP), San Diego, CA, June, 2003.
3 "Finding Strongly Connected Components in
Parallel in Particle Transport Sweeps", W. C.
McLendon III, B. A. Hendrickson, S. J. Plimpton,
and L. Rauchwerger, in Thirteenth ACM Symposium
on Parallel Algorithms and Architectures (SPAA),
Crete, Greece, July, 2001.
4 A Framework for Adaptive Algorithm Selection
in STAPL", N. Thomas, G. Tanase, O. Tkachyshyn,
J. Perdue, N.M. Amato, L. Rauchwerger, in ACM
SIGPLAN 2005 Symposium on Principles and Practice
of Parallel Programming (PPOPP), Chicago, IL,
June, 2005. (to appear)