STAPL: A High Productivity Programming Infrastructure for Parallel

1 / 50
About This Presentation
Title:

STAPL: A High Productivity Programming Infrastructure for Parallel

Description:

Parallel programming is Costly. Parallel programs are not portable ... Expresses the Data Task Parallelism Duality. pRange. View of a work space ... –

Number of Views:44
Avg rating:3.0/5.0
Slides: 51
Provided by: lawrencera
Learn more at: http://charm.cs.uiuc.edu
Category:

less

Transcript and Presenter's Notes

Title: STAPL: A High Productivity Programming Infrastructure for Parallel


1
STAPL A High Productivity Programming
Infrastructure for Parallel Distributed
Computing
  • Lawrence Rauchwerger
  • Parasol Lab, Dept of Computer Science
  • Texas AM University
  • http//parasol.tamu.edu/rwerger/

2
Motivation
  • Parallel programming is Costly
  • Parallel programs are not portable
  • Scalability Efficiency is (usually) poor
  • Dynamic programs are even harder
  • Small scale parallel machines ubiquitous

3
Our Approach STAPL
  • STAPL Parallel components library
  • Extensible, open ended
  • Parallel superset of STL
  • Sequential inter-operability
  • Layered architecture User Developer -
    Specialist
  • Extensible
  • Portable (only lowest layer needs to be
    specialized)
  • High Productivity Environment
  • components have (almost) sequential interfaces.

4
STAPL Specification
  • STL Philosophy
  • Shared Object View
  • User Layer No explicit communication
  • Machine Layer Architecture dependent code
  • Distributed Objects
  • no replication
  • no software coherence
  • Portable efficiency
  • Runtime System virtualizes underlying
    architecture.
  • Concurrency Communication Layer
  • SPMD (for now) parallelism

5
STAPL Applications
  • Motion Planning
  • Probabilistic Roadmap Methods for motion planning
    with application to protein folding, intelligent
    CAD, animation, robotics, etc.
  • Molecular Dynamics
  • A discrete event simulation that computes
    interactions between particles.

Protein Folding
6
STAPL Applications
  • Seismic Ray Tracing
  • Simulation of propagation of seismic rays in
    earths crust.
  • Particle Transport Computation
  • Efficient Massively Parallel Implementation of
    Discrete Ordinates Particle Transport Calculation.

Particle Transport Simulation
7
STAPL Overview
pRange
Runtime System
pContainer
Scheduler
Executor
pAlgorithm
  • Data is stored in pContainers
  • Parallel equivalents of all STL containers more
    (e.g., pGraph)
  • STAPL provides generic pAlgorithms
  • Parallel equivalents of STL algorithms more
    (e.g., list ranking)
  • pRanges bind pAlgorithms to pContainers
  • Similar to STL iterators, but also support
    parallelism

8
STAPL Overview
  • pContainers
  • pRange
  • pAlgorithms
  • RTS ARMI Communication Infrastructure
  • Applications using STAPL

9
pContainer Overview
  • pContainer A distributed (no replication)
    data structure with parallel (thread-safe)
    methods
  • Ease of Use
  • Shared Object View
  • Handles data distribution and remote data access
    internally (no explicit communication)
  • Efficiency
  • De-centralized distribution management
  • OO design to optimize specific containers
  • Minimum overhead over STL containers
  • Extendability
  • A set of base classes with basic functionality
  • New pContainers can be derived from Base classes
    with extended and optimized functionality

10
pContainer Layered Architecture
  • pContainer provides different views for users
    with different needs/levels of expertise
  • Basic User view
  • a single address space
  • interfaces similar to STL containers
  • Advanced User view
  • access to data distribution info to optimize
    methods
  • can provide customized distributions that
    exploit knowledge of application

11
pContainer Design
  • Base Sequential Container
  • STL Containers used to store data
  • Distribution Manager
  • provides shared object view
  • BasePContainer

12
STAPL Overview
  • pContainers
  • pRange
  • pAlgorithms
  • RTS ARMI Communication Infrastructure
  • Applications using STAPL

13
pRange Overview
  • Interface between pAlgorithms and pContainers
  • pAlgorithms expressed in terms of pRanges
  • pContainers provide pRanges
  • Similar to STL Iterator
  • Parallel programming support
  • Expression of computation as parallel task graph
  • Stores DDGs used in processing subranges
  • Less abstract than STL iterator
  • Access to pContainer methods
  • Expresses the DataTask Parallelism Duality

14
pRange
  • View of a work space
  • Set of tasks in a parallel computation
  • Can be recursively partitioned into subranges
  • Defined on disjoint portions of the work space
  • Leaf subrange in the hierarchy
  • Represents a single task
  • Smallest schedulable entity
  • Task
  • Function object to apply
  • Using same function for all subranges results in
    SPMD
  • Description of the data to which function is
    applied

15
pRange Example
pRange defined on application data
Application data stored in pMatrix
Subrange 1
Subrange 2
Thread 1
Thread 2
Thread 1
Function
Function
Dependences between subranges
Thread 2
Subrange 3
Subrange 6
Subrange 5
Subrange 4
Function
Function
Function
Function
  • Each subrange is a task
  • Boundary of each subrange is a set of cut edges
  • Data from several threads in subrange
  • If pRange partition matches data distribution
    then data access is all local

16
pRange Example
  • Subranges of pRange
  • Matrix elements in several subranges
  • Each subrange has a function object
  • Each subrange has a boundary and a function
    object
  • Data from several threads in subrange
  • pMatrix is distributed
  • If subrange partition matches data distribution
    then all data access is local
  • DDGs can be defined on subranges of the pRange
    and on elements inside each subrange
  • No DDG is shown here
  • Partitioning of subrange
  • Subranges can be recursively partitioned
  • Each subrange has a function object

17
Overview
  • pContainers
  • pRange
  • pAlgorithms
  • RTS ARMI Communication Infrastructure
  • Applications using STAPL

18
pAlgorithms
  • pAlgorithm is a set of parallel task objects
  • input for parallel tasks specified by the pRange
  • (Intermediate) results stored in pContainers
  • ARMI for communication between parallel tasks
  • pAlgorithms in STAPL
  • Parallel counterparts of STL algorithms provided
    in STAPL
  • STAPL contains additional parallel algorithms
  • List ranking
  • Parallel Strongly Connected Components
  • Parallel Euler Tour
  • etc

19
Algorithm Adaptivity in STAPL
  • Problem Parallel algorithms highly sensitive to
  • Architecture number of processors, memory
    interconnection, cache, available resources, etc
  • Environment thread management, memory
    allocation, operating system policies, etc
  • Data Characteristics input type, layout, etc
  • Solution adaptively choose the best algorithm
    from a library of options at run-time
  • Adaptive Patterns ?

20
Adaptive Framework
Installation Benchmarks
Data Repository
  • Overview of Approach
  • GivenMultiple implementation choices for the
    same high level algorithm.
  • STAPL installationAnalyze each pAlgorithms
    performance on system and create a selection
    model.
  • Program executionGather parameters, query model,
    and use predicted algorithm.

Architecture
Algorithm
Environment
Performance
Model
STAPL
User
Parallel Algorithm Choices
Code
Adaptive Executable
Data Characteristics
Run-Time Tests
Selected Algorithm
21
Model generation
  • Installation Benchmarking
  • Determine parameters that may affect
    performance(i.e., num procs, input size,
    algorithm specific)
  • Run all algorithms on a sampling of instance
    space
  • Insert timings from runs into performance
    database
  • Create a Decision Model
  • Generic interface enables learners to compete
  • Currently decision tree, neural net, Bayes naïve
    classifier
  • Based on predicted accuracy (10-way validation
    test).
  • Winning learner outputs query function in
    Cfunc predict_pAlgorithm(attribute1,
    attributes2, ..)

22
Runtime Algorithm Selection
  • Gather parameters
  • Immediately available (e.g., num procs)
  • Computed (e.g., disorder estimate for sorting)
  • Query model and execute
  • Query function statically linked at compile
    time.Current work dynamic linking with online
    model refinement.

23
Experiments
  • Investigated two operations
  • Parallel Sorting
  • Parallel Matrix Multiplication
  • Three Platforms
  • 128 processor SGI Altix
  • 1152 nodes, dual processor Xeon Cluster
  • 68 nodes, 16 way IBM SMP Cluster

24
Parallel Sorting Algorithms
  • Sample Sort
  • Samples to define processor bucket thresholds
  • Scan and distribute elements to buckets
  • Each processor sort local elements
  • Radix Sort
  • Parallel version of linear time sequential
    approach.
  • Passes over data multiple times, each time
    considering r bits.
  • Column Sort
  • O(lg n) on running time
  • Requires 4 local sorts and 4 communication steps
  • Uses pMatrix data structure for workspace

25
Sorting Attributes
  • Attributes used to model sorting decision
  • Processor Count
  • Data Type
  • Input Size
  • Max Value
  • Smaller value ranges may favor radix sort by
    reducing required passes
  • Presortedness
  • Generate data by varying initial state (sorted,
    random, reversed) and displacement
  • Quantify at runtime with normalized average
    distance metric derived from input sampling

26
Training Set Creation
  • 1000 Training inputs per platform by uniform
    random sampling of space defined below

P 64 linux cluster, frost only for sorted
and reverse sorted
27
Model Error Rate
  • Model accuracy with all training inputs is 94,
    98 and 94 on Cluster, Altix, and SMP Cluster

28
Altix
Cluster
SMPCluster F(p, dist_norm)
F(p, n, dt, dist_norm, max)
F(p, n, dt, dist_norm,
max)
29
Parallel Sorting Experimental Results
  • Altix Model
  • Decision Tree

Validation (N120M) on Altix
Interpretation of dist_norm
0 Sorted
0.5 Reversed
Random
30
Overview
  • pContainers
  • pRange
  • pAlgorithms
  • RTS ARMI Communication Infrastructure
  • Applications using STAPL

31
Current Implementation Protocols
  • Shared-Memory (OpenMP/Pthreads)
  • shared request queues
  • Message Passing (MPI-1.1)
  • sends/receives
  • Mixed-Mode
  • combination of MPI with threads
  • flat view of parallelism (for now)
  • take advantage of shared-memory

32
STAPL Run-Time System
  • Scheduler
  • Determine an execution order (DDG)
  • Policies
  • Automatic Static, Block, Dynamic, Partial Self
    Scheduling, Complete Self Scheduling
  • User defined
  • Executor
  • Execute DDG
  • Processor assignment
  • Synchronization and Communication

pAlgorithm
Run-Time
Cluster 1
Cluster 2
Cluster 3
33
ARMI STAPL Communication Infrastructure
  • ARMI Adaptive Remote Method Invocation
  • abstraction of shared-memory and message passing
    communication layer
  • programmer expresses fine-grain parallelism that
    ARMI adaptively coarsens
  • support for sync, async, point-to-point and
    group communication
  • ARMI can be as easy/natural as shared memory and
    as efficient as message passing

34
ARMI Communication Primitives
  • armi_sync
  • question ask a thread something
  • blocking version
  • function doesnt return until answer received
    from rmi
  • non-blocking version
  • function returns without answer
  • program can poll with rtnHandle.ready() and then
    access armis return value with rtnHandle.value()
  • collective operations
  • armi_broadcast, armi_reduce, etc.
  • can adaptively set groups for communication
  • arguments always passed by value

35
ARMI Synchronization Primitives
  • armi_fence, armi_barrier
  • tree-based barrier
  • implements distributed termination algorithm to
    ensure that all outstanding ARMI requests have
    been sent, received, and serviced
  • armi_wait
  • blocks until at least one (possibly more) ARMI
    request is received and serviced
  • armi_flush
  • empties local send buffer, pushing outstanding
    ARMI requests to remote destinations

36
ARMI Request Scheduling
  • Future Work
  • Optimizing communication with comm. thread
  • If servicing multiple computation threads,
    aggregate all messages to a single remote host
    and send together
  • If using different processing resources than
    computation, spend free cycles optimizing
    communication (i.e., aggregation factor, discover
    program comm. patterns)
  • Explore benefits of extreme platform
    customization
  • Further clarify layers of abstraction within ARMI
    to ease specialized implementations
  • Customize implementation, consistency model to
    target platform

37
Overview
  • pContainers
  • pRange
  • pAlgorithms
  • RTS ARMI Communication Infrastructure
  • Applications using STAPL

38
Particle Transport
Q What is the particle transport problem?
A Particle transport is all about counting
particles (such as neutrons). Given a physical
volume we want to know how many particles there
are and their locations, directions, and energies.
Q Why is it an important problem?
A Needed for the accurate simulation of complex
physical systems such as nuclear
reactions. Requires an estimated 50-80 of the
total execution time in multi-physics simulations.
Particle Transport is an important problem.
39
Transport Problem Applications
  • Oil Well Logging Tool
  • Shaft dug at possible well location
  • Radioactive sources placed in shaft with
    monitoring equipment
  • Simulation allows for verification of new
    techniques and tools

40
Discrete Ordinates Method
Iterative method for solving the first-order form
of the transport equation discretizes
Algorithm
41
Discrete Ordinates Method
42
TAXI Algorithm
43
Transport Sweeps
  • Involves a sweep of the spatial grid for each
    direction in ?.
  • For orthogonal grids there are only eight
    distinct sweep orderings.
  • Note A full transport sweep must process each
    direction.

44
Multiple Simultaneous Sweeps
  • One approach is to sequentially process each
    direction.
  • Another approach is to process each direction
    simultaneously.
  • Requires processors to sequentially process each
    direction during the sweep.

45
Sweep Dependence
  • Each sweep direction generates a unique
    dependence graph. A sweep starting from cell 1
    is shown.
  • For example, cell 3 must wait until cell 1 has
    been processed and must be processed before cells
    5 and 7.
  • Note that all cells in the same diagonal plane
    can be processed simultaneously.

46
pRange Dependence Graph
angle-set A
angle-set B
angle-set C
angle-set D
  • Numbers are cellset indices
  • Colors indicate processors

47
Adding a reflecting boundary
48
Opposing reflecting boundary
49
Strong Scalability
  • System Specs
  • Large, dedicated IBM cluster at LLNL
  • 68 Nodes.
  • 16 375 Mhz Power 3 processors and 16GB RAM/node
  • Nodes connected by IBM SP switch
  • Problem Info
  • 64x64x256 grid
  • 6,291,456 unknowns

50
Work in Progress (Open Topics)
  • STAPL Algorithms
  • STAPL Adaptive Containers
  • ARMI v2 (multi-threaded, communication pattern
    library)
  • STAPL RTS -- K42 Interface
  • A Compiler for STAPL
  • A high level, source to source compiler
  • Understands STAPL blocks
  • Optimizes composition
  • Automates composition
  • Generates checkers for STAPL programs

51
References
  • 1 "STAPL An Adaptive, Generic Parallel C
    Library",  Ping An, Alin Jula, Silvius Rus,
    Steven Saunders, Tim Smith, Gabriel Tanase,
    Nathan Thomas, Nancy Amato and Lawrence
    Rauchwerger,  14th Workshop on Languages and
    Compilers for Parallel Computing (LCPC), 
    Cumberland Falls, KY, August, 2001.
  • 2 "ARMI An Adaptive, Platform Independent
    Communication Library, Steven Saunders and
    Lawrence Rauchwerger, ACM SIGPLAN Symposium on
    Principles and Practice of Parallel Programming
    (PPoPP), San Diego, CA, June, 2003.
  • 3 "Finding Strongly Connected Components in
    Parallel in Particle Transport Sweeps", W. C.
    McLendon III, B. A. Hendrickson, S. J. Plimpton,
    and L. Rauchwerger, in Thirteenth ACM Symposium
    on Parallel Algorithms and Architectures (SPAA),
    Crete, Greece, July, 2001.
  • 4 A Framework for Adaptive Algorithm Selection
    in STAPL", N. Thomas, G. Tanase, O. Tkachyshyn,
    J. Perdue, N.M. Amato, L. Rauchwerger, in ACM
    SIGPLAN 2005 Symposium on Principles and Practice
    of Parallel Programming (PPOPP), Chicago, IL,
    June, 2005. (to appear)
Write a Comment
User Comments (0)
About PowerShow.com