Title: Engineering Analysis of High Performance Parallel Programs
1Engineering Analysis of High Performance Parallel
Programs
- David Culler
- Computer Science Division
- U.C.Berkeley
- http//www.cs.berkeley.edu/culler
2Traditional Parallel Programming Tools
- Focus on showing what program did and when it
did it - microscopic analysis of deterministic events
- oriented towards initial development of small
programs on small data sets and small machines - Instrumentation
- traces, counters, profiles
- Visualization
- Examples
- AIMS, PTOOLS, PPP
- pablo paradyn ... gt delphi
- ACTS TAU - tuning and analysis util.
3Example Pablo
4Beyond Zeroth-order Analysis
- Basic level to get to a system design that is
reasonable and behaves properly under ideal
condition - Subject the system to various stresses to
understand its operating regime and gain deeper
insight into its dynamic behavior - Combine empirical data with analytical models
- Iterate
- from What? to What if?
max displacement
Wind Speed
5Approach Framework for Parameterized Sensitivity
Analsys
- framework performs analysis over numerous runs
- statistical filtering
- vary parameter of interest
- provides means of combining data to isolate
effects of interest - gt ROBUSTNESS
Problem Data Set Generator
Well-developed Parallel Program
Instrumentation Tools
Study Parameter
Machine Characterizers
- Procs
- Comm. perf.
- Cache
- Scheduling
- ...
visualization, modeling
6Example NAS Parallel Benchmarks
- Fix problem size (NPB2.2 class A)
- Two different Architectures
- NOW Ultrasparc Cluster (170 MHz)
- SGI Origin (250 MHz)
- Six application kernels
- BT - Block Tridiagonal Solve
- SP -
- LU - Sparse LU
- MG - Multigrid
- IS - Integer sort
- FT - 3D FFT
- Examine sensitivity to P ( procs)
- time(P), speedup(P) Time(1)/Time(P)
7Single Processor Performance
8Simplest Example Performance( P )
- NPB2.2 on NOW and Origin 2000 (250)
9Understanding Speedup
- SpeedUp(p) T1
MAXp (Tcompute Tcomm. T wait) -
- Tcompute (work/p extra) x efficiency
- With message passing (e.g., MPI) communication
time and wait time are indistinguishable
10A more austere metric...
- Time spent doing thing X
- Total TimeX (P) ? TimeX (i)
- Constant for perfect speedup
P
i1
11Where Time is Spent ( P )
- Reveal basic Processor and network loading (vs P)
12Where Time is Spent ( P )
- Reveal basic Processor and network loading (vs P)
- Basis for model derivation - comm(P)
13Why do comm. costs increase?
- total volume?
- volume per processor?
- message overhead?
- contention?
14Communication Volume ( P )
15Communication Structure ( P )
16Understanding Efficiency ( P, M )
- Want to understand both what load the program is
placing on the system - and how well the system is handling that load
- gt characterize the capability of the system via
simple benchmarks (rather than advertised peaks) - gt combine with measured load for predictive
model, compare
30 MB/s
150 MB/s
17Communication Efficiency
18Tools gt Improvements in Run Time
- Efficiency analysis (vs parameters) gives insight
into where to improve the system or the program - use traditional profiling to see where is program
the bad stuff happens - or go back and tune the system to do better
19Why does comp. time decrease?
- Combining trace generation with simulation
provides new structural insight - Here clear knees in program working set ()
shift with machine size (P)
20Constant Problem Size Scaling
4
8
16
32
64
128
256
21LU Working Sets
- Sharp drop in miss rate from 512 to 1024
- WS captured by at 1024 KB per processor
- size increase (lt 32KB), miss rate decrease with
a constant rate - New effect, 100s KB to MB
22LU Working Sets
- CPS scaling means smaller and smaller problem per
processor - Smaller WS requirement
- Miss rate curve moves to the left with P
23LU Working Sets
- Given a fixed machine, we only observe a vertical
slice of the graph
24LU Working Sets
Cluster
Origin
25Working Sets
LU
IS
BT
FT
MG
SP
- There is a Cost to scaling when at larger machine
size, miss rate increases - There is a Benefit to scaling when at larger
machine size, miss rate decreases - Processing Efficiency is determined by -
- the interaction between the changes in working
set with the size of the machine
26Sensitivity to Multiprogramming
- Parallel machines are increasingly general
purpose - multiprogramming, at least interrupts and daemons
- Many ideal programs very sensitive to
perturbations - Msg Passing is loosely coupled, but
implementation may not be!
27Tools gt Improvements in Run Time
- MPI implementation spin-waits on send till
network available (or queue not full) or on
recv-complete - Should use two-phase spin-block
28Sensitivity to Seemingly Unrelated Activity
- The mechanism for doing parameter studies is
naturally extended to get statistically valid
data through multiple samples at each point - tend to get crisp, fast results in the wee hours
- Extend study outside the app
- Example two programs on big Origin
- alone together on 64 P
- 8 processor IS run 4.71 sec 6.18
- 36 processor SP run 26.36 sec 65.28
29Repeatability
- The variance for the repeated runs is a key
result for production codes - the real world is
not ideal
30Understanding the Platform
- A very Simple Example broadcast(M,P)
- vary M, P
- repeat
end time
start time
MPI bcast
MPI barrier
MPI barrier
31NOW bcast (m, p)
32Origin mean bcast (m, p)
33NOW bcast (1024, p)
34Origin bcast (1024, p)
35NOW bcast(1042, 16) repetitions
discarded first iteration
36Origin bcast(1042, 16) repetitions
discarded first iteration
37Origin bcast(1042, 16) repetitions - 10x
38Origin bcast(1042, 16) repetitions
39Origin bcast(1M, 16) repetitions
40Discussion
- Apply engineering analysis to your parallel
engineering analysis codes! - Isolate components
- Introduce controlled variations
- processors
- data set
- communication rate
- repetition
- Identify trouble spots
41To read more
- Parallel Computer Architecture - a
hardware/software approach, Culler and Singh,
Morgan-Kaufmann - Architectural Requirements and Scalability of the
NAS Parallel Benchmarks, Wong, Martin,
Arpaci-Dusseau, and Culler, Proc. of SC99 - Building MPI for Multi-Programming Systems using
Implicit Information, Wong, Arpaci-Dusseau,
Culler, 6th European PVM/MPI User's Group Meeting
- http//www.cs.berkeley.edu/culler/papers