Parallel Application Scaling, Performance, and Efficiency - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel Application Scaling, Performance, and Efficiency

Description:

Placement of computation within a parallel computer. Performance costs of ... Fabrizio Petrini, Darren J. Kerbyson, Scott Pakin, 'The Case of the Missing ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 58
Provided by: lbn67
Category:

less

Transcript and Presenter's Notes

Title: Parallel Application Scaling, Performance, and Efficiency


1
Parallel Application Scaling, Performance, and
Efficiency
  • David Skinner
  • NERSC/LBL

2
Parallel Scaling of MPI Codes
  • A practical talk on using MPI with focus on
  • Distribution of work within a parallel program
  • Placement of computation within a parallel
    computer
  • Performance costs of various types of
    communication
  • Understanding scaling performance terminology

3
Topics
  • Introduction
  • Load Balance
  • Synchronization
  • Simple stuff
  • File I/O
  • Performance profiling

4
Lets introduce these topics through a familiar
example Sharks and Fish II
  • Sharks and Fish II N2 force summation in
    parallel
  • E.g. 4 CPUs evaluate force for a global
    collection of 125 fish
  • Domain decomposition Each CPU is in charge of
    31 fish, but keeps a fairly recent copy of all
    the fishes positions (replicated data)
  • Is it not possible to uniformly decompose
    problems in general, especially in many
    dimensions
  • Luckily this problem has fine granularity and is
    2D, lets see how it scales

5
Sharks and Fish II Program
  • Data
  • n_fish is global
  • my_fish is local
  • fishi x, y,
  • Dynamics

MPI_Allgatherv(myfish_buf, lenrank, ..
for (i 0 i lt my_fish i)
for (j 0 j lt n_fish j) //
i!j ai g massj ( fishi fishj
) / rij
Move fish
6
Sharks and Fish II How fast?
  • Running on a machine seaborg.nersc.gov
  • 100 fish can move 1000 steps in
  • 1 task ? 5.459s
  • 32 tasks ? 2.756s
  • 1000 fish can move 1000 steps in
  • 1 task ? 511.14s
  • 32 tasks ? 20.815s
  • Whats the best way to run?
  • How many fish do we really have?
  • How large a computer do we have?
  • How much computer time i.e. allocation do we
    have?
  • How quickly, in real wall time, do we need the
    answer?

x 1.98 speedup
x 24.6 speedup
7
Scaling Good 1st Step Do runtimes make sense?
Running fish_sim for 100-1000 fish on 1-32 CPUs
we see
1 Task

32 Tasks
8
Scaling Walltimes
Walltime is (all)important but lets define some
other scaling metrics
9
Scaling definitions
  • Scaling studies involve changing the degree of
    parallelism. Will we be change the problem also?
  • Strong scaling
  • Fixed problem size
  • Weak scaling
  • Problem size grows with additional resources
  • Speed up Ts/Tp(n)
  • Efficiency Ts/(nTp(n))

Be aware there are multiple definitions for
these terms
10
Scaling Speedups
11
Scaling Efficiencies
Remarkably smooth! Often algorithm and
architecture make efficiency landscape quite
complex
12
Scaling Analysis
  • Why does efficiency drop?
  • Serial code sections ? Amdahls law
  • Surface to Volume ? Communication bound
  • Algorithm complexity or switching
  • Communication protocol switching

? Whoa!
13
Scaling Analysis
  • In general, changing problem size and concurrency
    expose or remove compute resources. Bottlenecks
    shift.
  • In general, first bottleneck wins.
  • Scaling brings additional resources too.
  • More CPUs (of course)
  • More cache(s)
  • More memory BW in some cases

14
Scaling Superlinear Speedup
CPUs (OMP)
15
Scaling Communication Bound
64 tasks , 52 comm
192 tasks , 66 comm
768 tasks , 79 comm
  • MPI_Allreduce buffer size is 32 bytes.
  • Q What resource is being depleted here?
  • A Small message latency
  • Compute per task is decreasing
  • Synchronization rate is increasing
  • Surface Volume ratio is increasing

16
Topics
  • Introduction
  • Load Balance
  • Synchronization
  • Simple stuff
  • File I/O

17
Load Balance cartoon
Unbalanced
Universal App

Balanced

Time saved by load balance
18
Load Balance performance data
Communication Time 64 tasks show 200s, 960 tasks
show 230s
MPI ranks sorted by total communication time
19
Load Balance code
  • while(1)
  • do_flops(Ni)
  • MPI_Alltoall()
  • MPI_Allreduce()

20
Load Balance real code
MPI Rank ?
Time ?
21
Load Balance analysis
  • The 64 slow tasks (with more compute work) cause
    30 seconds more communication in 960 tasks
  • This leads to 28800 CPUseconds (8 CPUhours) of
    unproductive computing
  • All imbalance requires is one slow task and a
    synchronizing collective!
  • Pair well problem size and concurrency.
  • Parallel computers allow you to waste time faster!

22
Load Balance FFT
  • Q When is imbalance good? A When is leads to a
    faster Algorithm.

23
Dynamical Load Balance Motivation
24
Load Balance Summary
  • Imbalance most often a byproduct of data
    decomposition
  • Must be addressed before further MPI tuning can
    happen
  • Good software exists for graph partitioning /
    remeshing
  • Dynamical load balance may be required for
    adaptive codes
  • For regular grids consider padding or contracting

25
Topics
  • Introduction
  • Load Balance
  • Synchronization
  • Simple stuff
  • File I/O
  • Performance profiling

26
Scaling of MPI_Barrier()
four orders of magnitude
27
Synchronization definition
  • MPI_Barrier(MPI_COMM_WORLD)
  • T1 MPI_Wtime()
  • e.g. MPI_Allreduce()
  • T2 MPI_Wtime()-T1

How synchronizing is MPI_Allreduce?
  • For a code running on N tasks what is the
    distribution of the T2s?
  • The average and width of this distribution tell
    us how
  • synchronizing e.g. MPI_Allreduce is
  • Completions semantics of MPI functions
  • Local leave based on local logic
    (MPI_Comm_rank)
  • Partially synchronizing leave after messaging
    MltN tasks (MPI_Bcast, MPI_Reduce)
  • Fully synchronizing leave after every else
    enters (MPI_Barrier, MPI_Allreduce)

28
seaborg.nersc.gov
  • Its very hard to discuss synchronization outside
    of the context a particular parallel computer
  • So we will examine parallel application scaling
    on an IBM SP which is largely applicable to
    other clusters

29
seaborg.nersc.gov basics
IBM SP
380 x
Colony Switch
Resource Speed Bytes
Registers 3 ns 2560 B
L1 Cache 5 ns 32 KB
L2 Cache 45 ns 8 MB
Main Memory 300 ns 16 GB
Remote Memory 19 us 7 TB
GPFS 10 ms 50 TB
HPSS 5 s 9 PB
CSS0
CSS1
  • 6080 dedicated CPUs, 96 shared login CPUs
  • Hierarchy of caching, speeds not balanced
  • Bottleneck determined by first depleted resource

HPSS
30
MPI on the IBM SP
  • 2-4096 way concurrency
  • MPI-1 and MPI-2
  • GPFS aware MPI-IO
  • Thread safety
  • Ranks on same node
  • bypass the switch

Colony Switch
CSS0
CSS1
HPSS
31
Seaborg point to point messaging
Switch bandwidth is often stated in optimistic
terms
inter-connect
Intranode
Internode
32
MPI seaborg.nersc.gov
Intra and Inter Node Communication Intra and Inter Node Communication Intra and Inter Node Communication
MP_EUIDEVICE (fabric) Bandwidth (MB/sec) Latency (usec)
css0 500 / 350 9 / 21
css1 X X
csss 500 / 350 9 / 21
css1
css0
  • Lower latency ? can satisfy more syncs/sec
  • What is the benefit of two adapters?
  • Can a single

csss
33
Inter-Node Bandwidth
  • Tune message size to optimize throughput
  • Aggregate messages when possible

?csss
css0 ?
34
MPI Performance is often Hierarchical
Intra
Inter
  • message size and task placement are key

35
MPI Latency not always 1 or 2 numbers
The set of all possibly latencies describes the
interconnect from the application perspective
36
Synchronization measurement
  • MPI_Barrier(MPI_COMM_WORLD)
  • T1 MPI_Wtime()
  • e.g. MPI_Allreduce()
  • T2 MPI_Wtime()-T1

How synchronizing is MPI_Allreduce?
For a code running on N tasks what is the
distribution of the T2s? Lets measure this
37
Synchronization MPI Collectives
Beyond load balance there is a distribution on
MPI timings intrinsic to the MPI Call
2048 tasks
38
Synchronization Architecture
and from the machine itself
t is the frequency kernel process scheduling
Unix cron et al.
39
Intrinsic Synchronization Alltoall
40
Intrinsic Synchronization Alltoall
Architecture makes a big difference!
41
This leads to variability in Execution Time
42
Synchronization Summary
  • As a programmer you can control
  • Which MPI calls you use (its not required to use
    them all).
  • Message sizes, Problem size (maybe)
  • The temporal granularity of synchronization
  • Language Writers and System Architects control
  • How hard is it to do last two above
  • The intrinsic amount of noise in the machine

43
Topics
  • Introduction
  • Load Balance
  • Synchronization
  • Simple stuff
  • File I/O
  • Performance profiling

44
Simple Stuff
  • Parallel programs are easier to mess up than
    serial ones. Here are some common pitfalls.

45
Whats wrong here?
46
MPI_Barrier
  • Is MPI_Barrier time bad? Probably. Is it
    avoidable?
  • three cases
  • The stray / unknown / debug barrier
  • The barrier which is masking compute balance
  • Barriers used for I/O ordering

Often very easy to fix
47
Topics
  • Introduction
  • Load Balance
  • Synchronization
  • Simple stuff
  • File I/O
  • Performance profiling

48
Parallel File I/O Strategies
MPI
Disk
Some strategies fall down at scale
49
Parallel File I/O Metadata
  • A parallel file system is great, but it is also
    another place to create contention.
  • Avoid uneeded disk I/O, know your file system
  • Often avoid file per task I/O strategies when
    running at scale

50
Topics
  • Introduction
  • Load Balance
  • Synchronization
  • Simple stuff
  • File I/O
  • Performance profiling

51
Performance Profiling
  • Most of the tables and graphs in this talk were
    generated using IPM (http//ipm-hpc.sf.net)
  • On seaborg do module load ipm and run as you
    normally would and you get brief summary to
    stdout.
  • More detailed performance profiles are generated
    from an XML record written by IPM.
  • In at most 24 hours after your job completes you
    should be able to find a performance summary of
    your job online
  • https//www.nersc.gov/nusers/status/llsum/

52
How to use IPM basics
  • 1) Do module load ipm, then run normally
  • 2) Upon completion you get
  • Maybe thats enough. If so youre done.
  • Have a nice day.

IPMv0.85
command
../exe/pmemd -O -c inpcrd -o res (completed)
host s05405
mpi_tasks 64 on 4 nodes start
02/22/05/100355 wallclock
24.278400 sec stop 02/22/05/100417
comm 32.43 gbytes 2.57604e00
total gflop/sec 2.04615e00
total

53
Want more detail? IPM_REPORTfull
IPMv0.85
command
../exe/pmemd -O -c inpcrd -o res (completed)
host s05405
mpi_tasks 64 on 4 nodes start
02/22/05/100355 wallclock
24.278400 sec stop 02/22/05/100417
comm 32.43 gbytes 2.57604e00
total gflop/sec 2.04615e00
total total
ltavggt min max wallclock
1373.67 21.4636
21.1087 24.2784 user
936.95 14.6398 12.68
20.3 system 227.7
3.55781 1.51 5 mpi
503.853 7.8727
4.2293 9.13725 comm
32.4268 17.42
41.407 gflop/sec 2.04614
0.0319709 0.02724 0.04041 gbytes
2.57604 0.0402507
0.0399284 0.0408173 gbytes_tx
0.665125 0.0103926 1.09673e-05
0.0368981 gbyte_rx 0.659763
0.0103088 9.83477e-07 0.0417372
54
Want more detail? IPM_REPORTfull
PM_CYC 3.00519e11
4.69561e09 4.50223e09 5.83342e09
PM_FPU0_CMPL 2.45263e10 3.83223e08
3.3396e08 5.12702e08 PM_FPU1_CMPL
1.48426e10 2.31916e08 1.90704e08
2.8053e08 PM_FPU_FMA 1.03083e10
1.61067e08 1.36815e08 1.96841e08
PM_INST_CMPL 3.33597e11 5.21245e09
4.33725e09 6.44214e09 PM_LD_CMPL
1.03239e11 1.61311e09 1.29033e09
1.84128e09 PM_ST_CMPL 7.19365e10
1.12401e09 8.77684e08 1.29017e09
PM_TLB_MISS 1.67892e08 2.62332e06
1.16104e06 2.36664e07
time calls ltmpigt
ltwallgt MPI_Bcast 352.365
2816 69.93 22.68 MPI_Waitany
81.0002 185729
16.08 5.21 MPI_Allreduce
38.6718 5184 7.68
2.49 MPI_Allgatherv 14.7468
448 2.93 0.95 MPI_Isend
12.9071 185729 2.56
0.83 MPI_Gatherv 2.06443
128 0.41 0.13
MPI_Irecv 1.349 185729
0.27 0.09 MPI_Waitall
0.606749 8064 0.12
0.04 MPI_Gather 0.0942596
192 0.02 0.01


55
Detailed profiling based on message size
per MPI call
per MPI call buffer size
56
Summary
Happy Scaling!
  • Introduction ?
  • Load Balance ?
  • Synchronization ?
  • Simple stuff ?
  • File I/O ?

57
Other sources of information
  • MPI Performance
  • http//www-unix.mcs.anl.gov/mpi/tutorial/perf/m
    piperf/
  • Seaborg MPI Scaling
  • http//www.nersc.gov/news/reports/technical/seabor
    g_scaling/
  • MPI Synchronization
  • Fabrizio Petrini, Darren J. Kerbyson, Scott
    Pakin, "The Case of the Missing Supercomputer
    Performance Achieving Optimal Performance on the
    8,192 Processors of ASCI Q", in Proc.
    SuperComputing, Phoenix, November 2003.
  • Domain decomposition
  • http//www.ddm.org/
  • google//space fillingdecomposition etc.
  • Metis
  • http//www-users.cs.umn.edu/karypis/metis
Write a Comment
User Comments (0)
About PowerShow.com