Title: Programming for Performance Part II
1Programming for PerformancePart II
2Orchestration for Performance
- Reducing amount of communication
- Inherent change logical data sharing patterns in
algorithm - Artifactual exploit spatial, temporal locality
in extended hierarchy - Techniques often similar to those on
uniprocessors - Structuring communication to reduce cost
3Reducing Artifactual Communication
- Message passing model
- Communication and replication are both explicit
- Even artifactual communication is in explicit
messages - Shared address space model
- More interesting from an architectural
perspective - Occurs transparently due to interactions of
program and system - sizes and granularities in extended memory
hierarchy - Use shared address space to illustrate issues
4Exploiting Temporal Locality
- Structure algorithm so working sets map well to
hierarchy - often techniques to reduce inherent communication
do well here - schedule tasks for data reuse once assigned
- Solver example blocking
- More useful when O(nk1) computation on O(nk)
data - many linear algebra computations
(factorization, matrix multiply)
5Exploiting Spatial Locality
- Besides capacity, granularities are important
- Granularity of allocation
- Granularity of communication or data transfer
- Granularity of coherence
- Major spatial-related causes of artifactual
communication - Conflict misses
- Data distribution/layout (allocation granularity)
- Fragmentation (communication granularity)
- False sharing of data (coherence granularity)
- All depend on how spatial access patterns
interact with data structures - Fix problems by modifying data structures, or
layout/alignment - Examine later in context of architectures
- one simple example here data distribution in SAS
solver
6Spatial Locality Example
- Repeated sweeps over 2-d grid, each time adding
1 to elements - Natural 2-d versus higher-dimensional array
representation
7Tradeoffs with Inherent Communication
- Partitioning grid solver blocks versus rows
- Blocks still have a spatial locality problem on
remote data - Row-wise can perform better despite worse
inherent c-to-c ratio
- Result depends on n and p
8Example Performance Impact
- Equation solver on SGI Origin2000
- Long cache block 128 bytes
512 x 512
12K x 12K
9Architectural Implications of Locality
- Communication abstraction that makes exploiting
it easy - For cache-coherent SAS, e.g.
- Size and organization of levels of memory
hierarchy - cost-effectiveness caches are expensive
- caveats flexibility for different and
time-shared workloads - Replication in main memory useful? If so, how to
manage? - hardware, OS/runtime, program?
- Granularities of allocation, communication,
coherence (?) - small granularities gt high overheads, but easier
to program - Machine granularity (resource division among
processors, memory...)
10Orchestration for Performance
- Reducing amount of communication
- Inherent change logical data sharing patterns in
algorithm - Artifactual exploit spatial, temporal locality
in extended hierarchy - Techniques often similar to those on
uniprocessors - Structuring communication to reduce cost
11Structuring Communication
- Given amount of comm (inherent or artifactual),
goal is to reduce cost - Cost of communication as seen by process
- C f ( o l tc - overlap)
- f frequency of messages
- o overhead per message (at both ends)
- l network delay per message
- nc total data sent
- m number of messages
- B bandwidth along path (determined by network,
NI, assist) - tc cost induced by contention per message
- overlap amount of latency hidden by overlap
with comp. or comm. - Portion in parentheses is cost of a message (as
seen by processor) - That portion, ignoring overlap, is latency of a
message - Goal reduce terms in latency and increase
overlap
12Reducing Overhead
- Can reduce no. of messages m or overhead per
message o - o is usually determined by hardware or system
software - Program should try to reduce m by coalescing
messages - More control when communication is explicit
- Coalescing data into larger messages
- Easy for regular, coarse-grained communication
- Can be difficult for irregular, naturally
fine-grained communication - may require changes to algorithm and extra work
- coalescing data and determining what and to whom
to send - will discuss more in implications for programming
models later
13Reducing Network Delay
- Network delay component fhth
- h number of hops traversed in network
- th linkswitch latency per hop
- Reducing f communicate less, or make messages
larger - Reducing h
- Map communication patterns to network topology
- e.g. nearest-neighbor on mesh and ring
all-to-all - How important is this?
- used to be major focus of parallel algorithms
- depends on no. of processors, how large th is
relative to other components - single phit in a pipelined networks
- message in store-and-forward networks less
important on modern machines - overheads, processor count, multiprogramming
14Reducing Contention
- All resources have nonzero occupancy
- Memory, communication controller, network link,
etc. - Finite bandwidth for serving transactions
- Effects of contention
- Increased end-to-end cost for messages
- Reduced available bandwidth for individual
messages - Causes imbalances across processors
- Particularly insidious performance problem
- Easy to ignore when programming
- Slow down messages that dont even need that
resource - by causing other dependent resources to also
congest - Effect can be devastating Dont flood a
resource!
15Types of Contention
- Network contention and end-point contention
(hot-spots) - Location and Module Hot-spots
- Location e.g. accumulating into global variable,
barrier - solution tree-structured communication
- Module all-to-all personalized comm. in matrix
transpose - solution stagger access by different processors
to same node temporally - In general, reduce burstiness may conflict with
making messages larger
16Overlapping Communication
- Cannot afford to stall for high latencies
- even on uniprocessors!
- Overlap with computation or communication to hide
latency - Requires extra concurrency (slackness), higher
bandwidth - Techniques
- Prefetching
- Block data transfer
- Overlap
- Multithreading
17Summary of Tradeoffs
- Different goals often have conflicting demands
- Load Balance
- fine-grain tasks
- random or dynamic assignment
- Communication
- usually coarse grain tasks
- decompose to obtain locality not random/dynamic
- Extra Work
- coarse grain tasks
- simple assignment
- Communication Cost
- big transfers amortize overhead and latency
- small transfers reduce contention
18Processor-Centric Perspective
1
0
0
1
0
0
S
y
n
c
h
r
o
n
i
z
a
t
i
o
n
D
a
t
a
-
r
e
m
o
t
e
B
u
s
y
-
u
s
e
f
u
l
B
u
s
y
-
o
v
e
r
h
e
a
d
D
a
t
a
-
l
o
c
a
l
7
5
7
5
)
)
s
s
(
(
e
e
m
m
i
i
5
0
5
0
T
T
2
5
2
5
P
P
P
P
0
1
2
3
(
a
)
S
e
q
u
e
n
t
i
a
l
(
b
)
P
a
r
a
l
l
e
l
w
i
t
h
f
o
u
r
p
r
o
c
e
s
s
o
r
s
19Relationship between Perspectives
20Summary
- Speedupprob(p)
-
- Goal is to reduce denominator components
- Both programmer and system have role to play
- Architecture cannot do much about load imbalance
or too much communication - But it can
- reduce incentive for creating ill-behaved
programs (efficient naming, communication and
synchronization) - reduce artifactual communication
- provide efficient naming for flexible assignment
- allow effective overlapping of communication
21Workload-Driven Architectural Evaluation
22Evaluation in Uniprocessors
- Evaluation
- For existing systems comparison and procurement
evaluation - For future systems careful extrapolation from
known quantities - Standard benchmarks
- Measured on wide range of machines and successive
generations - Measurements and technology assessment Features
Simulation new design - Simulator simulate the design with and without a
feature - Benchmarks run through the simulator to obtain
results - Together with cost and complexity, decisions made
23Difficult Enough for Uniprocessors
- Workloads need to be renewed and reconsidered
- Input data sets affect key interactions
- Changes from SPEC92 to SPEC95 to SPEC98
- Simulation is time-consuming
- Accurate simulators costly to develop and verify
- Good evaluation leads to good design
- Quantitative evaluation increasingly important
for multiprocessors - Maturity of architecture, and greater continuity
among generations - Its a grounded, engineering discipline now
- Good evaluation is critical, and we must learn to
do it right
24More Difficult for Multiprocessors
- What is a representative workload?
- Software model has not stabilized
- Many architectural and application degrees of
freedom - Huge design space no. of processors, other
architectural, application - Impact of these parameters and their interactions
can be huge - High cost of communication
- What are the appropriate metrics?
- Simulation is expensive
- Realistic configurations and sensitivity analysis
difficult - Larger design space, but more difficult to cover
- Understanding of parallel programs as workloads
is critical - Particularly interaction of application and
architectural parameters
25A Lot Depends on Sizes
- Application parameters and no. of procs affect
inherent properties - Load balance, communication, extra work, temporal
and spatial locality - Interactions with organization parameters of
extended memory hierarchy affect artifactual
communication and performance - Effects often dramatic, sometimes small
application-dependent
Barnes-Hut
Grid points(N)
- Understanding size interactions and scaling
relationships is key
26Outline
- Performance and scaling (of workload and
architecture) - Techniques
- Implications for behavioral characteristics and
performance metrics - Evaluating a real machine
- Choosing workloads
- Choosing workload parameters
- Choosing metrics and presenting results
- Evaluating an architectural idea/tradeoff through
simulation - Public-domain workload suites
27Measuring Performance
- Absolute performance
- Most important to end user
- Performance improvement due to parallelism
- Speedup(p) Performance(p) / Performance(1),
always - Performance Work / Time, always
- Work is determined by input configuration of the
problem - If work is fixed, can measure performance as
1/Time - Or retain explicit work measure (e.g.
transactions/sec, bonds/sec) - Still w.r.t particular configuration, and still
whats measured is time - Speedup(p) or
28Scaling Why Worry?
- Fixed problem size is limited
- Too small a problem
- May be appropriate for small machine
- Parallelism overheads begin to dominate benefits
for larger machines - Load imbalance
- Communication to computation ratio
- May even achieve slowdowns
- Doesnt reflect real usage, and inappropriate for
large machines - Too large a problem
- Difficult to measure improvement (may not be
runnable on a single processor)
29Too Large a Problem
- Suppose problem realistically large for big
machine - May not fit in small machine
- Cant run
- Thrashing to disk
- Working set doesnt fit in cache
- Fits at some p, leading to superlinear speedup
- Finally, users want to scale problems as machines
grow
30Demonstrating Scaling Problems
- Small Ocean and big equation solver problems on
SGI Origin2000
31Questions in Scaling
- Under what constraints to scale the application?
- What are the appropriate metrics for performance
improvement? - work is not fixed any more, so time not enough
- How should the application be scaled?
- Definitions
- Scaling a machine Can scale power in many ways
- Assume adding identical nodes, each bringing
memory - Problem size Vector of input parameters, e.g. N
(n, q, Dt) - Determines work done
- Distinct from data set size and memory usage
- Start by assuming its only one parameter n, for
simplicity