Title: Parallel Programming: Performance Issues
1Parallel ProgrammingPerformance Issues
2Whats involved in the performance of a system?
- Data generation, storage, and transmission
- Design, implementation, and maintenance costs
- Reuse, portability, and scalability potential
3Why worry about quantifying performance?
- To be able to compare algorithms based on
- Efficiency
- Speedup (vs. sequential)
- Scalability
4Some ways to model performance
- Amdahls Law
- Extrapolation from Observations
- Asymptotic Analysis
5Amdahls Law
- Every algorithm has a sequential component that
will limit the performance of a parallel
algorithm - The law If the sequential component of an
algorithm accounts for 1/s of the programs
execution time, then the maximum possible speedup
that can be achieved on a parallel computer is s
6Amdahls Law
- Generally holds when parallelizing sequential
programs incrementally - Does not hold overall
- Why?
- The execution time of any sequential section of a
task does not limit the number of tasks that
could be run in parallel or how those tasks might
be further decomposed
7Extrapolation from Observations
- We implemented the algorithm on parallel
computer X and achieved a speedup of 10.8 on 12
processors with problem size N100. - Why is this a problem?
- ONE DATA POINT!!
8Extrapolation from Observations
- T N N2/P
- This algorithm partitions the computationally
demanding O(N) component of the algorithm but
replicates the O(N) component on every processor.
There are no other sources of overhead.
9Extrapolation from Observations
- T N N2/P
- T N N2/P 100
- This algorithm partitions all the computation but
introduces an additional cost of 100
10Extrapolation from Observations
- T N N2/P
- T N N2/P 100
- T N N2/P 0.6P2
- This algorithm also partitions all the
computation but introduces an additional cost of
0.6P2
11Extrapolation from Observations
- T N N2/P
- T N N2/P 100
- T N N2/P 0.6P2
- All of these algorithms have a speedup of 10.8
when P 12 and N100
12Extrapolation from Observations
13Extrapolation from Observations
14Asymptotic Analysis
- Asymptotic analysis reveals that the algorithm
requires O(N log N) time on O(N) processors. - In other words, there exists a constant c and
minimum problem size N0 such that for all N gt N0,
cost(N) c NlogN on N processors
15Asymptotic Analysis
- Whats wrong with this?
- Works for large values of N and P, but is not
necessarily applicable to your problem
16Asymptotic Analysis
- Example 1 10N N log N
- For problems with less than 1024 processors, 10N
dominates the equation and must be considered - Example 2 1000N log N vs. 10N2
- Standard asymptotic analysis says that the first
algorithm is better even though for problems with
less than 996 processors, the second is faster
17Developing Models
- A good performance model should
- Explain observations
- Predict future observations
- Abstract unimportant details
- None of the three previous models successfully
extrapolate to all of the possible data points - Computer system modeling is too detailed to be
practical
18Developing Models
- Execution time (T) as a function of problem size
(N), processors (P), tasks (U), along with other
algorithm specific elements - T f(N, P, U, )
19Developing Models
- Execution time is the time between when the first
processor begins to operate on the problem until
the last processor ends operation - Note that this definition does not hold for
timeshare systems
20Developing Models
21Developing Models
22Developing Models
23Developing Models
- While not precisely accurate, these formulations
give us a way of calculating execution time that
is useful for other estimations
24Developing Models
- Our goal is to develop mathematical models that
have acceptable accuracy while being as simple as
possible - To reduce complexity
- Use the idealized multi-computer model
- Use scale analysis to remove insignificant
operations that are small in scale - Use empirical studies to fine tune
25Components of Execution Time
- Computation time
- Dependent on problem size, processors, processor
types, memory available, etc. - Communication time
- Intraprocess vs. Interprocess
- Startup time plus transfer time per word
26Components of Execution Time
27Components of Execution Time
- Idle Time
- Minimize idle time by overlapping communications
and computation - Create multiple tasks per processor
28Calculating Execution Time An Example
- Finite Difference (like the Atmosphere Model)
Grid size N x N x Z
N
Same computation at each grid point so
N
Z
Where tc is the average computation time at a
single grid point
29Calculating Execution Time An Example
Each task communicates these data points with two
other points
30Efficiency and Speedup
- Efficiency is the fraction of time that
processors spend doing useful work - This is used to compare algorithms irrespective
of problem size
31Efficiency and Speedup
- These relative calculations are based on single
processor runs - They do not give a good measure of merit
- Are good for judging scalability
32Scalability Analysis
- Efficiency decreases when P, ts , and tw increase
- Efficiency increases when N, Z, and tc increase
- Execution time decreases with increasing P but
has a lower bound - Execution time increases when N, Z, tc , ts ,
and tw increase
33Scalability Analysis
- Along with empirical data, models can be used to
answer some useful questions - Does the algorithm meet design requirements on
the target parallel computer? - How adaptable is the algorithm?
- How does the algorithm compare with other
algorithms for the same problem?
34Scalability with Fixed Problem Size
- Determine how execution time and efficiency vary
with increasing number of processors and fixed
problem size - What is the fastest I can solve problem A on
computer X?
35Scalability with Fixed Problem Size
- While efficiency generally decreases with the
size of the problem, execution time may increase
if the performance model includes a positive
power of P (tasks/problem size)
36Scalability with Fixed Problem Size
37Scalability with Fixed Problem Size
38Scalability with Scaled Problem Size
- Consider how the amount of computation performed
must scale with the problem size to keep
efficiency constant - This function of N (size of our grid) is called
an isoefficiency function - O(P) is scalable, O(Px) is not
39Scalability with Scaled Problem Size
- In order to maintain constant efficiency, this
relation must hold with increasing problem size - Single-processor time must increase at the same
rate as total parallel time or the amount of
essential computation must increase at the same
rate as overhead
40Execution Profiles
- If a performance model gives us bad news, how do
we find out where to fix our algorithm? - Create an execution profile
- Track the values for the variables in your
performance model and plot them
41Execution Profiles
42Experimental Studies
- There is simply too many factors to take into
account to produce a completely accurate
performance model - Experimental data must supplement
43Experimental Studies
- Early studies can be used to determine parameter
values such as startup time, transfer time per
word, etc. - Also can be used to verify a performance model
44Experimental Design
- Identify the data we want
- Run for multiple data points and, if possible,
different numbers of processors - The point is to get accurate and reproducible
results
45Causes of Variations in Experimental Results
- Nondeterministic algorithm
- Inaccurate timer
- Startup and shutdown costs
- Interference from other programs
- Contention (network traffic)
- Random resource allocation (OS use of randomness)
46Fitting Data to Models
- Plot data on a graph
- Example
- tw is the slope
- ts is the intercept when L 0
- Least-squares analysis
- Be imaginative
47Evaluating Implementations
- Be prepared for your model and your
implementation data to be different - But they shouldnt be too different
- Get execution profiles for actual running code
- Compare that with the execution profiles created
earlier
48Evaluating Implementations
- Execution time is higher than expected
- Usually extra overhead you didnt model
- Load imbalances (probably processor issue)
- Replicated computation (unnecessary sequential
processing) - Tool/algorithm mismatch
- Competition for bandwidth (communication is more
complicated than modeled)
49Evaluating Implementations
- Execution time is lower than expected
- Hardware effects (cache, internal bus speeds,
etc.) - Search anomalies (the parallel version is not
comparable with the sequential version)
50Refining the Communication Cost Model
- Most networks do not have a dedicated
communication path linking each pair of nodes - This can cause rerouting, switching, or blocking
of messages - The number of wires (links) that must be crossed
between any two nodes is the distance between
them - The maximum distance between two processors is
the diameter of the network
51Refining the Communication Cost Model
- Competition for bandwidth requires that we scale
our communication model by the number of
processors needing to send data over the same
link concurrently (S) - Does not account for retransmit costs
52Interconnection Networks
- The number or processors needing to share
bandwidth changes based on the type of network - Crossbar switching
- Bus-based
- Ethernet
- Mesh
- Hypercube
- Multistage Interconnection
53Crossbar Switching Networks
- Uses O(N2) switches
- Incurs no penalty for concurrent communication
54Bus-based Networks
- Serial communication over a bus
55Ethernet Networks
- A form of bus-based network
- Processor must have exclusive use of the bus
56Mesh Networks
- Form of crossbar switch network
- Processors are linked in chains so that a message
may need to pass through several processors to
reach its destination
57Mesh Networks
58Mesh Networks
Collision
59Hypercube Networks
- Similar to Mesh network
- Diameter of the network is equal to the
dimensions of the hypercube
60Hypercube Networks
61Multistage Interconnection Networks
- Uses switches to connect nodes
- Fewer than O(P2) switches
- Every message (in unidirectional networks) uses
the same number of wires so that distance is
constant between any two processors
62Multistage Interconnection Networks
63Input/Output
- Applications with large I/O requirements
- Checkpoints
- Simulation data
- Data analysis
- Out-of-core computation
64Input/Output
- Can be thought of as communications with an I/O
device using the same formulas used for
processor-to-processor communications - Concurrent read/writes can be managed just like
really slow processors
65Summary
- How to create performance models
- How to use performance models
- Early to compare algorithms
- Later to verify algorithm choices and fine-tune
the model - Finally to compare against data and determine
flaws in the algorithm or implementation