Parallel Programming: Performance Issues

About This Presentation

Title:

Parallel Programming: Performance Issues

Description:

Fitting Data to Models. Plot data on a graph. Example: tw is the slope ... Least-squares analysis. Be imaginative. Evaluating Implementations ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 66

Provided by: LeeMcC

Learn more at: https://www.cs.memphis.edu

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Programming: Performance Issues

1
Parallel ProgrammingPerformance Issues

September 10, 2002

2
Whats involved in the performance of a system?

Data generation, storage, and transmission
Design, implementation, and maintenance costs
Reuse, portability, and scalability potential

3
Why worry about quantifying performance?

To be able to compare algorithms based on
Efficiency
Speedup (vs. sequential)
Scalability

4
Some ways to model performance

Amdahls Law
Extrapolation from Observations
Asymptotic Analysis

5
Amdahls Law

Every algorithm has a sequential component that
will limit the performance of a parallel
algorithm
The law If the sequential component of an
algorithm accounts for 1/s of the programs
execution time, then the maximum possible speedup
that can be achieved on a parallel computer is s

6
Amdahls Law

Generally holds when parallelizing sequential
programs incrementally
Does not hold overall
Why?
The execution time of any sequential section of a
task does not limit the number of tasks that
could be run in parallel or how those tasks might
be further decomposed

7
Extrapolation from Observations

We implemented the algorithm on parallel
computer X and achieved a speedup of 10.8 on 12
processors with problem size N100.
Why is this a problem?
ONE DATA POINT!!

8
Extrapolation from Observations

T N N2/P
This algorithm partitions the computationally
demanding O(N) component of the algorithm but
replicates the O(N) component on every processor.
There are no other sources of overhead.

9
Extrapolation from Observations

T N N2/P
T N N2/P 100
This algorithm partitions all the computation but
introduces an additional cost of 100

10
Extrapolation from Observations

T N N2/P
T N N2/P 100
T N N2/P 0.6P2
This algorithm also partitions all the
computation but introduces an additional cost of
0.6P2

11
Extrapolation from Observations

T N N2/P
T N N2/P 100
T N N2/P 0.6P2
All of these algorithms have a speedup of 10.8
when P 12 and N100

12
Extrapolation from Observations
13
Extrapolation from Observations
14
Asymptotic Analysis

Asymptotic analysis reveals that the algorithm
requires O(N log N) time on O(N) processors.
In other words, there exists a constant c and
minimum problem size N0 such that for all N gt N0,
cost(N) c NlogN on N processors

15
Asymptotic Analysis

Whats wrong with this?
Works for large values of N and P, but is not
necessarily applicable to your problem

16
Asymptotic Analysis

Example 1 10N N log N
For problems with less than 1024 processors, 10N
dominates the equation and must be considered
Example 2 1000N log N vs. 10N2
Standard asymptotic analysis says that the first
algorithm is better even though for problems with
less than 996 processors, the second is faster

17
Developing Models

A good performance model should
Explain observations
Predict future observations
Abstract unimportant details
None of the three previous models successfully
extrapolate to all of the possible data points
Computer system modeling is too detailed to be
practical

18
Developing Models

Execution time (T) as a function of problem size
(N), processors (P), tasks (U), along with other
algorithm specific elements
T f(N, P, U, )

19
Developing Models

Execution time is the time between when the first
processor begins to operate on the problem until
the last processor ends operation
Note that this definition does not hold for
timeshare systems

20
Developing Models
21
Developing Models
22
Developing Models
23
Developing Models

While not precisely accurate, these formulations
give us a way of calculating execution time that
is useful for other estimations

24
Developing Models

Our goal is to develop mathematical models that
have acceptable accuracy while being as simple as
possible
To reduce complexity
Use the idealized multi-computer model
Use scale analysis to remove insignificant
operations that are small in scale
Use empirical studies to fine tune

25
Components of Execution Time

Computation time
Dependent on problem size, processors, processor
types, memory available, etc.
Communication time
Intraprocess vs. Interprocess
Startup time plus transfer time per word

26
Components of Execution Time
27
Components of Execution Time

Idle Time
Minimize idle time by overlapping communications
and computation
Create multiple tasks per processor

28
Calculating Execution Time An Example

Finite Difference (like the Atmosphere Model)

Grid size N x N x Z
N
Same computation at each grid point so
N
Z
Where tc is the average computation time at a
single grid point
29
Calculating Execution Time An Example
Each task communicates these data points with two
other points
30
Efficiency and Speedup

Efficiency is the fraction of time that
processors spend doing useful work
This is used to compare algorithms irrespective
of problem size

31
Efficiency and Speedup

These relative calculations are based on single
processor runs
They do not give a good measure of merit
Are good for judging scalability

32
Scalability Analysis

Efficiency decreases when P, ts , and tw increase
Efficiency increases when N, Z, and tc increase
Execution time decreases with increasing P but
has a lower bound
Execution time increases when N, Z, tc , ts ,
and tw increase

33
Scalability Analysis

Along with empirical data, models can be used to
answer some useful questions
Does the algorithm meet design requirements on
the target parallel computer?
How adaptable is the algorithm?
How does the algorithm compare with other
algorithms for the same problem?

34
Scalability with Fixed Problem Size

Determine how execution time and efficiency vary
with increasing number of processors and fixed
problem size
What is the fastest I can solve problem A on
computer X?

35
Scalability with Fixed Problem Size

While efficiency generally decreases with the
size of the problem, execution time may increase
if the performance model includes a positive
power of P (tasks/problem size)

36
Scalability with Fixed Problem Size
37
Scalability with Fixed Problem Size
38
Scalability with Scaled Problem Size

Consider how the amount of computation performed
must scale with the problem size to keep
efficiency constant
This function of N (size of our grid) is called
an isoefficiency function
O(P) is scalable, O(Px) is not

39
Scalability with Scaled Problem Size

In order to maintain constant efficiency, this
relation must hold with increasing problem size
Single-processor time must increase at the same
rate as total parallel time or the amount of
essential computation must increase at the same
rate as overhead

40
Execution Profiles

If a performance model gives us bad news, how do
we find out where to fix our algorithm?
Create an execution profile
Track the values for the variables in your
performance model and plot them

41
Execution Profiles
42
Experimental Studies

There is simply too many factors to take into
account to produce a completely accurate
performance model
Experimental data must supplement

43
Experimental Studies

Early studies can be used to determine parameter
values such as startup time, transfer time per
word, etc.
Also can be used to verify a performance model

44
Experimental Design

Identify the data we want
Run for multiple data points and, if possible,
different numbers of processors
The point is to get accurate and reproducible
results

45
Causes of Variations in Experimental Results

Nondeterministic algorithm
Inaccurate timer
Startup and shutdown costs
Interference from other programs
Contention (network traffic)
Random resource allocation (OS use of randomness)

46
Fitting Data to Models

Plot data on a graph
Example
tw is the slope
ts is the intercept when L 0
Least-squares analysis
Be imaginative

47
Evaluating Implementations

Be prepared for your model and your
implementation data to be different
But they shouldnt be too different
Get execution profiles for actual running code
Compare that with the execution profiles created
earlier

48
Evaluating Implementations

Execution time is higher than expected
Usually extra overhead you didnt model
Load imbalances (probably processor issue)
Replicated computation (unnecessary sequential
processing)
Tool/algorithm mismatch
Competition for bandwidth (communication is more
complicated than modeled)

49
Evaluating Implementations

Execution time is lower than expected
Hardware effects (cache, internal bus speeds,
etc.)
Search anomalies (the parallel version is not
comparable with the sequential version)

50
Refining the Communication Cost Model

Most networks do not have a dedicated
communication path linking each pair of nodes
This can cause rerouting, switching, or blocking
of messages
The number of wires (links) that must be crossed
between any two nodes is the distance between
them
The maximum distance between two processors is
the diameter of the network

51
Refining the Communication Cost Model

Competition for bandwidth requires that we scale
our communication model by the number of
processors needing to send data over the same
link concurrently (S)
Does not account for retransmit costs

52
Interconnection Networks

The number or processors needing to share
bandwidth changes based on the type of network
Crossbar switching
Bus-based
Ethernet
Mesh
Hypercube
Multistage Interconnection

53
Crossbar Switching Networks

Uses O(N2) switches
Incurs no penalty for concurrent communication

54
Bus-based Networks

Serial communication over a bus

55
Ethernet Networks

A form of bus-based network
Processor must have exclusive use of the bus

56
Mesh Networks

Form of crossbar switch network
Processors are linked in chains so that a message
may need to pass through several processors to
reach its destination

57
Mesh Networks
58
Mesh Networks
Collision
59
Hypercube Networks

Similar to Mesh network
Diameter of the network is equal to the
dimensions of the hypercube

60
Hypercube Networks
61
Multistage Interconnection Networks

Uses switches to connect nodes
Fewer than O(P2) switches
Every message (in unidirectional networks) uses
the same number of wires so that distance is
constant between any two processors

62
Multistage Interconnection Networks
63
Input/Output

Applications with large I/O requirements
Checkpoints
Simulation data
Data analysis
Out-of-core computation

64
Input/Output

Can be thought of as communications with an I/O
device using the same formulas used for
processor-to-processor communications
Concurrent read/writes can be managed just like
really slow processors

65
Summary

How to create performance models
How to use performance models
Early to compare algorithms
Later to verify algorithm choices and fine-tune
the model
Finally to compare against data and determine
flaws in the algorithm or implementation

Write a Comment

User Comments (0)

About PowerShow.com

Parallel Programming: Performance Issues - PowerPoint PPT Presentation

Parallel Programming: Performance Issues

Fitting Data to Models. Plot data on a graph. Example: tw is the slope ... Least-squares analysis. Be imaginative. Evaluating Implementations ... – PowerPoint PPT presentation