Parallel Programming: Performance Issues - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

Parallel Programming: Performance Issues

Description:

Fitting Data to Models. Plot data on a graph. Example: tw is the slope ... Least-squares analysis. Be imaginative. Evaluating Implementations ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 66
Provided by: LeeMcC
Category:

less

Transcript and Presenter's Notes

Title: Parallel Programming: Performance Issues


1
Parallel ProgrammingPerformance Issues
  • September 10, 2002

2
Whats involved in the performance of a system?
  • Data generation, storage, and transmission
  • Design, implementation, and maintenance costs
  • Reuse, portability, and scalability potential

3
Why worry about quantifying performance?
  • To be able to compare algorithms based on
  • Efficiency
  • Speedup (vs. sequential)
  • Scalability

4
Some ways to model performance
  • Amdahls Law
  • Extrapolation from Observations
  • Asymptotic Analysis

5
Amdahls Law
  • Every algorithm has a sequential component that
    will limit the performance of a parallel
    algorithm
  • The law If the sequential component of an
    algorithm accounts for 1/s of the programs
    execution time, then the maximum possible speedup
    that can be achieved on a parallel computer is s

6
Amdahls Law
  • Generally holds when parallelizing sequential
    programs incrementally
  • Does not hold overall
  • Why?
  • The execution time of any sequential section of a
    task does not limit the number of tasks that
    could be run in parallel or how those tasks might
    be further decomposed

7
Extrapolation from Observations
  • We implemented the algorithm on parallel
    computer X and achieved a speedup of 10.8 on 12
    processors with problem size N100.
  • Why is this a problem?
  • ONE DATA POINT!!

8
Extrapolation from Observations
  • T N N2/P
  • This algorithm partitions the computationally
    demanding O(N) component of the algorithm but
    replicates the O(N) component on every processor.
    There are no other sources of overhead.

9
Extrapolation from Observations
  • T N N2/P
  • T N N2/P 100
  • This algorithm partitions all the computation but
    introduces an additional cost of 100

10
Extrapolation from Observations
  • T N N2/P
  • T N N2/P 100
  • T N N2/P 0.6P2
  • This algorithm also partitions all the
    computation but introduces an additional cost of
    0.6P2

11
Extrapolation from Observations
  • T N N2/P
  • T N N2/P 100
  • T N N2/P 0.6P2
  • All of these algorithms have a speedup of 10.8
    when P 12 and N100

12
Extrapolation from Observations
13
Extrapolation from Observations
14
Asymptotic Analysis
  • Asymptotic analysis reveals that the algorithm
    requires O(N log N) time on O(N) processors.
  • In other words, there exists a constant c and
    minimum problem size N0 such that for all N gt N0,
    cost(N) c NlogN on N processors

15
Asymptotic Analysis
  • Whats wrong with this?
  • Works for large values of N and P, but is not
    necessarily applicable to your problem

16
Asymptotic Analysis
  • Example 1 10N N log N
  • For problems with less than 1024 processors, 10N
    dominates the equation and must be considered
  • Example 2 1000N log N vs. 10N2
  • Standard asymptotic analysis says that the first
    algorithm is better even though for problems with
    less than 996 processors, the second is faster

17
Developing Models
  • A good performance model should
  • Explain observations
  • Predict future observations
  • Abstract unimportant details
  • None of the three previous models successfully
    extrapolate to all of the possible data points
  • Computer system modeling is too detailed to be
    practical

18
Developing Models
  • Execution time (T) as a function of problem size
    (N), processors (P), tasks (U), along with other
    algorithm specific elements
  • T f(N, P, U, )

19
Developing Models
  • Execution time is the time between when the first
    processor begins to operate on the problem until
    the last processor ends operation
  • Note that this definition does not hold for
    timeshare systems

20
Developing Models
21
Developing Models
22
Developing Models
23
Developing Models
  • While not precisely accurate, these formulations
    give us a way of calculating execution time that
    is useful for other estimations

24
Developing Models
  • Our goal is to develop mathematical models that
    have acceptable accuracy while being as simple as
    possible
  • To reduce complexity
  • Use the idealized multi-computer model
  • Use scale analysis to remove insignificant
    operations that are small in scale
  • Use empirical studies to fine tune

25
Components of Execution Time
  • Computation time
  • Dependent on problem size, processors, processor
    types, memory available, etc.
  • Communication time
  • Intraprocess vs. Interprocess
  • Startup time plus transfer time per word

26
Components of Execution Time
27
Components of Execution Time
  • Idle Time
  • Minimize idle time by overlapping communications
    and computation
  • Create multiple tasks per processor

28
Calculating Execution Time An Example
  • Finite Difference (like the Atmosphere Model)

Grid size N x N x Z
N
Same computation at each grid point so
N
Z
Where tc is the average computation time at a
single grid point
29
Calculating Execution Time An Example
Each task communicates these data points with two
other points
30
Efficiency and Speedup
  • Efficiency is the fraction of time that
    processors spend doing useful work
  • This is used to compare algorithms irrespective
    of problem size

31
Efficiency and Speedup
  • These relative calculations are based on single
    processor runs
  • They do not give a good measure of merit
  • Are good for judging scalability

32
Scalability Analysis
  • Efficiency decreases when P, ts , and tw increase
  • Efficiency increases when N, Z, and tc increase
  • Execution time decreases with increasing P but
    has a lower bound
  • Execution time increases when N, Z, tc , ts ,
    and tw increase

33
Scalability Analysis
  • Along with empirical data, models can be used to
    answer some useful questions
  • Does the algorithm meet design requirements on
    the target parallel computer?
  • How adaptable is the algorithm?
  • How does the algorithm compare with other
    algorithms for the same problem?

34
Scalability with Fixed Problem Size
  • Determine how execution time and efficiency vary
    with increasing number of processors and fixed
    problem size
  • What is the fastest I can solve problem A on
    computer X?

35
Scalability with Fixed Problem Size
  • While efficiency generally decreases with the
    size of the problem, execution time may increase
    if the performance model includes a positive
    power of P (tasks/problem size)

36
Scalability with Fixed Problem Size
37
Scalability with Fixed Problem Size
38
Scalability with Scaled Problem Size
  • Consider how the amount of computation performed
    must scale with the problem size to keep
    efficiency constant
  • This function of N (size of our grid) is called
    an isoefficiency function
  • O(P) is scalable, O(Px) is not

39
Scalability with Scaled Problem Size
  • In order to maintain constant efficiency, this
    relation must hold with increasing problem size
  • Single-processor time must increase at the same
    rate as total parallel time or the amount of
    essential computation must increase at the same
    rate as overhead

40
Execution Profiles
  • If a performance model gives us bad news, how do
    we find out where to fix our algorithm?
  • Create an execution profile
  • Track the values for the variables in your
    performance model and plot them

41
Execution Profiles
42
Experimental Studies
  • There is simply too many factors to take into
    account to produce a completely accurate
    performance model
  • Experimental data must supplement

43
Experimental Studies
  • Early studies can be used to determine parameter
    values such as startup time, transfer time per
    word, etc.
  • Also can be used to verify a performance model

44
Experimental Design
  • Identify the data we want
  • Run for multiple data points and, if possible,
    different numbers of processors
  • The point is to get accurate and reproducible
    results

45
Causes of Variations in Experimental Results
  • Nondeterministic algorithm
  • Inaccurate timer
  • Startup and shutdown costs
  • Interference from other programs
  • Contention (network traffic)
  • Random resource allocation (OS use of randomness)

46
Fitting Data to Models
  • Plot data on a graph
  • Example
  • tw is the slope
  • ts is the intercept when L 0
  • Least-squares analysis
  • Be imaginative

47
Evaluating Implementations
  • Be prepared for your model and your
    implementation data to be different
  • But they shouldnt be too different
  • Get execution profiles for actual running code
  • Compare that with the execution profiles created
    earlier

48
Evaluating Implementations
  • Execution time is higher than expected
  • Usually extra overhead you didnt model
  • Load imbalances (probably processor issue)
  • Replicated computation (unnecessary sequential
    processing)
  • Tool/algorithm mismatch
  • Competition for bandwidth (communication is more
    complicated than modeled)

49
Evaluating Implementations
  • Execution time is lower than expected
  • Hardware effects (cache, internal bus speeds,
    etc.)
  • Search anomalies (the parallel version is not
    comparable with the sequential version)

50
Refining the Communication Cost Model
  • Most networks do not have a dedicated
    communication path linking each pair of nodes
  • This can cause rerouting, switching, or blocking
    of messages
  • The number of wires (links) that must be crossed
    between any two nodes is the distance between
    them
  • The maximum distance between two processors is
    the diameter of the network

51
Refining the Communication Cost Model
  • Competition for bandwidth requires that we scale
    our communication model by the number of
    processors needing to send data over the same
    link concurrently (S)
  • Does not account for retransmit costs

52
Interconnection Networks
  • The number or processors needing to share
    bandwidth changes based on the type of network
  • Crossbar switching
  • Bus-based
  • Ethernet
  • Mesh
  • Hypercube
  • Multistage Interconnection

53
Crossbar Switching Networks
  • Uses O(N2) switches
  • Incurs no penalty for concurrent communication

54
Bus-based Networks
  • Serial communication over a bus

55
Ethernet Networks
  • A form of bus-based network
  • Processor must have exclusive use of the bus

56
Mesh Networks
  • Form of crossbar switch network
  • Processors are linked in chains so that a message
    may need to pass through several processors to
    reach its destination

57
Mesh Networks
58
Mesh Networks
Collision
59
Hypercube Networks
  • Similar to Mesh network
  • Diameter of the network is equal to the
    dimensions of the hypercube

60
Hypercube Networks
61
Multistage Interconnection Networks
  • Uses switches to connect nodes
  • Fewer than O(P2) switches
  • Every message (in unidirectional networks) uses
    the same number of wires so that distance is
    constant between any two processors

62
Multistage Interconnection Networks
63
Input/Output
  • Applications with large I/O requirements
  • Checkpoints
  • Simulation data
  • Data analysis
  • Out-of-core computation

64
Input/Output
  • Can be thought of as communications with an I/O
    device using the same formulas used for
    processor-to-processor communications
  • Concurrent read/writes can be managed just like
    really slow processors

65
Summary
  • How to create performance models
  • How to use performance models
  • Early to compare algorithms
  • Later to verify algorithm choices and fine-tune
    the model
  • Finally to compare against data and determine
    flaws in the algorithm or implementation
Write a Comment
User Comments (0)
About PowerShow.com