Lecture 4 Analytical Modeling of Parallel Programs

About This Presentation

Title:

Lecture 4 Analytical Modeling of Parallel Programs

Description:

The total work done by the parallel algorithm is only 9 nodes and corresponding parallel time is 5 units time. Then the speedup is 14/5=2.8. – PowerPoint PPT presentation

Number of Views:101

Avg rating:3.0/5.0

Slides: 12

Provided by: xw37

Learn more at: http://www.cs.csi.cuny.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 4 Analytical Modeling of Parallel Programs

1
Lecture 4 Analytical Modeling of Parallel Programs

Parallel Computing
Fall 2008

2
Performance Metrics for Parallel Systems

Number of processing elements p
Execution Time
Parallel runtime the time that elapses from the
moment a parallel computation starts to the
moment the last processing element finishes
execution.
Ts serial runtime
Tp parallel runtime
Total Parallel Overhead T0
Total time collectively spent by all the
processing elements running time required by
the fastest known sequential algorithm for
solving the same problem on a single processing
element.
T0pTp-Ts

3
Performance Metrics for Parallel Systems

Speedup S
The ratio of the serial runtime of the best
sequential algorithm for solving a problem to the
time taken by the parallel algorithm to solve the
same problem on p processing elements.
STs(best)/Tp
Example adding n numbers TpT(logn), Ts T(n),
S T(n/logn)
Theoretically, speedup can never exceed the
number of processing elements p(Sltp).
Proof Assume a speedup is greater than p, then
each processing element can spend less than time
Ts/p solving the problem. In this case, a single
processing element could emulate the p processing
elements and solve the problem in fewer than Ts
units of time. This is a contradiction because
speedup, by definition, is computed with respect
to the best sequential algorithm.
Superlinear speedup In practice, a speedup
greater than p is sometimes observed, this
usually happens when the work performed by a
serial algorithm is greater than its parallel
formulation or due to hardware features that put
the serial implementation at a disadvantage.

4
Example for Superlinear speedup

Superlinear speedup
Example1 Superlinear effects from caches With
the problem instance size of A and 64KB cache,
the cache hit rate is 80. Assume latency to
cache of 2ns and latency of DRAM of 100ns, then
memory access time is 20.81000.221.6ns. If
the computation is memory bound and performs one
FLOP/memory access, this corresponds to a
processing rate of 46.3 MFLOPS. With the problem
instance size of A/2 and 64KB cache, the cache
hit rate is higher, i.e., 90, 8 the remaining
data comes from local DRAM and the other 2 comes
from the remote DRAM with latency of 400ns, then
memory access time is 20.91000.084000.0217.8
. The corresponding execution rate at each
processor is 56.18MFLOPS, and for two processors
the total processing rate is 112.36MFLOPS. Then
the speedup will be 112.36/46.32.43!

5
Example for Superlinear speedup

Superlinear speedup
Example2 Superlinear effects due to exploratory
decomposition explore leaf nodes of an
unstructured tree. Each leaf has a label
associated with it and the objective is to find a
node with a specified label, say S. The
solution node is the rightmost leaf in the tree.
A serial formulation of this problem based on
depth-first tree traversal explores the entire
tree, i.e. all 14 nodes, time is 14 units time.
Now a parallel formulation in which the left
subtree is explored by processing element 0 and
the right subtree is explored by processing
element 1. The total work done by the parallel
algorithm is only 9 nodes and corresponding
parallel time is 5 units time. Then the speedup
is 14/52.8.

6
Performance Metrics for Parallel Systems(cont.)

Efficiency E
Ratio of speedup to the number of processing
element.
ES/p
A measure of the fraction of time for which a
processing element is usefully employed.
Examples adding n numbers on n processing
elements TpT(logn), Ts T(n), S T(n/logn), E
T(1/logn)
Cost(also called Work or processor-time product)
W
Product of parallel runtime and the number of
processing elements used.
WTpp
Examples adding n numbers on n processing
elements W T(nlogn).
Cost-optimal if the cost of solving a problem on
a parallel computer has the same asymptotic
growth(in T terms) as a function of the input
size as the fastest-known sequential algorithm on
a single processing element.
Problem Size W2
The number of basic computation steps in the best
sequential algorithm to solve the problem on a
single processing element.
W2Ts of the fastest known algorithm to solve the
problem on a sequential computer.

7
Parallel vs Sequential Computing Amdahls

Theorem 0.1 (Amdahls Law) Let f, 0 f 1, be
the fraction of a computation that is inherently
sequential. Then the maximum obtainable speedup S
on p processors is S 1/(f (1 - f)/p)
Proof. Let T be the sequential running time for
the named computation. fT is the time spent on
the inherently sequential part of the program. On
p processors the remaining computation, if fully
parallelizable, would achieve a running time of
at most (1-f)T/p. This way the running time of
the parallel program on p processors is the sum
of the execution time of the sequential and
parallel components that is, fT (1 - f)T/p. The
maximum allowable speedup is therefore S T/(fT
(1 - f)T/p) and the result is proven.

8
Amdahls Law

Amdahl used this observation to advocate the
building of even more powerful sequential
machines as one cannot gain much by using
parallel machines. For example if f 10, then S
10 as p ? 8. The underlying assumption in
Amdahls Law is that the sequential component of
a program is a constant fraction of the whole
program. In many instances as problem size
increases the fraction of computation that is
inherently sequential decreases with time. In
many cases even a speedup of 10 is quite
significant by itself.
In addition Amdahls law is based on the concept
that parallel computing always tries to minimize
parallel time. In some cases a parallel computer
is used to increase the problem size that can be
solved in a fixed amount of time. For example in
weather prediction this would increase the
accuracy of say a three-day forecast or would
allow a more accurate five-day forecast.

9
Parallel vs Sequential Computing Gustaffsons Law

Theorem 0.2 (Gustafsons Law) Let the execution
time of a parallel algorithm consist of a
sequential segment fT and a parallel segment (1 -
f)T and the sequential segment is constant. The
scaled speedup of the algorithm is then. S (fT
(1 - f)Tp)/(fT (1 - f)T) f (1 - f)p
For f 0.05, we get S 19.05, whereas Amdahls
law gives an S 10.26.
1 proc p proc
fT fT
(1-f)Tp (1-f)T
T(f(1-f)p) T
Amdahls Law assumes that problem size is fixed
when it deals with scalability. Gustafsons Law
assumes that running time is fixed.

10
Brents Scheduling Principle(Emulations)

Suppose we have an unlimited parallelism
efficient parallel algorithm, i.e. an algorithm
that runs on zillions of processors. In practice
zillions of processors may not available. Suppose
we have only p processors. A question that arises
is what can we do to run the efficient zillion
processor algorithm on our limited machine.
One answer is emulation simulate the zillion
processor algorithm on the p processor machine.
Theorem 0.3 (Brents Principle) Let the execution
time of a parallel algorithm requires m
operations and runs in parallel time t. Then
running this algorithm on a limited processor
machine with only p processors would require time
m/p t.
Proof Let mi be the number of computational
operations at the i-th step, i.e. .If
we assign the p processors on the i-th step to
work on these mi operations they can conclude in
time . Thus the total
running time on p processors would be

Lecture 4 Analytical Modeling of Parallel Programs - PowerPoint PPT Presentation

Lecture 4 Analytical Modeling of Parallel Programs

The total work done by the parallel algorithm is only 9 nodes and corresponding parallel time is 5 units time. Then the speedup is 14/5=2.8. – PowerPoint PPT presentation