Slides Prepared from the CI-Tutor Courses at NCSA - PowerPoint PPT Presentation

About This Presentation

Title:

Slides Prepared from the CI-Tutor Courses at NCSA

Description:

It is well known in the parallel computing community, that you cannot take a small application and expect it to show good performance on a parallel computer. – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 22

Provided by: sadj5

Learn more at: http://users.cis.fiu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Slides Prepared from the CI-Tutor Courses at NCSA

1
Parallel Computing ExplainedParallel Performance
Analysis

Slides Prepared from the CI-Tutor Courses at NCSA
http//ci-tutor.ncsa.uiuc.edu/
By
S. Masoud Sadjadi
School of Computing and Information Sciences
Florida International University
March 2009

2
Agenda

1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
7 Cache Tuning
8 Parallel Performance Analysis
8.1 Speedup
8.2 Speedup Extremes
8.3 Efficiency
8.4 Amdahl's Law
8.5 Speedup Limitations
8.6 Benchmarks
8.7 Summary
9 About the IBM Regatta P690

3
Parallel Performance Analysis

Now that you have parallelized your code, and
have run it on a parallel computer using multiple
processors you may want to know the performance
gain that parallelization has achieved.
This chapter describes how to compute parallel
code performance.
Often the performance gain is not perfect, and
this chapter also explains some of the reasons
for limitations on parallel performance.
Finally, this chapter covers the kinds of
information you should provide in a benchmark,
and some sample benchmarks are given.

4
Speedup

The speedup of your code tells you how much
performance gain is achieved by running your
program in parallel on multiple processors.
A simple definition is that it is the length of
time it takes a program to run on a single
processor, divided by the time it takes to run on
a multiple processors.
Speedup generally ranges between 0 and p, where p
is the number of processors.
Scalability
When you compute with multiple processors in a
parallel environment, you will also want to know
how your code scales.
The scalability of a parallel code is defined as
its ability to achieve performance proportional
to the number of processors used.
As you run your code with more and more
processors, you want to see the performance of
the code continue to improve.
Computing speedup is a good way to measure how a
program scales as more processors are used.

5
Speedup

Linear Speedup
If it takes one processor an amount of time t to
do a task and if p processors can do the task in
time t / p, then you have perfect or linear
speedup (Sp p).
That is, running with 4 processors improves the
time by a factor of 4, running with 8 processors
improves the time by a factor of 8, and so on.
This is shown in the following illustration.

6
Speedup Extremes

The extremes of speedup happen when speedup is
greater than p, called super-linear speedup,
less than 1.
Super-Linear Speedup
You might wonder how super-linear speedup can
occur. How can speedup be greater than the number
of processors used?
The answer usually lies with the program's memory
use. When using multiple processors, each
processor only gets part of the problem compared
to the single processor case. It is possible that
the smaller problem can make better use of the
memory hierarchy, that is, the cache and the
registers. For example, the smaller problem may
fit in cache when the entire problem would not.
When super-linear speedup is achieved, it is
often an indication that the sequential code, run
on one processor, had serious cache miss
problems.
The most common programs that achieve
super-linear speedup are those that solve dense
linear algebra problems.

7
Speedup Extremes

Parallel Code Slower than Sequential Code
When speedup is less than one, it means that the
parallel code runs slower than the sequential
code.
This happens when there isn't enough computation
to be done by each processor.
The overhead of creating and controlling the
parallel threads outweighs the benefits of
parallel computation, and it causes the code to
run slower.
To eliminate this problem you can try to increase
the problem size or run with fewer processors.

8
Efficiency

Efficiency is a measure of parallel performance
that is closely related to speedup and is often
also presented in a description of the
performance of a parallel program.
Efficiency with p processors is defined as the
ratio of speedup with p processors to p.
Efficiency is a fraction that usually ranges
between 0 and 1.
Ep1 corresponds to perfect speedup of Sp p.
You can think of efficiency as describing the
average speedup per processor.

9
Amdahl's Law

An alternative formula for speedup is named
Amdahl's Law attributed to Gene Amdahl, one of
America's great computer scientists.
This formula, introduced in the 1980s, states
that no matter how many processors are used in a
parallel run, a program's speedup will be limited
by its fraction of sequential code.
That is, almost every program has a fraction of
the code that doesn't lend itself to parallelism.
This is the fraction of code that will have to be
run with just one processor, even in a parallel
run.
Amdahl's Law defines speedup with p processors as
follows
Where the term f stands for the fraction of
operations done sequentially with just one
processor, and the term (1 - f) stands for the
fraction of operations done in perfect
parallelism with p processors.

10
Amdahl's Law

The sequential fraction of code, f, is a unitless
measure ranging between 0 and 1.
When f is 0, meaning there is no sequential code,
then speedup is p, or perfect parallelism. This
can be seen by substituting f 0 in the formula
above, which results in Sp p.
When f is 1, meaning there is no parallel code,
then speedup is 1, or there is no benefit from
parallelism. This can be seen by substituting f
1 in the formula above, which results in Sp 1.
This shows that Amdahl's speedup ranges between 1
and p, where p is the number of processors used
in a parallel processing run.

11
Amdahl's Law

The interpretation of Amdahl's Law is that
speedup is limited by the fact that not all parts
of a code can be run in parallel.
Substituting in the formula, when the number of
processors goes to infinity, your code's speedup
is still limited by 1 / f.
Amdahl's Law shows that the sequential fraction
of code has a strong effect on speedup.
This helps to explain the need for large problem
sizes when using parallel computers.
It is well known in the parallel computing
community, that you cannot take a small
application and expect it to show good
performance on a parallel computer.
To get good performance, you need to run large
applications, with large data array sizes, and
lots of computation.
The reason for this is that as the problem size
increases the opportunity for parallelism grows,
and the sequential fraction shrinks, and it
shrinks in its importance for speedup.

12
Agenda

8 Parallel Performance Analysis
8.1 Speedup
8.2 Speedup Extremes
8.3 Efficiency
8.4 Amdahl's Law
8.5Speedup Limitations
8.5.1 Memory Contention Limitation
8.5.2 Problem Size Limitation
8.6 Benchmarks
8.7 Summary

13
Speedup Limitations

This section covers some of the reasons why a
program doesn't get perfect Speedup. Some of the
reasons for limitations on speedup are
Too much I/O
Speedup is limited when the code is I/O bound.
That is, when there is too much input or output
compared to the amount of computation.
Wrong algorithm
Speedup is limited when the numerical algorithm
is not suitable for a parallel computer.
You need to replace it with a parallel algorithm.
Too much memory contention
Speedup is limited when there is too much memory
contention.
You need to redesign the code with attention to
data locality.
Cache reutilization techniques will help here.

14
Speedup Limitations

Wrong problem size
Speedup is limited when the problem size is too
small to take best advantage of a parallel
computer.
In addition, speedup is limited when the problem
size is fixed.
That is, when the problem size doesn't grow as
you compute with more processors.
Too much sequential code
Speedup is limited when there's too much
sequential code.
This is shown by Amdahl's Law.
Too much parallel overhead
Speedup is limited when there is too much
parallel overhead compared to the amount of
computation.
These are the additional CPU cycles accumulated
in creating parallel regions, creating threads,
synchronizing threads, spin/blocking threads, and
ending parallel regions.
Load imbalance
Speedup is limited when the processors have
different workloads.
The processors that finish early will be idle
while they are waiting for the other processors
to catch up.

15
Memory Contention Limitation

Gene Golub, a professor of Computer Science at
Stanford University, writes in his book on
parallel computing that the best way to define
memory contention is with the word delay.
When different processors all want to read or
write into the main memory, there is a delay
until the memory is free.
On the SGI Origin2000 computer, you can determine
whether your code has memory contention problems
by using SGI's perfex utility.
The perfex utility is covered in the Cache Tuning
lecture in this course.
You can also refer to SGI's manual page, man
perfex, for more details.
On the Linux clusters, you can use the hardware
performance counter tools to get information on
memory performance.
On the IA32 platform, use perfex, vprof,
hmpcount, psrun/perfsuite.
On the IA64 platform, use vprof, pfmon,
psrun/perfsuite.

16
Memory Contention Limitation

Many of these tools can be used with the PAPI
performance counter interface.
Be sure to refer to the man pages and webpages on
the NCSA website for more information.
If the output of the utility shows that memory
contention is a problem, you will want to use
some programming techniques for reducing memory
contention.
A good way to reduce memory contention is to
access elements from the processor's cache memory
instead of the main memory.
Some programming techniques for doing this are
Access arrays with unit .
Order nested do loops (in Fortran) so that the
innermost loop index is the leftmost index of the
arrays in the loop. For the C language, the order
is the opposite of Fortran.
Avoid specific array sizes that are the same as
the size of the data cache or that are exact
fractions or exact multiples of the size of the
data cache.
Pad common blocks.
These techniques are called cache tuning
optimizations. The details for performing these
code modifications are covered in the section on
Cache Optimization of this lecture.

17
Problem Size Limitation

Small Problem Size
Speedup is almost always an increasing function
of problem size.
If there's not enough work to be done by the
available processors, the code will show limited
speedup.
The effect of small problem size on speedup is
shown in the following illustration.

18
Problem Size Limitation

Fixed Problem Size
When the problem size is fixed, you can reach a
point of negative returns when using additional
processors.
As you compute with more and more processors,
each processor has less and less amount of
computation to perform.
The additional parallel overhead, compared to the
amount of computation, causes the speedup curve
to start turning downward as shown in the
following figure.

19
Benchmarks

It will finally be time to report the parallel
performance of your application code.
You will want to show a speedup graph with the
number of processors on the x axis, and speedup
on the y axis.
Some other things you should report and record
are
the date you obtained the results
the problem size
the computer model
the compiler and the version number of the
compiler
any special compiler options you used

20
Benchmarks

When doing computational science, it is often
helpful to find out what kind of performance your
colleagues are obtaining.
In this regard, NCSA has a compilation of
parallel performance benchmarks online at
http//www.ncsa.uiuc.edu/UserInfo/Perf/NCSAbench/.
You might be interested in looking at these
benchmarks to see how other people report their
parallel performance.
In particular, the NAMD benchmark is a report
about the performance of the NAMD program that
does molecular dynamics simulations.

21
Summary

There are many good texts on parallel computing
which treat the subject of parallel performance
analysis. Here are two useful references
Scientific Computing An Introduction with
Parallel Computing, Gene Golub and James Ortega,
Academic Press, Inc.
Parallel Computing Theory and Practice, Michael
J. Quinn, McGraw-Hill, Inc.

Write a Comment

User Comments (0)