Slides Prepared from the CI-Tutor Courses at NCSA - PowerPoint PPT Presentation

About This Presentation

Title:

Slides Prepared from the CI-Tutor Courses at NCSA

Description:

Parallel Computing Explained Scalar Tuning Slides Prepared from the CI-Tutor Courses at NCSA http://ci-tutor.ncsa.uiuc.edu/ By S. Masoud Sadjadi School of Computing ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 21

Provided by: sadj5

Learn more at: http://users.cis.fiu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Slides Prepared from the CI-Tutor Courses at NCSA

1
Parallel Computing ExplainedScalar Tuning

Slides Prepared from the CI-Tutor Courses at NCSA
http//ci-tutor.ncsa.uiuc.edu/
By
S. Masoud Sadjadi
School of Computing and Information Sciences
Florida International University
March 2009

2
Agenda

1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
4.1 Aggressive Compiler Options
4.2 Compiler Optimizations
4.3 Vendor Tuned Code
4.4 Further Information

3
Scalar Tuning

If you are not satisfied with the performance of
your program on the new computer, you can tune
the scalar code to decrease its runtime.
This chapter describes many of these techniques
The use of the most aggressive compiler options
The improvement of loop unrolling
The use of subroutine inlining
The use of vendor supplied tuned code
The detection of cache problems, and their
solution are presented in the Cache Tuning
chapter.

4
Aggressive Compiler Options

For the SGI Origin2000 Linux clusters the main
optimization switch is
-On where n ranges from 0 to 3.
-O0 turns off all optimizations.
-O1 and -O2 do beneficial optimizations that will
not effect the accuracy of results.
-O3 specifies the most aggressive optimizations.
It takes the most compile time, may produce
changes in accuracy, and turns on software
pipelining.

5
Aggressive Compiler Options

It should be noted that O3 might carry out loop
transformations that produce incorrect results in
some codes.
It is recommended that one compare the answer
obtained from Level 3 optimization with one
obtained from a lower-level optimization.
On the SGI Origin2000 and the Linux clusters, O3
can be used together with OPTIEEE_arithmeticn
(n1,2, or 3) and mp (or mp1), respectively, to
enforce operation conformance to IEEE standard at
different levels.
On the SGI Origin2000, the option
-Ofast ip27
is also available. This option specifies the
most aggressive optimizations that are
specifically tuned for the Origin2000 computer.

6
Agenda

1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
4.1Aggressive Compiler Options
4.2 Compiler Optimizations
4.2.1 Statement Level
4.2.2 Block Level
4.2.3 Routine Level
4.2.4 Software Pipelining
4.2.5 Loop Unrolling
4.2.6 Subroutine Inlining
4.2.7 Optimization Report
4.2.8 Profile-guided Optimization (PGO)
4.3 Vendor Tuned Code
4.4 Further Information

7
Compiler Optimizations

The various compiler optimizations can be
classified as follows
Statement Level Optimizations
Block Level Optimizations
Routine Level Optimizations
Software Pipelining
Loop Unrolling
Subroutine Inlining
Each of these are described in the following
sections.

8
Statement Level

Constant Folding
Replace simple arithmetic operations on constants
with the pre-computed result.
y 57 becomes y 12
Short Circuiting
Avoid executing parts of conditional tests that
are not necessary.
if (I.eq.J .or. I.eq.K) expression
when IJ immediately compute the expression
Register Assignment
Put frequently used variables in registers.

9
Block Level

Dead Code Elimination
Remove unreachable code and code that is never
executed or used.
Instruction Scheduling
Reorder the instructions to improve memory
pipelining.

10
Routine Level

Strength Reduction
Replace expressions in a loop with an expression
that takes fewer cycles.
Common Subexpressions Elimination
Expressions that appear more than once, are
computed once, and the result is substituted for
each occurrence of the expression.
Constant Propagation
Compile time replacement of variables with
constants.
Loop Invariant Elimination
Expressions inside a loop that don't change with
the do loop index are moved outside the loop.

11
Software Pipelining

Software pipelining allows the mixing of
operations from different loop iterations in each
iteration of the hardware loop. It is used to get
the maximum work done per clock cycle.
Note On the R10000s there is out-of-order
execution of instructions, and software
pipelining may actually get in the way of this
feature.

12
Loop Unrolling

The loops stride (or step) value is increased,
and the body of the loop is replicated. It is
used to improve the scheduling of the loop by
giving a longer sequence of straight line code.
An example of loop unrolling follows
Original Loop Unrolled Loop
do I 1, 99 do I 1, 99, 3
c(I) a(I) b(I) c(I) a(I) b(I)
enddo c(I1) a(I1) b(I1)
c(I2) a(I2) b(I2)
enddo
There is a limit to the amount of unrolling that
can take place because there are a limited number
of registers.
On the SGI Origin2000, loops are unrolled to a
level of 8 by default. You can unroll to a level
of 12 by specifying
f90 -O3 -OPTunroll_times_max12 ... prog.f
On the IA32 Linux cluster, the corresponding flag
is unroll and -unroll0 for unrolling and no
unrolling, respectively.

13
Subroutine Inlining

Subroutine inlining replaces a call to a
subroutine with the body of the subroutine
itself.
One reason for using subroutine inlining is that
when a subroutine is called inside a do loop that
has a huge iteration count, subroutine inlining
may be more efficient because it cuts down on
loop overhead.
However, the chief reason for using it is that do
loops that contain subroutine calls may not
parallelize.

14
Subroutine Inlining

On the SGI Origin2000 computer, there are several
options to invoke inlining
Inline all routines except those specified to
-INLINEneverf90 -O3 -INLINEall prog.f
Inline no routines except those specified to
-INLINEmustf90 -O3 -INLINEnone prog.f
Specify a list of routines to inline at every
callf90 -O3 -INLINEmustsubrname prog.f
Specify a list of routines never to inlinef90
-O3 -INLINEneversubrname prog.f
On the Linux clusters, the following flags can
invoke function inlining
inline function expansion for calls defined
within the current source file-ip
inline function expansion for calls defined in
separate files-ipo

15
Optimization Report

Intel 9.x and later compilers can generate
reports that provide useful information on
optimization done on different parts of your
code.
To generate such optimization reports in a file
filename, add the flag -opt-report-file filename.
If you have a lot of source files to process
simultaneously, and you use a makefile to
compile, you can also use make's "suffix" rules
to have optimization reports produced
automatically, each with a unique name. For
example,
.f.o
ifort -c -o _at_ (FFLAGS) -opt-report-file .opt
.f
creates optimization reports that are named
identically to the original Fortran source but
with the suffix ".f" replaced by ".opt".

16
Optimization Report

To help developers and performance analysts
navigate through the usually lengthy optimization
reports, the NCSA program OptView is designed to
provide an easy-to-use and intuitive interface
that allows the user to browse through their own
source code, cross-referenced with the
optimization reports.
OptView is installed on NCSA's IA64 Linux cluster
under the directory /usr/apps/tools/bin. You can
either add that directory to your UNIX PATH or
you can invoke optview using an absolute path
name. You'll need to be using the X-Window system
and to have set your DISPLAY environment variable
correctly for OptView to work.
Optview can provide a quick overview of which
loops in a source code or source codes among
multiple files are highly optimized and which
might need further work. For a detailed
description of use of OptView, readers see
http//perfsuite.ncsa.uiuc.edu/OptView/

17
Profile-guided Optimization (PGO)

Profile-guided optimization allows Intel
compilers to use valuable runtime information to
make better decisions about function inlining and
interprocedural optimizations to generate faster
codes. Its methodology is illustrated as follows

18
Profile-guided Optimization (PGO)

First, you do an instrumented compilation by
adding the -prof-gen flag in the compile process
icc -prof-gen -c a1.c a2.c a3.c
icc a1.o a2.o a3.o -lirc
Then, you run the program with a representative
set of data to generate the dynamic information
files given by the .dyn suffix.
These files contain valuable runtime information
for the compiler to do better function inlining
and other optimizations.
Finally, the code is recompiled again with the
-prof-use flag to use the runtime information.
icc -prof-use -ipo -c a1.c a2.c a3.c
A profile-guided optimized executable is
generated.

19
Vendor Tuned Code

Vendor math libraries have codes that are
optimized for their specific machine.
On the SGI Origin2000 platform, Complib.sgimath
and SCSL are available.
On the Linux clusters, Intel MKL is available.
Ways to link to these libraries are described in
Section 3 - Porting Issues.

20
Further Information

SGI IRIX man and www pages
man opt
man lno
man inline
man ipa
man perfex
Performance Tuning for the Origin2000 at
http//www.ncsa.uiuc.edu/UserInfo/Resources/Hardwa
re/Origin2000OLD/Doc/
Linux clusters help and www pages
ifort/icc/icpc help (Intel)
http//www.ncsa.uiuc.edu/UserInfo/Resources/Hardwa
re/Intel64Cluster/ (Intel64)
http//www.ncsa.uiuc.edu/UserInfo/Resources/Hardwa
re/Intel64Cluster/ (Intel64)
http//perfsuite.ncsa.uiuc.edu/OptView/