Slides Prepared from the CI-Tutor Courses at NCSA - PowerPoint PPT Presentation

About This Presentation
Title:

Slides Prepared from the CI-Tutor Courses at NCSA

Description:

Parallel Computing Explained Scalar Tuning Slides Prepared from the CI-Tutor Courses at NCSA http://ci-tutor.ncsa.uiuc.edu/ By S. Masoud Sadjadi School of Computing ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 21
Provided by: sadj5
Learn more at: http://users.cis.fiu.edu
Category:

less

Transcript and Presenter's Notes

Title: Slides Prepared from the CI-Tutor Courses at NCSA


1
Parallel Computing ExplainedScalar Tuning
  • Slides Prepared from the CI-Tutor Courses at NCSA
  • http//ci-tutor.ncsa.uiuc.edu/
  • By
  • S. Masoud Sadjadi
  • School of Computing and Information Sciences
  • Florida International University
  • March 2009

2
Agenda
  • 1 Parallel Computing Overview
  • 2 How to Parallelize a Code
  • 3 Porting Issues
  • 4 Scalar Tuning
  • 4.1 Aggressive Compiler Options
  • 4.2 Compiler Optimizations
  • 4.3 Vendor Tuned Code
  • 4.4 Further Information

3
Scalar Tuning
  • If you are not satisfied with the performance of
    your program on the new computer, you can tune
    the scalar code to decrease its runtime.
  • This chapter describes many of these techniques
  • The use of the most aggressive compiler options
  • The improvement of loop unrolling
  • The use of subroutine inlining
  • The use of vendor supplied tuned code
  • The detection of cache problems, and their
    solution are presented in the Cache Tuning
    chapter.

4
Aggressive Compiler Options
  • For the SGI Origin2000 Linux clusters the main
    optimization switch is
  • -On where n ranges from 0 to 3.
  • -O0 turns off all optimizations.
  • -O1 and -O2 do beneficial optimizations that will
    not effect the accuracy of results.
  • -O3 specifies the most aggressive optimizations.
    It takes the most compile time, may produce
    changes in accuracy, and turns on software
    pipelining.

5
Aggressive Compiler Options
  • It should be noted that O3 might carry out loop
    transformations that produce incorrect results in
    some codes.
  • It is recommended that one compare the answer
    obtained from Level 3 optimization with one
    obtained from a lower-level optimization.
  • On the SGI Origin2000 and the Linux clusters, O3
    can be used together with OPTIEEE_arithmeticn
    (n1,2, or 3) and mp (or mp1), respectively, to
    enforce operation conformance to IEEE standard at
    different levels.
  • On the SGI Origin2000, the option
  • -Ofast ip27
  • is also available. This option specifies the
    most aggressive optimizations that are
    specifically tuned for the Origin2000 computer.

6
Agenda
  • 1 Parallel Computing Overview
  • 2 How to Parallelize a Code
  • 3 Porting Issues
  • 4 Scalar Tuning
  • 4.1Aggressive Compiler Options
  • 4.2 Compiler Optimizations
  • 4.2.1 Statement Level
  • 4.2.2 Block Level
  • 4.2.3 Routine Level
  • 4.2.4 Software Pipelining
  • 4.2.5 Loop Unrolling
  • 4.2.6 Subroutine Inlining
  • 4.2.7 Optimization Report
  • 4.2.8 Profile-guided Optimization (PGO)
  • 4.3 Vendor Tuned Code
  • 4.4 Further Information

7
Compiler Optimizations
  • The various compiler optimizations can be
    classified as follows
  • Statement Level Optimizations
  • Block Level Optimizations
  • Routine Level Optimizations
  • Software Pipelining
  • Loop Unrolling
  • Subroutine Inlining
  • Each of these are described in the following
    sections.

8
Statement Level
  • Constant Folding
  • Replace simple arithmetic operations on constants
    with the pre-computed result.
  •       y 57 becomes y 12
  • Short Circuiting
  • Avoid executing parts of conditional tests that
    are not necessary.
  •       if (I.eq.J .or. I.eq.K) expression     
    when IJ immediately compute the expression
  • Register Assignment
  • Put frequently used variables in registers.

9
Block Level
  • Dead Code Elimination
  • Remove unreachable code and code that is never
    executed or used.
  • Instruction Scheduling
  • Reorder the instructions to improve memory
    pipelining.

10
Routine Level
  • Strength Reduction
  • Replace expressions in a loop with an expression
    that takes fewer cycles.
  • Common Subexpressions Elimination
  • Expressions that appear more than once, are
    computed once, and the result is substituted for
    each occurrence of the expression.
  • Constant Propagation
  • Compile time replacement of variables with
    constants.
  • Loop Invariant Elimination
  • Expressions inside a loop that don't change with
    the do loop index are moved outside the loop.

11
Software Pipelining
  • Software pipelining allows the mixing of
    operations from different loop iterations in each
    iteration of the hardware loop. It is used to get
    the maximum work done per clock cycle.
  • Note On the R10000s there is out-of-order
    execution of instructions, and software
    pipelining may actually get in the way of this
    feature.

12
Loop Unrolling
  • The loops stride (or step) value is increased,
    and the body of the loop is replicated. It is
    used to improve the scheduling of the loop by
    giving a longer sequence of straight line code.
    An example of loop unrolling follows
  • Original Loop Unrolled Loop
  • do I 1, 99 do I 1, 99, 3
  • c(I) a(I) b(I) c(I) a(I) b(I)
  • enddo c(I1) a(I1) b(I1)
  • c(I2) a(I2) b(I2)
  • enddo
  • There is a limit to the amount of unrolling that
    can take place because there are a limited number
    of registers.
  • On the SGI Origin2000, loops are unrolled to a
    level of 8 by default. You can unroll to a level
    of 12 by specifying
  • f90 -O3 -OPTunroll_times_max12 ... prog.f
  • On the IA32 Linux cluster, the corresponding flag
    is unroll and -unroll0 for unrolling and no
    unrolling, respectively.

13
Subroutine Inlining
  • Subroutine inlining replaces a call to a
    subroutine with the body of the subroutine
    itself.
  • One reason for using subroutine inlining is that
    when a subroutine is called inside a do loop that
    has a huge iteration count, subroutine inlining
    may be more efficient because it cuts down on
    loop overhead.
  • However, the chief reason for using it is that do
    loops that contain subroutine calls may not
    parallelize.

14
Subroutine Inlining
  • On the SGI Origin2000 computer, there are several
    options to invoke inlining
  • Inline all routines except those specified to
    -INLINEneverf90 -O3 -INLINEall prog.f
  • Inline no routines except those specified to
    -INLINEmustf90 -O3 -INLINEnone prog.f
  • Specify a list of routines to inline at every
    callf90 -O3 -INLINEmustsubrname prog.f
  • Specify a list of routines never to inlinef90
    -O3 -INLINEneversubrname prog.f
  • On the Linux clusters, the following flags can
    invoke function inlining
  • inline function expansion for calls defined
    within the current source file-ip
  • inline function expansion for calls defined in
    separate files-ipo

15
Optimization Report
  • Intel 9.x and later compilers can generate
    reports that provide useful information on
    optimization done on different parts of your
    code.
  • To generate such optimization reports in a file
    filename, add the flag -opt-report-file filename.
  • If you have a lot of source files to process
    simultaneously, and you use a makefile to
    compile, you can also use make's "suffix" rules
    to have optimization reports produced
    automatically, each with a unique name. For
    example,
  • .f.o
  • ifort -c -o _at_ (FFLAGS) -opt-report-file .opt
    .f
  • creates optimization reports that are named
    identically to the original Fortran source but
    with the suffix ".f" replaced by ".opt".

16
Optimization Report
  • To help developers and performance analysts
    navigate through the usually lengthy optimization
    reports, the NCSA program OptView is designed to
    provide an easy-to-use and intuitive interface
    that allows the user to browse through their own
    source code, cross-referenced with the
    optimization reports.
  • OptView is installed on NCSA's IA64 Linux cluster
    under the directory /usr/apps/tools/bin. You can
    either add that directory to your UNIX PATH or
    you can invoke optview using an absolute path
    name. You'll need to be using the X-Window system
    and to have set your DISPLAY environment variable
    correctly for OptView to work.
  • Optview can provide a quick overview of which
    loops in a source code or source codes among
    multiple files are highly optimized and which
    might need further work. For a detailed
    description of use of OptView, readers see
    http//perfsuite.ncsa.uiuc.edu/OptView/

17
Profile-guided Optimization (PGO)
  • Profile-guided optimization allows Intel
    compilers to use valuable runtime information to
    make better decisions about function inlining and
    interprocedural optimizations to generate faster
    codes. Its methodology is illustrated as follows

18
Profile-guided Optimization (PGO)
  • First, you do an instrumented compilation by
    adding the -prof-gen flag in the compile process
  • icc -prof-gen -c a1.c a2.c a3.c
  • icc a1.o a2.o a3.o -lirc
  • Then, you run the program with a representative
    set of data to generate the dynamic information
    files given by the .dyn suffix.
  • These files contain valuable runtime information
    for the compiler to do better function inlining
    and other optimizations.
  • Finally, the code is recompiled again with the
    -prof-use flag to use the runtime information.
  • icc -prof-use -ipo -c a1.c a2.c a3.c
  • A profile-guided optimized executable is
    generated.

19
Vendor Tuned Code
  • Vendor math libraries have codes that are
    optimized for their specific machine.
  • On the SGI Origin2000 platform, Complib.sgimath
    and SCSL are available.
  • On the Linux clusters, Intel MKL is available.
    Ways to link to these libraries are described in
    Section 3 - Porting Issues.

20
Further Information
  • SGI IRIX man and www pages
  • man opt
  • man lno
  • man inline
  • man ipa
  • man perfex
  • Performance Tuning for the Origin2000 at
    http//www.ncsa.uiuc.edu/UserInfo/Resources/Hardwa
    re/Origin2000OLD/Doc/
  • Linux clusters help and www pages
  • ifort/icc/icpc help (Intel)
  • http//www.ncsa.uiuc.edu/UserInfo/Resources/Hardwa
    re/Intel64Cluster/ (Intel64)
  • http//www.ncsa.uiuc.edu/UserInfo/Resources/Hardwa
    re/Intel64Cluster/ (Intel64)
  • http//perfsuite.ncsa.uiuc.edu/OptView/
Write a Comment
User Comments (0)
About PowerShow.com