Slides Prepared from the CI-Tutor Courses at NCSA - PowerPoint PPT Presentation

About This Presentation
Title:

Slides Prepared from the CI-Tutor Courses at NCSA

Description:

... one can call the cpu_time routine using a FORTRAN wrapper or call the ... struct timeval tp; int ... To view the rearranged source use the option f90 ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 40
Provided by: sadj5
Learn more at: http://users.cs.fiu.edu
Category:

less

Transcript and Presenter's Notes

Title: Slides Prepared from the CI-Tutor Courses at NCSA


1
Parallel Computing ExplainedTiming and Profiling
  • Slides Prepared from the CI-Tutor Courses at NCSA
  • http//ci-tutor.ncsa.uiuc.edu/
  • By
  • S. Masoud Sadjadi
  • School of Computing and Information Sciences
  • Florida International University
  • March 2009
  • (Additional Slides by Javier Delgado)

2
Agenda
  • 1 Parallel Computing Overview
  • 2 How to Parallelize a Code
  • 3 Porting Issues
  • 4 Scalar Tuning
  • 5 Parallel Code Tuning
  • 6 Timing and Profiling
  • 6.1 Timing
  • 6.1.1 Timing a Section of Code
  • 6.1.1.1 CPU Time
  • 6.1.1.2 Wall clock Time
  • 6.1.2 Timing an Executable
  • 6.1.3 Timing a Batch Job
  • 6.2 Profiling
  • 6.2.1 Profiling Tools
  • 6.2.2 Profile Listings
  • 6.2.3 Profiling Analysis
  • 6.3 Further Information

3
Timing and Profiling
  • Now that your program has been ported to the new
    computer, you will want to know how fast it runs.
  • This chapter describes how to measure the speed
    of a program using various timing routines.
  • The chapter also covers how to determine which
    parts of the program account for the bulk of the
    computational load so that you can concentrate
    your tuning efforts on those computationally
    intensive parts of the program.
  • This chapter also gives a summary of some
    available profiling tools.

4
Timing
  • In the following sections, well discuss timers
    and review the profiling tools ssrun and prof on
    the Origin and vprof and gprof on the Linux
    Clusters. The specific timing functions described
    are
  • Timing a section of codeFORTRAN
  • etime, dtime, cpu_time for CPU time
  • time and f_time for wallclock time
  • clock for CPU time
  • gettimeofday for wallclock time
  • Timing an executable
  • time a.out
  • Timing a batch run
  • busage
  • qstat
  • qhist

5
CPU Time
  • etime
  • A section of code can be timed using etime.
  • It returns the elapsed CPU time in seconds since
    the program started.
  • real4 tarray(2),time1,time2,timeres
  • beginning of program
  • time1etime(tarray)
  • start of section of code to be timed
  • lots of computation
  • end of section of code to be timed
  • time2etime(tarray)
  • timerestime2-time1

6
CPU Time
  • dtime
  • A section of code can also be timed using dtime.
  • It returns the elapsed CPU time in seconds since
    the last call to dtime.
  • real4 tarray(2),timeres
  • beginning of program
  • timeresdtime(tarray)
  • start of section of code to be timed
  • lots of computation
  • end of section of code to be timed
  • timeresdtime(tarray)
  • rest of program

7
CPU Time
  • The etime and dtime Functions
  • User time.
  • This is returned as the first element of tarray.
  • Its the CPU time spent executing user code.
  • System time.
  • This is returned as the second element of tarray.
  • Its the time spent executing system calls on
    behalf of your program.
  • Sum of user and system time.
  • This is the function value that is returned.
  • Its the time that is usually reported.
  • Metric.
  • Timings are reported in seconds.
  • Timings are accurate to 1/100th of a second.

8
CPU Time
  • Timing Comparison Warnings
  • For the SGI computers
  • The etime and dtime functions return the MAX time
    over all threads for a parallel program.
  • This is the time of the longest thread, which is
    usually the master thread.
  • For the Linux Clusters
  • The etime and dtime functions are contained in
    the VAX compatibility library of the Intel
    FORTRAN Compiler.
  • To use this library include the compiler flag
    -Vaxlib.
  • Another warning Do not put calls to etime and
    dtime inside a do loop. The overhead is too
    large.

9
CPU Time
  • cpu_time
  • The cpu_time routine is available only on the
    Linux clusters as it is a component of the Intel
    FORTRAN compiler library.
  • It provides substantially higher resolution and
    has substantially lower overhead than the older
    etime and dtime routines.
  • It can be used as an elapsed timer.
  • real8 time1, time2, timeres
  • beginning of program
  • call cpu_time (time1)
  • start of section of code to be timed
  • lots of computation
  • end of section of code to be timed
  • call cpu_time(time2)
  • timerestime2-time1
  • rest of program

10
CPU Time
  • clock
  • For C programmers, one can call the cpu_time
    routine using a FORTRAN wrapper or call the
    intrinsic function clock that can be used to
    determine elapsed CPU time.
  • include lttime.hgt
  • static const double iCPS 1.0/(double)CLOCKS_PER_
    SEC
  • double time1, time2, timres
  • time1(clock()iCPS)
  • / do some work /
  • time2(clock()iCPS)
  • timerstime2-time1

11
Wall clock Time
  • time
  • For the Origin, the function time returns the
    time since 000000 GMT, Jan. 1, 1970.
  • It is a means of getting the elapsed wall clock
    time.
  • The wall clock time is reported in integer
    seconds.
  • external time integer4 time1,time2,timeres
  • beginning of program
  • time1time( )
  • start of section of code to be timed
  • lots of computation
  • end of section of code to be timed
  • time2time( )
  • timerestime2 - time1

12
Wall clock Time
  • f_time
  • For the Linux clusters, the appropriate FORTRAN
    function for elapsed time is f_time.
  • integer8 f_time
  • external f_time
  • integer8 time1,time2,timeres
  • beginning of program
  • time1f_time()
  • start of section of code to be timed
  • lots of computation
  • end of section of code to be timed
  • time2f_time()
  • timerestime2 - time1
  • As above for etime and dtime, the f_time function
    is in the VAX compatibility library of the Intel
    FORTRAN Compiler. To use this library include the
    compiler flag -Vaxlib.

13
Wall clock Time
  • gettimeofday
  • For C programmers, wallclock time can be obtained
    by using the very portable routine gettimeofday.
  • include ltstddef.hgt / definition of NULL /
  • include ltsys/time.hgt / definition of timeval
    struct and protyping of gettimeofday /
  • double t1,t2,elapsed
  • struct timeval tp
  • int rtn
  • ....
  • ....
  • rtngettimeofday(tp, NULL)
  • t1(double)tp.tv_sec(1.e-6)tp.tv_usec
  • ....
  • / do some work /
  • ....
  • rtngettimeofday(tp, NULL)
  • t2(double)tp.tv_sec(1.e-6)tp.tv_usec
  • elapsedt2-t1

14
Timing an Executable
  • To time an executable (if using a csh or tcsh
    shell, explicitly call /usr/bin/time)
  • time options a.out
  • where options can be -p for a simple output or
    -f format which allows the user to display more
    than just time related information.
  • Consult the man pages on the time command for
    format options.

15
Timing a Batch Job
  • Time of a batch job running or completed.
  • Origin
  • busage jobid
  • Linux clusters
  • qstat jobid for a running job
  • qhist jobid for a completed job

16
Agenda
  • 1 Parallel Computing Overview
  • 2 How to Parallelize a Code
  • 3 Porting Issues
  • 4 Scalar Tuning
  • 5 Parallel Code Tuning
  • 6 Timing and Profiling
  • 6.1 Timing
  • 6.1.1 Timing a Section of Code
  • 6.1.1.1 CPU Time
  • 6.1.1.2 Wall clock Time
  • 6.1.2 Timing an Executable
  • 6.1.3 Timing a Batch Job
  • 6.2 Profiling
  • 6.2.1 Profiling Tools
  • 6.2.2 Profile Listings
  • 6.2.3 Profiling Analysis
  • 6.3 Further Information

17
Profiling
  • Profiling determines where a program spends its
    time.
  • It detects the computationally intensive parts of
    the code.
  • Use profiling when you want to focus attention
    and optimization efforts on those loops that are
    responsible for the bulk of the computational
    load.
  • Most codes follow the 90-10 Rule.
  • That is, 90 of the computation is done in 10 of
    the code.

18
Profiling Tools
  • Profiling Tools on the Origin
  • On the SGI Origin2000 computer there are
    profiling tools named ssrun and prof.
  • Used together they do profiling, or what is
    called hot spot analysis.
  • They are useful for generating timing profiles.
  • ssrun
  • The ssrun utility collects performance data for
    an executable that you specify.
  • The performance data is written to a file named
    "executablename.exptype.id".
  • prof
  • The prof utility analyzes the data file created
    by ssrun and produces a report.
  • Example
  • ssrun -fpcsamp a.out
  • prof -h a.out.fpcsamp.m12345 gt prof.list

19
Profiling Tools
  • Profiling Tools on the Linux Clusters
  • On the Linux clusters the profiling tools are
    still maturing. There are currently several
    efforts to produce tools comparable to the ssrun
    and perfex tools.
  • gprof
  • Basic profiling information can be generated
    using the OS utility gprof.
  • First, compile the code with the compiler flags
    -p -g for the Intel compiler (-g on the Intel
    compiler does not change the optimization level)
    or -pg for the GNU compiler.
  • Second, run the program.
  • Finally analyze the resulting gmon.out file using
    the gprof utility gprof executable gmon.out.
  • efc -O -p -g -o foo foo.f
  • ./foo
  • gprof foo gmon.out

20
The Performance API (PAPI)
  • Provides an interface to hardware performance
    counters integrated in CPU
  • Provides more in-depth details about resource
    utilization
  • E.g. cache misses, instructions per second
  • Used by perfex, mpitrace, perfsuite, and other
    profiling tools
  • Requires kernel patch to deploy on Linux

21
Profiling Tools
  • Profiling Tools on the Linux Clusters
  • vprof
  • On the IA32 platform there is a utility called
    vprof that provides performance information using
    the PAPI instrumentation library.
  • To instrument the whole application requires
    recompiling and linking to vprof and PAPI
    libraries.
  • setenv VMON PAPI_TOT_CYC
  • ifc -g -O -o md md.f /usr/apps/tools/vprof/lib/vmo
    nauto_gcc.o -L/usr/apps/tools/lib -lvmon -lpapi
  • ./md
  • /usr/apps/tools/vprof/bin/cprof -e md vmon.out

22
Profile Listings
  • Profile Listings on the Origin
  • Prof Output First Listing
  • The first listing gives the number of cycles
    executed in each procedure (or subroutine). The
    procedures are listed in descending order of
    cycle count.

Cycles Cum Secs
Proc -------- ----- -----
---- ---- 42630984 58.47
58.47 0.57 VSUB 6498294 8.91
67.38 0.09 PFSOR 6141611
8.42 75.81 0.08 PBSOR 3654120
5.01 80.82 0.05 PFSOR1
2615860 3.59 84.41 0.03
VADD 1580424 2.17 86.57
0.02 ITSRCG 1144036 1.57
88.14 0.02 ITSRSI 886044
1.22 89.36 0.01 ITJSI 861136
1.18 90.54 0.01 ITJCG
23
Profile Listings
  • Profile Listings on the Origin
  • Prof Output Second Listing
  • The second listing gives the number of cycles per
    source code line.
  • The lines are listed in descending order of cycle
    count.

Cycles Cum Line
Proc -------- ----- -----
---- ---- 36556944 50.14
50.14 8106 VSUB 5313198
7.29 57.43 6974 PFSOR 4968804
6.81 64.24 6671 PBSOR
2989882 4.10 68.34 8107
VSUB 2564544 3.52 71.86
7097 PFSOR1 1988420 2.73
74.59 8103 VSUB 1629776
2.24 76.82 8045 VADD 994210
1.36 78.19 8108 VSUB
969056 1.33 79.52 8049 VADD
483018 0.66 80.18 6972
PFSOR
24
Profile Listings
  • Profile Listings on the Linux Clusters
  • gprof Output First Listing
  • The listing gives a 'flat' profile of functions
    and routines encountered, sorted by 'self
    seconds' which is the number of seconds accounted
    for by this function alone.

Flat profile Each sample counts as 0.000976562
seconds. cumulative self
self total time seconds
seconds calls us/call us/call name
----- ---------- ------- ----- -------
------- ----------- 38.07 5.67
5.67 101 56157.18 107450.88 compute_
34.72 10.84 5.17 25199500 0.21
0.21 dist_ 25.48 14.64 3.80
SIND_SINCOS 1.25
14.83 0.19
sin 0.37 14.88 0.06
cos 0.05 14.89 0.01
50500 0.15 0.15 dotr8_ 0.05
14.90 0.01 100 68.36 68.36
update_ 0.01 14.90 0.00
f_fioinit 0.01 14.90
0.00
f_intorange 0.01 14.90 0.00
mov 0.00 14.90
0.00 1 0.00 0.00 initialize_
25
Profile Listings
  • Profile Listings on the Linux Clusters
  • gprof Output Second Listing
  • The second listing gives a 'call-graph' profile
    of functions and routines encountered. The
    definitions of the columns are specific to the
    line in question. Detailed information is
    contained in the full output from gprof.

Call graph index time self children
called name ----- ------ ----
-------- ---------------- ---------------- 1
72.9 0.00 10.86
main 1 5.67 5.18
101/101 compute_ 2 0.01
0.00 100/100 update_ 8
0.00 0.00 1/1
initialize_ 12 ---------------------------------
------------------------------------
5.67 5.18 101/101 main
1 2 72.8 5.67 5.18 101
compute_ 2 5.17
0.00 25199500/25199500 dist_ 3
0.01 0.00 50500/50500 dotr8_
7 ----------------------------------------------
----------------------- 5.17
0.00 25199500/25199500 compute_ 2 3
34.7 5.17 0.00 25199500 dist_
3 ----------------------------------------------
-----------------------
ltspontaneousgt 4
25.5 3.80 0.00
SIND_SINCOS 4
26
Profile Listings
  • Profile Listings on the Linux Clusters
  • vprof Listing
  • The above listing from (using the -e option to
    cprof), displays not only cycles consumed by
    functions (a flat profile) but also the lines in
    the code that contribute to those functions.

Columns correspond to the following events
PAPI_TOT_CYC - Total cycles (1956 events) File
Summary 100.0 /u/ncsa/gbauer/temp/md.f Functio
n Summary 84.4 compute 15.6 dist Line
Summary 67.3 /u/ncsa/gbauer/temp/md.f106
13.6 /u/ncsa/gbauer/temp/md.f104 9.3
/u/ncsa/gbauer/temp/md.f166 2.5
/u/ncsa/gbauer/temp/md.f165 1.5
/u/ncsa/gbauer/temp/md.f102 1.2
/u/ncsa/gbauer/temp/md.f164 0.9
/u/ncsa/gbauer/temp/md.f107 0.8
/u/ncsa/gbauer/temp/md.f169 0.8
/u/ncsa/gbauer/temp/md.f162 0.8
/u/ncsa/gbauer/temp/md.f105
27
Profile Listings
  • Profile Listings on the Linux Clusters
  • vprof Listing (cont.)

0.7 /u/ncsa/gbauer/temp/md.f149 0.5
/u/ncsa/gbauer/temp/md.f163 0.2
/u/ncsa/gbauer/temp/md.f109 0.1
/u/ncsa/gbauer/temp/md.f100 100 0.1
do j1,np 101 if (i
.ne. j) then 102 1.5 call
dist(nd,box,pos(1,i),pos(1,j),rij,d) 103
! attribute half of the potential energy
to particle 'j' 104 13.6 pot
pot 0.5v(d) 105 0.8 do
k1,nd 106 67.3 f(k,i)
f(k,i) - rij(k)dv(d)/d 107 0.9
enddo 108 endif 109
0.2 enddo
28
Profiling Analysis
  • The program being analyzed in the previous Origin
    example has approximately 10000 source code
    lines, and consists of many subroutines.
  • The first profile listing shows that over 50 of
    the computation is done inside the VSUB
    subroutine.
  • The second profile listing shows that line 8106
    in subroutine VSUB accounted for 50 of the total
    computation.
  • Going back to the source code, line 8106 is a
    line inside a do loop.
  • Putting an OpenMP compiler directive in front of
    that do loop you can get 50 of the program to
    run in parallel with almost no work on your part.
  • Since the compiler has rearranged the source
    lines the line numbers given by ssrun/prof give
    you an area of the code to inspect.
  • To view the rearranged source use the option
  • f90 -FLISTON
  • cc -CLISTON
  • For the Intel compilers, the appropriate options
    are
  • ifort E
  • icc -E

29
MPE and Jumpshot
  • MPE is a tracing library that comes with MPI
  • Jumpshot is a graphical application for analyzing
    the MPE output
  • MPE requires inserting code at specific locations
    to be analyzed
  • Display options are specified in the code (e.g.
    Show MPI_Broadcast events in dotted blue lines

30
Jumpshot
31
Perfsuite
  • Collection of tools, utilities, and libraries for
    software performance analysis
  • Intel architectures only
  • Provides many in-depth statistics
  • Operations per cycle, Cache miss/hit data, etc.
  • Not difficult to use (but may be difficult to
    compile)mpiexec np NN psrun wrf.exepsprocess
    wrf.exe.NN_n.xml
  • Requires PAPI kernel patch for showing most
    information

32
Perfsuite Graphical App
http//perfsuite.ncsa.uiuc.edu/examples/GenIDLEST/
33
CEPBA Tools
  • Developed at the European Center for Parallelism
    at Barcelona
  • Currently not free
  • Provide text-based and graphical applications
    for
  • Execution analysis and optimization
  • Execution prediction
  • 3 Main tools
  • Mpitrace, Dimemas, Paraver

34
CEPBA Tools
  • Powerful, but complex
  • Requires PAPI kernel patch for showing most
    information
  • May require application to be recompiled
  • Very large trace files for long executions and/or
    high number of processors (e.g. over 10GB)

35
CEPBA Tools
Source Barcelona SuperComputing Center
http//www.bsc.es/plantillaA.php?cat_id479
36
Visualizing with Paraver
  • Process
  • (Compile application with mpitrace libraries
    linked)
  • Execute application (and preload mpitrace
    libraries if not linked to the application)
  • Convert individual trace files to a Paraver file
  • Chop paraver trace file, if it is too big

37
Paraver Screenshots
38
Dimemas
  • Estimate impact of code changes without changing
    the code
  • Estimate execution time on slightly different
    architectures

39
Further Information
  • SGI Irix
  • man etime
  • man 3 time
  • man 1 time
  • man busage
  • man timers
  • man ssrun
  • man prof
  • Origin2000 Performance Tuning and Optimization
    Guide
  • Linux Clusters
  • man 3 clock
  • man 2 gettimeofday
  • man 1 time
  • man 1 gprof
  • man 1B qstat
  • Intel Compilers Vprof on NCSA Linux Cluster
Write a Comment
User Comments (0)
About PowerShow.com