Title: Slides Prepared from the CI-Tutor Courses at NCSA
1Parallel Computing ExplainedTiming and Profiling
- Slides Prepared from the CI-Tutor Courses at NCSA
- http//ci-tutor.ncsa.uiuc.edu/
- By
- S. Masoud Sadjadi
- School of Computing and Information Sciences
- Florida International University
- March 2009
- (Additional Slides by Javier Delgado)
2Agenda
- 1 Parallel Computing Overview
- 2 How to Parallelize a Code
- 3 Porting Issues
- 4 Scalar Tuning
- 5 Parallel Code Tuning
- 6 Timing and Profiling
- 6.1 Timing
- 6.1.1 Timing a Section of Code
- 6.1.1.1 CPU Time
- 6.1.1.2 Wall clock Time
- 6.1.2 Timing an Executable
- 6.1.3 Timing a Batch Job
- 6.2 Profiling
- 6.2.1 Profiling Tools
- 6.2.2 Profile Listings
- 6.2.3 Profiling Analysis
- 6.3 Further Information
3Timing and Profiling
- Now that your program has been ported to the new
computer, you will want to know how fast it runs.
- This chapter describes how to measure the speed
of a program using various timing routines. - The chapter also covers how to determine which
parts of the program account for the bulk of the
computational load so that you can concentrate
your tuning efforts on those computationally
intensive parts of the program. - This chapter also gives a summary of some
available profiling tools.
4Timing
- In the following sections, well discuss timers
and review the profiling tools ssrun and prof on
the Origin and vprof and gprof on the Linux
Clusters. The specific timing functions described
are - Timing a section of codeFORTRAN
- etime, dtime, cpu_time for CPU time
- time and f_time for wallclock time
- clock for CPU time
- gettimeofday for wallclock time
- Timing an executable
- time a.out
- Timing a batch run
- busage
- qstat
- qhist
5CPU Time
- etime
- A section of code can be timed using etime.
- It returns the elapsed CPU time in seconds since
the program started. - real4 tarray(2),time1,time2,timeres
- beginning of program
- time1etime(tarray)
- start of section of code to be timed
- lots of computation
- end of section of code to be timed
- time2etime(tarray)
- timerestime2-time1
6CPU Time
- dtime
- A section of code can also be timed using dtime.
- It returns the elapsed CPU time in seconds since
the last call to dtime. -
- real4 tarray(2),timeres
- beginning of program
- timeresdtime(tarray)
- start of section of code to be timed
- lots of computation
- end of section of code to be timed
- timeresdtime(tarray)
- rest of program
7CPU Time
- The etime and dtime Functions
- User time.
- This is returned as the first element of tarray.
- Its the CPU time spent executing user code.
- System time.
- This is returned as the second element of tarray.
- Its the time spent executing system calls on
behalf of your program. - Sum of user and system time.
- This is the function value that is returned.
- Its the time that is usually reported.
- Metric.
- Timings are reported in seconds.
- Timings are accurate to 1/100th of a second.
8CPU Time
- Timing Comparison Warnings
- For the SGI computers
- The etime and dtime functions return the MAX time
over all threads for a parallel program. - This is the time of the longest thread, which is
usually the master thread. - For the Linux Clusters
- The etime and dtime functions are contained in
the VAX compatibility library of the Intel
FORTRAN Compiler. - To use this library include the compiler flag
-Vaxlib. - Another warning Do not put calls to etime and
dtime inside a do loop. The overhead is too
large.
9CPU Time
- cpu_time
- The cpu_time routine is available only on the
Linux clusters as it is a component of the Intel
FORTRAN compiler library. - It provides substantially higher resolution and
has substantially lower overhead than the older
etime and dtime routines. - It can be used as an elapsed timer.
- real8 time1, time2, timeres
- beginning of program
- call cpu_time (time1)
- start of section of code to be timed
- lots of computation
- end of section of code to be timed
- call cpu_time(time2)
- timerestime2-time1
- rest of program
10CPU Time
- clock
- For C programmers, one can call the cpu_time
routine using a FORTRAN wrapper or call the
intrinsic function clock that can be used to
determine elapsed CPU time. - include lttime.hgt
- static const double iCPS 1.0/(double)CLOCKS_PER_
SEC - double time1, time2, timres
-
- time1(clock()iCPS)
-
- / do some work /
-
- time2(clock()iCPS)
- timerstime2-time1
11Wall clock Time
- time
- For the Origin, the function time returns the
time since 000000 GMT, Jan. 1, 1970. - It is a means of getting the elapsed wall clock
time. - The wall clock time is reported in integer
seconds. - external time integer4 time1,time2,timeres
- beginning of program
- time1time( )
- start of section of code to be timed
- lots of computation
- end of section of code to be timed
- time2time( )
- timerestime2 - time1
12Wall clock Time
- f_time
- For the Linux clusters, the appropriate FORTRAN
function for elapsed time is f_time. - integer8 f_time
- external f_time
- integer8 time1,time2,timeres
- beginning of program
- time1f_time()
- start of section of code to be timed
- lots of computation
- end of section of code to be timed
- time2f_time()
- timerestime2 - time1
- As above for etime and dtime, the f_time function
is in the VAX compatibility library of the Intel
FORTRAN Compiler. To use this library include the
compiler flag -Vaxlib.
13Wall clock Time
- gettimeofday
- For C programmers, wallclock time can be obtained
by using the very portable routine gettimeofday. - include ltstddef.hgt / definition of NULL /
- include ltsys/time.hgt / definition of timeval
struct and protyping of gettimeofday / - double t1,t2,elapsed
- struct timeval tp
- int rtn
- ....
- ....
- rtngettimeofday(tp, NULL)
- t1(double)tp.tv_sec(1.e-6)tp.tv_usec
- ....
- / do some work /
- ....
- rtngettimeofday(tp, NULL)
- t2(double)tp.tv_sec(1.e-6)tp.tv_usec
- elapsedt2-t1
14Timing an Executable
- To time an executable (if using a csh or tcsh
shell, explicitly call /usr/bin/time) - time options a.out
- where options can be -p for a simple output or
-f format which allows the user to display more
than just time related information. - Consult the man pages on the time command for
format options.
15Timing a Batch Job
- Time of a batch job running or completed.
- Origin
- busage jobid
- Linux clusters
- qstat jobid for a running job
- qhist jobid for a completed job
16Agenda
- 1 Parallel Computing Overview
- 2 How to Parallelize a Code
- 3 Porting Issues
- 4 Scalar Tuning
- 5 Parallel Code Tuning
- 6 Timing and Profiling
- 6.1 Timing
- 6.1.1 Timing a Section of Code
- 6.1.1.1 CPU Time
- 6.1.1.2 Wall clock Time
- 6.1.2 Timing an Executable
- 6.1.3 Timing a Batch Job
- 6.2 Profiling
- 6.2.1 Profiling Tools
- 6.2.2 Profile Listings
- 6.2.3 Profiling Analysis
- 6.3 Further Information
17Profiling
- Profiling determines where a program spends its
time. - It detects the computationally intensive parts of
the code. - Use profiling when you want to focus attention
and optimization efforts on those loops that are
responsible for the bulk of the computational
load. - Most codes follow the 90-10 Rule.
- That is, 90 of the computation is done in 10 of
the code.
18Profiling Tools
- Profiling Tools on the Origin
- On the SGI Origin2000 computer there are
profiling tools named ssrun and prof. - Used together they do profiling, or what is
called hot spot analysis. - They are useful for generating timing profiles.
- ssrun
- The ssrun utility collects performance data for
an executable that you specify. - The performance data is written to a file named
"executablename.exptype.id". - prof
- The prof utility analyzes the data file created
by ssrun and produces a report. - Example
- ssrun -fpcsamp a.out
- prof -h a.out.fpcsamp.m12345 gt prof.list
19Profiling Tools
- Profiling Tools on the Linux Clusters
- On the Linux clusters the profiling tools are
still maturing. There are currently several
efforts to produce tools comparable to the ssrun
and perfex tools. - gprof
- Basic profiling information can be generated
using the OS utility gprof. - First, compile the code with the compiler flags
-p -g for the Intel compiler (-g on the Intel
compiler does not change the optimization level)
or -pg for the GNU compiler. - Second, run the program.
- Finally analyze the resulting gmon.out file using
the gprof utility gprof executable gmon.out. -
- efc -O -p -g -o foo foo.f
- ./foo
- gprof foo gmon.out
20The Performance API (PAPI)
- Provides an interface to hardware performance
counters integrated in CPU - Provides more in-depth details about resource
utilization - E.g. cache misses, instructions per second
- Used by perfex, mpitrace, perfsuite, and other
profiling tools - Requires kernel patch to deploy on Linux
21Profiling Tools
- Profiling Tools on the Linux Clusters
- vprof
- On the IA32 platform there is a utility called
vprof that provides performance information using
the PAPI instrumentation library. - To instrument the whole application requires
recompiling and linking to vprof and PAPI
libraries. - setenv VMON PAPI_TOT_CYC
- ifc -g -O -o md md.f /usr/apps/tools/vprof/lib/vmo
nauto_gcc.o -L/usr/apps/tools/lib -lvmon -lpapi - ./md
- /usr/apps/tools/vprof/bin/cprof -e md vmon.out
22Profile Listings
- Profile Listings on the Origin
- Prof Output First Listing
- The first listing gives the number of cycles
executed in each procedure (or subroutine). The
procedures are listed in descending order of
cycle count.
Cycles Cum Secs
Proc -------- ----- -----
---- ---- 42630984 58.47
58.47 0.57 VSUB 6498294 8.91
67.38 0.09 PFSOR 6141611
8.42 75.81 0.08 PBSOR 3654120
5.01 80.82 0.05 PFSOR1
2615860 3.59 84.41 0.03
VADD 1580424 2.17 86.57
0.02 ITSRCG 1144036 1.57
88.14 0.02 ITSRSI 886044
1.22 89.36 0.01 ITJSI 861136
1.18 90.54 0.01 ITJCG
23Profile Listings
- Profile Listings on the Origin
- Prof Output Second Listing
- The second listing gives the number of cycles per
source code line. - The lines are listed in descending order of cycle
count.
Cycles Cum Line
Proc -------- ----- -----
---- ---- 36556944 50.14
50.14 8106 VSUB 5313198
7.29 57.43 6974 PFSOR 4968804
6.81 64.24 6671 PBSOR
2989882 4.10 68.34 8107
VSUB 2564544 3.52 71.86
7097 PFSOR1 1988420 2.73
74.59 8103 VSUB 1629776
2.24 76.82 8045 VADD 994210
1.36 78.19 8108 VSUB
969056 1.33 79.52 8049 VADD
483018 0.66 80.18 6972
PFSOR
24Profile Listings
- Profile Listings on the Linux Clusters
- gprof Output First Listing
- The listing gives a 'flat' profile of functions
and routines encountered, sorted by 'self
seconds' which is the number of seconds accounted
for by this function alone.
Flat profile Each sample counts as 0.000976562
seconds. cumulative self
self total time seconds
seconds calls us/call us/call name
----- ---------- ------- ----- -------
------- ----------- 38.07 5.67
5.67 101 56157.18 107450.88 compute_
34.72 10.84 5.17 25199500 0.21
0.21 dist_ 25.48 14.64 3.80
SIND_SINCOS 1.25
14.83 0.19
sin 0.37 14.88 0.06
cos 0.05 14.89 0.01
50500 0.15 0.15 dotr8_ 0.05
14.90 0.01 100 68.36 68.36
update_ 0.01 14.90 0.00
f_fioinit 0.01 14.90
0.00
f_intorange 0.01 14.90 0.00
mov 0.00 14.90
0.00 1 0.00 0.00 initialize_
25Profile Listings
- Profile Listings on the Linux Clusters
- gprof Output Second Listing
- The second listing gives a 'call-graph' profile
of functions and routines encountered. The
definitions of the columns are specific to the
line in question. Detailed information is
contained in the full output from gprof.
Call graph index time self children
called name ----- ------ ----
-------- ---------------- ---------------- 1
72.9 0.00 10.86
main 1 5.67 5.18
101/101 compute_ 2 0.01
0.00 100/100 update_ 8
0.00 0.00 1/1
initialize_ 12 ---------------------------------
------------------------------------
5.67 5.18 101/101 main
1 2 72.8 5.67 5.18 101
compute_ 2 5.17
0.00 25199500/25199500 dist_ 3
0.01 0.00 50500/50500 dotr8_
7 ----------------------------------------------
----------------------- 5.17
0.00 25199500/25199500 compute_ 2 3
34.7 5.17 0.00 25199500 dist_
3 ----------------------------------------------
-----------------------
ltspontaneousgt 4
25.5 3.80 0.00
SIND_SINCOS 4
26Profile Listings
- Profile Listings on the Linux Clusters
- vprof Listing
- The above listing from (using the -e option to
cprof), displays not only cycles consumed by
functions (a flat profile) but also the lines in
the code that contribute to those functions.
Columns correspond to the following events
PAPI_TOT_CYC - Total cycles (1956 events) File
Summary 100.0 /u/ncsa/gbauer/temp/md.f Functio
n Summary 84.4 compute 15.6 dist Line
Summary 67.3 /u/ncsa/gbauer/temp/md.f106
13.6 /u/ncsa/gbauer/temp/md.f104 9.3
/u/ncsa/gbauer/temp/md.f166 2.5
/u/ncsa/gbauer/temp/md.f165 1.5
/u/ncsa/gbauer/temp/md.f102 1.2
/u/ncsa/gbauer/temp/md.f164 0.9
/u/ncsa/gbauer/temp/md.f107 0.8
/u/ncsa/gbauer/temp/md.f169 0.8
/u/ncsa/gbauer/temp/md.f162 0.8
/u/ncsa/gbauer/temp/md.f105
27Profile Listings
- Profile Listings on the Linux Clusters
- vprof Listing (cont.)
0.7 /u/ncsa/gbauer/temp/md.f149 0.5
/u/ncsa/gbauer/temp/md.f163 0.2
/u/ncsa/gbauer/temp/md.f109 0.1
/u/ncsa/gbauer/temp/md.f100 100 0.1
do j1,np 101 if (i
.ne. j) then 102 1.5 call
dist(nd,box,pos(1,i),pos(1,j),rij,d) 103
! attribute half of the potential energy
to particle 'j' 104 13.6 pot
pot 0.5v(d) 105 0.8 do
k1,nd 106 67.3 f(k,i)
f(k,i) - rij(k)dv(d)/d 107 0.9
enddo 108 endif 109
0.2 enddo
28Profiling Analysis
- The program being analyzed in the previous Origin
example has approximately 10000 source code
lines, and consists of many subroutines. - The first profile listing shows that over 50 of
the computation is done inside the VSUB
subroutine. - The second profile listing shows that line 8106
in subroutine VSUB accounted for 50 of the total
computation. - Going back to the source code, line 8106 is a
line inside a do loop. - Putting an OpenMP compiler directive in front of
that do loop you can get 50 of the program to
run in parallel with almost no work on your part. - Since the compiler has rearranged the source
lines the line numbers given by ssrun/prof give
you an area of the code to inspect. - To view the rearranged source use the option
- f90 -FLISTON
- cc -CLISTON
- For the Intel compilers, the appropriate options
are - ifort E
- icc -E
29MPE and Jumpshot
- MPE is a tracing library that comes with MPI
- Jumpshot is a graphical application for analyzing
the MPE output - MPE requires inserting code at specific locations
to be analyzed - Display options are specified in the code (e.g.
Show MPI_Broadcast events in dotted blue lines
30Jumpshot
31Perfsuite
- Collection of tools, utilities, and libraries for
software performance analysis - Intel architectures only
- Provides many in-depth statistics
- Operations per cycle, Cache miss/hit data, etc.
- Not difficult to use (but may be difficult to
compile)mpiexec np NN psrun wrf.exepsprocess
wrf.exe.NN_n.xml - Requires PAPI kernel patch for showing most
information
32Perfsuite Graphical App
http//perfsuite.ncsa.uiuc.edu/examples/GenIDLEST/
33CEPBA Tools
- Developed at the European Center for Parallelism
at Barcelona - Currently not free
- Provide text-based and graphical applications
for - Execution analysis and optimization
- Execution prediction
- 3 Main tools
- Mpitrace, Dimemas, Paraver
34CEPBA Tools
- Powerful, but complex
- Requires PAPI kernel patch for showing most
information - May require application to be recompiled
- Very large trace files for long executions and/or
high number of processors (e.g. over 10GB)
35CEPBA Tools
Source Barcelona SuperComputing Center
http//www.bsc.es/plantillaA.php?cat_id479
36Visualizing with Paraver
- Process
- (Compile application with mpitrace libraries
linked) - Execute application (and preload mpitrace
libraries if not linked to the application) - Convert individual trace files to a Paraver file
- Chop paraver trace file, if it is too big
37Paraver Screenshots
38Dimemas
- Estimate impact of code changes without changing
the code - Estimate execution time on slightly different
architectures
39Further Information
- SGI Irix
- man etime
- man 3 time
- man 1 time
- man busage
- man timers
- man ssrun
- man prof
- Origin2000 Performance Tuning and Optimization
Guide - Linux Clusters
- man 3 clock
- man 2 gettimeofday
- man 1 time
- man 1 gprof
- man 1B qstat
- Intel Compilers Vprof on NCSA Linux Cluster