Slides Prepared from the CI-Tutor Courses at NCSA

About This Presentation

Title:

Slides Prepared from the CI-Tutor Courses at NCSA

Description:

... one can call the cpu_time routine using a FORTRAN wrapper or call the ... struct timeval tp; int ... To view the rearranged source use the option f90 ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 40

Provided by: sadj5

Learn more at: http://users.cs.fiu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Slides Prepared from the CI-Tutor Courses at NCSA

1
Parallel Computing ExplainedTiming and Profiling

Slides Prepared from the CI-Tutor Courses at NCSA
http//ci-tutor.ncsa.uiuc.edu/
By
S. Masoud Sadjadi
School of Computing and Information Sciences
Florida International University
March 2009
(Additional Slides by Javier Delgado)

2
Agenda

1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
6.1 Timing
6.1.1 Timing a Section of Code
6.1.1.1 CPU Time
6.1.1.2 Wall clock Time
6.1.2 Timing an Executable
6.1.3 Timing a Batch Job
6.2 Profiling
6.2.1 Profiling Tools
6.2.2 Profile Listings
6.2.3 Profiling Analysis
6.3 Further Information

3
Timing and Profiling

Now that your program has been ported to the new
computer, you will want to know how fast it runs.
This chapter describes how to measure the speed
of a program using various timing routines.
The chapter also covers how to determine which
parts of the program account for the bulk of the
computational load so that you can concentrate
your tuning efforts on those computationally
intensive parts of the program.
This chapter also gives a summary of some
available profiling tools.

4
Timing

In the following sections, well discuss timers
and review the profiling tools ssrun and prof on
the Origin and vprof and gprof on the Linux
Clusters. The specific timing functions described
are
Timing a section of codeFORTRAN
etime, dtime, cpu_time for CPU time
time and f_time for wallclock time
clock for CPU time
gettimeofday for wallclock time
Timing an executable
time a.out
Timing a batch run
busage
qstat
qhist

5
CPU Time

etime
A section of code can be timed using etime.
It returns the elapsed CPU time in seconds since
the program started.
real4 tarray(2),time1,time2,timeres
beginning of program
time1etime(tarray)
start of section of code to be timed
lots of computation
end of section of code to be timed
time2etime(tarray)
timerestime2-time1

6
CPU Time

dtime
A section of code can also be timed using dtime.
It returns the elapsed CPU time in seconds since
the last call to dtime.
real4 tarray(2),timeres
beginning of program
timeresdtime(tarray)
start of section of code to be timed
lots of computation
end of section of code to be timed
timeresdtime(tarray)
rest of program

7
CPU Time

The etime and dtime Functions
User time.
This is returned as the first element of tarray.
Its the CPU time spent executing user code.
System time.
This is returned as the second element of tarray.
Its the time spent executing system calls on
behalf of your program.
Sum of user and system time.
This is the function value that is returned.
Its the time that is usually reported.
Metric.
Timings are reported in seconds.
Timings are accurate to 1/100th of a second.

8
CPU Time

Timing Comparison Warnings
For the SGI computers
The etime and dtime functions return the MAX time
over all threads for a parallel program.
This is the time of the longest thread, which is
usually the master thread.
For the Linux Clusters
The etime and dtime functions are contained in
the VAX compatibility library of the Intel
FORTRAN Compiler.
To use this library include the compiler flag
-Vaxlib.
Another warning Do not put calls to etime and
dtime inside a do loop. The overhead is too
large.

9
CPU Time

cpu_time
The cpu_time routine is available only on the
Linux clusters as it is a component of the Intel
FORTRAN compiler library.
It provides substantially higher resolution and
has substantially lower overhead than the older
etime and dtime routines.
It can be used as an elapsed timer.
real8 time1, time2, timeres
beginning of program
call cpu_time (time1)
start of section of code to be timed
lots of computation
end of section of code to be timed
call cpu_time(time2)
timerestime2-time1
rest of program

10
CPU Time

clock
For C programmers, one can call the cpu_time
routine using a FORTRAN wrapper or call the
intrinsic function clock that can be used to
determine elapsed CPU time.
include lttime.hgt
static const double iCPS 1.0/(double)CLOCKS_PER_
SEC
double time1, time2, timres
time1(clock()iCPS)
/ do some work /
time2(clock()iCPS)
timerstime2-time1

11
Wall clock Time

time
For the Origin, the function time returns the
time since 000000 GMT, Jan. 1, 1970.
It is a means of getting the elapsed wall clock
time.
The wall clock time is reported in integer
seconds.
external time integer4 time1,time2,timeres
beginning of program
time1time( )
start of section of code to be timed
lots of computation
end of section of code to be timed
time2time( )
timerestime2 - time1

12
Wall clock Time

f_time
For the Linux clusters, the appropriate FORTRAN
function for elapsed time is f_time.
integer8 f_time
external f_time
integer8 time1,time2,timeres
beginning of program
time1f_time()
start of section of code to be timed
lots of computation
end of section of code to be timed
time2f_time()
timerestime2 - time1
As above for etime and dtime, the f_time function
is in the VAX compatibility library of the Intel
FORTRAN Compiler. To use this library include the
compiler flag -Vaxlib.

13
Wall clock Time

gettimeofday
For C programmers, wallclock time can be obtained
by using the very portable routine gettimeofday.
include ltstddef.hgt / definition of NULL /
include ltsys/time.hgt / definition of timeval
struct and protyping of gettimeofday /
double t1,t2,elapsed
struct timeval tp
int rtn
....
....
rtngettimeofday(tp, NULL)
t1(double)tp.tv_sec(1.e-6)tp.tv_usec
....
/ do some work /
....
rtngettimeofday(tp, NULL)
t2(double)tp.tv_sec(1.e-6)tp.tv_usec
elapsedt2-t1

14
Timing an Executable

To time an executable (if using a csh or tcsh
shell, explicitly call /usr/bin/time)
time options a.out
where options can be -p for a simple output or
-f format which allows the user to display more
than just time related information.
Consult the man pages on the time command for
format options.

15
Timing a Batch Job

Time of a batch job running or completed.
Origin
busage jobid
Linux clusters
qstat jobid for a running job
qhist jobid for a completed job

16
Agenda

1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
6.1 Timing
6.1.1 Timing a Section of Code
6.1.1.1 CPU Time
6.1.1.2 Wall clock Time
6.1.2 Timing an Executable
6.1.3 Timing a Batch Job
6.2 Profiling
6.2.1 Profiling Tools
6.2.2 Profile Listings
6.2.3 Profiling Analysis
6.3 Further Information

17
Profiling

Profiling determines where a program spends its
time.
It detects the computationally intensive parts of
the code.
Use profiling when you want to focus attention
and optimization efforts on those loops that are
responsible for the bulk of the computational
load.
Most codes follow the 90-10 Rule.
That is, 90 of the computation is done in 10 of
the code.

18
Profiling Tools

Profiling Tools on the Origin
On the SGI Origin2000 computer there are
profiling tools named ssrun and prof.
Used together they do profiling, or what is
called hot spot analysis.
They are useful for generating timing profiles.
ssrun
The ssrun utility collects performance data for
an executable that you specify.
The performance data is written to a file named
"executablename.exptype.id".
prof
The prof utility analyzes the data file created
by ssrun and produces a report.
Example
ssrun -fpcsamp a.out
prof -h a.out.fpcsamp.m12345 gt prof.list

19
Profiling Tools

Profiling Tools on the Linux Clusters
On the Linux clusters the profiling tools are
still maturing. There are currently several
efforts to produce tools comparable to the ssrun
and perfex tools.
gprof
Basic profiling information can be generated
using the OS utility gprof.
First, compile the code with the compiler flags
-p -g for the Intel compiler (-g on the Intel
compiler does not change the optimization level)
or -pg for the GNU compiler.
Second, run the program.
Finally analyze the resulting gmon.out file using
the gprof utility gprof executable gmon.out.
efc -O -p -g -o foo foo.f
./foo
gprof foo gmon.out

20
The Performance API (PAPI)

Provides an interface to hardware performance
counters integrated in CPU
Provides more in-depth details about resource
utilization
E.g. cache misses, instructions per second
Used by perfex, mpitrace, perfsuite, and other
profiling tools
Requires kernel patch to deploy on Linux

21
Profiling Tools

Profiling Tools on the Linux Clusters
vprof
On the IA32 platform there is a utility called
vprof that provides performance information using
the PAPI instrumentation library.
To instrument the whole application requires
recompiling and linking to vprof and PAPI
libraries.
setenv VMON PAPI_TOT_CYC
ifc -g -O -o md md.f /usr/apps/tools/vprof/lib/vmo
nauto_gcc.o -L/usr/apps/tools/lib -lvmon -lpapi
./md
/usr/apps/tools/vprof/bin/cprof -e md vmon.out

22
Profile Listings

Profile Listings on the Origin
Prof Output First Listing
The first listing gives the number of cycles
executed in each procedure (or subroutine). The
procedures are listed in descending order of
cycle count.

Cycles Cum Secs
Proc -------- ----- -----
---- ---- 42630984 58.47
58.47 0.57 VSUB 6498294 8.91
67.38 0.09 PFSOR 6141611
8.42 75.81 0.08 PBSOR 3654120
5.01 80.82 0.05 PFSOR1
2615860 3.59 84.41 0.03
VADD 1580424 2.17 86.57
0.02 ITSRCG 1144036 1.57
88.14 0.02 ITSRSI 886044
1.22 89.36 0.01 ITJSI 861136
1.18 90.54 0.01 ITJCG
23
Profile Listings

Profile Listings on the Origin
Prof Output Second Listing
The second listing gives the number of cycles per
source code line.
The lines are listed in descending order of cycle
count.

Cycles Cum Line
Proc -------- ----- -----
---- ---- 36556944 50.14
50.14 8106 VSUB 5313198
7.29 57.43 6974 PFSOR 4968804
6.81 64.24 6671 PBSOR
2989882 4.10 68.34 8107
VSUB 2564544 3.52 71.86
7097 PFSOR1 1988420 2.73
74.59 8103 VSUB 1629776
2.24 76.82 8045 VADD 994210
1.36 78.19 8108 VSUB
969056 1.33 79.52 8049 VADD
483018 0.66 80.18 6972
PFSOR
24
Profile Listings

Profile Listings on the Linux Clusters
gprof Output First Listing
The listing gives a 'flat' profile of functions
and routines encountered, sorted by 'self
seconds' which is the number of seconds accounted
for by this function alone.

Flat profile Each sample counts as 0.000976562
seconds. cumulative self
self total time seconds
seconds calls us/call us/call name
----- ---------- ------- ----- -------
------- ----------- 38.07 5.67
5.67 101 56157.18 107450.88 compute_
34.72 10.84 5.17 25199500 0.21
0.21 dist_ 25.48 14.64 3.80
SIND_SINCOS 1.25
14.83 0.19
sin 0.37 14.88 0.06
cos 0.05 14.89 0.01
50500 0.15 0.15 dotr8_ 0.05
14.90 0.01 100 68.36 68.36
update_ 0.01 14.90 0.00
f_fioinit 0.01 14.90
0.00
f_intorange 0.01 14.90 0.00
mov 0.00 14.90
0.00 1 0.00 0.00 initialize_
25
Profile Listings

Profile Listings on the Linux Clusters
gprof Output Second Listing
The second listing gives a 'call-graph' profile
of functions and routines encountered. The
definitions of the columns are specific to the
line in question. Detailed information is
contained in the full output from gprof.

Call graph index time self children
called name ----- ------ ----
-------- ---------------- ---------------- 1
72.9 0.00 10.86
main 1 5.67 5.18
101/101 compute_ 2 0.01
0.00 100/100 update_ 8
0.00 0.00 1/1
initialize_ 12 ---------------------------------
------------------------------------
5.67 5.18 101/101 main
1 2 72.8 5.67 5.18 101
compute_ 2 5.17
0.00 25199500/25199500 dist_ 3
0.01 0.00 50500/50500 dotr8_
7 ----------------------------------------------
----------------------- 5.17
0.00 25199500/25199500 compute_ 2 3
34.7 5.17 0.00 25199500 dist_
3 ----------------------------------------------
-----------------------
ltspontaneousgt 4
25.5 3.80 0.00
SIND_SINCOS 4
26
Profile Listings

Profile Listings on the Linux Clusters
vprof Listing
The above listing from (using the -e option to
cprof), displays not only cycles consumed by
functions (a flat profile) but also the lines in
the code that contribute to those functions.

Columns correspond to the following events
PAPI_TOT_CYC - Total cycles (1956 events) File
Summary 100.0 /u/ncsa/gbauer/temp/md.f Functio
n Summary 84.4 compute 15.6 dist Line
Summary 67.3 /u/ncsa/gbauer/temp/md.f106
13.6 /u/ncsa/gbauer/temp/md.f104 9.3
/u/ncsa/gbauer/temp/md.f166 2.5
/u/ncsa/gbauer/temp/md.f165 1.5
/u/ncsa/gbauer/temp/md.f102 1.2
/u/ncsa/gbauer/temp/md.f164 0.9
/u/ncsa/gbauer/temp/md.f107 0.8
/u/ncsa/gbauer/temp/md.f169 0.8
/u/ncsa/gbauer/temp/md.f162 0.8
/u/ncsa/gbauer/temp/md.f105
27
Profile Listings

Profile Listings on the Linux Clusters
vprof Listing (cont.)

0.7 /u/ncsa/gbauer/temp/md.f149 0.5
/u/ncsa/gbauer/temp/md.f163 0.2
/u/ncsa/gbauer/temp/md.f109 0.1
/u/ncsa/gbauer/temp/md.f100 100 0.1
do j1,np 101 if (i
.ne. j) then 102 1.5 call
dist(nd,box,pos(1,i),pos(1,j),rij,d) 103
! attribute half of the potential energy
to particle 'j' 104 13.6 pot
pot 0.5v(d) 105 0.8 do
k1,nd 106 67.3 f(k,i)
f(k,i) - rij(k)dv(d)/d 107 0.9
enddo 108 endif 109
0.2 enddo
28
Profiling Analysis

The program being analyzed in the previous Origin
example has approximately 10000 source code
lines, and consists of many subroutines.
The first profile listing shows that over 50 of
the computation is done inside the VSUB
subroutine.
The second profile listing shows that line 8106
in subroutine VSUB accounted for 50 of the total
computation.
Going back to the source code, line 8106 is a
line inside a do loop.
Putting an OpenMP compiler directive in front of
that do loop you can get 50 of the program to
run in parallel with almost no work on your part.
Since the compiler has rearranged the source
lines the line numbers given by ssrun/prof give
you an area of the code to inspect.
To view the rearranged source use the option
f90 -FLISTON
cc -CLISTON
For the Intel compilers, the appropriate options
are
ifort E
icc -E

29
MPE and Jumpshot

MPE is a tracing library that comes with MPI
Jumpshot is a graphical application for analyzing
the MPE output
MPE requires inserting code at specific locations
to be analyzed
Display options are specified in the code (e.g.
Show MPI_Broadcast events in dotted blue lines

30
Jumpshot
31
Perfsuite

Collection of tools, utilities, and libraries for
software performance analysis
Intel architectures only
Provides many in-depth statistics
Operations per cycle, Cache miss/hit data, etc.
Not difficult to use (but may be difficult to
compile)mpiexec np NN psrun wrf.exepsprocess
wrf.exe.NN_n.xml
Requires PAPI kernel patch for showing most
information

32
Perfsuite Graphical App
http//perfsuite.ncsa.uiuc.edu/examples/GenIDLEST/
33
CEPBA Tools

Developed at the European Center for Parallelism
at Barcelona
Currently not free
Provide text-based and graphical applications
for
Execution analysis and optimization
Execution prediction
3 Main tools
Mpitrace, Dimemas, Paraver

34
CEPBA Tools

Powerful, but complex
Requires PAPI kernel patch for showing most
information
May require application to be recompiled
Very large trace files for long executions and/or
high number of processors (e.g. over 10GB)

35
CEPBA Tools
Source Barcelona SuperComputing Center
http//www.bsc.es/plantillaA.php?cat_id479
36
Visualizing with Paraver

Process
(Compile application with mpitrace libraries
linked)
Execute application (and preload mpitrace
libraries if not linked to the application)
Convert individual trace files to a Paraver file
Chop paraver trace file, if it is too big

37
Paraver Screenshots
38
Dimemas

Estimate impact of code changes without changing
the code
Estimate execution time on slightly different
architectures

39
Further Information

SGI Irix
man etime
man 3 time
man 1 time
man busage
man timers
man ssrun
man prof
Origin2000 Performance Tuning and Optimization
Guide
Linux Clusters
man 3 clock
man 2 gettimeofday
man 1 time
man 1 gprof
man 1B qstat
Intel Compilers Vprof on NCSA Linux Cluster

Write a Comment

User Comments (0)

About PowerShow.com

Slides Prepared from the CI-Tutor Courses at NCSA - PowerPoint PPT Presentation

Slides Prepared from the CI-Tutor Courses at NCSA

... one can call the cpu_time routine using a FORTRAN wrapper or call the ... struct timeval tp; int ... To view the rearranged source use the option f90 ... – PowerPoint PPT presentation