Title: Parallel
1High-Performance Grid Computing and Research
Networking
Performance
Presented by Diego Lopez and Javier
Munoz Instructor S. Masoud Sadjadi http//www.cs
.fiu.edu/sadjadi/Teaching/ sadjadi At cs Dot fiu
Dot edu
2Acknowledgements
- The content of many of the slides in this lecture
notes have been adopted from the online resources
prepared previously by the people listed below.
Many thanks! - Henri Casanova
- Principles of High Performance Computing
- http//navet.ics.hawaii.edu/casanova
- henric_at_hawaii.edu
3Code Performance
- We will mostly talk about how to make code go
fast, hence the high performance - Performance conflicts with other concerns
- Correctness
- You will see that when trying to make code go
fast one often breaks it - Readability
- Fast code typically requires more lines!
- Modularity can hurt performance
- e.g., Too many classes
- Portability
- Code that is fast on machine A can be slow on
machine B - At the extreme, highly optimized code is not
portable at all, and in fact is done in hardware!
4Why Performance?
- To do a time-consuming operation in less time
- I am an aircraft engineer
- I need to run a simulation to test the stability
of the wings at high speed - Id rather have the result in 5 minutes than in 5
hours so that I can complete the aircraft final
design sooner. - To do an operation before a tighter deadline
- I am a weather prediction agency
- I am getting input from weather stations/sensors
- Id like to make the forecast for tomorrow before
tomorrow
5Why Performance?
- To do a high number of operations per seconds
- I am the CTO of Amazon.com
- My Web server gets 1,000 hits per seconds
- Id like my Web server and my databases to handle
1,000 transactions per seconds so that customers
do not experience bad delays - Also called scalability
- Amazon does process several GBytes of data per
seconds
6Performance as Time
- Time between the start and the end of an
operation - Also called running time, elapsed time,
wall-clock time, response time, latency,
execution time, ... - Most straightforward measure my program takes
12.5s on a Pentium 3.5GHz - Can be normalized to some reference time
- Must be measured on a dedicated machine
7Performance as Rate
- Used often so that performance can be independent
on the size of the application - e.g., compressing a 1MB file takes 1 minute.
compressing a 2MB file takes 2 minutes. The
performance is the same. - Millions of instructions / sec (MIPS)
- MIPS instruction count / (execution time
106) clock rate / (CPI 106) - But Instructions Set Architectures are not
equivalent - 1 CISC instruction many RISC instructions
- Programs use different instruction mixes
- May be ok for same program on same architectures
8Performance as Rate
- Millions of floating point operations /sec
(MFlops) - Very popular, but often misleading
- e.g., A high MFlops rate in a stupid algorithm
could have poor application performance - Application-specific
- Millions of frames rendered per second
- Millions of amino-acid compared per second
- Millions of HTTP requests served per seconds
- Application-specific metrics are often preferable
and others may be misleading - MFlops can be application-specific thought
- For instance
- I want to add to n-element vectors
- This requires 2n Floating Point Operations
- Therefore MFlops is a good measure
9Peak Performance?
- Resource vendors always talk about peak
performance rate - computed based on specifications of the machine
- For instance
- I build a machine with 2 floating point units
- Each unit can do an operation in 2 cycles
- My CPU is at 1GHz
- Therefore I have a 12/2 1GFlops Machine
- Problem
- In real code you will never be able to use the
two floating point units constantly - Data needs to come from memory and cause the
floating point units to be idle - Typically, real code achieves only an (often
small) fraction of the peak performance
10Benchmarks
- Since many performance metrics turn out to be
misleading, people have designed benchmarks - Example SPEC Benchmark
- Integer benchmark
- Floating point benchmark
- These benchmarks are typically a collection of
several codes that come from real-world
software - The question what is a good benchmark? is
difficult - If the benchmarks do not correspond to what
youll do with the computer, then the benchmark
results are not relevant to you
11How About GHz?
- This is often the way in which people say that a
computer is better than another - More instruction per seconds for higher clock
rate - Faces the same problems as MIPS
- But usable within a specific architecture
Processor Clock Rate SPEC FP2000 Benchmark
IBM Power3 450 MHz 434
Intel PIII 1.4 GHz 456
Intel P4 2.4GHz 833
Itanium-2 1.0GHz 1356
12Program Performance
- In this class were not really concerned with
determining the performance of a compute platform
(whichever way it is defined) - Instead were concerned with improving a
programs performance - For a given platform, take a given program
- Run it an measure its wall-clock time
- Enhance it, run it an quantify the performance
improvement - i.e., the reduction in wall-clock time
- For each version compute its performance
- preferably as a relevant performance rate
- so that you can say the best implementation we
have so far goes this fast (perhaps a of the
peak performance)
13Speedup
- We need a metric to quantify the impact of your
performance enhancement - Speedup ratio of old time to new time
- old time 2h
- new time 1h
- speedup 2h / 1h 2
- Sometimes one talks about a slowdown in case
the enhancement is not beneficial - Happens more often than one thinks
14Parallel Performance
- The notion of speedup is completely generic
- By using a rice cooker Ive achieved a 1.20
speedup for rice cooking - For parallel programs on defines the Parallel
Speedup (well just say speedup) - Parallel program takes time T1 on 1 processor
- Parallel program takes time Tp on p processors
- Parallel Speedup(p) T1 / Tp
- In the ideal case, if my sequential program takes
2 hours on 1 processor, it takes 1 hour on 2
processors called linear speedup
15Speedup
linear speedup
speedup
superlinear speedup!!
sub-linear speedup
number of processors
16Superlinear Speedup?
- There are several possible causes
- Algorithm
- e.g., with optimization problems, throwing many
processors at it increases the chances that one
will get lucky and find the optimum fast - Hardware
- e.g., with many processors, it is possible that
the entire application data resides in cache (vs.
RAM) or in RAM (vs. Disk)
17Bad News Amdahls Law
- Consider a program whose execution consists of
two phases - One sequential phase
- One phase that can be perfectly parallelized
(linear speedup)
T1 time spent in phase that cannot be
parallelized.
Sequential program
T1
T2
Old time T T1 T2
T2 time spent in phase that can be
parallelized.
Parallel program
T1 T1
T2 lt T2
T2 time spent in parallelized
phase
New time T T1 T2
18Back to Amdahls Law
Sequential program
Parallel program
T1
T2
T1 T1
T2
- f T2 / (T1 T2)
- Fraction of the sequential execution time that is
spent in the parallelizable phase - p number of processors T2 / T2
- Linear speedup
- T T1 T2 T - T2 T2 / p T - f
T f T / s - Overall parallel speedup T / T
- Amdahls Law Speedup(p) 1/(1 - f f/p)
19Amdahls Law Example
Plot of 1/(1 - f f/p) for 4 values of f and
for increasing values of p
20Lessons from Amdahls Law
- Its a law of diminishing return
- If a significant fraction of the code (in terms
of time spent in it) is not parallelizable, then
parallelization is not going to be good - It sounds obvious, but people new to high
performance computing often forget how bad
Amdahls law can be - Luckily, many applications can be almost entirely
parallelized and f is small
21Parallel Efficiency
- Definition Eff(p) S(p) / p
- Typically lt 1, unless linear or superlinear
speedup - Used to measure how well the processors are
utilized - If increasing the number of processors by a
factor 10 increases the speedup by a factor 2,
perhaps its not worth it efficiency drops by a
factor 5 - Important when purchasing a parallel machine for
instance if due to the applications behavior
efficiency is low, forget buying a large cluster
22Scalability
- Measure of the effort needed to maintain
efficiency while adding processors - For a given problem size, plot Efd(p) for
increasing values of p - It should stay close to a flat line
- Isoefficiency At which rate does the problem
size need to be increase to maintain efficiency - By making a problem ridiculously large, on can
typically achieve good efficiency - Problem is it how the machine/code will be used?
23Performance Measures
- This is all well and good, but how does one
measure the performance of a program in practice? - Two issues
- Measuring wall-clock times
- Well see how it can be done shortly
- Measuring performance rates
- Measure wall clock time (see above)
- Count number of operations (frames, flops,
amino-acids whatever makes sense for the
application) - Either by actively counting (count)
- Or by looking at the code and figure out how many
operations are performed - Divide the count by the wall-clock time
24Measuring time by hand?
- One possibility would be to do this by just
looking at a clock, launching the program,
looking at the clock again when the program
terminates - This of course has some drawbacks
- Poor resolution
- Requires the users attention
- Therefore operating systems provide ways to time
programs automatically - UNIX provide the time command
25The UNIX time Command
- You can put time in front of any UNIX command you
invoke - When the invoked command completes, time prints
out timing (and other) information - time ls /home/casanova/ -la -R
- 0.520u 1.570s 020.58 10.1 00k 570105io
0pf0w - 0.520u 0.52 seconds of user time
- 1.570s 1.57 seconds of system time
- 020.56 20.56 seconds of wall-clock time
- 10.1 10.1 of CPU was used
- 00k memory used (text data)
- 570105io 570 input, 105 output (file system I/O)
- 0pf0w 0 page faults and 0 swaps
26User, System, Wall-Clock?
- User Time time that the code spends executing
user code (i.e., non system calls) - System Time time that the code spends executing
system calls - Wall-Clock Time time from start to end
- Wall-Clock User System
- in our example 20.56 0.52 1.57
- Why?
- because the process can be suspended by the O/S
due to contention for the CPU by other processes - because the process can be blocked waiting for
I/O
27Using time
- Its interesting to know what the user time and
the system time are - for instance, if the system time is really high,
it may be that the code does to many calls to
malloc(), for instance - But one would really need more information to fix
the code (not always clear which system calls may
be responsible for the high system time) - Wall-clock - system - user I/O suspended
- If the system is dedicated, suspended 0
- Therefore one can estimate the cost of I/O
- If I/O is really high, one may want to look at
reducing I/O or doing I/O better - Therefore, time can give us insight into
bottlenecks and gives us wall-clock time - Measurements should be done on dedicated systems
28Dedicated Systems
- Measuring the performance of a code must be done
on a quiescent, unloaded machine - the machine only runs the standard O/S processes
- The machine must be dedicated
- No other user can start a process
- The user measuring the performance only runs the
minimum amount of processes - basically, a shell
- In the class we will use machines in dedicated
mode - Nevertheless, one should always present
measurement results as averages over several
experiments - Because the (small) load imposed by the O/S is
not deterministic - In your assignments, always show averages over 10
experiments, or more if asked to do so explicitly
29Drawbacks of UNIX time
- The time command has poor resolution
- Only milliseconds
- Sometimes we want a higher precision, especially
if our performance improvements are in the 1-2
range - time times the whole code
- Sometimes were only interested in timing some
part of the code, for instance the one that we
are trying to optimize - Sometimes we want to compare the execution time
of different sections of the code
30Timing with gettimeofday
- gettimeofday from the standard C library
- Measures the number of microseconds since
midnight, Jan 1st 1970, expressed in seconds and
microseconds - include ltsys/time.hgt
- struct timeval start
- ...
- gettimeofday(tv,NULL)
- printf(ld,ld\n,start-gttv_sec,start-gttv_usec)
- ...
- Can be used to time sections of code
- Call gettimeofday at beginning of section
- Call gettimeofday at end of section
- Compute the time elapsed in microseconds
- e.g., (end.tv_sec1000000.0 end.tv_usec -
start.tv_sec1000000.0 - start.tv_usec) /
1000000.0
31Other Ways to Time Code
- ntp_gettime() (Internet RFC 1589)
- Sort of like gettimeofday, but reports estimated
error on time measurement - Not available for all systems
- Part of the GNU C Library
- Java System.currentTimeMillis()
- Known to have resolution problems, with
resolution higher than 1 millisecond! - Solution use a native interface to a better
timer - Java System.nanoTime()
- Added in J2SE 5.0
- Probably not accurate at the nanosecond level
- Tons of high precision timing in Java on the Web
32Why is Performance Poor?
- Performance is poor because the code suffers from
a performance bottleneck - Definition
- An application runs on a platform that has many
components - CPU, Memory, Operating System, Network, Hard
Drive, Video Card, etc. - Pick a component and make it faster
- If the application performance increases, that
component was the bottleneck!
33Removing a Bottleneck
- Brute force Hardware Upgrade
- Is sometimes necessary
- But can only get you so far and may be very
costly - e.g., memory technology
- Instead, modify the code
- The bottleneck is there because the code uses a
resource heavily or in non-intelligent manner - We will learn techniques to alleviate bottlenecks
at the software level
34Identifying a Bottleneck
- It can be difficult
- Youre not going to change the memory bus just to
see what happens to the application - But you can run the code on a different machine
and see what happens - One Approach
- Know/discover the characteristics of the machine
- Instrument the code with gettimeofdays everywhere
- Observe the application execution on the machine
- Tinker with the code
- Run the application again
- Repeat
- Reason about what the bottleneck is
35A better approach profiling
- A profiler is a tool that monitors the execution
of a program and that reports the amount of time
spent in different functions - Useful to identify the expensive functions
- Profiling cycle
- Compile the code with the profiler
- Run the code
- Identify the most expensive function
- Optimize that function
- call it less often if possible
- make it faster
- Repeat until you cant think of any ways to
further optimize the most expensive function - UNIX has a good, free profiler called gprof
36Profiler Types based on Output
- Flat profiler
- Flat profiler's compute the average call times,
from the calls, and do not breakdown the call
times based on the callee or the context. - Call-Graph profiler
- Call Graph profilers show the call times, and
frequencies of the functions, and also the
call-chains involved based on the callee. However
context is not preserved.
37Methods of data gathering
- Event based profilers
- In Programming languages listed, all of them have
event-based profilers - Java JVM-Profiler Interface JVM API provides
hooks to profiler, for trapping events like
calls, class-load, unload, thread enter leave. - Python Python profilers are profile module,
hotspot which are call-graph based, and use the
'sys.set_profile()' module to trap events like
c_call,return,exception, python_call,return,exc
eption. - Statistical profilers
- Some profilers operate by sampling. A sampling
profiler probes the target program's program
counter at regular intervals using operating
system interrupts. Sampling profiles are
typically less accurate and specific, but allow
the target program to run at near full speed. - Some profilers instrument the target program with
additional instructions to collect the required
information. Instrumenting the program can cause
changes in the performance of the program,
causing inaccurate results and heisenbugs.Instrume
nting can potentially be very specific but slows
down the target program as more specific
information is collected. - The resulting data are not exact, but a
statistical approximation. The actual amount of
error is usually more than one sampling period.
In fact, if a value is n times the sampling
period, the expected error in it is the
square-root of n sampling periods. 4 - Some of the most commonly used statistical
profilers are GNU's gprof, Oprofile and SGI's
Pixie.
38Methods of data gathering
- Instrumentation
- Manual Done by the programmer, e.g. by adding
instructions to explicitly calculate runtimes. - Compiler assisted Example "gcc -pg ..." for
gprof, "quantify g ..." for Quantify - Binary translation The tool adds instrumentation
to a compiled binary. Example ATOM - Runtime instrumentation Directly before
execution the code is instrumented. The program
run is fully supervised and controlled by the
tool. Examples PIN, Valgrind - Runtime injection More lightweight than runtime
instrumentation. Code is modified at runtime to
have jumps to helper functions. Example DynInst - Hypervisor Data are collected by running the
(usually) unmodified program under a hypervisor.
Example SIMMON - Simulator Data are collected by running under an
Instruction Set Simulator. Example SIMMON
39Using gprof
- Compile your code using gcc with the -pg
option - Run your code until completion
- Then run gprof with your programs name as single
command-line argument - Example
- gcc -pg prog.c -o prog
- ./prog
- gprof prog gt profile_file
- The output file contains all profiling information
40Profiling output
- The content of the file is explained in detail in
the file itself - At the beginning of the file is a summary of
which fraction of the code is spent in which
function - In the middle section is a detailed entry for
each function - At the end of the file is a function index, in
which each function is assigned a number in
brackets, e.g., 3
41Profiling Output
- Flat profiling summary
- cumulative self
- time seconds seconds name
- 30.9 0.77 0.77 ___multadd_D2A 1
- 16.9 1.19 0.42 _scheduler ltcycle 1gt
3 - 15.3 1.57 0.38 _scandir 5
- 9.2 1.80 0.23 _NSLookupAndBindSymbo
lHint 6 - 6.4 1.96 0.16 _job ltcycle 1gt 8
- 4.4 2.07 0.11 _NSIsSymbolNameDefine
dHint 9 - 1.6 2.11 0.04 _hash_nkey 10
- 1.6 2.15 0.04 _pthread_key_create
11 - 1.2 2.18 0.03 ___quorem_D2A 12
- 1.2 2.21 0.03 __mh_dylib_header
13 - 1.2 2.24 0.03 _probe_submitter
14 - 1.2 2.27 0.03 _request_submitter
15 -
in the function itself
in the function and its children
42Profiling output
- The middle section of the file provides detailed
information for each function - Entry format
- index time self children called name
- 1.21 3.10 80/132 f1
111 - 0.69 1.13 52/132 f2
123 - 1 23.1 2.12 4.23 132 func
1 - 4.23 0.00 32/5231 c
39 - Can vary depending on the version of gprof
- You should really read the explanations in the
file to be sure
43Using gprof
- Get the gprof output
- Understand the output
- Identify the function that has the highest self
time - Try to optimize that function
- make it faster by removing bottlenecks
- call it less often
- Repeat until there is no improvement
- Go on to the next function
44Profiling output
- index time self children called name
- 1.21 3.10 80/132 f1
111 - 0.69 1.13 52/132 f2
123 - 1 23.1 2.12 4.23 132 func
1 - 4.23 0.00 32/5231 c
39
Function func 1
45Profiling output
- index time self children called name
- 1.21 3.10 80/132 f1
111 - 0.69 1.13 52/132 f2
123 - 1 23.1 2.12 4.23 132 func
1 - 4.23 0.00 32/5231 c
39
Parents f1 111, f2 123
Function func 1
46Profiling output
- index time self children called name
- 1.21 3.10 80/132 f1
111 - 0.69 1.13 52/132 f2
123 - 1 23.1 2.12 4.23 132 func
1 - 4.23 0.00 32/5231 c
39
Parents f1 111, f2 123
Function func 1
Children c 39
47Profiling output
- index time self children called name
- 1.21 3.10 80/132 f1
111 - 0.69 1.13 52/132 f2
123 - 1 23.1 2.12 4.23 132 func
1 - 4.23 0.00 32/5231 c
39
Parents f1 111, f2 123
Function func 1
Children c 39
f1
f2
calls
calls
func
Call graph
calls
c
48Profiling output
- index time self children called name
- 1.21 3.10 80/132 f1
111 - 0.69 1.13 52/132 f2
123 - 1 23.1 2.12 4.23 132 func
1 - 4.23 0.00 32/5231 c
39
Parents f1 111, f2 123
Function func 1
Children c 39
f1
f2
Call counts
80 calls
52 calls
func
Call graph
called 132 times total
32 calls
c
called 5231 times total
49Profiling output
- index time self children called name
- 1.21 3.10 80/132 f1
111 - 0.69 1.13 52/132 f2
123 - 1 23.1 2.12 4.23 132 func
1 - 4.23 0.00 32/5231 c
39
Parents f1 111, f2 123
Function func 1
Children c 39
f1
f2
23.1 of the time is spent in func() 2.12 seconds
are spent in func() itself 4.23 seconds are spent
in childrens of func()
func
c
50Profiling output
- index time self children called name
- 1.21 3.10 80/132 f1
111 - 0.69 1.13 52/132 f2
123 - 1 23.1 2.12 4.23 132 func
1 - 4.23 0.00 32/5231 c
39
Parents f1 111, f2 123
Function func 1
Children c 39
f1
f2
1.21 seconds are spent calling func() in
f1() 0.69 seconds are spent calling func() in
f2() 3.10 seconds are spent calling funcs
children in f1() 1.13 seconds are spent calling
funcs children in f2()
func
c
51GNU gprof
- Instrumenting profiler for every UNIX-like system
52Using gprof GNU profiler
- Compile and link your program with profiling
enabled - cc -g -c myprog.c utils.c -pg
- cc -o myprog myprog.o utils.o -pg
- Execute your program to generate a profile data
file - Program will run normally (but slower) and will
write the profile data into a file called
gmon.out just before exiting - Program should exit using exit() function
- Run gprof to analyze the profile data
- gprof a.out
53Example Program
54Understanding Flat Profile
- The flat profile shows the total amount of time
your program spent executing each function. - If a function was not compiled for profiling, and
didn't run long enough to show up on the program
counter histogram, it will be indistinguishable
from a function that was never called
55Flat profile time
56Flat profile Cumulative seconds
57Flat profile Self seconds
Number of seconds accounted for this function
alone
58Flat profile Calls
Number of times was invoked
59Flat profile Self seconds per call
Average number of sec per call Spent in this
function alone
60Flat profile Total seconds per call
Average number of seconds spent in this function
and its descendents per call
61Call Graph call tree of the program
Called by main ( )
Descendants doit ( )
Current Function g( )
62Call Graph understanding each line
Total time propagated into this function by its
children
Unique index of this function
Number of times was called
Current Function g( )
total amount of time spent in this function
Percentage of the total time spent in this
function and its children.
63Call Graph parents numbers
Time that was propagated from the function's
children into this parent
Number of times this parent called the function
/ total number of times the function was called
Call Graph understanding each line
Time that was propagated directly from the
function into this parent
Current Function g( )
64Call Graph children numbers
Number of times this function called the child
/ total number of times this child was called
Current Function g( )
Amount of time that was propagated directly from
the child into function
Amount of time that was propagated from the
child's children to the function
65How gprof works
- Instruments program to count calls
- Watches the program running, samples the PC every
0.01 sec - Statistical inaccuracy fast function may take
0 or 1 samples - Run should be long enough comparing with sampling
period - Combine several gmon.out files into single report
- The output from gprof gives no indication of
parts of your program that are limited by I/O or
swapping bandwidth. This is because samples of
the program counter are taken at fixed intervals
of run time - number-of-calls figures are derived by counting,
not sampling. They are completely accurate and
will not vary from run to run if your program is
deterministic - Profiling with inlining and other optimizations
needs care
66Gprof example
include ltstdio.hgt int a(void) int i0,g0
while(ilt100000) gi return
g int b(void) int i0,g0
while(ilt400000) gi return g
int main(int argc, char argv) int
iterations if(argc ! 2)
printf("Usage s ltNo of Iterationsgt\n",
argv0) exit(-1) else
iterations atoi(argv1) printf("No of
iterations d\n", iterations)
while(iterations--) a() b()
67Gprof example
Flat profile Each sample counts as 0.01
seconds. cumulative self
self total time seconds seconds calls
ms/call ms/call name 80.24 63.85 63.85
50000 1.28 1.28 b 20.26 79.97
16.12 50000 0.32 0.32 a
68Gprof example
index time self children called
name
ltspontaneousgt 1 100.0 0.00 79.97
main 1 63.85
0.00 50000/50000 b 2
16.12 0.00 50000/50000 a
3 ----------------------------------------------
- 63.85 0.00 50000/50000
main 1 2 79.8 63.85 0.00 50000
b 2 ---------------------------------------
-------- 16.12 0.00
50000/50000 main 1 3 20.2 16.12
0.00 50000 a 3 ----------------------
-------------------------
69Gprof example Kernel Time
include ltstdio.hgt int a(void) sleep(1)
return 0 int b(void) sleep(4) return
0
int main(int argc, char argv) int
iterations if(argc ! 2)
printf("Usage s ltNo of Iterationsgt\n",
argv0) exit(-1) else
iterations atoi(argv1) printf("No of
iterations d\n", iterations)
while(iterations--) a() b()
70Gprof example Kernel Time
Flat profile Each sample counts as 0.01
seconds. no time accumulated cumulative
self self total time seconds
seconds calls Ts/call Ts/call name 0.00
0.00 0.00 120 0.00 0.00
sigprocmask 0.00 0.00 0.00 61
0.00 0.00 __libc_sigaction 0.00 0.00
0.00 61 0.00 0.00 sigaction
0.00 0.00 0.00 60 0.00
0.00 nanosleep 0.00 0.00 0.00
60 0.00 0.00 sleep 0.00 0.00
0.00 30 0.00 0.00 a 0.00
0.00 0.00 30 0.00 0.00 b
0.00 0.00 0.00 21 0.00
0.00 _IO_file_overflow 0.00 0.00 0.00
3 0.00 0.00 _IO_new_file_xsputn
0.00 0.00 0.00 2 0.00
0.00 _IO_new_do_write 0.00 0.00 0.00
2 0.00 0.00 __find_specmb 0.00
0.00 0.00 2 0.00 0.00
__guard_setup 0.00 0.00 0.00 1
0.00 0.00 _IO_default_xsputn 0.00
0.00 0.00 1 0.00 0.00
_IO_doallocbuf
71VTune performance analyzer
- To squeeze every bit of power out of Intel
architecture !
72VTune Modes/Features
- Time- and Event-Based, System-Wide Sampling
provides developers with the most accurate
representation of their software's actual
performance with negligible overhead - Call Graph Profiling provides developers with a
pictorial view of program flow to quickly
identify critical functions and call sequences - Counter Monitor allows developers to readily
track system activity during runtime which helps
them identify system level performance issues
73Sampling mode
- Monitors all active software on your system
- including your application, the OS , JIT-compiled
Java class files, Microsoft .NET files, 16-bit
applications, 32-bit applications, device drivers - Application performance is not impacted during
data collection
74Sampling Mode Benefits
- Low-overhead, system-wide profiling helps you
identify which modules and functions are
consuming the most time, giving you a detailed
look at your operating system and application - Benefits of sampling
- Profiling to find hotspots. Find the module,
functions, lines of source code and assembly
instructions that are consuming the most time - Low overhead. Overhead incurred by sampling is
typically about one percent - No need to instrument code. You do not need to
make any changes to code to profile with sampling
75How does sampling work?
- Sampling interrupts the processor after a certain
number of events and records the execution
information in a buffer area. When the buffer is
full, the information is copied to a file. After
saving the information, the program resumes
operation. In this way, the VTune maintains very
low overhead (about one percent) while sampling - Time-based sampling collects samples of active
instruction addresses at regular time-based
intervals (1ms. by default) - Event-based sampling collects samples of active
instruction addresses after a specified number of
processor events - After the program finishes, the samples are
mapped to modules and stored in a database within
the analyzer program.
76 Events counted by VTune
- Basic Events clock cycles, retired instructions
- Instruction Execution instruction decode, issue
and execution, data and control speculation, and
memory operations - Cycle Accounting Events stall cycle breakdowns
- Branch Events branch prediction
- Memory Hierarchy instruction prefetch,
instruction and data caches - System Events operating system monitors,
instruction and data TLBs
About 130 different events in Pentium 4
architecture !
77Viewing Sampling Results
- Process view
- all the processes that ran on the system during
data collection - Thread view
- the threads that ran within the processes you
select in Process view - Module view
- the modules that ran within the selected
processes and threads - Hotspot view
- the functions within the modules you select in
Module view
78Call Graph Mode
- Provides with a pictorial view of program flow to
quickly identify critical functions and call
sequences - Call graph profiling reveals
- Structure of your program on a function level
- Number of times a function is called from a
particular location - The time spent in each function
- Functions on a critical path.
79Call Graph Screenshot
the function summary pane
Critical Path displayed as red lines call
sequence in an application that took the most
time to execute.
Switch to Call-list View
80Call Graph (Cont.)
Additional info available - by hovering the move
over the functions
Wait time how much time spent waiting for
event to occur
81Jump to Source view
82Call Graph Call List View
Caller Functions are the functions that called
the Focus Function
Callee Functions are the functions that called by
Focus Function
83Counter Monitor
- Use the Counter Monitor feature of the VTune to
collect and display performance counter data.
Counter monitor selectively polls performance
counters, which are grouped categorically into
performance objects. - With the VTune analyzer, you can
- Monitor selected counters in performance objects.
- Correlate performance counter data with data
collected by other features in the VTune
analyzer, such as sampling. - Trigger the collection of counter data on events
other than a periodic timer.
84Counter Monitor
85Removing bottlenecks
- Now we know how to
- identify expensive sections of the code
- measure their performance
- compare to some notion of peak performance
- decide whether performance is unacceptably poor
- figure out what the physical bottleneck is
- A very common bottleneck memory
86The Memory Bottleneck
- The memory is a very common bottleneck that
programmers often dont think about - When you look at code, you often pay more
attention to computation - ai bj ck
- The access to the 3 arrays take more time than
doing an addition - For the code above, the memory is the bottleneck
for most machines!
87Why the Memory Bottleneck?
- In the 70s, everything was balanced
- The memory kept pace with the CPU
- n cycles to execute an instruction, n cycles to
bring in a word from memory - No longer true
- CPUs have gotten 1,000x faster
- Memory have gotten 10x faster and 1,000,000x
larger - Flops are free and bandwidth is expensive and
processors are STARVED for data
88Memory Latency and Bandwidth
- The performance of memory is typically defined by
Latency and Bandwidth (or Rate) - Latency time to read one byte from memory
- measured in nanoseconds these days
- Bandwidth how many bytes can be read per seconds
- measured in GB/sec
- Note that you dont have bandwidth 1 /
latency! - Reading 2 bytes in sequence may be cheaper than
reading one byte only - Lets see why...
89Latency and Bandwidth
memory bus
Memory
CPU
- Latency time for one data item to go from
memory to the CPU - In-flight number of data items that can be in
flight on the memory bus - Bandwidth Capacity / Latency
- Maximum number of items I get in a latency period
- Pipelining
- initial delay for getting the first data item
latency - if I ask for enough data items (gt in-flight),
after latency seconds, I receive data items at
rate bandwidth - Just like networks, hardware pipelines, etc.
90Latency and Bandwidth
- Why is memory bandwidth important?
- Because it gives us an upper bound on how fast
one could feed data to the CPU - Why is memory latency important?
- Because typically one cannot feed the CPU data
constantly and one gets impacted by the latency - Latency numbers must be put in perspective with
the clock rate, i.e., measured in cycles
91Current Memory Technology
Memory Latency Peak Bandwidth
DDR400 SDRAM 10 ns 6.4 GB/sec
DDR533 SDRAM 9.4 ns 8.5 GB/sec
DDR2-533 SDRAM 11.2 ns 8.5 GB/sec
DDR2-600 SDRAM 13.3 ns 9.6 GB/sec
DDR2-667 SDRAM ??? 10.6 GB/sec
DDR2-800 SDRAM ??? 12.8 GB/sec
source http//www.xbitlabs.com/articles/memory/di
splay/ddr2-ddr_2.html (best CAS-RAStoCAS-RASprech
arge settings)
92Memory Bottleneck Example
- Fragment of code ai bj ck
- Three memory references 2 reads, 1 write
- One addition can be done in one cycle
- If the memory bandwidth is 12.8GB/sec, then the
rate at which the processor can access integers
(4 bytes) is 12.8102410241024 / 4 3.4GHz - The above code needs to access 3 integers
- Therefore, the rate at which the code gets its
data is 1.1GHz - But the CPU could perform additions at 4GHz!
- Therefore The memory is the bottleneck
- And we assumed memory worked at the peak!!!
- We ignored other possible overheads on the bus
- In practice the gap can be around a factor 15 or
higher
93Dealing with memory
- How have people been dealing with the memory
bottleneck? - Computers are built with a memory hierarchy
- Registers, Multiple Levels of Cache, Main memory
- Data is brought in in bulk (cache line) from a
lower level (slow, cheap, big) to a higher level
(fast, expensive, small) - Hopefully brought in in a cache line will be
(re)used soon - temporal locality
- spatial locality
- Programs must be aware of the memory hierarchy
(at least to some extent) - Makes life difficult when writing for performance
- But is necessary on most systems
94Memory and parallel programs
- Memory and parallel program
- Rule of thumb make sure that concurrent
processes spend most of their time working on
their own data in their own memory (principle of
locality) - Place data near computation
- Avoid modifying shared data
- Access data in order and reuse
- Avoid indirection and linked data-structures
- Partition program into independent, balanced
computations - Avoid adaptive and dynamic computations
- Avoid synchronization and minimize inter-process
communications - The perfect parallel program no communication
between processors - Locality is what makes (efficient) parallel
programming painful in many cases - As a programmer you must constantly have a mental
picture of where all the data is with respect to
where the computation is taking place
95Memory and parallel programs
- What also makes parallel computing a pain is
distributed-memory programming - e.g., on a cluster
- Some computer architects are taking a new
approach design computers without locality - i.e., no memory hierarchy!
- Only a BIG (relatively) slow shared memory
- Massive multi-threading on many processors/cores
- Write ones application as TONS of threads and
run it on a massively multithreaded architecture - Key idea hide latency via parallelism
- If I have tons of threads, chances are that some
of them will have something useful to do while
others are waiting for the very distant
(relatively) memory
96Multithreaded Supercomputing?
- How about a machine with (fast) support for
thousands of threads? - for instance hundreds of processors/cores
- each processor has support for tens of hardware
threads - More hardware is needed in each core/processor
- e.g., an Instruction Pointer per thread
- logic for switching between threads
- Can get expensive for many threads
- One may want to replicate many of the units
(e.g., ALU) because there is no point in
supporting 15 threads if there is only one ALU - It may not be worth it for Intel to support more
extensive hyper-threading - But, if high-performance scientific applications
could be written easily and perform well on such
an architecture it would be big news
97Multithreaded Supercomputers
- The TERA MTA was the first such machine, and
although very few machines were sold, it made a
big splash - Lets look at Crays El Dorado project
98El Dorado Processor
99Eldorado System
100El Dorado Philosophy
- Data locality is no longer important
- Load Balancing is no longer important
- Regular computation is no longer important
- Finding the most parallelism in an application is
whats important - Not representative of the mainstream of computing
- We will mostly NOT use that philosophy in our
parallel programming examples - although you may like it better