Parallel

About This Presentation

Title:

Parallel

Description:

Title: Parallel & Distributed Computing Seminar (ICS691) Author: Henri Casanova Last modified by: jmunoz1 Created Date: 5/13/2005 2:20:40 PM Document presentation format – PowerPoint PPT presentation

Number of Views:91

Avg rating:3.0/5.0

Slides: 89

Provided by: HenriCa2

Learn more at: http://users.cis.fiu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Parallel

1
High-Performance Grid Computing and Research
Networking
Performance
Presented by Diego Lopez and Javier
Munoz Instructor S. Masoud Sadjadi http//www.cs
.fiu.edu/sadjadi/Teaching/ sadjadi At cs Dot fiu
Dot edu
2
Acknowledgements

The content of many of the slides in this lecture
notes have been adopted from the online resources
prepared previously by the people listed below.
Many thanks!
Henri Casanova
Principles of High Performance Computing
http//navet.ics.hawaii.edu/casanova
henric_at_hawaii.edu

3
Code Performance

We will mostly talk about how to make code go
fast, hence the high performance
Performance conflicts with other concerns
Correctness
You will see that when trying to make code go
fast one often breaks it
Readability
Fast code typically requires more lines!
Modularity can hurt performance
e.g., Too many classes
Portability
Code that is fast on machine A can be slow on
machine B
At the extreme, highly optimized code is not
portable at all, and in fact is done in hardware!

4
Why Performance?

To do a time-consuming operation in less time
I am an aircraft engineer
I need to run a simulation to test the stability
of the wings at high speed
Id rather have the result in 5 minutes than in 5
hours so that I can complete the aircraft final
design sooner.
To do an operation before a tighter deadline
I am a weather prediction agency
I am getting input from weather stations/sensors
Id like to make the forecast for tomorrow before
tomorrow

5
Why Performance?

To do a high number of operations per seconds
I am the CTO of Amazon.com
My Web server gets 1,000 hits per seconds
Id like my Web server and my databases to handle
1,000 transactions per seconds so that customers
do not experience bad delays
Also called scalability
Amazon does process several GBytes of data per
seconds

6
Performance as Time

Time between the start and the end of an
operation
Also called running time, elapsed time,
wall-clock time, response time, latency,
execution time, ...
Most straightforward measure my program takes
12.5s on a Pentium 3.5GHz
Can be normalized to some reference time
Must be measured on a dedicated machine

7
Performance as Rate

Used often so that performance can be independent
on the size of the application
e.g., compressing a 1MB file takes 1 minute.
compressing a 2MB file takes 2 minutes. The
performance is the same.
Millions of instructions / sec (MIPS)
MIPS instruction count / (execution time
106) clock rate / (CPI 106)
But Instructions Set Architectures are not
equivalent
1 CISC instruction many RISC instructions
Programs use different instruction mixes
May be ok for same program on same architectures

8
Performance as Rate

Millions of floating point operations /sec
(MFlops)
Very popular, but often misleading
e.g., A high MFlops rate in a stupid algorithm
could have poor application performance
Application-specific
Millions of frames rendered per second
Millions of amino-acid compared per second
Millions of HTTP requests served per seconds
Application-specific metrics are often preferable
and others may be misleading
MFlops can be application-specific thought
For instance
I want to add to n-element vectors
This requires 2n Floating Point Operations
Therefore MFlops is a good measure

9
Peak Performance?

Resource vendors always talk about peak
performance rate
computed based on specifications of the machine
For instance
I build a machine with 2 floating point units
Each unit can do an operation in 2 cycles
My CPU is at 1GHz
Therefore I have a 12/2 1GFlops Machine
Problem
In real code you will never be able to use the
two floating point units constantly
Data needs to come from memory and cause the
floating point units to be idle
Typically, real code achieves only an (often
small) fraction of the peak performance

10
Benchmarks

Since many performance metrics turn out to be
misleading, people have designed benchmarks
Example SPEC Benchmark
Integer benchmark
Floating point benchmark
These benchmarks are typically a collection of
several codes that come from real-world
software
The question what is a good benchmark? is
difficult
If the benchmarks do not correspond to what
youll do with the computer, then the benchmark
results are not relevant to you

11
How About GHz?

This is often the way in which people say that a
computer is better than another
More instruction per seconds for higher clock
rate
Faces the same problems as MIPS
But usable within a specific architecture

Processor Clock Rate SPEC FP2000 Benchmark
IBM Power3 450 MHz 434
Intel PIII 1.4 GHz 456
Intel P4 2.4GHz 833
Itanium-2 1.0GHz 1356
12
Program Performance

In this class were not really concerned with
determining the performance of a compute platform
(whichever way it is defined)
Instead were concerned with improving a
programs performance
For a given platform, take a given program
Run it an measure its wall-clock time
Enhance it, run it an quantify the performance
improvement
i.e., the reduction in wall-clock time
For each version compute its performance
preferably as a relevant performance rate
so that you can say the best implementation we
have so far goes this fast (perhaps a of the
peak performance)

13
Speedup

We need a metric to quantify the impact of your
performance enhancement
Speedup ratio of old time to new time
old time 2h
new time 1h
speedup 2h / 1h 2
Sometimes one talks about a slowdown in case
the enhancement is not beneficial
Happens more often than one thinks

14
Parallel Performance

The notion of speedup is completely generic
By using a rice cooker Ive achieved a 1.20
speedup for rice cooking
For parallel programs on defines the Parallel
Speedup (well just say speedup)
Parallel program takes time T1 on 1 processor
Parallel program takes time Tp on p processors
Parallel Speedup(p) T1 / Tp
In the ideal case, if my sequential program takes
2 hours on 1 processor, it takes 1 hour on 2
processors called linear speedup

15
Speedup
linear speedup
speedup
superlinear speedup!!
sub-linear speedup
number of processors
16
Superlinear Speedup?

There are several possible causes
Algorithm
e.g., with optimization problems, throwing many
processors at it increases the chances that one
will get lucky and find the optimum fast
Hardware
e.g., with many processors, it is possible that
the entire application data resides in cache (vs.
RAM) or in RAM (vs. Disk)

17
Bad News Amdahls Law

Consider a program whose execution consists of
two phases
One sequential phase
One phase that can be perfectly parallelized
(linear speedup)

T1 time spent in phase that cannot be
parallelized.
Sequential program
T1
T2
Old time T T1 T2
T2 time spent in phase that can be
parallelized.
Parallel program
T1 T1
T2 lt T2
T2 time spent in parallelized
phase
New time T T1 T2
18
Back to Amdahls Law
Sequential program
Parallel program
T1
T2
T1 T1
T2

f T2 / (T1 T2)
Fraction of the sequential execution time that is
spent in the parallelizable phase
p number of processors T2 / T2
Linear speedup
T T1 T2 T - T2 T2 / p T - f
T f T / s
Overall parallel speedup T / T
Amdahls Law Speedup(p) 1/(1 - f f/p)

19
Amdahls Law Example
Plot of 1/(1 - f f/p) for 4 values of f and
for increasing values of p
20
Lessons from Amdahls Law

Its a law of diminishing return
If a significant fraction of the code (in terms
of time spent in it) is not parallelizable, then
parallelization is not going to be good
It sounds obvious, but people new to high
performance computing often forget how bad
Amdahls law can be
Luckily, many applications can be almost entirely
parallelized and f is small

21
Parallel Efficiency

Definition Eff(p) S(p) / p
Typically lt 1, unless linear or superlinear
speedup
Used to measure how well the processors are
utilized
If increasing the number of processors by a
factor 10 increases the speedup by a factor 2,
perhaps its not worth it efficiency drops by a
factor 5
Important when purchasing a parallel machine for
instance if due to the applications behavior
efficiency is low, forget buying a large cluster

22
Scalability

Measure of the effort needed to maintain
efficiency while adding processors
For a given problem size, plot Efd(p) for
increasing values of p
It should stay close to a flat line
Isoefficiency At which rate does the problem
size need to be increase to maintain efficiency
By making a problem ridiculously large, on can
typically achieve good efficiency
Problem is it how the machine/code will be used?

23
Performance Measures

This is all well and good, but how does one
measure the performance of a program in practice?
Two issues
Measuring wall-clock times
Well see how it can be done shortly
Measuring performance rates
Measure wall clock time (see above)
Count number of operations (frames, flops,
amino-acids whatever makes sense for the
application)
Either by actively counting (count)
Or by looking at the code and figure out how many
operations are performed
Divide the count by the wall-clock time

24
Measuring time by hand?

One possibility would be to do this by just
looking at a clock, launching the program,
looking at the clock again when the program
terminates
This of course has some drawbacks
Poor resolution
Requires the users attention
Therefore operating systems provide ways to time
programs automatically
UNIX provide the time command

25
The UNIX time Command

You can put time in front of any UNIX command you
invoke
When the invoked command completes, time prints
out timing (and other) information
time ls /home/casanova/ -la -R
0.520u 1.570s 020.58 10.1 00k 570105io
0pf0w
0.520u 0.52 seconds of user time
1.570s 1.57 seconds of system time
020.56 20.56 seconds of wall-clock time
10.1 10.1 of CPU was used
00k memory used (text data)
570105io 570 input, 105 output (file system I/O)
0pf0w 0 page faults and 0 swaps

26
User, System, Wall-Clock?

User Time time that the code spends executing
user code (i.e., non system calls)
System Time time that the code spends executing
system calls
Wall-Clock Time time from start to end
Wall-Clock User System
in our example 20.56 0.52 1.57
Why?
because the process can be suspended by the O/S
due to contention for the CPU by other processes
because the process can be blocked waiting for
I/O

27
Using time

Its interesting to know what the user time and
the system time are
for instance, if the system time is really high,
it may be that the code does to many calls to
malloc(), for instance
But one would really need more information to fix
the code (not always clear which system calls may
be responsible for the high system time)
Wall-clock - system - user I/O suspended
If the system is dedicated, suspended 0
Therefore one can estimate the cost of I/O
If I/O is really high, one may want to look at
reducing I/O or doing I/O better
Therefore, time can give us insight into
bottlenecks and gives us wall-clock time
Measurements should be done on dedicated systems

28
Dedicated Systems

Measuring the performance of a code must be done
on a quiescent, unloaded machine
the machine only runs the standard O/S processes
The machine must be dedicated
No other user can start a process
The user measuring the performance only runs the
minimum amount of processes
basically, a shell
In the class we will use machines in dedicated
mode
Nevertheless, one should always present
measurement results as averages over several
experiments
Because the (small) load imposed by the O/S is
not deterministic
In your assignments, always show averages over 10
experiments, or more if asked to do so explicitly

29
Drawbacks of UNIX time

The time command has poor resolution
Only milliseconds
Sometimes we want a higher precision, especially
if our performance improvements are in the 1-2
range
time times the whole code
Sometimes were only interested in timing some
part of the code, for instance the one that we
are trying to optimize
Sometimes we want to compare the execution time
of different sections of the code

30
Timing with gettimeofday

gettimeofday from the standard C library
Measures the number of microseconds since
midnight, Jan 1st 1970, expressed in seconds and
microseconds
include ltsys/time.hgt
struct timeval start
...
gettimeofday(tv,NULL)
printf(ld,ld\n,start-gttv_sec,start-gttv_usec)
...
Can be used to time sections of code
Call gettimeofday at beginning of section
Call gettimeofday at end of section
Compute the time elapsed in microseconds
e.g., (end.tv_sec1000000.0 end.tv_usec -
start.tv_sec1000000.0 - start.tv_usec) /
1000000.0

31
Other Ways to Time Code

ntp_gettime() (Internet RFC 1589)
Sort of like gettimeofday, but reports estimated
error on time measurement
Not available for all systems
Part of the GNU C Library
Java System.currentTimeMillis()
Known to have resolution problems, with
resolution higher than 1 millisecond!
Solution use a native interface to a better
timer
Java System.nanoTime()
Added in J2SE 5.0
Probably not accurate at the nanosecond level
Tons of high precision timing in Java on the Web

32
Why is Performance Poor?

Performance is poor because the code suffers from
a performance bottleneck
Definition
An application runs on a platform that has many
components
CPU, Memory, Operating System, Network, Hard
Drive, Video Card, etc.
Pick a component and make it faster
If the application performance increases, that
component was the bottleneck!

33
Removing a Bottleneck

Brute force Hardware Upgrade
Is sometimes necessary
But can only get you so far and may be very
costly
e.g., memory technology
Instead, modify the code
The bottleneck is there because the code uses a
resource heavily or in non-intelligent manner
We will learn techniques to alleviate bottlenecks
at the software level

34
Identifying a Bottleneck

It can be difficult
Youre not going to change the memory bus just to
see what happens to the application
But you can run the code on a different machine
and see what happens
One Approach
Know/discover the characteristics of the machine
Instrument the code with gettimeofdays everywhere
Observe the application execution on the machine
Tinker with the code
Run the application again
Repeat
Reason about what the bottleneck is

35
A better approach profiling

A profiler is a tool that monitors the execution
of a program and that reports the amount of time
spent in different functions
Useful to identify the expensive functions
Profiling cycle
Compile the code with the profiler
Run the code
Identify the most expensive function
Optimize that function
call it less often if possible
make it faster
Repeat until you cant think of any ways to
further optimize the most expensive function
UNIX has a good, free profiler called gprof

36
Profiler Types based on Output

Flat profiler
Flat profiler's compute the average call times,
from the calls, and do not breakdown the call
times based on the callee or the context.
Call-Graph profiler
Call Graph profilers show the call times, and
frequencies of the functions, and also the
call-chains involved based on the callee. However
context is not preserved.

37
Methods of data gathering

Event based profilers
In Programming languages listed, all of them have
event-based profilers
Java JVM-Profiler Interface JVM API provides
hooks to profiler, for trapping events like
calls, class-load, unload, thread enter leave.
Python Python profilers are profile module,
hotspot which are call-graph based, and use the
'sys.set_profile()' module to trap events like
c_call,return,exception, python_call,return,exc
eption.
Statistical profilers
Some profilers operate by sampling. A sampling
profiler probes the target program's program
counter at regular intervals using operating
system interrupts. Sampling profiles are
typically less accurate and specific, but allow
the target program to run at near full speed.
Some profilers instrument the target program with
additional instructions to collect the required
information. Instrumenting the program can cause
changes in the performance of the program,
causing inaccurate results and heisenbugs.Instrume
nting can potentially be very specific but slows
down the target program as more specific
information is collected.
The resulting data are not exact, but a
statistical approximation. The actual amount of
error is usually more than one sampling period.
In fact, if a value is n times the sampling
period, the expected error in it is the
square-root of n sampling periods. 4
Some of the most commonly used statistical
profilers are GNU's gprof, Oprofile and SGI's
Pixie.

38
Methods of data gathering

Instrumentation
Manual Done by the programmer, e.g. by adding
instructions to explicitly calculate runtimes.
Compiler assisted Example "gcc -pg ..." for
gprof, "quantify g ..." for Quantify
Binary translation The tool adds instrumentation
to a compiled binary. Example ATOM
Runtime instrumentation Directly before
execution the code is instrumented. The program
run is fully supervised and controlled by the
tool. Examples PIN, Valgrind
Runtime injection More lightweight than runtime
instrumentation. Code is modified at runtime to
have jumps to helper functions. Example DynInst
Hypervisor Data are collected by running the
(usually) unmodified program under a hypervisor.
Example SIMMON
Simulator Data are collected by running under an
Instruction Set Simulator. Example SIMMON

39
Using gprof

Compile your code using gcc with the -pg
option
Run your code until completion
Then run gprof with your programs name as single
command-line argument
Example
gcc -pg prog.c -o prog
./prog
gprof prog gt profile_file
The output file contains all profiling information

40
Profiling output

The content of the file is explained in detail in
the file itself
At the beginning of the file is a summary of
which fraction of the code is spent in which
function
In the middle section is a detailed entry for
each function
At the end of the file is a function index, in
which each function is assigned a number in
brackets, e.g., 3

41
Profiling Output

Flat profiling summary
cumulative self
time seconds seconds name
30.9 0.77 0.77 ___multadd_D2A 1
16.9 1.19 0.42 _scheduler ltcycle 1gt
3
15.3 1.57 0.38 _scandir 5
9.2 1.80 0.23 _NSLookupAndBindSymbo
lHint 6
6.4 1.96 0.16 _job ltcycle 1gt 8
4.4 2.07 0.11 _NSIsSymbolNameDefine
dHint 9
1.6 2.11 0.04 _hash_nkey 10
1.6 2.15 0.04 _pthread_key_create
11
1.2 2.18 0.03 ___quorem_D2A 12
1.2 2.21 0.03 __mh_dylib_header
13
1.2 2.24 0.03 _probe_submitter
14
1.2 2.27 0.03 _request_submitter
15

in the function itself
in the function and its children
42
Profiling output

The middle section of the file provides detailed
information for each function
Entry format
index time self children called name
1.21 3.10 80/132 f1
111
0.69 1.13 52/132 f2
123
1 23.1 2.12 4.23 132 func
1
4.23 0.00 32/5231 c
39
Can vary depending on the version of gprof
You should really read the explanations in the
file to be sure

43
Using gprof

Get the gprof output
Understand the output
Identify the function that has the highest self
time
Try to optimize that function
make it faster by removing bottlenecks
call it less often
Repeat until there is no improvement
Go on to the next function

44
Profiling output

index time self children called name
1.21 3.10 80/132 f1
111
0.69 1.13 52/132 f2
123
1 23.1 2.12 4.23 132 func
1
4.23 0.00 32/5231 c
39

Function func 1
45
Profiling output

index time self children called name
1.21 3.10 80/132 f1
111
0.69 1.13 52/132 f2
123
1 23.1 2.12 4.23 132 func
1
4.23 0.00 32/5231 c
39

Parents f1 111, f2 123
Function func 1
46
Profiling output

index time self children called name
1.21 3.10 80/132 f1
111
0.69 1.13 52/132 f2
123
1 23.1 2.12 4.23 132 func
1
4.23 0.00 32/5231 c
39

Parents f1 111, f2 123
Function func 1
Children c 39
47
Profiling output

index time self children called name
1.21 3.10 80/132 f1
111
0.69 1.13 52/132 f2
123
1 23.1 2.12 4.23 132 func
1
4.23 0.00 32/5231 c
39

Parents f1 111, f2 123
Function func 1
Children c 39
f1
f2
calls
calls
func
Call graph
calls
c
48
Profiling output

index time self children called name
1.21 3.10 80/132 f1
111
0.69 1.13 52/132 f2
123
1 23.1 2.12 4.23 132 func
1
4.23 0.00 32/5231 c
39

Parents f1 111, f2 123
Function func 1
Children c 39
f1
f2
Call counts
80 calls
52 calls
func
Call graph
called 132 times total
32 calls
c
called 5231 times total
49
Profiling output

index time self children called name
1.21 3.10 80/132 f1
111
0.69 1.13 52/132 f2
123
1 23.1 2.12 4.23 132 func
1
4.23 0.00 32/5231 c
39

Parents f1 111, f2 123
Function func 1
Children c 39
f1
f2
23.1 of the time is spent in func() 2.12 seconds
are spent in func() itself 4.23 seconds are spent
in childrens of func()
func
c
50
Profiling output

index time self children called name
1.21 3.10 80/132 f1
111
0.69 1.13 52/132 f2
123
1 23.1 2.12 4.23 132 func
1
4.23 0.00 32/5231 c
39

Parents f1 111, f2 123
Function func 1
Children c 39
f1
f2
1.21 seconds are spent calling func() in
f1() 0.69 seconds are spent calling func() in
f2() 3.10 seconds are spent calling funcs
children in f1() 1.13 seconds are spent calling
funcs children in f2()
func
c
51
GNU gprof

Instrumenting profiler for every UNIX-like system

52
Using gprof GNU profiler

Compile and link your program with profiling
enabled
cc -g -c myprog.c utils.c -pg
cc -o myprog myprog.o utils.o -pg
Execute your program to generate a profile data
file
Program will run normally (but slower) and will
write the profile data into a file called
gmon.out just before exiting
Program should exit using exit() function
Run gprof to analyze the profile data
gprof a.out

53
Example Program
54
Understanding Flat Profile

The flat profile shows the total amount of time
your program spent executing each function.
If a function was not compiled for profiling, and
didn't run long enough to show up on the program
counter histogram, it will be indistinguishable
from a function that was never called

55
Flat profile time
56
Flat profile Cumulative seconds
57
Flat profile Self seconds
Number of seconds accounted for this function
alone
58
Flat profile Calls
Number of times was invoked
59
Flat profile Self seconds per call
Average number of sec per call Spent in this
function alone
60
Flat profile Total seconds per call
Average number of seconds spent in this function
and its descendents per call
61
Call Graph call tree of the program
Called by main ( )
Descendants doit ( )
Current Function g( )
62
Call Graph understanding each line
Total time propagated into this function by its
children
Unique index of this function
Number of times was called
Current Function g( )
total amount of time spent in this function
Percentage of the total time spent in this
function and its children.
63
Call Graph parents numbers
Time that was propagated from the function's
children into this parent
Number of times this parent called the function
/ total number of times the function was called
Call Graph understanding each line
Time that was propagated directly from the
function into this parent
Current Function g( )
64
Call Graph children numbers
Number of times this function called the child
/ total number of times this child was called
Current Function g( )
Amount of time that was propagated directly from
the child into function
Amount of time that was propagated from the
child's children to the function
65
How gprof works

Instruments program to count calls
Watches the program running, samples the PC every
0.01 sec
Statistical inaccuracy fast function may take
0 or 1 samples
Run should be long enough comparing with sampling
period
Combine several gmon.out files into single report
The output from gprof gives no indication of
parts of your program that are limited by I/O or
swapping bandwidth. This is because samples of
the program counter are taken at fixed intervals
of run time
number-of-calls figures are derived by counting,
not sampling. They are completely accurate and
will not vary from run to run if your program is
deterministic
Profiling with inlining and other optimizations
needs care

66
Gprof example
include ltstdio.hgt int a(void) int i0,g0
while(ilt100000) gi return
g int b(void) int i0,g0
while(ilt400000) gi return g
int main(int argc, char argv) int
iterations if(argc ! 2)
printf("Usage s ltNo of Iterationsgt\n",
argv0) exit(-1) else
iterations atoi(argv1) printf("No of
iterations d\n", iterations)
while(iterations--) a() b()

67
Gprof example
Flat profile Each sample counts as 0.01
seconds. cumulative self
self total time seconds seconds calls
ms/call ms/call name 80.24 63.85 63.85
50000 1.28 1.28 b 20.26 79.97
16.12 50000 0.32 0.32 a
68
Gprof example
index time self children called
name
ltspontaneousgt 1 100.0 0.00 79.97
main 1 63.85
0.00 50000/50000 b 2
16.12 0.00 50000/50000 a
3 ----------------------------------------------
- 63.85 0.00 50000/50000
main 1 2 79.8 63.85 0.00 50000
b 2 ---------------------------------------
-------- 16.12 0.00
50000/50000 main 1 3 20.2 16.12
0.00 50000 a 3 ----------------------
-------------------------
69
Gprof example Kernel Time
include ltstdio.hgt int a(void) sleep(1)
return 0 int b(void) sleep(4) return
0
int main(int argc, char argv) int
iterations if(argc ! 2)
printf("Usage s ltNo of Iterationsgt\n",
argv0) exit(-1) else
iterations atoi(argv1) printf("No of
iterations d\n", iterations)
while(iterations--) a() b()

70
Gprof example Kernel Time
Flat profile Each sample counts as 0.01
seconds. no time accumulated cumulative
self self total time seconds
seconds calls Ts/call Ts/call name 0.00
0.00 0.00 120 0.00 0.00
sigprocmask 0.00 0.00 0.00 61
0.00 0.00 __libc_sigaction 0.00 0.00
0.00 61 0.00 0.00 sigaction
0.00 0.00 0.00 60 0.00
0.00 nanosleep 0.00 0.00 0.00
60 0.00 0.00 sleep 0.00 0.00
0.00 30 0.00 0.00 a 0.00
0.00 0.00 30 0.00 0.00 b
0.00 0.00 0.00 21 0.00
0.00 _IO_file_overflow 0.00 0.00 0.00
3 0.00 0.00 _IO_new_file_xsputn
0.00 0.00 0.00 2 0.00
0.00 _IO_new_do_write 0.00 0.00 0.00
2 0.00 0.00 __find_specmb 0.00
0.00 0.00 2 0.00 0.00
__guard_setup 0.00 0.00 0.00 1
0.00 0.00 _IO_default_xsputn 0.00
0.00 0.00 1 0.00 0.00
_IO_doallocbuf
71
VTune performance analyzer

To squeeze every bit of power out of Intel
architecture !

72
VTune Modes/Features

Time- and Event-Based, System-Wide Sampling
provides developers with the most accurate
representation of their software's actual
performance with negligible overhead
Call Graph Profiling provides developers with a
pictorial view of program flow to quickly
identify critical functions and call sequences
Counter Monitor allows developers to readily
track system activity during runtime which helps
them identify system level performance issues

73
Sampling mode

Monitors all active software on your system
including your application, the OS , JIT-compiled
Java class files, Microsoft .NET files, 16-bit
applications, 32-bit applications, device drivers
Application performance is not impacted during
data collection

74
Sampling Mode Benefits

Low-overhead, system-wide profiling helps you
identify which modules and functions are
consuming the most time, giving you a detailed
look at your operating system and application
Benefits of sampling
Profiling to find hotspots. Find the module,
functions, lines of source code and assembly
instructions that are consuming the most time
Low overhead. Overhead incurred by sampling is
typically about one percent
No need to instrument code. You do not need to
make any changes to code to profile with sampling

75
How does sampling work?

Sampling interrupts the processor after a certain
number of events and records the execution
information in a buffer area. When the buffer is
full, the information is copied to a file. After
saving the information, the program resumes
operation. In this way, the VTune maintains very
low overhead (about one percent) while sampling
Time-based sampling collects samples of active
instruction addresses at regular time-based
intervals (1ms. by default)
Event-based sampling collects samples of active
instruction addresses after a specified number of
processor events
After the program finishes, the samples are
mapped to modules and stored in a database within
the analyzer program.

76
Events counted by VTune

Basic Events clock cycles, retired instructions
Instruction Execution instruction decode, issue
and execution, data and control speculation, and
memory operations
Cycle Accounting Events stall cycle breakdowns
Branch Events branch prediction
Memory Hierarchy instruction prefetch,
instruction and data caches
System Events operating system monitors,
instruction and data TLBs

About 130 different events in Pentium 4
architecture !
77
Viewing Sampling Results

Process view
all the processes that ran on the system during
data collection
Thread view
the threads that ran within the processes you
select in Process view
Module view
the modules that ran within the selected
processes and threads
Hotspot view
the functions within the modules you select in
Module view

78
Call Graph Mode

Provides with a pictorial view of program flow to
quickly identify critical functions and call
sequences
Call graph profiling reveals
Structure of your program on a function level
Number of times a function is called from a
particular location
The time spent in each function
Functions on a critical path.

79
Call Graph Screenshot
the function summary pane
Critical Path displayed as red lines call
sequence in an application that took the most
time to execute.
Switch to Call-list View
80
Call Graph (Cont.)
Additional info available - by hovering the move
over the functions
Wait time how much time spent waiting for
event to occur
81
Jump to Source view
82
Call Graph Call List View
Caller Functions are the functions that called
the Focus Function
Callee Functions are the functions that called by
Focus Function
83
Counter Monitor

Use the Counter Monitor feature of the VTune to
collect and display performance counter data.
Counter monitor selectively polls performance
counters, which are grouped categorically into
performance objects.
With the VTune analyzer, you can
Monitor selected counters in performance objects.
Correlate performance counter data with data
collected by other features in the VTune
analyzer, such as sampling.
Trigger the collection of counter data on events
other than a periodic timer.

84
Counter Monitor
85
Removing bottlenecks

Now we know how to
identify expensive sections of the code
measure their performance
compare to some notion of peak performance
decide whether performance is unacceptably poor
figure out what the physical bottleneck is
A very common bottleneck memory

86
The Memory Bottleneck

The memory is a very common bottleneck that
programmers often dont think about
When you look at code, you often pay more
attention to computation
ai bj ck
The access to the 3 arrays take more time than
doing an addition
For the code above, the memory is the bottleneck
for most machines!

87
Why the Memory Bottleneck?

In the 70s, everything was balanced
The memory kept pace with the CPU
n cycles to execute an instruction, n cycles to
bring in a word from memory
No longer true
CPUs have gotten 1,000x faster
Memory have gotten 10x faster and 1,000,000x
larger
Flops are free and bandwidth is expensive and
processors are STARVED for data

88
Memory Latency and Bandwidth

The performance of memory is typically defined by
Latency and Bandwidth (or Rate)
Latency time to read one byte from memory
measured in nanoseconds these days
Bandwidth how many bytes can be read per seconds
measured in GB/sec
Note that you dont have bandwidth 1 /
latency!
Reading 2 bytes in sequence may be cheaper than
reading one byte only
Lets see why...

89
Latency and Bandwidth
memory bus
Memory
CPU

Latency time for one data item to go from
memory to the CPU
In-flight number of data items that can be in
flight on the memory bus
Bandwidth Capacity / Latency
Maximum number of items I get in a latency period
Pipelining
initial delay for getting the first data item
latency
if I ask for enough data items (gt in-flight),
after latency seconds, I receive data items at
rate bandwidth
Just like networks, hardware pipelines, etc.

90
Latency and Bandwidth

Why is memory bandwidth important?
Because it gives us an upper bound on how fast
one could feed data to the CPU
Why is memory latency important?
Because typically one cannot feed the CPU data
constantly and one gets impacted by the latency
Latency numbers must be put in perspective with
the clock rate, i.e., measured in cycles

91
Current Memory Technology
Memory Latency Peak Bandwidth
DDR400 SDRAM 10 ns 6.4 GB/sec
DDR533 SDRAM 9.4 ns 8.5 GB/sec
DDR2-533 SDRAM 11.2 ns 8.5 GB/sec
DDR2-600 SDRAM 13.3 ns 9.6 GB/sec
DDR2-667 SDRAM ??? 10.6 GB/sec
DDR2-800 SDRAM ??? 12.8 GB/sec
source http//www.xbitlabs.com/articles/memory/di
splay/ddr2-ddr_2.html (best CAS-RAStoCAS-RASprech
arge settings)
92
Memory Bottleneck Example

Fragment of code ai bj ck
Three memory references 2 reads, 1 write
One addition can be done in one cycle
If the memory bandwidth is 12.8GB/sec, then the
rate at which the processor can access integers
(4 bytes) is 12.8102410241024 / 4 3.4GHz
The above code needs to access 3 integers
Therefore, the rate at which the code gets its
data is 1.1GHz
But the CPU could perform additions at 4GHz!
Therefore The memory is the bottleneck
And we assumed memory worked at the peak!!!
We ignored other possible overheads on the bus
In practice the gap can be around a factor 15 or
higher

93
Dealing with memory

How have people been dealing with the memory
bottleneck?
Computers are built with a memory hierarchy
Registers, Multiple Levels of Cache, Main memory
Data is brought in in bulk (cache line) from a
lower level (slow, cheap, big) to a higher level
(fast, expensive, small)
Hopefully brought in in a cache line will be
(re)used soon
temporal locality
spatial locality
Programs must be aware of the memory hierarchy
(at least to some extent)
Makes life difficult when writing for performance
But is necessary on most systems

94
Memory and parallel programs

Memory and parallel program
Rule of thumb make sure that concurrent
processes spend most of their time working on
their own data in their own memory (principle of
locality)
Place data near computation
Avoid modifying shared data
Access data in order and reuse
Avoid indirection and linked data-structures
Partition program into independent, balanced
computations
Avoid adaptive and dynamic computations
Avoid synchronization and minimize inter-process
communications
The perfect parallel program no communication
between processors
Locality is what makes (efficient) parallel
programming painful in many cases
As a programmer you must constantly have a mental
picture of where all the data is with respect to
where the computation is taking place

95
Memory and parallel programs

What also makes parallel computing a pain is
distributed-memory programming
e.g., on a cluster
Some computer architects are taking a new
approach design computers without locality
i.e., no memory hierarchy!
Only a BIG (relatively) slow shared memory
Massive multi-threading on many processors/cores
Write ones application as TONS of threads and
run it on a massively multithreaded architecture
Key idea hide latency via parallelism
If I have tons of threads, chances are that some
of them will have something useful to do while
others are waiting for the very distant
(relatively) memory

96
Multithreaded Supercomputing?

How about a machine with (fast) support for
thousands of threads?
for instance hundreds of processors/cores
each processor has support for tens of hardware
threads
More hardware is needed in each core/processor
e.g., an Instruction Pointer per thread
logic for switching between threads
Can get expensive for many threads
One may want to replicate many of the units
(e.g., ALU) because there is no point in
supporting 15 threads if there is only one ALU
It may not be worth it for Intel to support more
extensive hyper-threading
But, if high-performance scientific applications
could be written easily and perform well on such
an architecture it would be big news

97
Multithreaded Supercomputers

The TERA MTA was the first such machine, and
although very few machines were sold, it made a
big splash
Lets look at Crays El Dorado project

98
El Dorado Processor
99
Eldorado System
100
El Dorado Philosophy

Data locality is no longer important
Load Balancing is no longer important
Regular computation is no longer important
Finding the most parallelism in an application is
whats important
Not representative of the mainstream of computing
We will mostly NOT use that philosophy in our
parallel programming examples
although you may like it better

Write a Comment

User Comments (0)

About PowerShow.com

Parallel - PowerPoint PPT Presentation

Parallel

Title: Parallel & Distributed Computing Seminar (ICS691) Author: Henri Casanova Last modified by: jmunoz1 Created Date: 5/13/2005 2:20:40 PM Document presentation format – PowerPoint PPT presentation