Parallel - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel

Description:

Title: Parallel & Distributed Computing Seminar (ICS691) Author: Henri Casanova Last modified by: jmunoz1 Created Date: 5/13/2005 2:20:40 PM Document presentation format – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 89
Provided by: HenriCa2
Learn more at: http://users.cis.fiu.edu
Category:

less

Transcript and Presenter's Notes

Title: Parallel


1
High-Performance Grid Computing and Research
Networking
Performance
Presented by Diego Lopez and Javier
Munoz Instructor S. Masoud Sadjadi http//www.cs
.fiu.edu/sadjadi/Teaching/ sadjadi At cs Dot fiu
Dot edu
2
Acknowledgements
  • The content of many of the slides in this lecture
    notes have been adopted from the online resources
    prepared previously by the people listed below.
    Many thanks!
  • Henri Casanova
  • Principles of High Performance Computing
  • http//navet.ics.hawaii.edu/casanova
  • henric_at_hawaii.edu

3
Code Performance
  • We will mostly talk about how to make code go
    fast, hence the high performance
  • Performance conflicts with other concerns
  • Correctness
  • You will see that when trying to make code go
    fast one often breaks it
  • Readability
  • Fast code typically requires more lines!
  • Modularity can hurt performance
  • e.g., Too many classes
  • Portability
  • Code that is fast on machine A can be slow on
    machine B
  • At the extreme, highly optimized code is not
    portable at all, and in fact is done in hardware!

4
Why Performance?
  • To do a time-consuming operation in less time
  • I am an aircraft engineer
  • I need to run a simulation to test the stability
    of the wings at high speed
  • Id rather have the result in 5 minutes than in 5
    hours so that I can complete the aircraft final
    design sooner.
  • To do an operation before a tighter deadline
  • I am a weather prediction agency
  • I am getting input from weather stations/sensors
  • Id like to make the forecast for tomorrow before
    tomorrow

5
Why Performance?
  • To do a high number of operations per seconds
  • I am the CTO of Amazon.com
  • My Web server gets 1,000 hits per seconds
  • Id like my Web server and my databases to handle
    1,000 transactions per seconds so that customers
    do not experience bad delays
  • Also called scalability
  • Amazon does process several GBytes of data per
    seconds

6
Performance as Time
  • Time between the start and the end of an
    operation
  • Also called running time, elapsed time,
    wall-clock time, response time, latency,
    execution time, ...
  • Most straightforward measure my program takes
    12.5s on a Pentium 3.5GHz
  • Can be normalized to some reference time
  • Must be measured on a dedicated machine

7
Performance as Rate
  • Used often so that performance can be independent
    on the size of the application
  • e.g., compressing a 1MB file takes 1 minute.
    compressing a 2MB file takes 2 minutes. The
    performance is the same.
  • Millions of instructions / sec (MIPS)
  • MIPS instruction count / (execution time
    106) clock rate / (CPI 106)
  • But Instructions Set Architectures are not
    equivalent
  • 1 CISC instruction many RISC instructions
  • Programs use different instruction mixes
  • May be ok for same program on same architectures

8
Performance as Rate
  • Millions of floating point operations /sec
    (MFlops)
  • Very popular, but often misleading
  • e.g., A high MFlops rate in a stupid algorithm
    could have poor application performance
  • Application-specific
  • Millions of frames rendered per second
  • Millions of amino-acid compared per second
  • Millions of HTTP requests served per seconds
  • Application-specific metrics are often preferable
    and others may be misleading
  • MFlops can be application-specific thought
  • For instance
  • I want to add to n-element vectors
  • This requires 2n Floating Point Operations
  • Therefore MFlops is a good measure

9
Peak Performance?
  • Resource vendors always talk about peak
    performance rate
  • computed based on specifications of the machine
  • For instance
  • I build a machine with 2 floating point units
  • Each unit can do an operation in 2 cycles
  • My CPU is at 1GHz
  • Therefore I have a 12/2 1GFlops Machine
  • Problem
  • In real code you will never be able to use the
    two floating point units constantly
  • Data needs to come from memory and cause the
    floating point units to be idle
  • Typically, real code achieves only an (often
    small) fraction of the peak performance

10
Benchmarks
  • Since many performance metrics turn out to be
    misleading, people have designed benchmarks
  • Example SPEC Benchmark
  • Integer benchmark
  • Floating point benchmark
  • These benchmarks are typically a collection of
    several codes that come from real-world
    software
  • The question what is a good benchmark? is
    difficult
  • If the benchmarks do not correspond to what
    youll do with the computer, then the benchmark
    results are not relevant to you

11
How About GHz?
  • This is often the way in which people say that a
    computer is better than another
  • More instruction per seconds for higher clock
    rate
  • Faces the same problems as MIPS
  • But usable within a specific architecture

Processor Clock Rate SPEC FP2000 Benchmark
IBM Power3 450 MHz 434
Intel PIII 1.4 GHz 456
Intel P4 2.4GHz 833
Itanium-2 1.0GHz 1356
12
Program Performance
  • In this class were not really concerned with
    determining the performance of a compute platform
    (whichever way it is defined)
  • Instead were concerned with improving a
    programs performance
  • For a given platform, take a given program
  • Run it an measure its wall-clock time
  • Enhance it, run it an quantify the performance
    improvement
  • i.e., the reduction in wall-clock time
  • For each version compute its performance
  • preferably as a relevant performance rate
  • so that you can say the best implementation we
    have so far goes this fast (perhaps a of the
    peak performance)

13
Speedup
  • We need a metric to quantify the impact of your
    performance enhancement
  • Speedup ratio of old time to new time
  • old time 2h
  • new time 1h
  • speedup 2h / 1h 2
  • Sometimes one talks about a slowdown in case
    the enhancement is not beneficial
  • Happens more often than one thinks

14
Parallel Performance
  • The notion of speedup is completely generic
  • By using a rice cooker Ive achieved a 1.20
    speedup for rice cooking
  • For parallel programs on defines the Parallel
    Speedup (well just say speedup)
  • Parallel program takes time T1 on 1 processor
  • Parallel program takes time Tp on p processors
  • Parallel Speedup(p) T1 / Tp
  • In the ideal case, if my sequential program takes
    2 hours on 1 processor, it takes 1 hour on 2
    processors called linear speedup

15
Speedup
linear speedup
speedup
superlinear speedup!!
sub-linear speedup
number of processors
16
Superlinear Speedup?
  • There are several possible causes
  • Algorithm
  • e.g., with optimization problems, throwing many
    processors at it increases the chances that one
    will get lucky and find the optimum fast
  • Hardware
  • e.g., with many processors, it is possible that
    the entire application data resides in cache (vs.
    RAM) or in RAM (vs. Disk)

17
Bad News Amdahls Law
  • Consider a program whose execution consists of
    two phases
  • One sequential phase
  • One phase that can be perfectly parallelized
    (linear speedup)

T1 time spent in phase that cannot be
parallelized.
Sequential program
T1
T2
Old time T T1 T2
T2 time spent in phase that can be
parallelized.
Parallel program
T1 T1
T2 lt T2
T2 time spent in parallelized
phase
New time T T1 T2
18
Back to Amdahls Law
Sequential program
Parallel program
T1
T2
T1 T1
T2
  • f T2 / (T1 T2)
  • Fraction of the sequential execution time that is
    spent in the parallelizable phase
  • p number of processors T2 / T2
  • Linear speedup
  • T T1 T2 T - T2 T2 / p T - f
    T f T / s
  • Overall parallel speedup T / T
  • Amdahls Law Speedup(p) 1/(1 - f f/p)

19
Amdahls Law Example
Plot of 1/(1 - f f/p) for 4 values of f and
for increasing values of p
20
Lessons from Amdahls Law
  • Its a law of diminishing return
  • If a significant fraction of the code (in terms
    of time spent in it) is not parallelizable, then
    parallelization is not going to be good
  • It sounds obvious, but people new to high
    performance computing often forget how bad
    Amdahls law can be
  • Luckily, many applications can be almost entirely
    parallelized and f is small

21
Parallel Efficiency
  • Definition Eff(p) S(p) / p
  • Typically lt 1, unless linear or superlinear
    speedup
  • Used to measure how well the processors are
    utilized
  • If increasing the number of processors by a
    factor 10 increases the speedup by a factor 2,
    perhaps its not worth it efficiency drops by a
    factor 5
  • Important when purchasing a parallel machine for
    instance if due to the applications behavior
    efficiency is low, forget buying a large cluster

22
Scalability
  • Measure of the effort needed to maintain
    efficiency while adding processors
  • For a given problem size, plot Efd(p) for
    increasing values of p
  • It should stay close to a flat line
  • Isoefficiency At which rate does the problem
    size need to be increase to maintain efficiency
  • By making a problem ridiculously large, on can
    typically achieve good efficiency
  • Problem is it how the machine/code will be used?

23
Performance Measures
  • This is all well and good, but how does one
    measure the performance of a program in practice?
  • Two issues
  • Measuring wall-clock times
  • Well see how it can be done shortly
  • Measuring performance rates
  • Measure wall clock time (see above)
  • Count number of operations (frames, flops,
    amino-acids whatever makes sense for the
    application)
  • Either by actively counting (count)
  • Or by looking at the code and figure out how many
    operations are performed
  • Divide the count by the wall-clock time

24
Measuring time by hand?
  • One possibility would be to do this by just
    looking at a clock, launching the program,
    looking at the clock again when the program
    terminates
  • This of course has some drawbacks
  • Poor resolution
  • Requires the users attention
  • Therefore operating systems provide ways to time
    programs automatically
  • UNIX provide the time command

25
The UNIX time Command
  • You can put time in front of any UNIX command you
    invoke
  • When the invoked command completes, time prints
    out timing (and other) information
  • time ls /home/casanova/ -la -R
  • 0.520u 1.570s 020.58 10.1 00k 570105io
    0pf0w
  • 0.520u 0.52 seconds of user time
  • 1.570s 1.57 seconds of system time
  • 020.56 20.56 seconds of wall-clock time
  • 10.1 10.1 of CPU was used
  • 00k memory used (text data)
  • 570105io 570 input, 105 output (file system I/O)
  • 0pf0w 0 page faults and 0 swaps

26
User, System, Wall-Clock?
  • User Time time that the code spends executing
    user code (i.e., non system calls)
  • System Time time that the code spends executing
    system calls
  • Wall-Clock Time time from start to end
  • Wall-Clock User System
  • in our example 20.56 0.52 1.57
  • Why?
  • because the process can be suspended by the O/S
    due to contention for the CPU by other processes
  • because the process can be blocked waiting for
    I/O

27
Using time
  • Its interesting to know what the user time and
    the system time are
  • for instance, if the system time is really high,
    it may be that the code does to many calls to
    malloc(), for instance
  • But one would really need more information to fix
    the code (not always clear which system calls may
    be responsible for the high system time)
  • Wall-clock - system - user I/O suspended
  • If the system is dedicated, suspended 0
  • Therefore one can estimate the cost of I/O
  • If I/O is really high, one may want to look at
    reducing I/O or doing I/O better
  • Therefore, time can give us insight into
    bottlenecks and gives us wall-clock time
  • Measurements should be done on dedicated systems

28
Dedicated Systems
  • Measuring the performance of a code must be done
    on a quiescent, unloaded machine
  • the machine only runs the standard O/S processes
  • The machine must be dedicated
  • No other user can start a process
  • The user measuring the performance only runs the
    minimum amount of processes
  • basically, a shell
  • In the class we will use machines in dedicated
    mode
  • Nevertheless, one should always present
    measurement results as averages over several
    experiments
  • Because the (small) load imposed by the O/S is
    not deterministic
  • In your assignments, always show averages over 10
    experiments, or more if asked to do so explicitly

29
Drawbacks of UNIX time
  • The time command has poor resolution
  • Only milliseconds
  • Sometimes we want a higher precision, especially
    if our performance improvements are in the 1-2
    range
  • time times the whole code
  • Sometimes were only interested in timing some
    part of the code, for instance the one that we
    are trying to optimize
  • Sometimes we want to compare the execution time
    of different sections of the code

30
Timing with gettimeofday
  • gettimeofday from the standard C library
  • Measures the number of microseconds since
    midnight, Jan 1st 1970, expressed in seconds and
    microseconds
  • include ltsys/time.hgt
  • struct timeval start
  • ...
  • gettimeofday(tv,NULL)
  • printf(ld,ld\n,start-gttv_sec,start-gttv_usec)
  • ...
  • Can be used to time sections of code
  • Call gettimeofday at beginning of section
  • Call gettimeofday at end of section
  • Compute the time elapsed in microseconds
  • e.g., (end.tv_sec1000000.0 end.tv_usec -
    start.tv_sec1000000.0 - start.tv_usec) /
    1000000.0

31
Other Ways to Time Code
  • ntp_gettime() (Internet RFC 1589)
  • Sort of like gettimeofday, but reports estimated
    error on time measurement
  • Not available for all systems
  • Part of the GNU C Library
  • Java System.currentTimeMillis()
  • Known to have resolution problems, with
    resolution higher than 1 millisecond!
  • Solution use a native interface to a better
    timer
  • Java System.nanoTime()
  • Added in J2SE 5.0
  • Probably not accurate at the nanosecond level
  • Tons of high precision timing in Java on the Web

32
Why is Performance Poor?
  • Performance is poor because the code suffers from
    a performance bottleneck
  • Definition
  • An application runs on a platform that has many
    components
  • CPU, Memory, Operating System, Network, Hard
    Drive, Video Card, etc.
  • Pick a component and make it faster
  • If the application performance increases, that
    component was the bottleneck!

33
Removing a Bottleneck
  • Brute force Hardware Upgrade
  • Is sometimes necessary
  • But can only get you so far and may be very
    costly
  • e.g., memory technology
  • Instead, modify the code
  • The bottleneck is there because the code uses a
    resource heavily or in non-intelligent manner
  • We will learn techniques to alleviate bottlenecks
    at the software level

34
Identifying a Bottleneck
  • It can be difficult
  • Youre not going to change the memory bus just to
    see what happens to the application
  • But you can run the code on a different machine
    and see what happens
  • One Approach
  • Know/discover the characteristics of the machine
  • Instrument the code with gettimeofdays everywhere
  • Observe the application execution on the machine
  • Tinker with the code
  • Run the application again
  • Repeat
  • Reason about what the bottleneck is

35
A better approach profiling
  • A profiler is a tool that monitors the execution
    of a program and that reports the amount of time
    spent in different functions
  • Useful to identify the expensive functions
  • Profiling cycle
  • Compile the code with the profiler
  • Run the code
  • Identify the most expensive function
  • Optimize that function
  • call it less often if possible
  • make it faster
  • Repeat until you cant think of any ways to
    further optimize the most expensive function
  • UNIX has a good, free profiler called gprof

36
Profiler Types based on Output
  • Flat profiler
  • Flat profiler's compute the average call times,
    from the calls, and do not breakdown the call
    times based on the callee or the context.
  • Call-Graph profiler
  • Call Graph profilers show the call times, and
    frequencies of the functions, and also the
    call-chains involved based on the callee. However
    context is not preserved.

37
Methods of data gathering
  • Event based profilers
  • In Programming languages listed, all of them have
    event-based profilers
  • Java JVM-Profiler Interface JVM API provides
    hooks to profiler, for trapping events like
    calls, class-load, unload, thread enter leave.
  • Python Python profilers are profile module,
    hotspot which are call-graph based, and use the
    'sys.set_profile()' module to trap events like
    c_call,return,exception, python_call,return,exc
    eption.
  • Statistical profilers
  • Some profilers operate by sampling. A sampling
    profiler probes the target program's program
    counter at regular intervals using operating
    system interrupts. Sampling profiles are
    typically less accurate and specific, but allow
    the target program to run at near full speed.
  • Some profilers instrument the target program with
    additional instructions to collect the required
    information. Instrumenting the program can cause
    changes in the performance of the program,
    causing inaccurate results and heisenbugs.Instrume
    nting can potentially be very specific but slows
    down the target program as more specific
    information is collected.
  • The resulting data are not exact, but a
    statistical approximation. The actual amount of
    error is usually more than one sampling period.
    In fact, if a value is n times the sampling
    period, the expected error in it is the
    square-root of n sampling periods. 4
  • Some of the most commonly used statistical
    profilers are GNU's gprof, Oprofile and SGI's
    Pixie.

38
Methods of data gathering
  • Instrumentation
  • Manual Done by the programmer, e.g. by adding
    instructions to explicitly calculate runtimes.
  • Compiler assisted Example "gcc -pg ..." for
    gprof, "quantify g ..." for Quantify
  • Binary translation The tool adds instrumentation
    to a compiled binary. Example ATOM
  • Runtime instrumentation Directly before
    execution the code is instrumented. The program
    run is fully supervised and controlled by the
    tool. Examples PIN, Valgrind
  • Runtime injection More lightweight than runtime
    instrumentation. Code is modified at runtime to
    have jumps to helper functions. Example DynInst
  • Hypervisor Data are collected by running the
    (usually) unmodified program under a hypervisor.
    Example SIMMON
  • Simulator Data are collected by running under an
    Instruction Set Simulator. Example SIMMON

39
Using gprof
  • Compile your code using gcc with the -pg
    option
  • Run your code until completion
  • Then run gprof with your programs name as single
    command-line argument
  • Example
  • gcc -pg prog.c -o prog
  • ./prog
  • gprof prog gt profile_file
  • The output file contains all profiling information

40
Profiling output
  • The content of the file is explained in detail in
    the file itself
  • At the beginning of the file is a summary of
    which fraction of the code is spent in which
    function
  • In the middle section is a detailed entry for
    each function
  • At the end of the file is a function index, in
    which each function is assigned a number in
    brackets, e.g., 3

41
Profiling Output
  • Flat profiling summary
  • cumulative self
  • time seconds seconds name
  • 30.9 0.77 0.77 ___multadd_D2A 1
  • 16.9 1.19 0.42 _scheduler ltcycle 1gt
    3
  • 15.3 1.57 0.38 _scandir 5
  • 9.2 1.80 0.23 _NSLookupAndBindSymbo
    lHint 6
  • 6.4 1.96 0.16 _job ltcycle 1gt 8
  • 4.4 2.07 0.11 _NSIsSymbolNameDefine
    dHint 9
  • 1.6 2.11 0.04 _hash_nkey 10
  • 1.6 2.15 0.04 _pthread_key_create
    11
  • 1.2 2.18 0.03 ___quorem_D2A 12
  • 1.2 2.21 0.03 __mh_dylib_header
    13
  • 1.2 2.24 0.03 _probe_submitter
    14
  • 1.2 2.27 0.03 _request_submitter
    15

in the function itself
in the function and its children
42
Profiling output
  • The middle section of the file provides detailed
    information for each function
  • Entry format
  • index time self children called name
  • 1.21 3.10 80/132 f1
    111
  • 0.69 1.13 52/132 f2
    123
  • 1 23.1 2.12 4.23 132 func
    1
  • 4.23 0.00 32/5231 c
    39
  • Can vary depending on the version of gprof
  • You should really read the explanations in the
    file to be sure

43
Using gprof
  • Get the gprof output
  • Understand the output
  • Identify the function that has the highest self
    time
  • Try to optimize that function
  • make it faster by removing bottlenecks
  • call it less often
  • Repeat until there is no improvement
  • Go on to the next function

44
Profiling output
  • index time self children called name
  • 1.21 3.10 80/132 f1
    111
  • 0.69 1.13 52/132 f2
    123
  • 1 23.1 2.12 4.23 132 func
    1
  • 4.23 0.00 32/5231 c
    39

Function func 1
45
Profiling output
  • index time self children called name
  • 1.21 3.10 80/132 f1
    111
  • 0.69 1.13 52/132 f2
    123
  • 1 23.1 2.12 4.23 132 func
    1
  • 4.23 0.00 32/5231 c
    39

Parents f1 111, f2 123
Function func 1
46
Profiling output
  • index time self children called name
  • 1.21 3.10 80/132 f1
    111
  • 0.69 1.13 52/132 f2
    123
  • 1 23.1 2.12 4.23 132 func
    1
  • 4.23 0.00 32/5231 c
    39

Parents f1 111, f2 123
Function func 1
Children c 39
47
Profiling output
  • index time self children called name
  • 1.21 3.10 80/132 f1
    111
  • 0.69 1.13 52/132 f2
    123
  • 1 23.1 2.12 4.23 132 func
    1
  • 4.23 0.00 32/5231 c
    39

Parents f1 111, f2 123
Function func 1
Children c 39
f1
f2
calls
calls
func
Call graph
calls
c
48
Profiling output
  • index time self children called name
  • 1.21 3.10 80/132 f1
    111
  • 0.69 1.13 52/132 f2
    123
  • 1 23.1 2.12 4.23 132 func
    1
  • 4.23 0.00 32/5231 c
    39

Parents f1 111, f2 123
Function func 1
Children c 39
f1
f2
Call counts
80 calls
52 calls
func
Call graph
called 132 times total
32 calls
c
called 5231 times total
49
Profiling output
  • index time self children called name
  • 1.21 3.10 80/132 f1
    111
  • 0.69 1.13 52/132 f2
    123
  • 1 23.1 2.12 4.23 132 func
    1
  • 4.23 0.00 32/5231 c
    39

Parents f1 111, f2 123
Function func 1
Children c 39
f1
f2
23.1 of the time is spent in func() 2.12 seconds
are spent in func() itself 4.23 seconds are spent
in childrens of func()
func
c
50
Profiling output
  • index time self children called name
  • 1.21 3.10 80/132 f1
    111
  • 0.69 1.13 52/132 f2
    123
  • 1 23.1 2.12 4.23 132 func
    1
  • 4.23 0.00 32/5231 c
    39

Parents f1 111, f2 123
Function func 1
Children c 39
f1
f2
1.21 seconds are spent calling func() in
f1() 0.69 seconds are spent calling func() in
f2() 3.10 seconds are spent calling funcs
children in f1() 1.13 seconds are spent calling
funcs children in f2()
func
c
51
GNU gprof
  • Instrumenting profiler for every UNIX-like system

52
Using gprof GNU profiler
  • Compile and link your program with profiling
    enabled
  • cc -g -c myprog.c utils.c -pg
  • cc -o myprog myprog.o utils.o -pg
  • Execute your program to generate a profile data
    file
  • Program will run normally (but slower) and will
    write the profile data into a file called
    gmon.out just before exiting
  • Program should exit using exit() function
  • Run gprof to analyze the profile data
  • gprof a.out

53
Example Program
54
Understanding Flat Profile
  • The flat profile shows the total amount of time
    your program spent executing each function.
  • If a function was not compiled for profiling, and
    didn't run long enough to show up on the program
    counter histogram, it will be indistinguishable
    from a function that was never called

55
Flat profile time
56
Flat profile Cumulative seconds
57
Flat profile Self seconds
Number of seconds accounted for this function
alone
58
Flat profile Calls
Number of times was invoked
59
Flat profile Self seconds per call
Average number of sec per call Spent in this
function alone
60
Flat profile Total seconds per call
Average number of seconds spent in this function
and its descendents per call
61
Call Graph call tree of the program
Called by main ( )
Descendants doit ( )
Current Function g( )
62
Call Graph understanding each line
Total time propagated into this function by its
children
Unique index of this function
Number of times was called
Current Function g( )
total amount of time spent in this function
Percentage of the total time spent in this
function and its children.
63
Call Graph parents numbers
Time that was propagated from the function's
children into this parent
Number of times this parent called the function
/ total number of times the function was called
Call Graph understanding each line
Time that was propagated directly from the
function into this parent
Current Function g( )
64
Call Graph children numbers
Number of times this function called the child
/ total number of times this child was called
Current Function g( )
Amount of time that was propagated directly from
the child into function
Amount of time that was propagated from the
child's children to the function
65
How gprof works
  • Instruments program to count calls
  • Watches the program running, samples the PC every
    0.01 sec
  • Statistical inaccuracy fast function may take
    0 or 1 samples
  • Run should be long enough comparing with sampling
    period
  • Combine several gmon.out files into single report
  • The output from gprof gives no indication of
    parts of your program that are limited by I/O or
    swapping bandwidth. This is because samples of
    the program counter are taken at fixed intervals
    of run time
  • number-of-calls figures are derived by counting,
    not sampling. They are completely accurate and
    will not vary from run to run if your program is
    deterministic
  • Profiling with inlining and other optimizations
    needs care

66
Gprof example
include ltstdio.hgt int a(void) int i0,g0
while(ilt100000) gi return
g int b(void) int i0,g0
while(ilt400000) gi return g
int main(int argc, char argv) int
iterations if(argc ! 2)
printf("Usage s ltNo of Iterationsgt\n",
argv0) exit(-1) else
iterations atoi(argv1) printf("No of
iterations d\n", iterations)
while(iterations--) a() b()

67
Gprof example
Flat profile Each sample counts as 0.01
seconds. cumulative self
self total time seconds seconds calls
ms/call ms/call name 80.24 63.85 63.85
50000 1.28 1.28 b 20.26 79.97
16.12 50000 0.32 0.32 a
68
Gprof example
index time self children called
name
ltspontaneousgt 1 100.0 0.00 79.97
main 1 63.85
0.00 50000/50000 b 2
16.12 0.00 50000/50000 a
3 ----------------------------------------------
- 63.85 0.00 50000/50000
main 1 2 79.8 63.85 0.00 50000
b 2 ---------------------------------------
-------- 16.12 0.00
50000/50000 main 1 3 20.2 16.12
0.00 50000 a 3 ----------------------
-------------------------
69
Gprof example Kernel Time
include ltstdio.hgt int a(void) sleep(1)
return 0 int b(void) sleep(4) return
0
int main(int argc, char argv) int
iterations if(argc ! 2)
printf("Usage s ltNo of Iterationsgt\n",
argv0) exit(-1) else
iterations atoi(argv1) printf("No of
iterations d\n", iterations)
while(iterations--) a() b()

70
Gprof example Kernel Time
Flat profile Each sample counts as 0.01
seconds. no time accumulated cumulative
self self total time seconds
seconds calls Ts/call Ts/call name 0.00
0.00 0.00 120 0.00 0.00
sigprocmask 0.00 0.00 0.00 61
0.00 0.00 __libc_sigaction 0.00 0.00
0.00 61 0.00 0.00 sigaction
0.00 0.00 0.00 60 0.00
0.00 nanosleep 0.00 0.00 0.00
60 0.00 0.00 sleep 0.00 0.00
0.00 30 0.00 0.00 a 0.00
0.00 0.00 30 0.00 0.00 b
0.00 0.00 0.00 21 0.00
0.00 _IO_file_overflow 0.00 0.00 0.00
3 0.00 0.00 _IO_new_file_xsputn
0.00 0.00 0.00 2 0.00
0.00 _IO_new_do_write 0.00 0.00 0.00
2 0.00 0.00 __find_specmb 0.00
0.00 0.00 2 0.00 0.00
__guard_setup 0.00 0.00 0.00 1
0.00 0.00 _IO_default_xsputn 0.00
0.00 0.00 1 0.00 0.00
_IO_doallocbuf
71
VTune performance analyzer
  • To squeeze every bit of power out of Intel
    architecture !

72
VTune Modes/Features
  • Time- and Event-Based, System-Wide Sampling
    provides developers with the most accurate
    representation of their software's actual
    performance with negligible overhead
  • Call Graph Profiling provides developers with a
    pictorial view of program flow to quickly
    identify critical functions and call sequences
  • Counter Monitor allows developers to readily
    track system activity during runtime which helps
    them identify system level performance issues

73
Sampling mode
  • Monitors all active software on your system
  • including your application, the OS , JIT-compiled
    Java class files, Microsoft .NET files, 16-bit
    applications, 32-bit applications, device drivers
  • Application performance is not impacted during
    data collection

74
Sampling Mode Benefits
  • Low-overhead, system-wide profiling helps you
    identify which modules and functions are
    consuming the most time, giving you a detailed
    look at your operating system and application
  • Benefits of sampling
  • Profiling to find hotspots. Find the module,
    functions, lines of source code and assembly
    instructions that are consuming the most time
  • Low overhead. Overhead incurred by sampling is
    typically about one percent
  • No need to instrument code. You do not need to
    make any changes to code to profile with sampling

75
How does sampling work?
  • Sampling interrupts the processor after a certain
    number of events and records the execution
    information in a buffer area. When the buffer is
    full, the information is copied to a file. After
    saving the information, the program resumes
    operation. In this way, the VTune maintains very
    low overhead (about one percent) while sampling
  • Time-based sampling collects samples of active
    instruction addresses at regular time-based
    intervals (1ms. by default)
  • Event-based sampling collects samples of active
    instruction addresses after a specified number of
    processor events
  • After the program finishes, the samples are
    mapped to modules and stored in a database within
    the analyzer program.

76
Events counted by VTune
  • Basic Events clock cycles, retired instructions
  • Instruction Execution instruction decode, issue
    and execution, data and control speculation, and
    memory operations
  • Cycle Accounting Events stall cycle breakdowns
  • Branch Events branch prediction
  • Memory Hierarchy instruction prefetch,
    instruction and data caches
  • System Events operating system monitors,
    instruction and data TLBs

About 130 different events in Pentium 4
architecture !
77
Viewing Sampling Results
  • Process view
  • all the processes that ran on the system during
    data collection
  • Thread view
  • the threads that ran within the processes you
    select in Process view
  • Module view
  • the modules that ran within the selected
    processes and threads
  • Hotspot view
  • the functions within the modules you select in
    Module view

78
Call Graph Mode
  • Provides with a pictorial view of program flow to
    quickly identify critical functions and call
    sequences
  • Call graph profiling reveals
  • Structure of your program on a function level
  • Number of times a function is called from a
    particular location
  • The time spent in each function
  • Functions on a critical path.

79
Call Graph Screenshot
the function summary pane
Critical Path displayed as red lines call
sequence in an application that took the most
time to execute.
Switch to Call-list View
80
Call Graph (Cont.)
Additional info available - by hovering the move
over the functions
Wait time how much time spent waiting for
event to occur
81
Jump to Source view
82
Call Graph Call List View
Caller Functions are the functions that called
the Focus Function
Callee Functions are the functions that called by
Focus Function
83
Counter Monitor
  • Use the Counter Monitor feature of the VTune to
    collect and display performance counter data.
    Counter monitor selectively polls performance
    counters, which are grouped categorically into
    performance objects.
  • With the VTune analyzer, you can
  • Monitor selected counters in performance objects.
  • Correlate performance counter data with data
    collected by other features in the VTune
    analyzer, such as sampling.
  • Trigger the collection of counter data on events
    other than a periodic timer.

84
Counter Monitor
85
Removing bottlenecks
  • Now we know how to
  • identify expensive sections of the code
  • measure their performance
  • compare to some notion of peak performance
  • decide whether performance is unacceptably poor
  • figure out what the physical bottleneck is
  • A very common bottleneck memory

86
The Memory Bottleneck
  • The memory is a very common bottleneck that
    programmers often dont think about
  • When you look at code, you often pay more
    attention to computation
  • ai bj ck
  • The access to the 3 arrays take more time than
    doing an addition
  • For the code above, the memory is the bottleneck
    for most machines!

87
Why the Memory Bottleneck?
  • In the 70s, everything was balanced
  • The memory kept pace with the CPU
  • n cycles to execute an instruction, n cycles to
    bring in a word from memory
  • No longer true
  • CPUs have gotten 1,000x faster
  • Memory have gotten 10x faster and 1,000,000x
    larger
  • Flops are free and bandwidth is expensive and
    processors are STARVED for data

88
Memory Latency and Bandwidth
  • The performance of memory is typically defined by
    Latency and Bandwidth (or Rate)
  • Latency time to read one byte from memory
  • measured in nanoseconds these days
  • Bandwidth how many bytes can be read per seconds
  • measured in GB/sec
  • Note that you dont have bandwidth 1 /
    latency!
  • Reading 2 bytes in sequence may be cheaper than
    reading one byte only
  • Lets see why...

89
Latency and Bandwidth
memory bus
Memory
CPU
  • Latency time for one data item to go from
    memory to the CPU
  • In-flight number of data items that can be in
    flight on the memory bus
  • Bandwidth Capacity / Latency
  • Maximum number of items I get in a latency period
  • Pipelining
  • initial delay for getting the first data item
    latency
  • if I ask for enough data items (gt in-flight),
    after latency seconds, I receive data items at
    rate bandwidth
  • Just like networks, hardware pipelines, etc.

90
Latency and Bandwidth
  • Why is memory bandwidth important?
  • Because it gives us an upper bound on how fast
    one could feed data to the CPU
  • Why is memory latency important?
  • Because typically one cannot feed the CPU data
    constantly and one gets impacted by the latency
  • Latency numbers must be put in perspective with
    the clock rate, i.e., measured in cycles

91
Current Memory Technology
Memory Latency Peak Bandwidth
DDR400 SDRAM 10 ns 6.4 GB/sec
DDR533 SDRAM 9.4 ns 8.5 GB/sec
DDR2-533 SDRAM 11.2 ns 8.5 GB/sec
DDR2-600 SDRAM 13.3 ns 9.6 GB/sec
DDR2-667 SDRAM ??? 10.6 GB/sec
DDR2-800 SDRAM ??? 12.8 GB/sec
source http//www.xbitlabs.com/articles/memory/di
splay/ddr2-ddr_2.html (best CAS-RAStoCAS-RASprech
arge settings)
92
Memory Bottleneck Example
  • Fragment of code ai bj ck
  • Three memory references 2 reads, 1 write
  • One addition can be done in one cycle
  • If the memory bandwidth is 12.8GB/sec, then the
    rate at which the processor can access integers
    (4 bytes) is 12.8102410241024 / 4 3.4GHz
  • The above code needs to access 3 integers
  • Therefore, the rate at which the code gets its
    data is 1.1GHz
  • But the CPU could perform additions at 4GHz!
  • Therefore The memory is the bottleneck
  • And we assumed memory worked at the peak!!!
  • We ignored other possible overheads on the bus
  • In practice the gap can be around a factor 15 or
    higher

93
Dealing with memory
  • How have people been dealing with the memory
    bottleneck?
  • Computers are built with a memory hierarchy
  • Registers, Multiple Levels of Cache, Main memory
  • Data is brought in in bulk (cache line) from a
    lower level (slow, cheap, big) to a higher level
    (fast, expensive, small)
  • Hopefully brought in in a cache line will be
    (re)used soon
  • temporal locality
  • spatial locality
  • Programs must be aware of the memory hierarchy
    (at least to some extent)
  • Makes life difficult when writing for performance
  • But is necessary on most systems

94
Memory and parallel programs
  • Memory and parallel program
  • Rule of thumb make sure that concurrent
    processes spend most of their time working on
    their own data in their own memory (principle of
    locality)
  • Place data near computation
  • Avoid modifying shared data
  • Access data in order and reuse
  • Avoid indirection and linked data-structures
  • Partition program into independent, balanced
    computations
  • Avoid adaptive and dynamic computations
  • Avoid synchronization and minimize inter-process
    communications
  • The perfect parallel program no communication
    between processors
  • Locality is what makes (efficient) parallel
    programming painful in many cases
  • As a programmer you must constantly have a mental
    picture of where all the data is with respect to
    where the computation is taking place

95
Memory and parallel programs
  • What also makes parallel computing a pain is
    distributed-memory programming
  • e.g., on a cluster
  • Some computer architects are taking a new
    approach design computers without locality
  • i.e., no memory hierarchy!
  • Only a BIG (relatively) slow shared memory
  • Massive multi-threading on many processors/cores
  • Write ones application as TONS of threads and
    run it on a massively multithreaded architecture
  • Key idea hide latency via parallelism
  • If I have tons of threads, chances are that some
    of them will have something useful to do while
    others are waiting for the very distant
    (relatively) memory

96
Multithreaded Supercomputing?
  • How about a machine with (fast) support for
    thousands of threads?
  • for instance hundreds of processors/cores
  • each processor has support for tens of hardware
    threads
  • More hardware is needed in each core/processor
  • e.g., an Instruction Pointer per thread
  • logic for switching between threads
  • Can get expensive for many threads
  • One may want to replicate many of the units
    (e.g., ALU) because there is no point in
    supporting 15 threads if there is only one ALU
  • It may not be worth it for Intel to support more
    extensive hyper-threading
  • But, if high-performance scientific applications
    could be written easily and perform well on such
    an architecture it would be big news

97
Multithreaded Supercomputers
  • The TERA MTA was the first such machine, and
    although very few machines were sold, it made a
    big splash
  • Lets look at Crays El Dorado project

98
El Dorado Processor
99
Eldorado System
100
El Dorado Philosophy
  • Data locality is no longer important
  • Load Balancing is no longer important
  • Regular computation is no longer important
  • Finding the most parallelism in an application is
    whats important
  • Not representative of the mainstream of computing
  • We will mostly NOT use that philosophy in our
    parallel programming examples
  • although you may like it better
Write a Comment
User Comments (0)
About PowerShow.com