Add title here

About This Presentation

Title:

Add title here

Description:

Linux papiprof. prof-like tool for use with papirun. based on Curtis Janssen's vprof ... Platforms: Alpha Tru64, MIPS IRIX, Linux IA64, Linux IA32, Solaris SPARC ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 50

Provided by: caam1

Category:

more less

Transcript and Presenter's Notes

Title: Add title here

1
HPCToolkit Multi-platform Tools for Analyzing
Node Performance
John Mellor-Crummey Robert Fowler Nathan
Tallent Gabriel Marin Department of Computer
Science Rice University
http//hipersoft.cs.rice.edu/hpctoolkit/
2
What Makes a Program Fast?

Good algorithm
Good data structure
Efficient code

3
Analysis and Tuning Questions

How can we tell if a program has good
performance?
How can we tell that it doesnt?
If performance is not good, how can we pinpoint
where?
How can we tell why?
What can we do about it?

4
What about Parallel Codes?

Even partitioning of computation
Minimum communication
Low communication overhead
Good parallelism

5
Tuning a Parallel Code

Analyzing parallelism
Get the parallelism right
Analyze node performance
Tune that as well

6
A Digression Parallel Line Sweep

Good parallel performance requires suitable
partitioning
Tightly-coupled computations are problematic
Line-sweep computations ADI integration among
others

do j 1, n do i 2,n
a(i,j) a(i-1,j)
recurrences make parallelization difficult with
BLOCK partitionings
7
Parallelizing Line Sweepswith Block Partitionings
Approach 1 Only compute along local dimensions
Local Sweeps along x and z
Local Sweep along y
Transpose
Transpose back

Fully parallel computation
High communication volume transpose ALL data

8
Coarse-Grain Pipelining
Approach 2 Compute along partitioned dimensions
Partial serialization induces wavefront
parallelism with block partitioning
9
Coarse-Grain Pipelining
Approach 2 Compute along partitioned dimensions
Partial serialization induces wavefront
parallelism with block partitioning
Processor 0
Processor 1
Processor 2
Processor 3
10
Multipartitioning

Style of skewed-cyclic distribution
Each processor owns a tile between each pair of
cuts along each distributed dimension

11
Multipartitioning

Enables full parallelism for a sweep along any
partitioned dimension

Processor 0
Processor 1
Processor 2
Processor 3
12
Parallelizing Line Sweeps

13
Understanding Node Performance

The rest of this talk will focus on this topic.

14
The Setting Modern Computer Systems

Microprocessor-based architectures
Deeply-pipelined processors with internal
parallelism
out of order superscalar Alpha
multiple functional units
circuitry to dynamically determine dependences
and dispatch instructions
many instructions can be in flight at once
VLIW Itanium
issue a fixed size bundle of instructions each
cycle
bundles tailored to mix of available functional
units
compiler pre-determines what instructions execute
in parallel
Complex memory hierarchy
non-blocking, multi-level caches
TLB

15
The Setting Modern Scientific Applications

Multi-lingual programs
Many source files
Complex build process
Typical
Multiple directories
Multiple makefiles
Incomplete automation
External libraries in binary-only form

16
The Problem Programming Modern Microprocessor
Systems Efficiently

Architectural sweet spot often differs from
applications
Example
Architecture modest cache sizes, long cache
lines
Irregular particle application
access large amounts of data in irregular access
pattern
most of long cache lines go unread
almost no temporal reuse
Gap between peak and typical performance is
growing
5-10 of peak is common today
Gap between processor speed and memory speed is
growing
Performance analysis and tuning is necessary!

1
17
Performance Monitoring Hardware

Purpose
capture information about performance critical
details that is otherwise inaccessible
cycles in flight, TLB misses, mispredicted
branches, etc
What it does
Characterize events and measure durations
record information about an instruction as it
executes.
Two flavors of performance monitoring hardware
aggregate performance event counters
sample events during execution cycles, board
cache misses
limitation out of order execution smears
attribution of events
ProfileMe instruction execution trace hardware
a set of boolean flags indicating occurrence of
events (e.g., traps, replays, etc) cycle
counters
limitation not all sources of delay are counted,
attribution is sometimes unintuitive

3
18
Performance Tool Goals

Support large, multi-lingual applications
a mix of of Fortran, C, C
external libraries
thousands of procedures
hundreds of thousands of lines
we must avoid
manual instrumentation
significantly altering the build process
frequent recompilation
Multi-platform
Scalable data collection
Analyze both serial and parallel codes
Effective presentation of analysis results
intuitive enough for physicists and engineers to
use
detailed enough to meet the needs of compiler
writers

19
HPCToolkit System Overview
application source
20
HPCToolkit System Overview
application source
binary object code
compilation
linking
source correlation
profile execution
binary analysis
program structure
hyperlinked database
performance profile
interpret profile
hpcviewer

launch unmodified, optimized application binaries
collect statistical profiles of events of interest

21
HPCToolkit System Overview

decode instructions and combine with profile data

22
HPCToolkit System Overview

extract loop nesting information from executables

23
HPCToolkit System Overview

synthesize new metrics by combining metrics
relate metrics, structure, and program source

24
HPCToolkit System Overview

support top-down analysis with interactive viewer
analyze results anytime, anywhere

25
HPCToolkit System Overview
application source
binary object code
compilation
linking
source correlation
profile execution
binary analysis
program structure
hyperlinked database
performance profile
interpret profile
hpcviewer
26
Data Collection

Support analysis of unmodified, optimized
binaries
Inserting code to start, stop and read counters
has many drawbacks, so dont do it!
nested measurements skew results
Use hardware performance monitoring to collect
statistical profiles of events of interest
Different platforms have different capabilities
event-based counters MIPS, IA64, Pentium
ProfileMe instruction tracing Alpha
Different capabilities require different
approaches

27
Sample-based Performance Analysis

Events sampled when
aggregate performance counter exceeds threshold
instruction selected for ProfileMe tracing
Each time a sample occurs
note the program counter
record information in a histogram
Map sampled PC values back to source lines
Advantages
provides a high-level view of where events
happen during execution
can be started at launch time without prior
preparation

4
28
Data Collection papirun for Linux

PAPI Performance API
interface to hardware performance monitors
supports many platforms
papirun open source sample-based profiling
preload monitoring library before launching
application
inspect load map to set up sampling for all load
modules
record PC samples for each module along with load
map
Linux IA64 and IA32
papiprof prof-like tool
output styles
XML for use with hpcview
plain text

29
Data Collection DCPI and ProfileMe

Alpha ProfileMe
EV67 records info about an instruction as it
executes
mispredicted branches, memory access replay traps
more accurate attribution of events
DCPI (Digital) Continuous Profiling
Infrastructure
sample processor counters and instructions
continuously during execution of all code
all programs
shared libraries
operating system
support both on-line and off-line data analysis
to date, we use only off-line analysis

30
HPCToolkit System Overview
31
Linux papiprof

prof-like tool for use with papirun
based on Curtis Janssens vprof
uses GNU binutils to perform PC ? source mapping
interpret profiles collected with papirun
Map counts associated with instruction addresses
back to (file, function, source line) triples
output styles
ascii profile format
XML-based profile format for use with HPCView

15
32
Metric Synthesis with xprof (Alpha)

Interpret DCPI samples into useful metrics
Transform low-level data to higher-level metrics
DCPI ProfileMe information associated with PC
values
project ProfileMe data into useful equivalence
classes
decode instruction type info in application
binary at each PC
FLOP
memory operation
integer operation
fuse the two kinds of information
Retired instructions instruction type
retired FLOPs
retired integer operations
retired memory operations
Map back to source code like papiprof

33
HPCToolkit System Overview
34
Why Binary Analysis?

Problems
Line-level performance statistics may be
inaccurate, and offer a myopic view of program
performance
Interesting performance for scientific programs
is at the loop level
Approach
recover loop information from an application
binary

35
Program Structure Recovery with bloop

Parse instructions in an executable using GNU
binutils
Analyze branches to identify basic blocks
Construct control flow graph using branch target
analysis
be careful with machine conventions and delay
slots!
Use interval analysis to identify natural loop
nests
Map machine instructions to source lines with
symbol table
dependent on accurate debugging information!
Normalize output to recover source-level view
Platforms AlphaTru64, MIPSIRIX, LinuxIA64,
LinuxIA32, SolarisSPARC

36
Sample Flowgraph from an Executable

Loop nesting structure
blue outermost level
red loop level 1
green loop level 2

Observation optimization complicates program
structure!
37
Normalizing Program Structure
Constraint each source line must appear at most
once

Coalesce duplicate lines
(1) if duplicate lines appear in different loops
find least common ancestor in scope tree merge
corresponding loops along the paths to each of
the duplicates
purpose re-rolls loops that have been split
(2) if duplicate lines appear at multiple levels
in a loop nest
discard all but the innermost instance
purpose handles loop-invariant code motion
apply (1) and (2) repeatedly until a fixed point
is reached

38
Recovered Program Structure

ltLM n"/apps/smg98/test/smg98"gt
...
ltF n"/apps/smg98/struct_linear_solvers/smg_rel
ax.c"gt
ltP n"hypre_SMGRelaxFreeARem"gt
ltL b"146" e"146"gt
ltS b"146" e"146"/gt
lt/Lgt
lt/Pgt
ltP n"hypre_SMGRelax"gt
ltL b"297" e"328"gt
ltS b"297" e"297"/gt
ltL b"301" e"328"gt
ltS b"301" e"301"/gt
ltL b"318" e"325"gt
ltS b"318" e"325"/gt
lt/Lgt
ltS b"328" e"328"/gt
lt/Lgt
ltS b"302" e"302"/gt

39
HPCToolkit System Overview
40
Data Correlation

Problem
any one performance measure provides a myopic
view
some measure potential causes (e.g. cache misses)
some measure effects (e.g. cycles)
cache misses not always a problem
event counter attribution is inaccurate for
out-of-order processors
Approaches
multiple metrics for each program line
computed metrics, e.g. cache miss rate
eliminate mental arithmetic
serve as a key for sorting
hierarchical structure
line level attribution errors give good
loop-level information

41
HPCToolkit System Overview
42
HPCViewer Screenshot
Annotated Source View
Metrics
Navigation
43
Flattening for Top Down Analysis

Problem
strict hierarchical view of a program is too
rigid
want to compare program components at the same
level as peers
Solution
enable a scopes descendants to be flattened to
compare their children as peers

Current scope
flatten
unflatten
44
Using HPCTools Toolkit on Linux
source
a.out
mpiexec papirun -e PAPI_L2_TCM a.out
bloop a.out
Raw profile data
program structure
papiprof
portable profile
...
hpcview
portable profile
XML database
configuration file
12
45
hpcview Configuration File
ltHPCVIEWgt ltTITLE name"POP 4-way shmem,
model_sizemedium" /gt ltPATH name"." /gt
ltPATH name"./compile" /gt ltPATH
name"../sshmem" /gt ltPATH name"../source" /gt
- ltMETRIC name"pcc" displayName"Cycles"gt
ltFILE name"pop.fcy_hwc.pxml" /gt lt/METRICgt
ltMETRIC name"dc" displayName"L1 miss"gt
ltFILE name"pop.fdc_hwc.pxml" /gt lt/METRICgt
ltMETRIC name"dsc" displayName"L2 miss"gt
ltFILE name"pop.fdsc_hwc.pxml" /gt lt/METRICgt
ltMETRIC name"fp" displayName"FP insts"gt
ltFILE name"pop.fgfp_hwc.pxml" /gt lt/METRICgt
ltMETRIC name"rat" displayName"cy per FLOP"gt-
ltCOMPUTEgt ltmathgt ltapplygt ltdivide/gt
ltcigtpcclt/cigt ltcigtfplt/cigt lt/applygt lt/mathgt
lt/COMPUTEgt lt/METRICgt lt/HPCVIEWgt
Heading on Display
Paths to interesting source directories
Metrics defined by Platform Independent Profile
Files
Expression for derived metric
13
46
Some Uses for HPCToolkit

Identifying unproductive work
where is the program spending its time not
performing FLOPS
Memory hierarchy issues
bandwidth utilization misses x line size/cycles
exposed latency ideal vs. measured
Cross architecture or compiler comparisons
what program features cause performance
differences?
Gap between peak and observed performance
loop balance vs. machine balance?
Evaluating load balance in a parallelized code
how do profiles for different processes compare

47
Assessment of HPCToolkit Functionality

Top down analysis focuses attention where it
belongs
sorted views put the important things first
Integrated browsing interface facilitates
exploration
rich network of connections makes navigation
simple
Hierarchical, loop-level reporting facilitates
analysis
more sensible view when statement-level data is
imprecise
Binary analysis handles multi-lingual
applications and libraries
succeeds where language and compiler based tools
cant
Sample-based profiling, aggregation and derived
metrics
reduce manual effort in analysis and tuning cycle
Multiple metrics provide a better picture of
performance
Multi-platform data collection
Platform independent analysis tool

48
csprof/csbuild/csview

Collect call stack (gprof-like) profiles of
application binaries without prior arrangement
no special compiler options or link step
any compiled language or mix thereof
Collect call tree annotated with sample counts
at each node in the tree
Build call graph using call tree data
Use viewer for interactive exploration of call
graph profile data

16
49
Whats Next?