A PatternBased Approach to Automated Application Performance Analysis - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

A PatternBased Approach to Automated Application Performance Analysis

Description:

A Pattern-Based Approach to. Automated Application ... Pattern from an ... end of the trace file has been reached for each pattern separately ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 44
Provided by: rickk7
Learn more at: http://charm.cs.uiuc.edu
Category:

less

Transcript and Presenter's Notes

Title: A PatternBased Approach to Automated Application Performance Analysis


1
A Pattern-Based Approach to Automated
Application Performance Analysis
  • Nikhil Bhatia, Shirley Moore,
  • Felix Wolf, and Jack Dongarra
  • Innovative Computing Laboratory
  • University of Tennessee
  • bhatia, shirley, fwolf, dongarra_at_cs.utk.edu
  •  
  • Bernd Mohr
  • Zentralinstitut für Angewandte Mathematik
  • Forschungszentrum Jülich
  • b.mohr_at_fz-juelich.de
  •  

2
KOJAK Project
  • Collaborative research project between
  • University of Tennessee
  • Forschungszentrum Jülich
  • Automatic performance analysis
  • MPI and/or OpenMP applications
  • Parallel communication analysis
  • CPU and memory analysis
  • WWW
  • htttp//icl.cs.utk.edu/kojak/
  • http//www.fz-juelich.de/zam/kojak/
  • Contact
  • kojak_at_cs.utk.edu

3
KOJAK Team
  • People
  • Nikhil Bhatia
  • Jack Dongarra
  • Marc-André Hermanns
  • Bernd Mohr
  • Shirley Moore
  • Felix Wolf
  • Brian Wylie
  • Sponsors
  • U.S. Department of Defense
  • U.S. Department of Energy
  • Forschungszentrum Jülich

4
KOJAK / EXPERT Architecture
Semiautomatic Instrumentation
Instrumented source code
OPARI / TAU
Source code
POMPPMPI Libraries
Compiler / Linker
Executable
EPILOG Library
PAPI Library
Run DPCL


Automatic
Analysis
EXPERT Analyzer
Analysis report
CUBE
EPILOG Trace file
EARL



Manual Analysis

VTF3 Trace file
Trace converter
VAMPIR
5
Tracing
  • Recording of individual time-stamped program
    events as opposed to aggregated information
  • Entering and leaving a function
  • Sending and receiving a message
  • Typical event records include
  • Timestamp
  • Process or thread identifier
  • Event type
  • Type-specific information
  • Event trace
  • Sequence of events in chronological order

6
Tracing (2)
Process A
void master ... send(B, tag, buf)
...
void master trace(ENTER, 1) ...
trace(SEND, B) send(B, tag, buf) ...
trace(EXIT, 1)
MONITOR
Process B
void slave ... recv(A, tag, buf)
...
void slave trace(ENTER, 2) ... recv(A,
tag, buf) trace(RECV, A) ... trace(EXIT,
2)
7
Low-level View of Performance Behavior
8
Automatic Performance Analysis
  • Transformation of low-level performance data
  • Take event traces of MPI/OpenMP applications
  • Search for execution patterns
  • Calculate mapping
  • Problem, call path, system resource ? time
  • Display in performance browser

9
EXPERT
  • Offline trace analyzer
  • Input format EPILOG
  • Transforms traces into compact representation of
    performance behavior
  • Mapping of call paths, process or threads into
    metric space
  • Implemented in C
  • KOJAK 1.0 version was in Python
  • We still maintain a development version in Python
    to validate design changes
  • Uses EARL library to access event trace

10
EARL Library
  • Provides random access to individual events
  • Computes links between corresponding events
  • E.g., From RECV to SEND event
  • Identifies groups of events that represent an
    aspect of the programs execution state
  • E.g., all SEND events of messages in transit at a
    given moment
  • Implemented in C
  • Makes extensive use of STL
  • Language bindings
  • C
  • Python

11
Pattern Specification
  • Pattern
  • Compound event
  • Set of primitive events ( constitutents)
  • Relationships between constituents
  • Constraints
  • Patterns specified as C classes (also have a
    Python implementation for rapid prototyping)
  • Provides callback method to be called upon
    occurrence of a specific event type in event
    stream (root event)
  • Uses links or state information to find remaining
    constituents
  • Calculates (call path, location) matrix
    containing the time spent on a specific behavior
    in a particular (call path, location) pair
  • Location can be a process or a thread

12
Pattern Specification (2)
  • Profiling patterns
  • Simple profiling information
  • E.g.,How much time was spent in MPI calls?
  • Described by pairs of events
  • ENTER and EXIT of certain routine (e.g., MPI)
  • Patterns describing complex inefficiency
    situations
  • Usually described by more than two events
  • e.g., late sender or synchronization before
    all-to-all operations
  • All patterns are arranged in an inclusion
    hierarchy
  • Inclusion of execution-time interval sets
    exhibiting the performance behavior
  • e.g., execution time includes communication time

13
Pattern Hierarchy
14
Basic Search Strategy
  • Register each pattern for specific event type
  • Type of root event
  • Read the trace file once from the beginning to
    the end
  • Depending on the type of the current event
  • Invoke callback method of pattern classes
    registered for it
  • Callback method
  • Accesses additional events to identify remaining
    constituents
  • To do this it may follow links or obtain state
    information
  • Pattern from an implementation viewpoint
  • Set of events hold together by links and
    state-set boundaries

15
Late Sender
location
MPI_SEND

B
MPI_RECV
ENTER EXIT SEND RECV Message Link
idle
A
time
16
Late Sender / Wrong Order
location
MPI_SEND

B
MPI_RECV
ENTER EXIT SEND RECV Message Link
C

idle
A

time
17
Improved Search Strategy in KOJAK 2
  • Exploit specialization relationships among
    different patterns
  • Pass on compound-event instances from more
    general pattern (class) to more specific pattern
    (class)
  • Along a path in the pattern hierarchy
  • Previous implementation
  • Patterns could register only for primitive events
    (e.g., RECV)
  • New implementation
  • Patterns can publish compound events
  • Patterns can register for primitive events and
    compound events

18
Pathway of Example Pattern
19
Late-Sender instances are published
class P2P(Pattern) ... def
register(self, analyzer)
analyzer.subscribe('RECV', self.recv) def
recv(self, recv) ... return
recv_op class LateSender(Pattern)
... def parent(self) return
"P2P" def register(self, analyzer)
analyzer.subscribe(RECV_OP', self.recv_op)
def recv_op(self, recv_op) if ...
return ls else
return None
20
... and reused
class MsgsWrongOrderLS(Pattern) ...
def parent(self) return "LateSender"
def register(self, analyzer)
analyzer.subscribe(LATE_SEND', self.late_send)
def late_send(self, ls) pos
ls'RECV'pos' loc_id
ls'RECV'loc_id' queue
self._trace.queue(pos, -1, loc_id) if
queue and queue0 lt ls'SEND'pos'
loc_id lsENTER_RECV'loc_id'
cnode_id lsENTER_RECV'cnodeptr'
self._severity.add(cnode_id, loc_id,
lsIDLE_TIME') return None
21
Profiling Patterns
  • Previous implementation every pattern class did
    three things upon the occurrence of an EXIT event
  • Identify matching ENTER event
  • Filter based on call-path characteristics
  • Accumulate time or counter values
  • Current implementation
  • Do 1. 3. in a centralized fashion for all
    patterns
  • Do 2. after the end of the trace file has been
    reached for each pattern separately
  • One per call path instead of once per call-path
    instance

22
Representation of Performance Behavior
  • Three-dimensional matrix
  • Performance property (pattern)
  • Call tree
  • Process or thread
  • Uniform mapping onto time
  • Each cell contains fraction of
    execution time
    (severity)
  • E.g. waiting time, overhead
  • Each dimension is organized in a hierarchy

Execution
Main
Machine
SMP Node
Process
Specific Behavior
Subroutine
Thread
23
Single-Node Performance in EXPERT
  • How do my processes and threads perform
    individually?
  • CPU performance
  • Memory performance
  • Analysis of parallelism performance
  • Temporal and spatial relationships between
    run-time events
  • Analysis of CPU and memory performance
  • Hardware counters
  • Analysis
  • EXPERT Identifies tuples (call path, thread)
    whose occurrence rate of a certain event is above
    / below a certain threshold
  • Use entire execution time of those tuples as
    severity (upper bound)

24
Profiling Patterns (Examples)
  • Execution time
  • CPU and memory performance
  • MPI and OpenMP
  • Execution time including idle threads
  • Execution time

Total Execution
L1 data miss rate above average FP rate below
average FP to memory operation ratio
L1 Data Cache Floating Point FM ratio
  • MPI API calls
  • OpenMP API calls
  • Time lost on unused CPUs during OpenMP sequential
    execution

MPI OpenMP Idle Threads
25
Complex Patterns (Samples)
  • MPI
  • OpenMP
  • Blocked receiver
  • Blocked sender
  • Waiting for new messages although older messages
    ready
  • Waiting for last participant in N-to-N operation
  • Waiting for sender in broadcast operation

Late Sender Late Receiver Messages in Wrong
Order Wait at N x N Late Broadcast
  • Waiting time in explicit or implicit barriers
  • Waiting for lock owned by another thread

Wait at Barrier Lock Synchronization
26
KOJAK Time Model
Performance Properties
CPU Reservation Execution Idle Threads
location
Thread 1.3
Thread 1.2
Process 1
Thread 1.1
Thread 1.0
Thread 0.3
Thread 0.2
Process 0
Thread 0.1
Thread 0.0
time
27
CUBE Uniform Behavioral Encoding
  • Abstract data model of performance behavior
  • Portable data format (XML)
  • Documented C API to write CUBE files
  • Generic presentation component
  • Performance-data algebra

CUBE GUI
KOJAK
CONE
CUBE (XML)
TAU
Performance Tool4
28
CUBE Data Model
  • Most performance data are mappings of aggregated
    metric values onto program and system resources
  • Performance metrics
  • Execution time, floating-point operations, cache
    misses
  • Program resources (static and dynamic)
  • Functions, call paths
  • System resources
  • Cluster nodes, processes, threads
  • Hierarchical organization of each dimension
  • Inclusion of metrics, e.g., cache misses ? memory
    accesses
  • Source code hierarchy, call tree
  • Nodes hosting processes, processes spawning
    threads

Metric
Program
System
29
CUBE GUI
  • Design emphasizes simplicity by combining a small
    number of orthogonal features
  • Three coupled tree browsers
  • Each node labeled with metric value
  • Limited set of actions
  • Selecting a metric / call path
  • Break down of aggregated values
  • Expanding / collapsing nodes
  • Collapsed node represents entire subtree
  • Expanded node represents only itself without
    children
  • Scalable because level of detail can be adjusted
  • Separate documentation http//icl.cs.utk.edu/koja
    k/cube/

30
CUBE GUI (2)
Where in the source code? Which call path?
Which type of problem?
Which process / thread ?
How severe is the problem?
31
New Patterns for Analysis of Wavefront Algorithms
  • Parallelization scheme used for particle
    transport problems
  • Example ASCI benchmark SWEEP3D
  • Three-dimensional domain (i,j,k)
  • Two-dimensional domain decomposition (i,j)

DO octants DO angles in octant DO k
planes ! block i-inflows
IF neighbor (E/W) MPI_RECV(E/W) !
block j-inflows IF neighbor (N/S)
MPI_RECV(N/S) compute grid cell
! block i-outflows IF
neighbor (E/W) MPI_SEND(E/W) ! (block
j-outflows IF neighbor (N/S)
MPI_SEND(N/S) END DO kplanes END DO
angles in octant END DO octants
32
Pipeline Refill
  • Wavefronts from different directions
  • Limited parallelism upon pipeline refill
  • Four new late-sender patterns
  • Refill from NW, NE, SE, SW
  • Definition of these patterns required
  • Topological knowledge
  • Recognition of direction change

33
Addition of Topological Knowledge to KOJAK
  • Idea map performance data onto topology
  • Detect higher-level events related to the
    parallel algorithm
  • Link occurrences of patterns to such higher-level
    events
  • Visually expose correlations of performance
    problems with topological characteristics
  • Recording of topological information in EPILOG
  • Extension of the data format to include different
    topologies (e.g., Cartesion, graph)
  • MPI wrapper functions for applications using MPI
    topology functions
  • Instrumentation API for applications not using
    MPI topology functions

34
Recognition of Direction Change
  • Maintain a FIFO queue for each process that
    records the directions of messages received
  • Directions calculated using topological
    information
  • Wavefronts propagate along diagonal lines
  • Each wavefront has a horizontal and a vertical
    component, corresponding to one of receive and
    send pairs in the sweep() routine
  • Two potential wait states at the moment of a
    direction change, each resulting from one of the
    two receive statements
  • Specialization of late sender pattern
  • No assumptions about specifics of the computation
    performed, so applicable to a broad range of
    wavefront algorithms
  • Extension to 3-dimensional data decomposition
    should be straight-forward

35
New Topology Display
  • Exposes the correlation of wait states identified
    by pattern analysis with the topological
    characteristics of the affected processes by
    visually mapping their severity onto the virtual
    topology
  • Figure below shows rendering of the distribution
    of late-sender times for pipeline refill from
    North-West (i.e., upper left corner).
  • Corner reached by the wavefront last incurs most
    of the waiting times, whereas processes closer to
    the origin of the wavefront incur less.

36
Future Work
  • Definition of new patterns for detecting
    inefficient program behavior
  • Based on hardware counter metrics (including
    derived metrics) and routine and loop level
    profile data
  • Based on combined analysis of profile and trace
    data
  • Architecture-specific patterns e.g.,
    topology-based, Cray X1
  • Patterns related to algorithmic classes (similar
    to wavefront approach)
  • Power consumption/temperature
  • More scalable trace file analysis
  • Parallel/distributed approach to pattern analysis
  • Online analysis

37
EXPERT MPI Patterns
  • MPI
  • Time spent on MPI calls.
  • Communication
  • Time spent on MPI calls used for communication.
  • Collective
  • Time spent on collective communication.
  • Early Reduce
  • Collective communication operations that send
    data from all processes to one destination
    process (i.e., n-to-1) may suffer from waiting
    times if the destination process enters the
    operation earlier than its sending counterparts,
    that is, before any data could have been sent.
    The property refers to the time lost as a result
    of that situation.
  • Late Broadcast
  • Collective communication operations that send
    data from one source process to all processes
    (i.e., 1-to-n) may suffer from waiting times if
    destination processes enter the operation earlier
    than the source process, that is, before any data
    could have been sent. The property refers to the
    time lost as a result of that situation.

38
EXPERT MPI Patterns (2)
  • Wait at N x N
  • Collective communication operations that send
    data from all processes to all processes (i.e.,
    n-to-n) exhibit an inherent synchronization among
    all participants, that is, no process can finish
    the operation until the last process has started.
    The time until all processes have entered the
    operation is measured and used to compute the
    severity.
  • Point to Point
  • Time spent on point-to-point communication.
  • Late Receiver
  • A send operation is blocked until the
    corresponding receive operation is called. This
    can happen for several reasons. Either the MPI
    implementation is working in synchronous mode by
    default or the size of the message to be sent
    exceeds the available MPI-internal buffer space
    and the operation is blocked until the data is
    transferred to the receiver.

39
EXPERT MPI Patterns (3)
  • Messages in Wrong Order (Late Receiver)
  • A Late Receiver situation may be the result of
    messages that are sent in the wrong order. If a
    process sends messages to processes that are not
    ready to receive them, the sender's MPI-internal
    buffer may overflow so that from then on the
    process needs to send in synchronous mode causing
    a Late Receiver situation.
  • Late Sender
  • It refers to the time wasted when a call to a
    blocking receive operation (e.g, MPI_Recv or
    MPI_Wait) is posted before the corresponding send
    operation has been started.
  • Messages in Wrong Order (Late Sender)
  • A Late Sender situation may be the result of
    messages that are received in the wrong order. If
    a process expects messages from one or more
    processes in a certain order while these
    processes are sending them in a different order,
    the receiver may need to wait longer for a
    message because this message may be sent later
    while messages sent earlier are ready to be
    received.
  • IO (MPI)
  • Time spent on MPI file IO.

40
EXPERT MPI Patterns (4)
  • Synchronization (MPI)
  • Time spent on MPI barrier synchronization.
  • Wait at Barrier (MPI)
  • This covers the time spent on waiting in front of
    an MPI barrier. The time until all processes have
    entered the barrier is measured and used to
    compute the severity.

41
EXPERT OpenMP Patterns
  • OpenMP
  • Time spent on the OpenMP run-time system.
  • Flush (OpenMP)
  • Time spent on flush directives.
  • Fork (OpenMP)
  • Time spent by the master thread on team creation.
  • Synchronization (OpenMP)
  • Time spent on OpenMP barrier or lock
    synchronization. Lock synchronization may be
    accomplished using either API calls or critical
    sections.

42
EXPERT OpenMP Patterns (2)
  • Barrier (OpenMP)
  • The time spent on implicit (compiler-generated)
    or explicit (user-specified) OpenMP barrier
    synchronization. As already mentioned, implicit
    barriers are treated similar to explicit ones.
    The instrumentation procedure replaces an
    implicit barrier with an explicit barrier
    enclosed by the parallel construct. This is done
    by adding a nowait clause and a barrier directive
    as the last statement of the parallel construct.
    In cases where the implicit barrier cannot be
    removed (i.e., parallel region), the explicit
    barrier is executed in front of the implicit
    barrier, which will be negligible because the
    team will already be synchronized when reaching
    it. The synthetic explicit barrier appears in the
    display as a special implicit barrier construct.
  • Explicit (OpenMP)
  • Time spent on explicit OpenMP barriers.
  • Implicit (OpenMP)
  • Time spent on implicit OpenMP barriers.
  • Wait at Barrier (Explicit)
  • This covers the time spent on waiting in front of
    an explicit (user-specified) OpenMP barrier. The
    time until all processes have entered the barrier
    is measured and used to compute the severity.

43
EXPERT OpenMP Patterns (3)
  • Wait at Barrier (Implicit)
  • This covers the time spent on waiting in front of
    an implicit (compiler-generated) OpenMP barrier.
    The time until all processes have entered the
    barrier is measured and used to compute the
    severity.
  • Lock Competition (OpenMP)
  • This property refers to the time a thread spent
    on waiting for a lock that had been previously
    acquired by another thread.
  • API (OpenMP)
  • Lock competition caused by OpenMP API calls.
  • Critical (OpenMP)
  • Lock competition caused by critical sections.
  • Idle Threads
  • Idle times caused by sequential execution before
    or after an OpenMP parallel region.
Write a Comment
User Comments (0)
About PowerShow.com