Scalable Dynamic Instrumentation for Bluegene/L - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Scalable Dynamic Instrumentation for Bluegene/L

Description:

Lawrence Livermore National Laboratory employees (the brains behind the project) ... Uses trampoline code. DynInst Trampoline Code. Dynamic Probe Class Library (DPCL) ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 49
Provided by: Grego158
Category:

less

Transcript and Presenter's Notes

Title: Scalable Dynamic Instrumentation for Bluegene/L


1
Scalable Dynamic Instrumentation for Bluegene/L
  • by Gregory Lee

2
Acknowledgements
  • Lawrence Livermore National Laboratory employees
    (the brains behind the project)
  • Dong Ahn, Bronis de Supinski, and Martin Schulz
  • Fellow Student Scholars
  • Steve Ko (University of Illinois) and Barry
    Rountree (University of Georgia)
  • Advisor
  • Allan Snavely

3
Bluegene/L
  • 64K Compute Nodes, two 700MHZ processors per node
  • Compute nodes run custom lightweight kernel no
    multitasking, limited system calls
  • Dedicated I/O nodes for additional system calls
    and external communication
  • 1024 I/O nodes with same architecture as compute
    nodes

4
Why Dynamic Instrumentation?
  • Parallel applications often have long compile
    times and longer run times
  • Typical method (printf or log file)
  • Modify code to output intermediate steps,
    recompile, rerun
  • Static code instrumentation?
  • Need to restart application
  • Where to put instrumentation?
  • Dynamic instrumentation
  • No need to recompile or rerun application
  • Instrumentation can be easily added and removed

5
DynInst
  • API that allows programs to insert code
    snippets into a running application
  • Interfaces with debugger (i.e. ptrace)
  • Machine independent interface via abstract syntax
    trees
  • Uses trampoline code

6
DynInst Trampoline Code
7
Dynamic Probe Class Library (DPCL)
  • API built on DynInst
  • Higher level of abstraction
  • Instrumentation as probes, C-like expressions
  • Adds ability to instrument multiple processes in
    a parallel application
  • Allows data to be sent from application to tool

8
Original DPCL structure
9
Scaling Issues with DPCL
  • Tool requires one socket each per super daemon
    and daemon
  • Strain system limit
  • Huge bottleneck at tool
  • BG/L compute nodes cant support daemons
  • No multitasking
  • App sends data via shared memory with daemon

10
Multicast Reduction Network
  • Developed at the University of Wisconsin, Madison
  • Creates tree topology network between front end
    tool and compute node application processes
  • Scalable multicast from front end to compute
    nodes
  • Upstream, downstream, and synchronization filters

11
Typical Tool Topology
12
Tool Topology with MRNet
13
DynInst Modifications
  • Perform instrumentation remotely from CNs on
    BG/Ls I/O nodes
  • Instrument multiple processes per daemon.
  • Interface with CIOD

14
DPCL Front End Modifications
  • Move from Process oriented to Application
    oriented view
  • Job Launch via Launchmon
  • Starts application and daemon processes
  • Gathers process information (i.e. PIDs,
    hostnames, etc.)
  • Statically link application to runtime library

15
DPCL Daemon Modifications
  • Removed super daemons
  • Commands processed through a DPCL filter
  • Ability to instrument multiple processes
  • Single message de-multiplexed for all processes
  • Callbacks must cycle through daemons ProcessD
    objects

16
MRNet based DPCL on BG/L
17
Current State
  • DynInst not fully functional on BG/L
  • Can control application processes, but not
    instrument
  • Ported to the Multiprogrammatic Capability
    Cluster (MCR)
  • 1,152 dual 2.4GHz Pentium 4 Xeon processors
  • 11 Teraflops
  • Linux cluster
  • Fully functional DynInst

18
Performance Tests
  • uBG/L
  • 1024 node (1 rack) of BG/L
  • 8 Compute Nodes per I/O node
  • MCR
  • Latency test
  • Throughput test
  • DPCL performance

19
MRNet Latency Test
  • Created a MRNet Communicator with a single
    compute node
  • Front end sends a packet to the compute node and
    awaits an ack
  • Average of 1000 send message, receive ack pairs

20
MRNet Latency Results
21
MRNet Throughput Test
  • Each compute node sends fixed number of packets
    to front end
  • 100 to 1000 packets in 100 packet intervals
  • With and without sum filter
  • Each data point represents best of at least 5
    runs
  • Avoid system noise

22
MCR one process per CN MRNet topology
23
MCR MRNet One Proc Throughput
24
MCR One Proc Filter Speedup
25
MCR two processes per CN MRNet topology
26
MCR MRNet Two Procs Throughput
27
MCR Two Proc Filter Speedup
28
uBG/L MRNet topology
29
uBG/L MRNet Throughput Results
30
uBG/L Filter Speedups
31
MRNet Performance Conclusions
  • Moderate latency
  • Scalability
  • Scales well for most test cases
  • Some problems at extreme points
  • A smart DPCL tool would not require this stress
    on communication
  • Filters very effective
  • For balanced tree topology
  • For large number of nodes

32
MRNet DPCL Performance Tests
  • Tests on both uBG/L and MCR
  • 3 tests
  • Simple DPCL command latency
  • Master daemon optimization results
  • Attach latency

33
Blocking Command Latency
  • Time to construct and send simple command and
    receive all acks from daemons
  • Minimal overhead of sending a command

34
Blocking Command Latency - MCR
35
Blocking Command Latency uBG/L
36
Master Daemon Optimization
  • Some data sent to tool is redundant
  • Executable information i.e. module names,
    function names, and instrumentation points
  • Only need one daemon to send this data

37
Optimization Results - MCR
38
Optimization Speedup - MCR
39
Optimization Results uBG/L
40
Optimization Speedup uBG/L
41
Attach Latency (Optimized) MCR
42
Attach Latency (Optimized) uBG/L
43
DPCL Performance Conclusion
  • Scales very well
  • Optimization benefits most at larger number of
    nodes
  • uBG/L long pre attach and attach times
  • Could be 8x worse on BG/L
  • Room for optimization

44
Interface Extension Contexts
  • More control over process selection
  • vs. single process or entire application
  • Create MPI communicator-like contexts
  • Ability to take advantage of MRNets filters
  • Can specify context for any DPCL command
  • Default to world context

45
Interface Extension Results
  • Currently fully implemented and functional
  • Need to perform test to show utility
  • Can utilize MRNet filters
  • Less intrusive on application

46
Conclusion
  • More application oriented view using MRNet
  • Scales well under tests performed
  • Need to test on larger machines
  • Contexts for arbitrary placement of
    instrumentation

47
References
  • M. Schulz, D. Ahn, A. Bernat, B. R. de Supinski,
    S. Y. Ko, G. Lee, and B. Rountree. Scalable
    Dynamic Binary Instrumentation for Blue Gene/L.
  • L. DeRose, T. Hoover Jr., and J. K.
    Hollingsworth. The Dynamic Probe Class Library
    An Infrastructure for Developing Instrumentation
    for Performance Tools.
  • P. Roth, D. Arnold, and B. Miller. MRNet A
    Software-Based Multicast/Reduction Network for
    Scalable Tools
  • B. Buck and J. K. Hollingsworth. An API for
    Runtime Code Patching.
  • D. M. Pase. Dynamic Probe Class Library (DPCL)
    Tutorial and Reference Guide. IBM, 1998.

48
  • Questions?
  • Comments?
  • Ideas?
Write a Comment
User Comments (0)
About PowerShow.com