Title: Scalable Dynamic Instrumentation for Bluegene/L
1Scalable Dynamic Instrumentation for Bluegene/L
2Acknowledgements
- Lawrence Livermore National Laboratory employees
(the brains behind the project) - Dong Ahn, Bronis de Supinski, and Martin Schulz
- Fellow Student Scholars
- Steve Ko (University of Illinois) and Barry
Rountree (University of Georgia) - Advisor
- Allan Snavely
3Bluegene/L
- 64K Compute Nodes, two 700MHZ processors per node
- Compute nodes run custom lightweight kernel no
multitasking, limited system calls - Dedicated I/O nodes for additional system calls
and external communication - 1024 I/O nodes with same architecture as compute
nodes
4Why Dynamic Instrumentation?
- Parallel applications often have long compile
times and longer run times - Typical method (printf or log file)
- Modify code to output intermediate steps,
recompile, rerun - Static code instrumentation?
- Need to restart application
- Where to put instrumentation?
- Dynamic instrumentation
- No need to recompile or rerun application
- Instrumentation can be easily added and removed
5DynInst
- API that allows programs to insert code
snippets into a running application - Interfaces with debugger (i.e. ptrace)
- Machine independent interface via abstract syntax
trees - Uses trampoline code
6DynInst Trampoline Code
7Dynamic Probe Class Library (DPCL)
- API built on DynInst
- Higher level of abstraction
- Instrumentation as probes, C-like expressions
- Adds ability to instrument multiple processes in
a parallel application - Allows data to be sent from application to tool
8Original DPCL structure
9Scaling Issues with DPCL
- Tool requires one socket each per super daemon
and daemon - Strain system limit
- Huge bottleneck at tool
- BG/L compute nodes cant support daemons
- No multitasking
- App sends data via shared memory with daemon
10Multicast Reduction Network
- Developed at the University of Wisconsin, Madison
- Creates tree topology network between front end
tool and compute node application processes - Scalable multicast from front end to compute
nodes - Upstream, downstream, and synchronization filters
11Typical Tool Topology
12Tool Topology with MRNet
13DynInst Modifications
- Perform instrumentation remotely from CNs on
BG/Ls I/O nodes - Instrument multiple processes per daemon.
- Interface with CIOD
14DPCL Front End Modifications
- Move from Process oriented to Application
oriented view - Job Launch via Launchmon
- Starts application and daemon processes
- Gathers process information (i.e. PIDs,
hostnames, etc.) - Statically link application to runtime library
15DPCL Daemon Modifications
- Removed super daemons
- Commands processed through a DPCL filter
- Ability to instrument multiple processes
- Single message de-multiplexed for all processes
- Callbacks must cycle through daemons ProcessD
objects
16MRNet based DPCL on BG/L
17Current State
- DynInst not fully functional on BG/L
- Can control application processes, but not
instrument - Ported to the Multiprogrammatic Capability
Cluster (MCR) - 1,152 dual 2.4GHz Pentium 4 Xeon processors
- 11 Teraflops
- Linux cluster
- Fully functional DynInst
18Performance Tests
- uBG/L
- 1024 node (1 rack) of BG/L
- 8 Compute Nodes per I/O node
- MCR
- Latency test
- Throughput test
- DPCL performance
19MRNet Latency Test
- Created a MRNet Communicator with a single
compute node - Front end sends a packet to the compute node and
awaits an ack - Average of 1000 send message, receive ack pairs
20MRNet Latency Results
21MRNet Throughput Test
- Each compute node sends fixed number of packets
to front end - 100 to 1000 packets in 100 packet intervals
- With and without sum filter
- Each data point represents best of at least 5
runs - Avoid system noise
22MCR one process per CN MRNet topology
23MCR MRNet One Proc Throughput
24MCR One Proc Filter Speedup
25MCR two processes per CN MRNet topology
26MCR MRNet Two Procs Throughput
27MCR Two Proc Filter Speedup
28uBG/L MRNet topology
29uBG/L MRNet Throughput Results
30uBG/L Filter Speedups
31MRNet Performance Conclusions
- Moderate latency
- Scalability
- Scales well for most test cases
- Some problems at extreme points
- A smart DPCL tool would not require this stress
on communication - Filters very effective
- For balanced tree topology
- For large number of nodes
32MRNet DPCL Performance Tests
- Tests on both uBG/L and MCR
- 3 tests
- Simple DPCL command latency
- Master daemon optimization results
- Attach latency
33Blocking Command Latency
- Time to construct and send simple command and
receive all acks from daemons - Minimal overhead of sending a command
34Blocking Command Latency - MCR
35Blocking Command Latency uBG/L
36Master Daemon Optimization
- Some data sent to tool is redundant
- Executable information i.e. module names,
function names, and instrumentation points - Only need one daemon to send this data
37Optimization Results - MCR
38Optimization Speedup - MCR
39Optimization Results uBG/L
40Optimization Speedup uBG/L
41Attach Latency (Optimized) MCR
42Attach Latency (Optimized) uBG/L
43DPCL Performance Conclusion
- Scales very well
- Optimization benefits most at larger number of
nodes - uBG/L long pre attach and attach times
- Could be 8x worse on BG/L
- Room for optimization
44Interface Extension Contexts
- More control over process selection
- vs. single process or entire application
- Create MPI communicator-like contexts
- Ability to take advantage of MRNets filters
- Can specify context for any DPCL command
- Default to world context
45Interface Extension Results
- Currently fully implemented and functional
- Need to perform test to show utility
- Can utilize MRNet filters
- Less intrusive on application
46Conclusion
- More application oriented view using MRNet
- Scales well under tests performed
- Need to test on larger machines
- Contexts for arbitrary placement of
instrumentation
47References
- M. Schulz, D. Ahn, A. Bernat, B. R. de Supinski,
S. Y. Ko, G. Lee, and B. Rountree. Scalable
Dynamic Binary Instrumentation for Blue Gene/L. - L. DeRose, T. Hoover Jr., and J. K.
Hollingsworth. The Dynamic Probe Class Library
An Infrastructure for Developing Instrumentation
for Performance Tools. - P. Roth, D. Arnold, and B. Miller. MRNet A
Software-Based Multicast/Reduction Network for
Scalable Tools - B. Buck and J. K. Hollingsworth. An API for
Runtime Code Patching. - D. M. Pase. Dynamic Probe Class Library (DPCL)
Tutorial and Reference Guide. IBM, 1998.
48- Questions?
- Comments?
- Ideas?