Scalable Dynamic Instrumentation for Bluegene/L - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Scalable Dynamic Instrumentation for Bluegene/L

Description:

Lawrence Livermore National Laboratory employees (the brains behind the project) ... Uses trampoline code. DynInst Trampoline Code. Dynamic Probe Class Library (DPCL) ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 49

Provided by: Grego158

Category:

more less

Transcript and Presenter's Notes

Title: Scalable Dynamic Instrumentation for Bluegene/L

1
Scalable Dynamic Instrumentation for Bluegene/L

by Gregory Lee

2
Acknowledgements

Lawrence Livermore National Laboratory employees
(the brains behind the project)
Dong Ahn, Bronis de Supinski, and Martin Schulz
Fellow Student Scholars
Steve Ko (University of Illinois) and Barry
Rountree (University of Georgia)
Advisor
Allan Snavely

3
Bluegene/L

64K Compute Nodes, two 700MHZ processors per node
Compute nodes run custom lightweight kernel no
multitasking, limited system calls
Dedicated I/O nodes for additional system calls
and external communication
1024 I/O nodes with same architecture as compute
nodes

4
Why Dynamic Instrumentation?

Parallel applications often have long compile
times and longer run times
Typical method (printf or log file)
Modify code to output intermediate steps,
recompile, rerun
Static code instrumentation?
Need to restart application
Where to put instrumentation?
Dynamic instrumentation
No need to recompile or rerun application
Instrumentation can be easily added and removed

5
DynInst

API that allows programs to insert code
snippets into a running application
Interfaces with debugger (i.e. ptrace)
Machine independent interface via abstract syntax
trees
Uses trampoline code

6
DynInst Trampoline Code
7
Dynamic Probe Class Library (DPCL)

API built on DynInst
Higher level of abstraction
Instrumentation as probes, C-like expressions
Adds ability to instrument multiple processes in
a parallel application
Allows data to be sent from application to tool

8
Original DPCL structure
9
Scaling Issues with DPCL

Tool requires one socket each per super daemon
and daemon
Strain system limit
Huge bottleneck at tool
BG/L compute nodes cant support daemons
No multitasking
App sends data via shared memory with daemon

10
Multicast Reduction Network

Developed at the University of Wisconsin, Madison
Creates tree topology network between front end
tool and compute node application processes
Scalable multicast from front end to compute
nodes
Upstream, downstream, and synchronization filters

11
Typical Tool Topology
12
Tool Topology with MRNet
13
DynInst Modifications

Perform instrumentation remotely from CNs on
BG/Ls I/O nodes
Instrument multiple processes per daemon.
Interface with CIOD

14
DPCL Front End Modifications

Move from Process oriented to Application
oriented view
Job Launch via Launchmon
Starts application and daemon processes
Gathers process information (i.e. PIDs,
hostnames, etc.)
Statically link application to runtime library

15
DPCL Daemon Modifications

Removed super daemons
Commands processed through a DPCL filter
Ability to instrument multiple processes
Single message de-multiplexed for all processes
Callbacks must cycle through daemons ProcessD
objects

16
MRNet based DPCL on BG/L
17
Current State

DynInst not fully functional on BG/L
Can control application processes, but not
instrument
Ported to the Multiprogrammatic Capability
Cluster (MCR)
1,152 dual 2.4GHz Pentium 4 Xeon processors
11 Teraflops
Linux cluster
Fully functional DynInst

18
Performance Tests

uBG/L
1024 node (1 rack) of BG/L
8 Compute Nodes per I/O node
MCR
Latency test
Throughput test
DPCL performance

19
MRNet Latency Test

Created a MRNet Communicator with a single
compute node
Front end sends a packet to the compute node and
awaits an ack
Average of 1000 send message, receive ack pairs

20
MRNet Latency Results
21
MRNet Throughput Test

Each compute node sends fixed number of packets
to front end
100 to 1000 packets in 100 packet intervals
With and without sum filter
Each data point represents best of at least 5
runs
Avoid system noise

22
MCR one process per CN MRNet topology
23
MCR MRNet One Proc Throughput
24
MCR One Proc Filter Speedup
25
MCR two processes per CN MRNet topology
26
MCR MRNet Two Procs Throughput
27
MCR Two Proc Filter Speedup
28
uBG/L MRNet topology
29
uBG/L MRNet Throughput Results
30
uBG/L Filter Speedups
31
MRNet Performance Conclusions

Moderate latency
Scalability
Scales well for most test cases
Some problems at extreme points
A smart DPCL tool would not require this stress
on communication
Filters very effective
For balanced tree topology
For large number of nodes

32
MRNet DPCL Performance Tests

Tests on both uBG/L and MCR
3 tests
Simple DPCL command latency
Master daemon optimization results
Attach latency

33
Blocking Command Latency

Time to construct and send simple command and
receive all acks from daemons
Minimal overhead of sending a command

34
Blocking Command Latency - MCR
35
Blocking Command Latency uBG/L
36
Master Daemon Optimization

Some data sent to tool is redundant
Executable information i.e. module names,
function names, and instrumentation points
Only need one daemon to send this data

37
Optimization Results - MCR
38
Optimization Speedup - MCR
39
Optimization Results uBG/L
40
Optimization Speedup uBG/L
41
Attach Latency (Optimized) MCR
42
Attach Latency (Optimized) uBG/L
43
DPCL Performance Conclusion

Scales very well
Optimization benefits most at larger number of
nodes
uBG/L long pre attach and attach times
Could be 8x worse on BG/L
Room for optimization

44
Interface Extension Contexts

More control over process selection
vs. single process or entire application
Create MPI communicator-like contexts
Ability to take advantage of MRNets filters
Can specify context for any DPCL command
Default to world context

45
Interface Extension Results

Currently fully implemented and functional
Need to perform test to show utility
Can utilize MRNet filters
Less intrusive on application

46
Conclusion

More application oriented view using MRNet
Scales well under tests performed
Need to test on larger machines
Contexts for arbitrary placement of
instrumentation

47
References

M. Schulz, D. Ahn, A. Bernat, B. R. de Supinski,
S. Y. Ko, G. Lee, and B. Rountree. Scalable
Dynamic Binary Instrumentation for Blue Gene/L.
L. DeRose, T. Hoover Jr., and J. K.
Hollingsworth. The Dynamic Probe Class Library
An Infrastructure for Developing Instrumentation
for Performance Tools.
P. Roth, D. Arnold, and B. Miller. MRNet A
Software-Based Multicast/Reduction Network for
Scalable Tools
B. Buck and J. K. Hollingsworth. An API for
Runtime Code Patching.
D. M. Pase. Dynamic Probe Class Library (DPCL)
Tutorial and Reference Guide. IBM, 1998.