Shadow Profiling: Hiding Instrumentation Costs with Parallelism

About This Presentation

Title:

Shadow Profiling: Hiding Instrumentation Costs with Parallelism

Description:

Shadow Profiling: Hiding Instrumentation. Costs with Parallelism. Tipp Moseley. Alex Shye ... Collect arbitrarily detailed and abundant information. Incur ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 24

Provided by: tippmo

Learn more at: http://www.cgo.org

Category:

more less

Transcript and Presenter's Notes

Title: Shadow Profiling: Hiding Instrumentation Costs with Parallelism

1
Shadow ProfilingHiding Instrumentation Costs
with Parallelism

Tipp Moseley
Alex Shye
Vijay Janapa Reddi
Dirk Grunwald
(University of Colorado)
Ramesh Peri
(Intel Corporation)

2
Motivation

An ideal profiler will
Collect arbitrarily detailed and abundant
information
Incur negligible overhead
A real profiler, e.g., using Pin, satisfies
condition 1
But the cost is high
3X for BBL counting
25X for loop profiling
50X or higher for memory profiling
A real profiler, e.g. PMU sampling or code
patching, satisfies condition 2
But the detail is very coarse

3
Motivation
Bursty Tracing (Sampled Instrumentation),Novel
Hardware,Shadow Profiling
VTune, DCPI, OProfile, PAPI, pfmon, PinProbes,
Pintools, Valgrind, ATOM,
4
Goal

To create a profiler capable of collecting
detailed, abundant information while incurring
negligible overhead
Enable developers to focus on other things

5
The Big Idea

Stems from fault tolerance work on deterministic
replication
Periodically fork(), profile shadow processes

Time CPU 0 CPU 1 CPU 2 CPU 3
0 Orig. Slice 0 Slice 0
1 Orig. Slice 1 Slice 0 Slice 1
2 Orig. Slice 2 Slice 0 Slice 1 Slice 2
3 Orig. Slice 3 Slice 3 Slice 1 Slice 2
4 Orig. Slice 4 Slice 3 Slice 4 Slice 2
5 Slice 3 Slice 4
6 Slice 4
Assuming instrumentation overhead of 3X
6
Challenges

Threads
Shared Memory
Asynchronous Interrupts
System Calls
JIT overhead
Overhead vs. Number of CPUs
Maximum speedup is Number of CPUs
If profiler overhead is 50X, need at least 51
CPUs to run in real-time (probably many more)
Too many complications to ensure deterministic
replication

7
Goal (Revised)

To create a profiler capable of sampling detailed
traces (bursts) with negligible overhead
Trade abundance for low overhead
Like SimPoints or SMARTS (but not as smart )

8
The Big Idea (revised)

Do not strive for full, deterministic replica
Instead, profile many short, mostly deterministic
bursts
Profile a fixed number of instructions
Fake it for system calls
Must not allow shadow to side-effect system

Time CPU 0 CPU 1 CPU 2 CPU 3
0 Orig. Slice 0 Slice 0 Spyware
1 Orig. Slice 1 Slice 0 Spyware
2 Orig. Slice 2 Slice 0 Slice 1 Spyware
3 Orig. Slice 3 Slice 1 Spyware
4 Orig. Slice 4 Slice 1 Spyware
9
Design Overview
10
Design Overview

Monitor uses Pin Probes (code patching)
Application runs natively
Monitor receives periodic timer signal and
decides when to fork()
After fork(), child uses PIN_ExecuteAt()
functionality to switch Pin from Probe to JIT
mode.
Shadow process profiles as usual, except handling
of special cases
Monitor logs special read() system calls and
pipes result to shadow processes

11
System Calls

For SPEC CPU2000, system calls occur around 35
times per second
Forking after each puts lots of pressure on CoW
pages, Pin JIT engine
95 of dynamic system calls can be safely handled
Some system calls can be allowed to execute (49)
getrusage, _llseek, times, time, brk, munmap,
fstat64, close, stat64, umask, getcwd, uname,
access, exit_group,

12
System Calls

Some can be replaced with success assumed (39)
write, ftruncate, writev, unlink, rename,
Some are handled specially, but execution may
continue (1.8)
mmap2, open(creat), mmap, mprotect, mremap, fcntl
read() is special (5.4)
For reads from pipes/sockets, the data must be
logged from the original app
For reads from files, the file must be closed and
reopened after the fork() because the OS file
pointer is not duplicated
ioctl() is special (4.8)
Frequent in perlbmk
Behavior is device-dependent, safest action is to
simply terminate the segment and re-fork()

13
Other Issues

Shared Memory
Disallow writes to shared memory
Asynchronous Interrupts (Userspace signals)
Since we are only mostly deterministic, no longer
an issue
When main program receives a signal, pass it
along to live children
JIT Overhead
After each fork(), it is like Pinning a new
program
Warmup is too slow
Use Persistent Code Caching CGO07

14
Multithreaded Programs

Issuefork() does not duplicate all threads
Only the thread that called fork()
Solution
Barrier all threads in the program and store
their CPU state
Fork the process and clone new threads for those
that were destroyed
Identical address space only register state was
really lost
In each new thread, restore previous CPU state
Modified clone() handling in Pin VM
Continue execution, virtualize thread IDs for
relevant system calls

15
Tuning Overhead

Load
Number of active shadow processes
Tested 0.125, 0.25, 0.5, 1.0, 2.0
Sample Size
Number of instructions to profile
Longer samples for less overhead, more data
Shorter samples for more evenly dispersed data
Tested 1M, 10M, 100M

16
Experiments

Value Profiling
Typical overhead 100X
Accuracy measured by Difference in Invariance
Path Profiling
Typical overhead 50 - 10X
Accuracy measured by percent of hot paths
detected (2 threshold)
All experiments use SPEC2000 INT Benchmarks with
ref data set
Arithmetic mean of 3 runs presented

17
Results - Value Profiling Overhead

Overhead versus native execution
Several configurations less than 1
Path profiling exhibits similar trends

18
Results - Value Profiling Accuracy

All configurations within 7 of perfect profile
Lower is better

19
Results - Path Profiling Accuracy

Most configurations over 90 accurate
Higher is better
Some benchmarks (e.g., 176.gcc, 186.crafty,
187.parser) have millions of paths, but few are
hot

20
Results - Page Fault Increase

Proportional increase in page faults
Shadow/Native

21
Results - Page Fault Rate

Difference in page faults per second experienced
by native application

22
Future Work

Improve stability for multithreaded programs
Investigate effects of different persistent code
cache policies
Compare sampling policies
Random (current)
Phase/event-based
Static analysis
Study convergence
Apply technique
Profile-guided optimizations
Simulation techniques

23
Conclusion

Shadow Profiling allows collection of bursts of
detailed traces
Accuracy is over 90
Incurs negligible overhead
Often less than 1
With increasing numbers of cores, allows
developers focus to shift from profiling to
applying optimizations

Write a Comment

User Comments (0)