Title: Shadow Profiling: Hiding Instrumentation Costs with Parallelism
1Shadow ProfilingHiding Instrumentation Costs
with Parallelism
- Tipp Moseley
- Alex Shye
- Vijay Janapa Reddi
- Dirk Grunwald
- (University of Colorado)
- Ramesh Peri
- (Intel Corporation)
2Motivation
- An ideal profiler will
- Collect arbitrarily detailed and abundant
information - Incur negligible overhead
- A real profiler, e.g., using Pin, satisfies
condition 1 - But the cost is high
- 3X for BBL counting
- 25X for loop profiling
- 50X or higher for memory profiling
- A real profiler, e.g. PMU sampling or code
patching, satisfies condition 2 - But the detail is very coarse
3Motivation
Bursty Tracing (Sampled Instrumentation),Novel
Hardware,Shadow Profiling
VTune, DCPI, OProfile, PAPI, pfmon, PinProbes,
Pintools, Valgrind, ATOM,
4Goal
- To create a profiler capable of collecting
detailed, abundant information while incurring
negligible overhead - Enable developers to focus on other things
5The Big Idea
- Stems from fault tolerance work on deterministic
replication - Periodically fork(), profile shadow processes
Time CPU 0 CPU 1 CPU 2 CPU 3
0 Orig. Slice 0 Slice 0
1 Orig. Slice 1 Slice 0 Slice 1
2 Orig. Slice 2 Slice 0 Slice 1 Slice 2
3 Orig. Slice 3 Slice 3 Slice 1 Slice 2
4 Orig. Slice 4 Slice 3 Slice 4 Slice 2
5 Slice 3 Slice 4
6 Slice 4
Assuming instrumentation overhead of 3X
6Challenges
- Threads
- Shared Memory
- Asynchronous Interrupts
- System Calls
- JIT overhead
- Overhead vs. Number of CPUs
- Maximum speedup is Number of CPUs
- If profiler overhead is 50X, need at least 51
CPUs to run in real-time (probably many more) - Too many complications to ensure deterministic
replication
7Goal (Revised)
- To create a profiler capable of sampling detailed
traces (bursts) with negligible overhead - Trade abundance for low overhead
- Like SimPoints or SMARTS (but not as smart )
8The Big Idea (revised)
- Do not strive for full, deterministic replica
- Instead, profile many short, mostly deterministic
bursts - Profile a fixed number of instructions
- Fake it for system calls
- Must not allow shadow to side-effect system
Time CPU 0 CPU 1 CPU 2 CPU 3
0 Orig. Slice 0 Slice 0 Spyware
1 Orig. Slice 1 Slice 0 Spyware
2 Orig. Slice 2 Slice 0 Slice 1 Spyware
3 Orig. Slice 3 Slice 1 Spyware
4 Orig. Slice 4 Slice 1 Spyware
9Design Overview
10Design Overview
- Monitor uses Pin Probes (code patching)
- Application runs natively
- Monitor receives periodic timer signal and
decides when to fork() - After fork(), child uses PIN_ExecuteAt()
functionality to switch Pin from Probe to JIT
mode. - Shadow process profiles as usual, except handling
of special cases - Monitor logs special read() system calls and
pipes result to shadow processes
11System Calls
- For SPEC CPU2000, system calls occur around 35
times per second - Forking after each puts lots of pressure on CoW
pages, Pin JIT engine - 95 of dynamic system calls can be safely handled
- Some system calls can be allowed to execute (49)
- getrusage, _llseek, times, time, brk, munmap,
fstat64, close, stat64, umask, getcwd, uname,
access, exit_group,
12System Calls
- Some can be replaced with success assumed (39)
- write, ftruncate, writev, unlink, rename,
- Some are handled specially, but execution may
continue (1.8) - mmap2, open(creat), mmap, mprotect, mremap, fcntl
- read() is special (5.4)
- For reads from pipes/sockets, the data must be
logged from the original app - For reads from files, the file must be closed and
reopened after the fork() because the OS file
pointer is not duplicated - ioctl() is special (4.8)
- Frequent in perlbmk
- Behavior is device-dependent, safest action is to
simply terminate the segment and re-fork()
13Other Issues
- Shared Memory
- Disallow writes to shared memory
- Asynchronous Interrupts (Userspace signals)
- Since we are only mostly deterministic, no longer
an issue - When main program receives a signal, pass it
along to live children - JIT Overhead
- After each fork(), it is like Pinning a new
program - Warmup is too slow
- Use Persistent Code Caching CGO07
14Multithreaded Programs
- Issuefork() does not duplicate all threads
- Only the thread that called fork()
- Solution
- Barrier all threads in the program and store
their CPU state - Fork the process and clone new threads for those
that were destroyed - Identical address space only register state was
really lost - In each new thread, restore previous CPU state
- Modified clone() handling in Pin VM
- Continue execution, virtualize thread IDs for
relevant system calls
15Tuning Overhead
- Load
- Number of active shadow processes
- Tested 0.125, 0.25, 0.5, 1.0, 2.0
- Sample Size
- Number of instructions to profile
- Longer samples for less overhead, more data
- Shorter samples for more evenly dispersed data
- Tested 1M, 10M, 100M
16Experiments
- Value Profiling
- Typical overhead 100X
- Accuracy measured by Difference in Invariance
- Path Profiling
- Typical overhead 50 - 10X
- Accuracy measured by percent of hot paths
detected (2 threshold) - All experiments use SPEC2000 INT Benchmarks with
ref data set - Arithmetic mean of 3 runs presented
17Results - Value Profiling Overhead
- Overhead versus native execution
- Several configurations less than 1
- Path profiling exhibits similar trends
18Results - Value Profiling Accuracy
- All configurations within 7 of perfect profile
- Lower is better
19Results - Path Profiling Accuracy
- Most configurations over 90 accurate
- Higher is better
- Some benchmarks (e.g., 176.gcc, 186.crafty,
187.parser) have millions of paths, but few are
hot
20Results - Page Fault Increase
- Proportional increase in page faults
- Shadow/Native
21Results - Page Fault Rate
- Difference in page faults per second experienced
by native application
22Future Work
- Improve stability for multithreaded programs
- Investigate effects of different persistent code
cache policies - Compare sampling policies
- Random (current)
- Phase/event-based
- Static analysis
- Study convergence
- Apply technique
- Profile-guided optimizations
- Simulation techniques
23Conclusion
- Shadow Profiling allows collection of bursts of
detailed traces - Accuracy is over 90
- Incurs negligible overhead
- Often less than 1
- With increasing numbers of cores, allows
developers focus to shift from profiling to
applying optimizations