Title: Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO
1Guiding Ispike with Instrumentation and Hardware
(PMU) ProfilesCGO04 Tutorial3/21/04
- CK. Luk
- chi-keung.luk_at_intel.com
- Massachusetts Microprocessor Design Center
- Intel Corporation
2What is Ispike?
- A post-link optimizer for Itanium/Linux
- No source code required
- Memory-centric optimizations
- Code layout prefetching, data layout
prefetching - Significant speedups over compiler-optimized
programs - 10 average speedup over gcc O3 on SPEC CINT
2000 - Profile usages
- Understanding program characteristics
- Driving optimizations automatically
- Evaluating the effectiveness of optimizations
3Profiles used by Ispike
Granularity Hardware Profiles (pfmon) Instrumentation Profiles (pin) Usages
Per inst. PC sample --- Identifying hot spots
Per inst. line I-EAR (I-Cache) --- Inst. prefetching
Per inst. line I-EAR (I-TLB) --- ---
Per branch BTB Edge profile Code layout, data layout, and other opts
Per load D-EAR (D-Cache) Load-latency profile Data prefetching
Per load D-EAR (D-TLB) --- ---
Per load D-EAR (stride) Stride profile Data prefetching
4Profile Example D-EAR (cache)
Top 10 loads in the D-EAR profile of the MCF
benchmark
Total sampled miss latency
latency buckets
5Profile Analysis Tools
- A set of tools written for visualizing and
analyzing profiles, e.g., - Control flow graph (CFG) viewer
- Code-layout viewer
- Load-latency comparator
6CFG Viewer
C For evaluating the accuracy of profiles
7Code-layout Viewer
C For evaluating code-layout optimization
8Load-latency Comparator
- For evaluating data-layout optimization and data
prefetching
9Deriving New Profiles from PMUs
- New profile types can be derived from PMUs
- Two examples
- Consumer stall cycles
- D-cache miss strides
10Consumer Stall Cycles
PC-sample count
Basic block A
N1
I1 ld8 r2 r3 / other instructions / I2
add r2 r2, 1 I3 st8 r3 r2
N2
N3
- Question
- How many cycles of stall experienced by I2?
- (Note not necessarily the load latency of
I1) - Method
- PC-sample count is proportional to (stall cycles
frequency) -
11D-cache Miss Strides
- Problem
- Detect strides that are statically unknown
12D-EAR based Stride Profiling
- Sample load misses with 2 phases
13Performance Evaluation
- Instrumentation vs. PMU profiles
- Profiling overhead
- Performance impact
- Ispike optimizations
- Code layout, instruction prefetching, data
layout, data prefetching, inlining, global-data
optimization, scalar optimizations - Baseline compilers
- Intel Electron compiler (ecc), version 8.0 Beta,
-O3 - GNU C compiler (gcc), version 3.2, -O3
- Benchmarks
- SPEC CINT2000 (profiled with training, measured
with reference) - System
- 1GHz Itanium 2, 16KB L1I/16KB L1D, 256KB L2, 3MB
L3, 16GB memory - Red Hat Enterprise Linux AS with 2.4.18 kernel
14Performance Gains with PMU Profiles
BTB (1 sample/10K branches), D-EAR cache (1
sample/100 load misses) D-EAR stride (1 sample
/100 misses in skipping, 1 sample/miss in
inspection)
Gcc3.2 O3 baseline
Ecc8.0 O3 baseline
- Up to 40 gain
- Geo. means 8.5 over Ecc and 9.9 over Gcc
15Cycle Breakdown (Ecc Baseline)
- Help understand if individual optimizations are
doing a good job
16PMU Profiling Overhead
- Overhead reduced from 58 to 23 when lowering
the BTB sampling rate by 10x. - Overhead reduced to 3 when lowering the D-EAR
sampling rate by 10x.
17Instrumentation Profiling Overhead
- Why is the overhead so large?
- Training runs are too short to amortize the
dynamic compilation cost - Techniques like ephemeral instrumentation yet to
be applied
18PMU vs. Instrumentation (Perf. Gains)
profiling overhead
gt60x
59
24
3
- PMU profiles can be as good as instrumentation
profiles - Could be even better in some cases (e.g., mcf)
- However, possible performance drops when samples
are too sparse - E.g., gap and parser when Stride lt1/1000, 1/1gt
19Reference
- Ispike A Post-link Optimizer for the Intel
Itanium Architecture, by Luk et. al. In
Proceedings of CGO04.
http//www.cgo.org/papers/01_82_luk_ck.pdf