Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO

Description:

Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO 04 Tutorial 3/21/04 CK. Luk chi-keung.luk_at_intel.com Massachusetts Microprocessor Design Center – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 20
Provided by: chik153
Category:

less

Transcript and Presenter's Notes

Title: Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO


1
Guiding Ispike with Instrumentation and Hardware
(PMU) ProfilesCGO04 Tutorial3/21/04
  • CK. Luk
  • chi-keung.luk_at_intel.com
  • Massachusetts Microprocessor Design Center
  • Intel Corporation

2
What is Ispike?
  • A post-link optimizer for Itanium/Linux
  • No source code required
  • Memory-centric optimizations
  • Code layout prefetching, data layout
    prefetching
  • Significant speedups over compiler-optimized
    programs
  • 10 average speedup over gcc O3 on SPEC CINT
    2000
  • Profile usages
  • Understanding program characteristics
  • Driving optimizations automatically
  • Evaluating the effectiveness of optimizations

3
Profiles used by Ispike
Granularity Hardware Profiles (pfmon) Instrumentation Profiles (pin) Usages
Per inst. PC sample --- Identifying hot spots
Per inst. line I-EAR (I-Cache) --- Inst. prefetching
Per inst. line I-EAR (I-TLB) --- ---
Per branch BTB Edge profile Code layout, data layout, and other opts
Per load D-EAR (D-Cache) Load-latency profile Data prefetching
Per load D-EAR (D-TLB) --- ---
Per load D-EAR (stride) Stride profile Data prefetching
4
Profile Example D-EAR (cache)
Top 10 loads in the D-EAR profile of the MCF
benchmark
Total sampled miss latency
latency buckets
5
Profile Analysis Tools
  • A set of tools written for visualizing and
    analyzing profiles, e.g.,
  • Control flow graph (CFG) viewer
  • Code-layout viewer
  • Load-latency comparator

6
CFG Viewer
C For evaluating the accuracy of profiles
7
Code-layout Viewer
C For evaluating code-layout optimization
8
Load-latency Comparator
  • For evaluating data-layout optimization and data
    prefetching

9
Deriving New Profiles from PMUs
  • New profile types can be derived from PMUs
  • Two examples
  • Consumer stall cycles
  • D-cache miss strides

10
Consumer Stall Cycles
PC-sample count
Basic block A
N1
I1 ld8 r2 r3 / other instructions / I2
add r2 r2, 1 I3 st8 r3 r2
N2
N3
  • Question
  • How many cycles of stall experienced by I2?
  • (Note not necessarily the load latency of
    I1)
  • Method
  • PC-sample count is proportional to (stall cycles
    frequency)

11
D-cache Miss Strides
  • Problem
  • Detect strides that are statically unknown

12
D-EAR based Stride Profiling
  • Sample load misses with 2 phases

13
Performance Evaluation
  • Instrumentation vs. PMU profiles
  • Profiling overhead
  • Performance impact
  • Ispike optimizations
  • Code layout, instruction prefetching, data
    layout, data prefetching, inlining, global-data
    optimization, scalar optimizations
  • Baseline compilers
  • Intel Electron compiler (ecc), version 8.0 Beta,
    -O3
  • GNU C compiler (gcc), version 3.2, -O3
  • Benchmarks
  • SPEC CINT2000 (profiled with training, measured
    with reference)
  • System
  • 1GHz Itanium 2, 16KB L1I/16KB L1D, 256KB L2, 3MB
    L3, 16GB memory
  • Red Hat Enterprise Linux AS with 2.4.18 kernel

14
Performance Gains with PMU Profiles
BTB (1 sample/10K branches), D-EAR cache (1
sample/100 load misses) D-EAR stride (1 sample
/100 misses in skipping, 1 sample/miss in
inspection)
Gcc3.2 O3 baseline
Ecc8.0 O3 baseline
  • Up to 40 gain
  • Geo. means 8.5 over Ecc and 9.9 over Gcc

15
Cycle Breakdown (Ecc Baseline)
  • Help understand if individual optimizations are
    doing a good job

16
PMU Profiling Overhead
  • Overhead reduced from 58 to 23 when lowering
    the BTB sampling rate by 10x.
  • Overhead reduced to 3 when lowering the D-EAR
    sampling rate by 10x.

17
Instrumentation Profiling Overhead
  • Why is the overhead so large?
  • Training runs are too short to amortize the
    dynamic compilation cost
  • Techniques like ephemeral instrumentation yet to
    be applied

18
PMU vs. Instrumentation (Perf. Gains)
profiling overhead
gt60x
59
24
3
  • PMU profiles can be as good as instrumentation
    profiles
  • Could be even better in some cases (e.g., mcf)
  • However, possible performance drops when samples
    are too sparse
  • E.g., gap and parser when Stride lt1/1000, 1/1gt

19
Reference
  • Ispike A Post-link Optimizer for the Intel
    Itanium Architecture, by Luk et. al. In
    Proceedings of CGO04.

http//www.cgo.org/papers/01_82_luk_ck.pdf
Write a Comment
User Comments (0)
About PowerShow.com