Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO

About This Presentation

Title:

Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO

Description:

Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO 04 Tutorial 3/21/04 CK. Luk chi-keung.luk_at_intel.com Massachusetts Microprocessor Design Center – PowerPoint PPT presentation

Number of Views:83

Avg rating:3.0/5.0

Slides: 20

Provided by: chik153

Category:

more less

Transcript and Presenter's Notes

Title: Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO

1
Guiding Ispike with Instrumentation and Hardware
(PMU) ProfilesCGO04 Tutorial3/21/04

CK. Luk
chi-keung.luk_at_intel.com
Massachusetts Microprocessor Design Center
Intel Corporation

2
What is Ispike?

A post-link optimizer for Itanium/Linux
No source code required
Memory-centric optimizations
Code layout prefetching, data layout
prefetching
Significant speedups over compiler-optimized
programs
10 average speedup over gcc O3 on SPEC CINT
2000
Profile usages
Understanding program characteristics
Driving optimizations automatically
Evaluating the effectiveness of optimizations

3
Profiles used by Ispike
Granularity Hardware Profiles (pfmon) Instrumentation Profiles (pin) Usages
Per inst. PC sample --- Identifying hot spots
Per inst. line I-EAR (I-Cache) --- Inst. prefetching
Per inst. line I-EAR (I-TLB) --- ---
Per branch BTB Edge profile Code layout, data layout, and other opts
Per load D-EAR (D-Cache) Load-latency profile Data prefetching
Per load D-EAR (D-TLB) --- ---
Per load D-EAR (stride) Stride profile Data prefetching
4
Profile Example D-EAR (cache)
Top 10 loads in the D-EAR profile of the MCF
benchmark
Total sampled miss latency
latency buckets
5
Profile Analysis Tools

A set of tools written for visualizing and
analyzing profiles, e.g.,
Control flow graph (CFG) viewer
Code-layout viewer
Load-latency comparator

6
CFG Viewer
C For evaluating the accuracy of profiles
7
Code-layout Viewer
C For evaluating code-layout optimization
8
Load-latency Comparator

For evaluating data-layout optimization and data
prefetching

9
Deriving New Profiles from PMUs

New profile types can be derived from PMUs
Two examples
Consumer stall cycles
D-cache miss strides

10
Consumer Stall Cycles
PC-sample count
Basic block A
N1
I1 ld8 r2 r3 / other instructions / I2
add r2 r2, 1 I3 st8 r3 r2
N2
N3

Question
How many cycles of stall experienced by I2?
(Note not necessarily the load latency of
I1)
Method
PC-sample count is proportional to (stall cycles
frequency)

11
D-cache Miss Strides

Problem
Detect strides that are statically unknown

12
D-EAR based Stride Profiling

Sample load misses with 2 phases

13
Performance Evaluation

Instrumentation vs. PMU profiles
Profiling overhead
Performance impact
Ispike optimizations
Code layout, instruction prefetching, data
layout, data prefetching, inlining, global-data
optimization, scalar optimizations
Baseline compilers
Intel Electron compiler (ecc), version 8.0 Beta,
-O3
GNU C compiler (gcc), version 3.2, -O3
Benchmarks
SPEC CINT2000 (profiled with training, measured
with reference)
System
1GHz Itanium 2, 16KB L1I/16KB L1D, 256KB L2, 3MB
L3, 16GB memory
Red Hat Enterprise Linux AS with 2.4.18 kernel

14
Performance Gains with PMU Profiles
BTB (1 sample/10K branches), D-EAR cache (1
sample/100 load misses) D-EAR stride (1 sample
/100 misses in skipping, 1 sample/miss in
inspection)
Gcc3.2 O3 baseline
Ecc8.0 O3 baseline