Automatic Measurement of Instruction Cache Capacity in XRay - PowerPoint PPT Presentation

About This Presentation
Title:

Automatic Measurement of Instruction Cache Capacity in XRay

Description:

... compiler to produce library. Examples: ATLAS, FFTW, SPIRAL, ... QEST'05 ... Require online manuals. Actual hardware values vs. number available for optimization ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 26
Provided by: kamen2
Learn more at: http://www.csc.lsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Automatic Measurement of Instruction Cache Capacity in XRay


1
Automatic Measurement of Instruction Cache
Capacityin X-Ray
  • Kamen Yotov
  • kyotov_at_us.ibm.com
  • IBM T. J. Watson Research Center
  • Joint work with
  • Tyler Steele, Sandra Jackson,
  • Keshav Pingali, Paul Stodghill
  • Department of Computer Science
  • Cornell University

2
Motivation self-optimizing software
  • Goal portable performance
  • Self-optimizing software
  • Generates code with parameters whose optimal
    values depend on the platform (hardware / OS /
    compiler)
  • Determines experimentally optimal parameter
    values
  • Uses native C compiler to produce library
  • Examples ATLAS, FFTW, SPIRAL,

3
Example Register Blocking for MMM
  • Hardware parameters
  • Number of FP registers (NR)
  • I-Cache Capacity (ICC)
  • A simple model for the register tile size for
    MMM
  • Yotov et al. IEEE05
  • MU x NU MU NU Temp NR
  • KU (unroll of K loop)
  • does not depend on NR
  • depends on ICC
  • Need to know NR and ICC!

4
Why not consult the manuals?
  • Self-optimizing systems
  • Require online manuals
  • Actual hardware values vs. number available for
    optimization
  • For software optimization, hardware values may
    not be relevant
  • (e.g.) number of hardware registers may not be
    equal to number of registers available for
    holding program values (register 0 on SPARC)
  • Incomplete
  • Parameters like capacity and line size of
    off-chip caches vary from model to model
  • Even same model of computer may be shipped with
    different cache organizations
  • Not usually documented in processor manuals
  • Moving Target

5
Automatic Measurement Tools
  • lmbench
  • OS benchmark, some CPU / Memory benchmarks
  • Larry McVoy, BitMover, Inc.
  • Carl Staelin, HP
  • Calibrator
  • Memory hierarchy benchmark
  • Stefan Manegold
  • Centrum voor Wiskunde en Informatica
  • MOB
  • Memory hierarchy benchmark
  • Josep Blanquer, Robert Chalmers
  • University of California Santa Barbara

6
X-Ray
  • Set of micro-benchmarks in ANSI C89
  • Download and compile on any architecture
    (portable)
  • Deduce hardware parameter values from timing
    results
  • Some amount of O/S specific code
  • High-resolution timing routines
  • Super-page allocation
  • Currently support Linux
  • Windows and Solaris, IRIX, and AIX in the works
  • Paradox
  • Compiler optimizations may contaminate timing
    results
  • Cannot afford to turn off all optimizations

7
Example Latency of Integer ADD(Step by Step)
  • t gettime()
  • r1 r2
  • return gettime() t

Problem hard to measure small time intervals
accurately
8
Step by Step (cont.)
  • t gettime()
  • while (--R) //R is number of repetitions
  • r1 r2
  • return gettime() t

Problem loop overhead
9
Step by Step (cont.)
  • t gettime()
  • i R / U
  • while (--i) //loop unrolled U times
  • r1 r2
  • r1 r2
  • ........
  • r1 r2
  • return gettime() t

Problem compiler optimizations
10
Step by Step (cont.)
  • t gettime()
  • i R / U
  • switch (v)
  • case 0 loop
  • case 1 r1 r2
  • case 2 r1 r2
  • .................
  • case U r1 r2
  • if (--i)
  • goto loop
  • if (!v) return gettime() t else use(r1,r2)

Solution volatile int v 0
11
Latency of integer ADD nano-benchmark C code
  • Want to measure
  • r1r2
  • Generate C Code from specification
  • ltr1r2, ltr1, r2 intgtgt
  • volatile int v 0
  • volatile int vr 0
  • register int r1 vr
  • register int r2 vr
  • t gettime()
  • i R / U
  • switch (v)
  • case 0 loop
  • case 1 r1 r2
  • case 2 r1 r2
  • .................
  • case U r1 r2
  • if (--i)
  • goto loop
  • if (!v)
  • return gettime() t
  • else

12
X-Ray architecture
13
Instruction Throughput
  • Specification
  • Control Engine

N3, B1
14
Micro-benchmarks in X-Ray
  • CPU
  • Frequency
  • Instruction Latency
  • Instruction Throughput
  • Instruction Existence
  • FPU on embedded processors
  • FMA on general purpose processors
  • SMP and SMT
  • Memory Hierarchy
  • Number of Registers of various types (int, float,
    SSE, )
  • Multilevel Caches, TLB
  • Associativity
  • Block Size
  • Capacity
  • Latency
  • Instruction Cache Capacity

15
Previous Approaches for Memory Hierarchy
Parameters
  • Saavedra Benchmark (Hennessy-Patterson)
  • Accesses elements of an array constant stride
    apart
  • Measures average memory access time
  • Deficiencies
  • Considers all levels simultaneously
  • Works only for capacities that are powers-of-2
  • Suffers from a number of implementation level
    deficiencies
  • Constant stride accesses
  • Loop overhead problems
  • Overlapping memory operations
  • Prone to compiler optimizations

16
ExampleIsolation of lower cache levels
  • Idea for Ln measurements
  • Use sequences as for L1 measurements
  • Make L1Ln-1 transparent to measurements
  • Unique in isolating the behavior of Ln so that
    all higher levels miss
  • Approach
  • Use sequences of sequences
  • Convolution of sequences

?

17
Measuring I-Cache Capacity
  • Approach for Data Cache does not work
  • Array of pointers ? Code sequence with branches
  • Such branches are very predictable
  • Nearly impossible to get precise timing
  • Measure time to execute special code sequence of
    size N statements
  • Find the biggest N for which there is no
    significant increase in time per statement

18
Nano-benchmark
  • Similar to Instruction Throughput
  • Parameters (1, 4)
  • Grow length N
  • Code size computed
  • (char )finish (char )start

19
Sensitivity
  • Graph for Pentium M
  • 9 more in the paper
  • Performance oscillates
  • Even after averaging out noise
  • Cannot wait for jump
  • Need more robust measurement

20
Control Engine Script
  • Start with N256
  • Compute
  • Mean
  • Standard deviation
  • For
  • Binary-search
  • Detect jump when time is more than

21
Experimental Results
22
Pentium 4
  • Does not cache ISA instructions, but uops
  • Trace cache
  • Measure the number of instructions
  • Smoothing in the nano-benchmark minimum of time
    in

23
Conclusions
  • X-Ray A framework and tool
  • First to measure instruction cache capacity
  • Algorithms for precise measurements of some
    important hardware parameters
  • Experimental results on many modern architectures
  • Other X-Ray resources
  • Memory Hierarchy parameter measurement appeared
    at SIGMETRICS05
  • CPU parameter measurement appeared at QEST05
  • Improving X-Ray is work in progress

24
Current and Future Work
  • 2-address vs. 3-address code
  • Out-of-Order execution
  • Number Physical registers
  • Number / Type Functional Units
  • Cache
  • bandwidth
  • write mode
  • sharedness
  • replacement policy

25
Thank you!
  • My E-Mail
  • kamen_at_yotov.org
  • kyotov_at_us.ibm.com
  • Cornell Group homepage
  • http//iss.cs.cornell.edu
  • This work emerged from a joint project with David
    Paduas group at UIUC
  • http//polaris.cs.uiuc.edu/newframework.html
  • Download X-Ray!
  • http//iss.cs.cornell.edu/software/x-ray.aspx
Write a Comment
User Comments (0)
About PowerShow.com