Title: Automatic Measurement of Instruction Cache Capacity in XRay
1Automatic Measurement of Instruction Cache
Capacityin X-Ray
- Kamen Yotov
- kyotov_at_us.ibm.com
- IBM T. J. Watson Research Center
- Joint work with
- Tyler Steele, Sandra Jackson,
- Keshav Pingali, Paul Stodghill
- Department of Computer Science
- Cornell University
2Motivation self-optimizing software
- Goal portable performance
- Self-optimizing software
- Generates code with parameters whose optimal
values depend on the platform (hardware / OS /
compiler) - Determines experimentally optimal parameter
values - Uses native C compiler to produce library
- Examples ATLAS, FFTW, SPIRAL,
3Example Register Blocking for MMM
- Hardware parameters
- Number of FP registers (NR)
- I-Cache Capacity (ICC)
- A simple model for the register tile size for
MMM - Yotov et al. IEEE05
- MU x NU MU NU Temp NR
- KU (unroll of K loop)
- does not depend on NR
- depends on ICC
- Need to know NR and ICC!
4Why not consult the manuals?
- Self-optimizing systems
- Require online manuals
- Actual hardware values vs. number available for
optimization - For software optimization, hardware values may
not be relevant - (e.g.) number of hardware registers may not be
equal to number of registers available for
holding program values (register 0 on SPARC) - Incomplete
- Parameters like capacity and line size of
off-chip caches vary from model to model - Even same model of computer may be shipped with
different cache organizations - Not usually documented in processor manuals
- Moving Target
5Automatic Measurement Tools
- lmbench
- OS benchmark, some CPU / Memory benchmarks
- Larry McVoy, BitMover, Inc.
- Carl Staelin, HP
- Calibrator
- Memory hierarchy benchmark
- Stefan Manegold
- Centrum voor Wiskunde en Informatica
- MOB
- Memory hierarchy benchmark
- Josep Blanquer, Robert Chalmers
- University of California Santa Barbara
6X-Ray
- Set of micro-benchmarks in ANSI C89
- Download and compile on any architecture
(portable) - Deduce hardware parameter values from timing
results - Some amount of O/S specific code
- High-resolution timing routines
- Super-page allocation
- Currently support Linux
- Windows and Solaris, IRIX, and AIX in the works
- Paradox
- Compiler optimizations may contaminate timing
results - Cannot afford to turn off all optimizations
7Example Latency of Integer ADD(Step by Step)
- t gettime()
- r1 r2
- return gettime() t
Problem hard to measure small time intervals
accurately
8Step by Step (cont.)
- t gettime()
- while (--R) //R is number of repetitions
- r1 r2
- return gettime() t
Problem loop overhead
9Step by Step (cont.)
- t gettime()
- i R / U
- while (--i) //loop unrolled U times
-
- r1 r2
- r1 r2
- ........
- r1 r2
-
- return gettime() t
Problem compiler optimizations
10Step by Step (cont.)
- t gettime()
- i R / U
- switch (v)
-
- case 0 loop
- case 1 r1 r2
- case 2 r1 r2
- .................
- case U r1 r2
- if (--i)
- goto loop
-
- if (!v) return gettime() t else use(r1,r2)
Solution volatile int v 0
11Latency of integer ADD nano-benchmark C code
- Want to measure
- r1r2
- Generate C Code from specification
- ltr1r2, ltr1, r2 intgtgt
- volatile int v 0
- volatile int vr 0
- register int r1 vr
- register int r2 vr
- t gettime()
- i R / U
- switch (v)
-
- case 0 loop
- case 1 r1 r2
- case 2 r1 r2
- .................
- case U r1 r2
- if (--i)
- goto loop
-
- if (!v)
- return gettime() t
- else
12X-Ray architecture
13Instruction Throughput
N3, B1
14Micro-benchmarks in X-Ray
- CPU
- Frequency
- Instruction Latency
- Instruction Throughput
- Instruction Existence
- FPU on embedded processors
- FMA on general purpose processors
- SMP and SMT
- Memory Hierarchy
- Number of Registers of various types (int, float,
SSE, ) - Multilevel Caches, TLB
- Associativity
- Block Size
- Capacity
- Latency
- Instruction Cache Capacity
15Previous Approaches for Memory Hierarchy
Parameters
- Saavedra Benchmark (Hennessy-Patterson)
- Accesses elements of an array constant stride
apart - Measures average memory access time
- Deficiencies
- Considers all levels simultaneously
- Works only for capacities that are powers-of-2
- Suffers from a number of implementation level
deficiencies - Constant stride accesses
- Loop overhead problems
- Overlapping memory operations
- Prone to compiler optimizations
16ExampleIsolation of lower cache levels
- Idea for Ln measurements
- Use sequences as for L1 measurements
- Make L1Ln-1 transparent to measurements
- Unique in isolating the behavior of Ln so that
all higher levels miss - Approach
- Use sequences of sequences
- Convolution of sequences
?
17Measuring I-Cache Capacity
- Approach for Data Cache does not work
- Array of pointers ? Code sequence with branches
- Such branches are very predictable
- Nearly impossible to get precise timing
- Measure time to execute special code sequence of
size N statements - Find the biggest N for which there is no
significant increase in time per statement
18Nano-benchmark
- Similar to Instruction Throughput
- Parameters (1, 4)
- Grow length N
- Code size computed
- (char )finish (char )start
19Sensitivity
- Graph for Pentium M
- 9 more in the paper
- Performance oscillates
- Even after averaging out noise
- Cannot wait for jump
- Need more robust measurement
20Control Engine Script
- Start with N256
- Compute
- Mean
- Standard deviation
- For
- Binary-search
- Detect jump when time is more than
21Experimental Results
22Pentium 4
- Does not cache ISA instructions, but uops
- Trace cache
- Measure the number of instructions
- Smoothing in the nano-benchmark minimum of time
in
23Conclusions
- X-Ray A framework and tool
- First to measure instruction cache capacity
- Algorithms for precise measurements of some
important hardware parameters - Experimental results on many modern architectures
- Other X-Ray resources
- Memory Hierarchy parameter measurement appeared
at SIGMETRICS05 - CPU parameter measurement appeared at QEST05
- Improving X-Ray is work in progress
24Current and Future Work
- 2-address vs. 3-address code
- Out-of-Order execution
- Number Physical registers
- Number / Type Functional Units
- Cache
- bandwidth
- write mode
- sharedness
- replacement policy
25Thank you!
- My E-Mail
- kamen_at_yotov.org
- kyotov_at_us.ibm.com
- Cornell Group homepage
- http//iss.cs.cornell.edu
- This work emerged from a joint project with David
Paduas group at UIUC - http//polaris.cs.uiuc.edu/newframework.html
- Download X-Ray!
- http//iss.cs.cornell.edu/software/x-ray.aspx