Title: The SGI Pro64 Compiler Infrastructure
1The SGI Pro64 Compiler Infrastructure
- A Tutorial
- Guang R. Gao (U of Delaware) J. Dehnert (SGI)
- J. N. Amaral (U of Alberta) R. Towle (SGI)
2Acknowledgement
- The SGI Compiler Development Teams
- The MIPSpro/Pro64 Development Team
- University of Delaware
- CAPSL Compiler Team
- These individuals contributed directly to this
tutorial - A. Douillet (Udel) F. Chow (Equator)
- S. Chan (Intel) W. Ho (Routefree)
- Z. Hu (Udel) K. Lesniak (SGI)
- S. Liu (HP) R. Lo (Routefree)
- S. Mantripragada (SGI) C. Murthy (SGI)
- M. Murphy (SGI) G. Pirocanac (SGI)
- D. Stephenson (SGI) D. Whitney (SGI)
- H. Yang (Udel)
3What is Pro64?
- A suite of optimizing compiler tools for Linux/
Intel IA-64 systems - C, C and Fortran90/95 compilers
- Conforming to the IA-64 Linux ABI and API
standards - Open to all researchers/developers in the
community - Compatible with HP Native User Environment
4Who Might Want to Use Pro64?
- Researchers test new compiler analysis and
optimization algorithms - Developers retarget to another
architecture/system - Educators a compiler teaching platform
5Outline
- Background and Motivation
- Part I An overview of the SGI Pro64 compiler
infrastructure - Part II The Pro64 code generator design
- Part III Using Pro64 in compiler research
development - SGI Pro64 support
- Summary
6PART I Overview of the Pro64 Compiler
7Outline
- Logical compilation model and component flow
- WHIRL Intermediate Representation
- Inter-Procedural Analysis (IPA)
- Loop Nest Optimizer (LNO) and Parallelization
- Global optimization (WOPT)
- Feedback
- Design for debugability and testability
8Logical Compilation Model
driver (sgicc/sgif90/sgiCC)
front end IPA (gfec/gfecc/mfef90)
back end (be, as)
linker (ld)
WHIRL (.B/.I)
obj (.o)
Src (.c/.C/.f)
a.out/.so
Data Path
Fork and Exec
9Components of Pro64
Front end
Interprocedural Analysis and Optimization
Loop Nest Optimization and Parallelization
Global Optimization
Code Generation
10Data Flow Relationship Between Modules
-O3
-IPA
LNO
Local IPA
Main IPA
Lower to High W.
.B
Inliner
gfec
.I
lower I/O
gfecc
(only for f90)
.w2c.c
WHIRL C
f90
.w2c.h
.w2f.f
WHIRL fortran
-O0
Take either path
Lower all
CG
Very high WHIRL
-phase woff
High WHIRL
Main opt
Lower Mid W
-O2/O3
Mid WHIRL
Low WHIRL
11Front Ends
- C front end based on gcc
- C front end based on g
- Fortran90/95 front end from MIPSpro
12Intermediate Representation
- IR is called WHIRL
- Tree structured, with references to symbol table
- Maps used for local or sparse annotation
- Common interface between components
- Multiple languages, multiple targets
- Same IR, 5 levels of representation
- Continuous lowering during compilation
- Optimization strategy tied to level
13IPA Main Stage
- Analysis
- alias analysis
- array section
- code layout
- Optimization (fully integrated)
- inlining
- cloning
- dead function and variable elimination
- constant propagation
14IPA Design Features
- User transparent
- No makefile changes
- Handles DSOs, unanalyzed objects
- Provide info (e.g. alias analysis, procedure
properties) smoothly to - loop nest optimizer
- main optimizer
- code generator
15Loop Nest Optimizer/Parallelizer
- All languages (including OpenMP)
- Loop level dependence analysis
- Uniprocessor loop level transformations
- Automatic parallelization
16Loop Level Transformations
- Based on unified cost model
- Heuristics integrated with software pipelining
- Loop vector dependency info passed to CG
- Loop Fission
- Loop Fusion
- Loop Unroll and Jam
- Loop Interchange
- Loop Peeling
- Loop Tiling
- Vector Data Prefetching
17Parallelization
- Automatic
- Array privatization
- Doacross parallelization
- Array section analysis
- Directive based
- OpenMP
- Integrated with automatic methods
18Global Optimization Phase
- SSA is unifying technology
- Use only SSA as program representation
- All traditional global optimizations implemented
- Every optimization preserves SSA form
- Can reapply each optimization as needed
19Pro64 Extensions to SSA
- Representing aliases and indirect memory
operations (Chow et al, CC 96) - Integrated partial redundancy elimination (Chow
et al, PLDI 97 Kennedy et al, CC 98, TOPLAS 99) - Support for speculative code motion
- Register promotion via load and store placement
(Lo et al, PLDI 98)
20Feedback
- Used throughout the compiler
- Instrumentation can be added at any stage
- Explicit instrumentation data incorporated where
inserted - Instrumentation data maintained and checked for
consistency through program transformations.
21Design for Debugability (DFD) and Testability
(DFT)
- DFD and DFT built-in from start
- Can build with extra validity checks
- Simple option specification used to
- Substitute components known to be good
- Enable/disable full components or specific
optimizations - Invoke alternative heuristics
- Trace individual phases
22Where to Obtain Pro64 Compiler and its Support
- SGI Source download
- http//oss.sgi.com/projects/Pro64/
- University of Delaware Pro64 Support Group
- http//www.capsl.udel.edu/pro64
- pro64_at_capsl.udel.edu
23Overview of The Pro64 Code Generator
PART II
24Outline
- Code generator flow diagram
- WHIRL/CGIR and TARG-INFO
- Hyperblock formation and predication (HBF)
- Predicate Query System (PQS)
- Loop preparation (CGPREP) and software pipelining
- Global and local instruction scheduling (IGLS)
- Global and local register allocation (GRA, LRA)
25Flowchart of Code Generator
WHIRL
Control Flow Opt II EBO
WHIRL-to-TOP Lowering
EBO Extended basic block optimization peephole, e
tc.
CGIR Quad Op List
IGLS pre-pass GRA, LRA, EBO IGLS
post-pass Control Flow Opt
Control Flow Opt I EBO
Hyperblock Formation Critical-Path Reduction
PQS Predicate Query System
Code Emission
Process Inner Loops unrolling, EBO Loop prep,
software pipelining
26WHIRL
- Abstract syntax tree based
- Symbol table links, map annotations
- Base representation is simple and efficient
- Used through several phases with lowering
- Designed for multiple target architectures
27From WHIRL to CGIR An Example
- T1 sp a
- T2 ld T1
- T3 sp i
- T4 ld T3
- T5 sxt T4
- T6 T5 ltlt 2
- T7 T6
- T8 T2 T7
- T9 ld T8
- T10 sp aa
- st T10 T9
ST aa
int a int i int aa aa ai
LD
a
CVTL32
4
i
(a) Source
(b) WHIRL
(c) CGIR
28Code Generation Intermediate Representation (CGIR)
- TOPs (Target Operations) are quads
- Operands/results are TNs
- Basic block nodes in control flow graph
- Load/store architecture
- Supports predication
- Flags on TOPs (copy ops, integer add, load, etc.)
- Flags on operands (TNs)
29From WHIRL to CGIR
Contd
- Information passed
- alias information
- loop information
- symbol table and maps
30 The Target Information Table (TARG_INFO)
- Objective
- Parameterized description of a target machine and
system architecture - Separates architecture details from the
compilers algorithms - Minimizes compiler changes when targeting a new
architecture
31The Target Information Table (TARG_INFO)
Contd
- Based on an extension of Cydra tables, with major
improvements - Architecture models have already targeted
- Whole MIPS family
- IA-64
- IA-32
- SGI graphics processors (earlier version)
32Flowchart of Code Generator
WHIRL
Control Flow Opt II EBO
WHIRL-to-TOP Lowering
EBO Extended basic block optimization peephole, e
tc.
CGIR Quad Op List
IGLS pre-pass GRA, LRA, EBO IGLS
post-pass Control Flow Opt
Control Flow Opt I EBO
Hyperblock Formation Critical-Path Reduction
PQS Predicate Query System
Code Emission
Process Inner Loops unrolling, EBO Loop prep,
software pipelining
33Hyperblock Formation and Predicated Execution
- Hyperblock single-entry multiple-exit
control-flow region - loop body, hammock region, etc.
- Hyperblock formation algorithm
- Based on Scott Mahlkes method Mahlke96
- But, less aggressive tail duplication
34Hyperblock Formation Algorithm
- Hammock regions
- Innermost loops
- General regions (path based)
- Paths sorted by priorities (freq., size, length,
etc.) - Inclusion of a path is guided by its impact on
resources, scheduling height, and priority level - Internal branches are removed via predication
- Predicate reuse
Region Identification
Block Selection
Tail Duplication
If Conversion
Objective Keep the scheduling height close to
that of the highest priority path.
35Hyperblock Formation - An Example
1
1
aa ai bb bi switch (aa) case 1
if (aa lt tabsiz) aa tabaa case 2
if (bb lt tabsiz) bb tabbb default
ans aa bb
2
4
4
2
1
5
4,5
5
2
6
6
6
6,7
7
8
7
7
8
8
8
H1
H2
(a) Source
(c) Hyperblock formation with aggressive
tail duplication
(b) CFG
36Hyperblock Formation - An Example
Contd
1
1
1
2
4
4
2
4
2
H1
5
5
5
6
6
6
6
7
7
7
7
8
8
H2
8
H1
H2
8
(b) Hyperblock formation with aggressive
tail duplication
(c) Pro64 hyperblock formation
(a) CFG
37Features of the Pro64 Hyperblock Formation (HBF)
Algorithm
- Form good vs. maximal hyperblocks
- Avoid unnecessary duplication
- No reverse if-conversion
- Hyperblocks are not a barrier to global code
motion later in IGLS
38Predicate Query System (PQS)
- Purpose gather information and provide
interfaces allowing other phases to make queries
regarding the relationships among predicate
values - PQS functions (examples)
- BOOL PQSCG_is_disjoint (PQS_TN tn1, PQS_TN
tn2) - BOOL PQSCG_is_subset (PQS_TN_SET
tns1, PQS_TN_SET tns2)
39Flowchart of Code Generator
WHIRL
Control Flow Opt II EBO
WHIRL-to-TOP Lowering
EBO Extended basic block optimization peephole, e
tc.
CGIR Quad Op List
IGLS pre-pass GRA, LRA, EBO IGLS
post-pass Control Flow Opt
Control Flow Opt I EBO
Hyperblock Formation Critical-Path Reduction
PQS Predicate Query System
Code Emission
Process Inner Loops unrolling, EBO Loop prep,
software pipelining
40Loop Preparation and Optimization for Software
Pipelining
- Loop canonicalization for SWP
- Read/Write removal (register aware)
- Loop unrolling (resource aware)
- Recurrence removal or extension
- Prefetch
- Forced if-conversion
41Pro64 Software Pipelining Method Overview
- Test for SWP-amenable loops
- Extensive loop preparation and optimization
before application DeTo93 - Use lifetime sensitive SWP algorithm Huff93
- Register allocation after scheduling based on
Cydra 5 RLTS92, DeTo93 - Handle both while and do loops
- Smooth switching to normal scheduling if not
successful.
42Pro64 Lifetime-Sensitive Modulo Scheduling for
Software Pipelining
- Features
- Try to place an op ASAP or ALAP to minimize
register pressure - Slack scheduling
- Limited backtracking
- Operation-driven scheduling framework
Compute Estart/Lstart for all unplaced ops
Choose a good op to place into the current
partial schedule within its Estart/Lstart range
yes
Register allocate
Succeed
no
done
Eject conflicting Ops
43Flowchart of Code Generator
WHIRL
Control Flow Opt II EBO
WHIRL-to-TOP Lowering
EBO Extended basic block optimization peephole, e
tc.
CGIR Quad Op List
IGLS pre-pass GRA, LRA, EBO IGLS
post-pass Control Flow Opt
Control Flow Opt I EBO
Hyperblock Formation Critical-Path Reduction
PQS Predicate Query System
Code Emission
Process Inner Loops unrolling, EBO Loop prep,
software pipelining
44Integrated Global Local Scheduling (IGLS) Method
- The basic IGLS framework integrates global code
motion (GCM) with local scheduling MaJD98 - IGLS extended to hyperblock scheduling
- Performs profitable code motion between
hyperblock regions and normal regions
45IGLS Phase Flow Diagram
Hyperblock Scheduling (HBS)
Block Priority Selection Motion
Selection Target Selection
Global Code Motion (GCM)
Local Code Scheduling (LCS)
46Advantages of the Extended IGLSMethod - The
Example Revisited
1
- Advantages
- No rigid boundaries between hyperblocks and
non-hyperblocks - GCM moves code into and out of a hyperblock
according to profitability
1
2
4
H1
4
2
H1
5
5
6
6
7
7
8
8
H2
H2
H3
8
(a) Pro64 hyperblock
(b) Profitable duplication
47Software Pipelining vsNormal Scheduling
a SWP-amenable loop candidate ?
No
Yes
IGLS
Inner loop processing software pipelining
GRA/LRA
Failure/not profitable
IGLS
Code Emission
Success
48Flowchart of Code Generator
WHIRL
Control Flow Opt II EBO
WHIRL-to-TOP Lowering
EBO Extended basic block optimization peephole, e
tc.
CGIR Quad Op List
IGLS pre-pass GRA, LRA, EBO IGLS
post-pass Control Flow Opt
Control Flow Opt I EBO
Hyperblock Formation Critical-Path Reduction
PQS Predicate Query System
Code Emission
Process Inner Loops unrolling, EBO Loop prep,
software pipelining
49Global and Local Register Allocation(GRA/LRA)
From prepass IGLS
- LRA-RQ provides an estimate of local register
requirements - Allocates global variables using a priority-based
register allocator ChowHennessy90,Chow83,
Briggs92 - Incorporates IA-64 specific extensions, e.g.
register stack usage
GRA
LRA Register Request LRA-RQ
Priority Based Register Allocation with IA-64
Extensions
LRA
To postpass IGLS
50Local Register Allocation (LRA)
- Assign_registers using reverse linear scan
- Reordering depth-first ordering on the DDG
Assign_Registers
succeed
failed
Fix_LRA
first time
Instruction reordering
Spill global spill local
51Future Research Topics for Pro64 Code Generator
- Hyperblock formation
- Predicate query system
- Enhanced speculation support
52PART III Using Pro64 in Compiler Research and
Development
53Outline
- General Remarks
- Case Study I Integration of new instruction
reordering algorithm to minimize register
pressure Govind,Yang,Amaral,Gao2000 - Case Study II Design and evaluation of an
induction pointer prefetching algorithm
Stouchinin,Douillet,Amaral,Dehnert,Gao2000
54Case I
- Introduction of the Minimum Register Instruction
Sequence (MRIS) problem and a proposed solution - Problem formulation
- The proposed algorithm
- Pro64 porting experience
- Where to start
- How to start
- Results
- Summary
55Researchers
- R. Govindarajan (Indian Inst. Of Science)
- Hongbo Yang (Univ. of Delaware)
- Chihong Zhang (Conexant)
- José Nelson Amaral (Univ. of Alberta)
- Guang R. Gao (Univ. of Delaware)
56The Minimum Register Instruction Sequence Problem
Given a data dependence graph G, derive an
instruction sequence S for G that is optimal in
the sense that its register requirement is
minimum.
57A Motivating Example
(a) DDG (b) Instruction Sequence 1
(c) Instruction Sequence 2
- Observation Register requirements drop 25 from
(b) to (c) !
58Motivation
- IA-64 style processors
- Reduce spills in local register allocation phase
- Reduce Local Register Allocation (LRA) requests
in Global Register Allocation (GRA) phase - Reduce overall register pressure on a per
procedure basis - Out-of-order issue processor
- Instruction reordering buffer
- Register renaming
59How to Solve the MRIS Problem?
L1 (a, b, f, h) L2 (c, f) L3 (e, g,
h) L4 (d, g)
- Register lineages
- Live range of lineages
- Lineage interference
(c) Lineages
(a) Concepts
(b) DDG
60How to Solve the MRIS Problem?
L1 (a, b, f, h) L2 (c, f) L3 (e, g,
h) L4 (d, g)
- Register lineages
- Live range of lineages
- Lineage interference
(c) Lineages
(a) Concepts
(b) DDG
Questions Can L1 and L2 share the same
register?
61How to Solve the MRIS Problem?
L1 (a, b, f, h) L2 (c, f) L3 (e, g,
h) L4 (d, g)
- Register lineages
- Live range of lineages
- Lineage interference
(c) Lineages
(a) Concepts
(b) DDG
Questions Can L1 and L2 share the same
register? Can L2 and
L3 share the same register?
62How to Solve the MRIS Problem?
L1 (a, b, f, h) L2 (c, f) L3 (e, g,
h) L4 (d, g)
- Register lineages
- Live range of lineages
- Lineage interference
(c) Lineages
(a) Concepts
(b) DDG
Questions Can L1 and L2 share the same
register? Can L2 and
L3 share the same register?
Can L1 and L4 share the same
register? Can L2 and L4 share the same register?
63Lineage Interference Graph
L1 (a, b, f, h) L2 (c, f) L3 (e, g,
h) L4 (d, g)
a
b
c
d
e
g
f
h
(a) Original DDG (b) Lineage Interference Graph
(LIG)
Question Is the lower bound of the required
registers 3? Challenge Derive a Heuristic
Register Bound (HRB)!
64Our Solution Method
DDG
- A good construction algorithm for LIG
- An effective heuristic method to calculate the
HRB - An efficient scheduling method (do not backtrack)
Form Lineage Interference Graph (LIG)
Derive HRB
Extended list-scheduling guided by HRB
A good instruction sequence
65Pro64 Porting Experience
- Porting plan and design
- Implementation
- Debugging and validation
- Evaluation
66Implementation
- Dependence graph construction
- LIG formation
- LIG construction and coloring
- The reordering algorithm implementation
67Porting Plan and Design
../common/targ_info/abi/ia64
- Understand the compiler infrastructure
- Understand the register model (mainly from
targ_info) - e.g.
- register classes (int, float, predicate, app,
control) - register save/restore conventions caller/callee
save, return value, argument passing, stack
pointer, etc.
68Register Allocation
GRA
LRA At block level
Assign_Registers Fix_LRA_Blues
Succ?
Fail?
reschedule local code motion spill global or
local registers
69Implementation
- DDG construction use native service routines
e.g. CG_DEP_Compute_Graph - LIG coloring using native support for set
package (e.g. bitset.c) - Scheduler implementation vector package native
support (e.g. cg_vector.cxx) - Access dependence graph using native service
functions ARC_succs, ARC_preds, ARC_kind
70Debugging and Validation
- Trace file
- tt540x1. General trace of LRA
- tt45 0x4. Dependence graph building
- tr53. Target Operations (TOP) before LRA
- tr54. TOP after LRA
71Evaluation
- Static measurement
- Fat point -tt54 0x40
- Dynamic measurement
- Hardware counter in R12k and perfex
72Evaluation
- For the MIPS R12K (SPEC95fp), the lineage-based
algorithm reduce the number of loads executed by
12, the number of stores by 14, and the
execution time by 2.5 over a baseline. - It is slightly better than the algorithm in the
MIPSPro compiler.
73Case II
Design and Evaluation of an Induction Pointer
Prefetching Algorithm
74Researchers
- Artour Stoutchinin (STMicroelectronics)
- José Nelson Amaral (Univ. of Alberta)
- Guang R. Gao (Univ. of Delaware)
- Jim Dehnert (Silicon Graphics Inc.)
- Suneel Jain (Narus Inc.)
- Alban Douillet (Univ. of Delaware)
75Motivation
The important loops of many programs are
pointer-chasing loops that access recursive data
structures through induction pointers.
Example max 0 current head while(current
! NULL)
if(current-gtkey gt max) max
current-gtkey current
current-gtnext
76Problem Statement
How to identify pointer-chasing recurrences?
How to decide whether there are enough processor
resources and memory bandwidth to profitably
prefetch an induction pointer?
How to efficiently integrate induction pointer
prefetching with loop scheduling based on the
profitability analysis?
77Prefetching Costs
- More instructions to issue
- More memory traffic
- Longer code (disruption in instruction cache)
- Displacement of potentially good data from cache
After prefetching t226 lw 0x34(t228) tmp
subu t226, t226s tmp addu tmp, tmp tmp addu
t226, tmp pref 0x0(tmp) t226s t226
Before prefetching t226 lw 0x34(t228)
78What to Prefetch?When to Prefetch it?
A good optimizing compiler should only prefetch
data that will actually be referenced.
It should prefetch far enough in advance to
prevent a cache miss when the reference occurs.
But, not too far in advance, because the data
might be evicted from the cache before it is
used, or might displace data that will be
referenced again.
79Prefetch Address
In order to prefetch, the compiler must calculate
addresses that will be referenced in future
iterations of the loop.
For loops that access regular data structures,
such as vectors and matrices, compilers can use
static analysis of the array indexes to compute
the prefetching addresses.
How can we predict future values of induction
pointers?
80Key Intuition
Recursive data structures are often allocated
at regular intervals.
Example curr head (item) malloc(sizeof(item)
) while(curr-gtkey get_key()) ! NULL)
curr-gtnext curr
(item)malloc(sizeof(item))
other_memory_allocations()
curr -gt next NULL
81Pre-Fetching Technique
Example max 0 current head
tmp current while(current ! NULL)
if(current-gtkey gt max)
max current-gtkey current
current-gtnext stride current - tmp
prefetch(current
stridek) tmp current
82Prefetch Sequence (R10K)
In our implementation, the stride is recomputed
in every iteration of the loop, making it
tolerant of (infrequent) stride changes.
stride addr - addr.prev stride stride
k addr.pref addr stride addr.prev addr pref
addr.pref
83Identification of Pointer-Chasing Recurrences
A surprisingly simple method works well look in
the intermediate code for recurrence circuits
containing only loads with constant offsets.
Examples node ptr-gtnext r1 lt- load r2,
offset_next ptr node-gtptr r2 lt- load r1,
offset_ptr current current-gtnext r2 lt- load
r1 r1 lt- load r2, offset_next
84Profitability Analysis
Goal Balance the gains and costs of prefetching.
Although we use resource estimates analogous
to those done for software pipelining, we
consider loop bodies with control flow.
How to estimate the resources available
for prefetching in a basic block B that belongs
to many data dependence recurrences?
85Software Pipelining
- What limits the speed of a loop?
- Data dependences recurrence initiation interval
(recMII) - Processor resources resource initiation
interval (resMII) - Memory accesses memory initiation interval
(memMII)
0
1
2
3
4
5
6
7
8
9
10
11
12
16
15
14
13
time
86Data Dependences(recMII)
The recurrence minimum initiation interval
(recMII) is given by
for i 0 to N - 1 do a Xi Xi -
1 Ri b Yi Xi Zi - 1 c Zi
Yi 1 end
(dist,lat)
87The recMII for Loops with Control Flow
An instruction of a basic block B, can belong to
many recurrences (with distinct control
paths). We define the recurrence MII of a load
operation L as L ? c means that the operation
L is part of the recurrence c.
Control Flow Graph
88Processor Resources(resMII)
A basic block B may belong to multiple control
paths. We define the resource constraint of a
basic block B as the maximum over all control
paths that execute B.
Control Flow Graph
89Available Memory Bandwidth
Processors with non-blocking caches can support
up to k outstanding cache misses without
stalling. We define the available memory
bandwidth of all control paths that execute a
basic block B as where m(p) is the number of
expected cache misses in each control path p.
Control Flow Graph
90Profitability Analysis
Adding prefetch code for an induction pointer L
in a basic block B is profitable if both (1)
the mii due to recurrences that contain L is
greater than the resMII after prefetch
insertion, and (2) there is enough memory
bandwidth to enable another cache miss
without causing stalls.
91Computing Available Memory Bandwidth
To compute the available memory bandwidth of a
control path we need to estimate how many cache
misses are expected in that control path.
We use a graph coloring technique over a cache
miss interference graph to predict which memory
references are likely to incur a miss.
92The Miss Interference Graph
Two memory references interfere if 1. They are
both expected to miss the cache 2. They can both
be issued in the same iteration of the loop 3.
They do not fall into the same cache line
Miss Interference Graph assumptions 1. Loop
invariant references are cache hits
(global-pointer relative, stack-pointer relative,
etc). 2. Memory references on mutually exclusive
control paths do not interfere. 3. References
relative to the same base address interfere
only if their relative offset is larger than the
cache line.
93Prefetching Algorithm
DoPrefetch(P,V,E) 1. C ? pointer-chasing
recurrences 2. R ? Prioritized list of induction
pointer loads in C 3. N ? Prioritized list of
other loads (not in C) 4. O ? R N 5. mark each
L in O as a cache miss 6. for each L in O, L ?
B 7. do if recMIIP(B) ? resMIIP(B) and
S(B) 8. then add prefetch for L to
B 9. mark L as cache
hit 10. endif 11. endfor
94An Example
mcf minimal cost flow optimizer, (Konrad-Zuse
Informatics Center, Berlin)
1 while (arcin) 2 tail
arcin-gttail 3 if (tail-gttime
arcin-gtorg_cost gt latest) 4 arcin
(arc_t )tail-gtmark 5 continue
6 arc_cost
tail-gtpotential head_potential 7 if
(red_cost lt 0) 8 if (new_arcs lt
MAX_NEW_ARCS) 9
insert_new_arc(arcnew, new_arcs, tail,
head,
arc_cost, red_cost) 10
new_arcs 11
else if((cost_t)arcnew0.flow gt red_cost) 12
replace_weaker_arc(arcnew,
tail, head,
arc_cost, red_cost)
13 arcin (arc_t
)tail-gtmark
95An Example
1 while (arcin) 2 tail
arcin-gttail 3 if (tail-gttime
arcin-gtorg_cost gt latest) 4 arcin
(arc_t )tail-gtmark 5 continue
6 arc_cost
tail-gtpotential head_potential 7 if
(red_cost lt 0) 8 if (new_arcs lt
MAX_NEW_ARCS) 9
insert_new_arc(arcnew, new_arcs, tail,
head,
arc_cost, red_cost) 10
new_arcs 11
else if((cost_t)arcnew0.flow gt red_cost) 12
replace_weaker_arc(arcnew,
tail, head,
arc_cost, red_cost)
13 arcin (arc_t
)tail-gtmark
961. t228 lw 0x0(t226) 2. t229 lw 0x14(t226) 3.
t230 lw 0x38(t228) 4. t231 addu t229, t230 5.
t232 slt t220, 0 6. bne B3, t232, 0
B1
B3
B2
9. t234 lw 0x2c(t228) 10. t235 subu t225,
t234 11. t233 addiu t235, 0x1e 12. bgez B7, t233
7. t226 lw 0x34(t228) 8. b B8
13. t236 slt t209. t175 14. Beq B6, t236, 0
B4
B5
B6
insert_new_arc()
replace_weaker_arc()
15. t226 lw 0x34(t228)
B7
B8
15. bne B1, t226, 0
971. t228 lw 0x0(t226) 2. t229 lw 0x14(t226) 3.
t230 lw 0x38(t228) 4. t231 addu t229, t230 5.
t232 slt t220, 0 6. bne B3, t232, 0
B1
B3
B2
9. t234 lw 0x2c(t228) 10. t235 subu t225,
t234 11. t233 addiu t235, 0x1e 12. bgez B7, t233
7. t226 lw 0x34(t228) 8. b B8
13. t236 slt t209. t175 14. Beq B6, t236, 0
B4
B5
B6
insert_new_arc()
replace_weaker_arc()
15. t226 lw 0x34(t228)
B7
B8
15. bne B1, t226, 0
981. t228 lw 0x0(t226) 2. t229 lw 0x14(t226) 3.
t230 lw 0x38(t228) 4. t231 addu t229, t230 5.
t232 slt t220, 0 6. bne B3, t232, 0
B1
B3
B2
9. t234 lw 0x2c(t228) 10. t235 subu t225,
t234 11. t233 addiu t235, 0x1e 12. bgez B7, t233
7. t226 lw 0x34(t228) 8. b B8
13. t236 slt t209. t175 14. Beq B6, t236, 0
B4
B5
B6
insert_new_arc()
replace_weaker_arc()
15. t226 lw 0x34(t228)
B7
B8
15. bne B1, t226, 0
99B1
1. t228 lw 0x0(t226) 2. t229 lw 0x14(t226) 3.
t230 lw 0x38(t228) 4. t231 addu t229, t230 5.
t232 slt t220, 0 6. bne B3, t232, 0
B3
B4
B6
B5
B2
7. t226 lw 0x34(t228) 8. b B10
B7
15. t226 lw 0x34(t228)
B8
100B1
1. t228 lw 0x0(t226) 1. tmp subu t228,
t228s 1. tmp addu tmp, tmp 1. tmp addw t228,
tmp 1. pref 0x34(tmp) 1. t228s t228 2. t229
lw 0x14(t226) 3. t230 lw 0x38(t228) 4. t231
addu t229, t230 5. t232 slt t220, 0 6. bne B3,
t232, 0
B3
B4
B6
B5
B2
7. t226 lw 0x34(t228) 7. tmp subu t226,
t226s 7. tmp addu tmp, tmp 7. tmp addu
t226, tmp 7. pref 0x0(tmp) 7. t226s t226
8. b B10
B7
15. t226 lw 0x34(t228) 15. tmp subu t226,
t226s 15. tmp addu tmp, tmp 15. tmp addu
t226, tmp 15. pref 0x0(tmp) 15. t226s t226
B8
101When Pointer Prefetch Works
102When Pointer Prefetch Does Not Help
103Summary of Attributes
- Software-only implementation
- Simple candidate identification
- Simple code transformation
- No impact on user data structures
- Simple profitability analysis, local to loop
- Performance degradations are rare, minor
104Open Questions
- How often is the speculated stride correct?
- Can instrumentation feedback help?
- How well does the speculative prefetch work with
other recursive data structures trees, graphs,
etc? - How well does this approach work for read/write
recursive data structures?
105Related Work (Software)
- Luk-Mowry (ASPLOS-96)
- Greedy prefetching History-Pointer prefetching
Data Linearization Prefetching - Change the data structure storage
- Lipatsi et al. (Micro-95)
- Prefetching pointers at procedure call sites
- Liu-Dimitri-Kaeli (Journal of Syst. Arch.-99)
- Maintains a table of offsets for prefetching
106Related Work (Hardware)
- Roth-Moshovos-Sohi (ASPLOS, 1998)
- Gonzales-Gonzales (ICS, 1997)
- Mehrotra (Urbana-Champaign, 1996)
- Chen-Baer (Trans. Computer, 1995)
- Charney-Reeves (Trans. Comp., 1994)
- Jegou-Teman (ICS, 1993)
- Fu-Patel (Micro, 1992)
107Execution Time Measurements
108Prefetch Improvement
109L1 Cache Misses
110L2 Cache Misses
111TLB Misses
112Benchmarks
gcc GNU C compiler li Lisp
interpreter mcf Minimal cost flow
solver parser Syntactic parser of English twolf
Place and route simulator mlp Multi-layer
perceptron simulator ft Minimum spanning tree
algorithm