The SGI Pro64 Compiler Infrastructure

About This Presentation

Title:

The SGI Pro64 Compiler Infrastructure

Description:

loop body, hammock region, etc. Hyperblock formation algorithm ... Hammock regions. Innermost loops. General regions (path based) Paths sorted by priorities (freq. ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 112

Provided by: GAO159

Category:

more less

Transcript and Presenter's Notes

Title: The SGI Pro64 Compiler Infrastructure

1
The SGI Pro64 Compiler Infrastructure
- A Tutorial

Guang R. Gao (U of Delaware) J. Dehnert (SGI)
J. N. Amaral (U of Alberta) R. Towle (SGI)

2
Acknowledgement

The SGI Compiler Development Teams
The MIPSpro/Pro64 Development Team
University of Delaware
CAPSL Compiler Team
These individuals contributed directly to this
tutorial
A. Douillet (Udel) F. Chow (Equator)
S. Chan (Intel) W. Ho (Routefree)
Z. Hu (Udel) K. Lesniak (SGI)
S. Liu (HP) R. Lo (Routefree)
S. Mantripragada (SGI) C. Murthy (SGI)
M. Murphy (SGI) G. Pirocanac (SGI)
D. Stephenson (SGI) D. Whitney (SGI)
H. Yang (Udel)

3
What is Pro64?

A suite of optimizing compiler tools for Linux/
Intel IA-64 systems
C, C and Fortran90/95 compilers
Conforming to the IA-64 Linux ABI and API
standards
Open to all researchers/developers in the
community
Compatible with HP Native User Environment

4
Who Might Want to Use Pro64?

Researchers test new compiler analysis and
optimization algorithms
Developers retarget to another
architecture/system
Educators a compiler teaching platform

5
Outline

Background and Motivation
Part I An overview of the SGI Pro64 compiler
infrastructure
Part II The Pro64 code generator design
Part III Using Pro64 in compiler research
development
SGI Pro64 support
Summary

6
PART I Overview of the Pro64 Compiler
7
Outline

Logical compilation model and component flow
WHIRL Intermediate Representation
Inter-Procedural Analysis (IPA)
Loop Nest Optimizer (LNO) and Parallelization
Global optimization (WOPT)
Feedback
Design for debugability and testability

8
Logical Compilation Model
driver (sgicc/sgif90/sgiCC)
front end IPA (gfec/gfecc/mfef90)
back end (be, as)
linker (ld)
WHIRL (.B/.I)
obj (.o)
Src (.c/.C/.f)
a.out/.so
Data Path
Fork and Exec
9
Components of Pro64
Front end
Interprocedural Analysis and Optimization
Loop Nest Optimization and Parallelization
Global Optimization
Code Generation
10
Data Flow Relationship Between Modules
-O3
-IPA
LNO
Local IPA
Main IPA
Lower to High W.
.B
Inliner
gfec
.I
lower I/O
gfecc
(only for f90)
.w2c.c
WHIRL C
f90
.w2c.h
.w2f.f
WHIRL fortran
-O0
Take either path
Lower all
CG
Very high WHIRL
-phase woff
High WHIRL
Main opt
Lower Mid W
-O2/O3
Mid WHIRL
Low WHIRL
11
Front Ends

C front end based on gcc
C front end based on g
Fortran90/95 front end from MIPSpro

12
Intermediate Representation

IR is called WHIRL
Tree structured, with references to symbol table
Maps used for local or sparse annotation
Common interface between components
Multiple languages, multiple targets
Same IR, 5 levels of representation
Continuous lowering during compilation
Optimization strategy tied to level

13
IPA Main Stage

Analysis
alias analysis
array section
code layout
Optimization (fully integrated)
inlining
cloning
dead function and variable elimination
constant propagation

14
IPA Design Features

User transparent
No makefile changes
Handles DSOs, unanalyzed objects
Provide info (e.g. alias analysis, procedure
properties) smoothly to
loop nest optimizer
main optimizer
code generator

15
Loop Nest Optimizer/Parallelizer

All languages (including OpenMP)
Loop level dependence analysis
Uniprocessor loop level transformations
Automatic parallelization

16
Loop Level Transformations

Based on unified cost model
Heuristics integrated with software pipelining
Loop vector dependency info passed to CG

Loop Fission
Loop Fusion
Loop Unroll and Jam
Loop Interchange

Loop Peeling
Loop Tiling
Vector Data Prefetching

17
Parallelization

Automatic
Array privatization
Doacross parallelization
Array section analysis
Directive based
OpenMP
Integrated with automatic methods

18
Global Optimization Phase

SSA is unifying technology
Use only SSA as program representation
All traditional global optimizations implemented
Every optimization preserves SSA form
Can reapply each optimization as needed

19
Pro64 Extensions to SSA

Representing aliases and indirect memory
operations (Chow et al, CC 96)
Integrated partial redundancy elimination (Chow
et al, PLDI 97 Kennedy et al, CC 98, TOPLAS 99)
Support for speculative code motion
Register promotion via load and store placement
(Lo et al, PLDI 98)

20
Feedback

Used throughout the compiler
Instrumentation can be added at any stage
Explicit instrumentation data incorporated where
inserted
Instrumentation data maintained and checked for
consistency through program transformations.

21
Design for Debugability (DFD) and Testability
(DFT)

DFD and DFT built-in from start
Can build with extra validity checks
Simple option specification used to
Substitute components known to be good
Enable/disable full components or specific
optimizations
Invoke alternative heuristics
Trace individual phases

22
Where to Obtain Pro64 Compiler and its Support

SGI Source download
http//oss.sgi.com/projects/Pro64/
University of Delaware Pro64 Support Group
http//www.capsl.udel.edu/pro64
pro64_at_capsl.udel.edu

23
Overview of The Pro64 Code Generator
PART II
24
Outline

Code generator flow diagram
WHIRL/CGIR and TARG-INFO
Hyperblock formation and predication (HBF)
Predicate Query System (PQS)
Loop preparation (CGPREP) and software pipelining
Global and local instruction scheduling (IGLS)
Global and local register allocation (GRA, LRA)

25
Flowchart of Code Generator
WHIRL
Control Flow Opt II EBO
WHIRL-to-TOP Lowering
EBO Extended basic block optimization peephole, e
tc.
CGIR Quad Op List
IGLS pre-pass GRA, LRA, EBO IGLS
post-pass Control Flow Opt
Control Flow Opt I EBO
Hyperblock Formation Critical-Path Reduction
PQS Predicate Query System
Code Emission
Process Inner Loops unrolling, EBO Loop prep,
software pipelining
26
WHIRL

Abstract syntax tree based
Symbol table links, map annotations
Base representation is simple and efficient
Used through several phases with lowering
Designed for multiple target architectures

27
From WHIRL to CGIR An Example

T1 sp a
T2 ld T1
T3 sp i
T4 ld T3
T5 sxt T4
T6 T5 ltlt 2
T7 T6
T8 T2 T7
T9 ld T8
T10 sp aa
st T10 T9

ST aa
int a int i int aa aa ai
LD

a

CVTL32
4
i
(a) Source
(b) WHIRL
(c) CGIR
28
Code Generation Intermediate Representation (CGIR)

TOPs (Target Operations) are quads
Operands/results are TNs
Basic block nodes in control flow graph
Load/store architecture
Supports predication
Flags on TOPs (copy ops, integer add, load, etc.)
Flags on operands (TNs)

29
From WHIRL to CGIR
Contd

Information passed
alias information
loop information
symbol table and maps

30

The Target Information Table (TARG_INFO)

Objective
Parameterized description of a target machine and
system architecture
Separates architecture details from the
compilers algorithms
Minimizes compiler changes when targeting a new
architecture

31
The Target Information Table (TARG_INFO)
Contd

Based on an extension of Cydra tables, with major
improvements
Architecture models have already targeted
Whole MIPS family
IA-64
IA-32
SGI graphics processors (earlier version)

32
Flowchart of Code Generator
WHIRL
Control Flow Opt II EBO
WHIRL-to-TOP Lowering
EBO Extended basic block optimization peephole, e
tc.
CGIR Quad Op List
IGLS pre-pass GRA, LRA, EBO IGLS
post-pass Control Flow Opt
Control Flow Opt I EBO
Hyperblock Formation Critical-Path Reduction
PQS Predicate Query System
Code Emission
Process Inner Loops unrolling, EBO Loop prep,
software pipelining
33
Hyperblock Formation and Predicated Execution

Hyperblock single-entry multiple-exit
control-flow region
loop body, hammock region, etc.
Hyperblock formation algorithm
Based on Scott Mahlkes method Mahlke96
But, less aggressive tail duplication

34
Hyperblock Formation Algorithm

Hammock regions
Innermost loops
General regions (path based)
Paths sorted by priorities (freq., size, length,
etc.)
Inclusion of a path is guided by its impact on
resources, scheduling height, and priority level
Internal branches are removed via predication
Predicate reuse

Region Identification
Block Selection
Tail Duplication
If Conversion
Objective Keep the scheduling height close to
that of the highest priority path.
35
Hyperblock Formation - An Example
1
1
aa ai bb bi switch (aa) case 1
if (aa lt tabsiz) aa tabaa case 2
if (bb lt tabsiz) bb tabbb default
ans aa bb
2
4
4
2
1
5
4,5
5
2
6
6
6
6,7
7
8
7
7
8
8
8
H1
H2
(a) Source
(c) Hyperblock formation with aggressive
tail duplication
(b) CFG
36
Hyperblock Formation - An Example
Contd
1
1
1
2
4
4
2
4
2
H1
5
5
5
6
6
6
6
7
7
7
7
8
8
H2
8
H1
H2
8
(b) Hyperblock formation with aggressive
tail duplication
(c) Pro64 hyperblock formation
(a) CFG
37
Features of the Pro64 Hyperblock Formation (HBF)
Algorithm

Form good vs. maximal hyperblocks
Avoid unnecessary duplication
No reverse if-conversion
Hyperblocks are not a barrier to global code
motion later in IGLS

38
Predicate Query System (PQS)

Purpose gather information and provide
interfaces allowing other phases to make queries
regarding the relationships among predicate
values
PQS functions (examples)
BOOL PQSCG_is_disjoint (PQS_TN tn1, PQS_TN
tn2)
BOOL PQSCG_is_subset (PQS_TN_SET
tns1, PQS_TN_SET tns2)

39
Flowchart of Code Generator
WHIRL
Control Flow Opt II EBO
WHIRL-to-TOP Lowering
EBO Extended basic block optimization peephole, e
tc.
CGIR Quad Op List
IGLS pre-pass GRA, LRA, EBO IGLS
post-pass Control Flow Opt
Control Flow Opt I EBO
Hyperblock Formation Critical-Path Reduction
PQS Predicate Query System
Code Emission
Process Inner Loops unrolling, EBO Loop prep,
software pipelining
40
Loop Preparation and Optimization for Software
Pipelining

Loop canonicalization for SWP
Read/Write removal (register aware)
Loop unrolling (resource aware)
Recurrence removal or extension
Prefetch
Forced if-conversion

41
Pro64 Software Pipelining Method Overview

Test for SWP-amenable loops
Extensive loop preparation and optimization
before application DeTo93
Use lifetime sensitive SWP algorithm Huff93
Register allocation after scheduling based on
Cydra 5 RLTS92, DeTo93
Handle both while and do loops
Smooth switching to normal scheduling if not
successful.

42
Pro64 Lifetime-Sensitive Modulo Scheduling for
Software Pipelining

Features
Try to place an op ASAP or ALAP to minimize
register pressure
Slack scheduling
Limited backtracking
Operation-driven scheduling framework

Compute Estart/Lstart for all unplaced ops
Choose a good op to place into the current
partial schedule within its Estart/Lstart range
yes
Register allocate
Succeed
no
done
Eject conflicting Ops
43
Flowchart of Code Generator
WHIRL
Control Flow Opt II EBO
WHIRL-to-TOP Lowering
EBO Extended basic block optimization peephole, e
tc.
CGIR Quad Op List
IGLS pre-pass GRA, LRA, EBO IGLS
post-pass Control Flow Opt
Control Flow Opt I EBO
Hyperblock Formation Critical-Path Reduction
PQS Predicate Query System
Code Emission
Process Inner Loops unrolling, EBO Loop prep,
software pipelining
44
Integrated Global Local Scheduling (IGLS) Method

The basic IGLS framework integrates global code
motion (GCM) with local scheduling MaJD98
IGLS extended to hyperblock scheduling
Performs profitable code motion between
hyperblock regions and normal regions

45
IGLS Phase Flow Diagram
Hyperblock Scheduling (HBS)
Block Priority Selection Motion
Selection Target Selection
Global Code Motion (GCM)
Local Code Scheduling (LCS)
46
Advantages of the Extended IGLSMethod - The
Example Revisited
1

Advantages
No rigid boundaries between hyperblocks and
non-hyperblocks
GCM moves code into and out of a hyperblock
according to profitability

1
2
4
H1
4
2
H1
5
5
6
6
7
7
8
8
H2
H2
H3
8
(a) Pro64 hyperblock
(b) Profitable duplication
47
Software Pipelining vsNormal Scheduling
a SWP-amenable loop candidate ?
No
Yes
IGLS
Inner loop processing software pipelining
GRA/LRA
Failure/not profitable
IGLS
Code Emission
Success
48
Flowchart of Code Generator
WHIRL
Control Flow Opt II EBO
WHIRL-to-TOP Lowering
EBO Extended basic block optimization peephole, e
tc.
CGIR Quad Op List
IGLS pre-pass GRA, LRA, EBO IGLS
post-pass Control Flow Opt
Control Flow Opt I EBO
Hyperblock Formation Critical-Path Reduction
PQS Predicate Query System
Code Emission
Process Inner Loops unrolling, EBO Loop prep,
software pipelining
49
Global and Local Register Allocation(GRA/LRA)
From prepass IGLS

LRA-RQ provides an estimate of local register
requirements
Allocates global variables using a priority-based
register allocator ChowHennessy90,Chow83,
Briggs92
Incorporates IA-64 specific extensions, e.g.
register stack usage

GRA
LRA Register Request LRA-RQ
Priority Based Register Allocation with IA-64
Extensions
LRA
To postpass IGLS
50
Local Register Allocation (LRA)

Assign_registers using reverse linear scan
Reordering depth-first ordering on the DDG

Assign_Registers
succeed
failed
Fix_LRA
first time
Instruction reordering
Spill global spill local
51
Future Research Topics for Pro64 Code Generator

Hyperblock formation
Predicate query system
Enhanced speculation support

52
PART III Using Pro64 in Compiler Research and
Development

Case Studies

53
Outline

General Remarks
Case Study I Integration of new instruction
reordering algorithm to minimize register
pressure Govind,Yang,Amaral,Gao2000
Case Study II Design and evaluation of an
induction pointer prefetching algorithm
Stouchinin,Douillet,Amaral,Dehnert,Gao2000

54
Case I

Introduction of the Minimum Register Instruction
Sequence (MRIS) problem and a proposed solution
Problem formulation
The proposed algorithm
Pro64 porting experience
Where to start
How to start
Results
Summary

55
Researchers

R. Govindarajan (Indian Inst. Of Science)
Hongbo Yang (Univ. of Delaware)
Chihong Zhang (Conexant)
José Nelson Amaral (Univ. of Alberta)
Guang R. Gao (Univ. of Delaware)

56
The Minimum Register Instruction Sequence Problem
Given a data dependence graph G, derive an
instruction sequence S for G that is optimal in
the sense that its register requirement is
minimum.
57
A Motivating Example
(a) DDG (b) Instruction Sequence 1
(c) Instruction Sequence 2

Observation Register requirements drop 25 from
(b) to (c) !

58
Motivation

IA-64 style processors
Reduce spills in local register allocation phase
Reduce Local Register Allocation (LRA) requests
in Global Register Allocation (GRA) phase
Reduce overall register pressure on a per
procedure basis
Out-of-order issue processor
Instruction reordering buffer
Register renaming

59
How to Solve the MRIS Problem?
L1 (a, b, f, h) L2 (c, f) L3 (e, g,
h) L4 (d, g)

Register lineages
Live range of lineages
Lineage interference

Register lineages
Live range of lineages
Lineage interference

(c) Lineages
(a) Concepts
(b) DDG
Questions Can L1 and L2 share the same
register?
61
How to Solve the MRIS Problem?
L1 (a, b, f, h) L2 (c, f) L3 (e, g,
h) L4 (d, g)

Register lineages
Live range of lineages
Lineage interference

(c) Lineages
(a) Concepts
(b) DDG
Questions Can L1 and L2 share the same
register? Can L2 and
L3 share the same register?
62
How to Solve the MRIS Problem?
L1 (a, b, f, h) L2 (c, f) L3 (e, g,
h) L4 (d, g)

Register lineages
Live range of lineages
Lineage interference

(c) Lineages
(a) Concepts
(b) DDG
Questions Can L1 and L2 share the same
register? Can L2 and
L3 share the same register?
Can L1 and L4 share the same
register? Can L2 and L4 share the same register?
63
Lineage Interference Graph
L1 (a, b, f, h) L2 (c, f) L3 (e, g,
h) L4 (d, g)
a
b
c
d
e
g
f
h
(a) Original DDG (b) Lineage Interference Graph
(LIG)
Question Is the lower bound of the required
registers 3? Challenge Derive a Heuristic
Register Bound (HRB)!
64
Our Solution Method
DDG

A good construction algorithm for LIG
An effective heuristic method to calculate the
HRB
An efficient scheduling method (do not backtrack)

Form Lineage Interference Graph (LIG)
Derive HRB
Extended list-scheduling guided by HRB
A good instruction sequence
65
Pro64 Porting Experience

Porting plan and design
Implementation
Debugging and validation
Evaluation

66
Implementation

Dependence graph construction
LIG formation
LIG construction and coloring
The reordering algorithm implementation

67
Porting Plan and Design
../common/targ_info/abi/ia64

Understand the compiler infrastructure
Understand the register model (mainly from
targ_info)
e.g.
register classes (int, float, predicate, app,
control)
register save/restore conventions caller/callee
save, return value, argument passing, stack
pointer, etc.

68
Register Allocation
GRA
LRA At block level
Assign_Registers Fix_LRA_Blues
Succ?
Fail?
reschedule local code motion spill global or
local registers
69
Implementation

DDG construction use native service routines
e.g. CG_DEP_Compute_Graph
LIG coloring using native support for set
package (e.g. bitset.c)
Scheduler implementation vector package native
support (e.g. cg_vector.cxx)
Access dependence graph using native service
functions ARC_succs, ARC_preds, ARC_kind

70
Debugging and Validation

Trace file
tt540x1. General trace of LRA
tt45 0x4. Dependence graph building
tr53. Target Operations (TOP) before LRA
tr54. TOP after LRA

71
Evaluation

Static measurement
Fat point -tt54 0x40
Dynamic measurement
Hardware counter in R12k and perfex

72
Evaluation

For the MIPS R12K (SPEC95fp), the lineage-based
algorithm reduce the number of loads executed by
12, the number of stores by 14, and the
execution time by 2.5 over a baseline.
It is slightly better than the algorithm in the
MIPSPro compiler.

73
Case II
Design and Evaluation of an Induction Pointer
Prefetching Algorithm
74
Researchers

Artour Stoutchinin (STMicroelectronics)
José Nelson Amaral (Univ. of Alberta)
Guang R. Gao (Univ. of Delaware)
Jim Dehnert (Silicon Graphics Inc.)
Suneel Jain (Narus Inc.)
Alban Douillet (Univ. of Delaware)

75
Motivation
The important loops of many programs are
pointer-chasing loops that access recursive data
structures through induction pointers.
Example max 0 current head while(current
! NULL)
if(current-gtkey gt max) max
current-gtkey current
current-gtnext
76
Problem Statement
How to identify pointer-chasing recurrences?
How to decide whether there are enough processor
resources and memory bandwidth to profitably
prefetch an induction pointer?
How to efficiently integrate induction pointer
prefetching with loop scheduling based on the
profitability analysis?
77
Prefetching Costs

More instructions to issue
More memory traffic
Longer code (disruption in instruction cache)
Displacement of potentially good data from cache

After prefetching t226 lw 0x34(t228) tmp
subu t226, t226s tmp addu tmp, tmp tmp addu
t226, tmp pref 0x0(tmp) t226s t226
Before prefetching t226 lw 0x34(t228)
78
What to Prefetch?When to Prefetch it?
A good optimizing compiler should only prefetch
data that will actually be referenced.
It should prefetch far enough in advance to
prevent a cache miss when the reference occurs.
But, not too far in advance, because the data
might be evicted from the cache before it is
used, or might displace data that will be
referenced again.
79
Prefetch Address
In order to prefetch, the compiler must calculate
addresses that will be referenced in future
iterations of the loop.
For loops that access regular data structures,
such as vectors and matrices, compilers can use
static analysis of the array indexes to compute
the prefetching addresses.
How can we predict future values of induction
pointers?
80
Key Intuition
Recursive data structures are often allocated
at regular intervals.
Example curr head (item) malloc(sizeof(item)
) while(curr-gtkey get_key()) ! NULL)
curr-gtnext curr
(item)malloc(sizeof(item))
other_memory_allocations()
curr -gt next NULL
81
Pre-Fetching Technique
Example max 0 current head
tmp current while(current ! NULL)
if(current-gtkey gt max)
max current-gtkey current
current-gtnext stride current - tmp
prefetch(current
stridek) tmp current
82
Prefetch Sequence (R10K)
In our implementation, the stride is recomputed
in every iteration of the loop, making it
tolerant of (infrequent) stride changes.
stride addr - addr.prev stride stride
k addr.pref addr stride addr.prev addr pref
addr.pref
83
Identification of Pointer-Chasing Recurrences
A surprisingly simple method works well look in
the intermediate code for recurrence circuits
containing only loads with constant offsets.
Examples node ptr-gtnext r1 lt- load r2,
offset_next ptr node-gtptr r2 lt- load r1,
offset_ptr current current-gtnext r2 lt- load
r1 r1 lt- load r2, offset_next
84
Profitability Analysis
Goal Balance the gains and costs of prefetching.
Although we use resource estimates analogous
to those done for software pipelining, we
consider loop bodies with control flow.
How to estimate the resources available
for prefetching in a basic block B that belongs
to many data dependence recurrences?
85
Software Pipelining

What limits the speed of a loop?
Data dependences recurrence initiation interval
(recMII)
Processor resources resource initiation
interval (resMII)
Memory accesses memory initiation interval
(memMII)

0
1
2
3
4
5
6
7
8
9
10
11
12
16
15
14
13
time
86
Data Dependences(recMII)
The recurrence minimum initiation interval
(recMII) is given by
for i 0 to N - 1 do a Xi Xi -
1 Ri b Yi Xi Zi - 1 c Zi
Yi 1 end
(dist,lat)
87
The recMII for Loops with Control Flow
An instruction of a basic block B, can belong to
many recurrences (with distinct control
paths). We define the recurrence MII of a load
operation L as L ? c means that the operation
L is part of the recurrence c.
Control Flow Graph
88
Processor Resources(resMII)
A basic block B may belong to multiple control
paths. We define the resource constraint of a
basic block B as the maximum over all control
paths that execute B.
Control Flow Graph
89
Available Memory Bandwidth
Processors with non-blocking caches can support
up to k outstanding cache misses without
stalling. We define the available memory
bandwidth of all control paths that execute a
basic block B as where m(p) is the number of
expected cache misses in each control path p.
Control Flow Graph
90
Profitability Analysis
Adding prefetch code for an induction pointer L
in a basic block B is profitable if both (1)
the mii due to recurrences that contain L is
greater than the resMII after prefetch
insertion, and (2) there is enough memory
bandwidth to enable another cache miss
without causing stalls.
91
Computing Available Memory Bandwidth
To compute the available memory bandwidth of a
control path we need to estimate how many cache
misses are expected in that control path.
We use a graph coloring technique over a cache
miss interference graph to predict which memory
references are likely to incur a miss.
92
The Miss Interference Graph
Two memory references interfere if 1. They are
both expected to miss the cache 2. They can both
be issued in the same iteration of the loop 3.
They do not fall into the same cache line
Miss Interference Graph assumptions 1. Loop
invariant references are cache hits
(global-pointer relative, stack-pointer relative,
etc). 2. Memory references on mutually exclusive
control paths do not interfere. 3. References
relative to the same base address interfere
only if their relative offset is larger than the
cache line.
93
Prefetching Algorithm
DoPrefetch(P,V,E) 1. C ? pointer-chasing
recurrences 2. R ? Prioritized list of induction
pointer loads in C 3. N ? Prioritized list of
other loads (not in C) 4. O ? R N 5. mark each
L in O as a cache miss 6. for each L in O, L ?
B 7. do if recMIIP(B) ? resMIIP(B) and
S(B) 8. then add prefetch for L to
B 9. mark L as cache
hit 10. endif 11. endfor
94
An Example
mcf minimal cost flow optimizer, (Konrad-Zuse
Informatics Center, Berlin)
1 while (arcin) 2 tail
arcin-gttail 3 if (tail-gttime
arcin-gtorg_cost gt latest) 4 arcin
(arc_t )tail-gtmark 5 continue
6 arc_cost
tail-gtpotential head_potential 7 if
(red_cost lt 0) 8 if (new_arcs lt
MAX_NEW_ARCS) 9
insert_new_arc(arcnew, new_arcs, tail,
head,
arc_cost, red_cost) 10
new_arcs 11
else if((cost_t)arcnew0.flow gt red_cost) 12
replace_weaker_arc(arcnew,
tail, head,
arc_cost, red_cost)
13 arcin (arc_t
)tail-gtmark
95
An Example
1 while (arcin) 2 tail
arcin-gttail 3 if (tail-gttime
arcin-gtorg_cost gt latest) 4 arcin
(arc_t )tail-gtmark 5 continue
6 arc_cost
tail-gtpotential head_potential 7 if
(red_cost lt 0) 8 if (new_arcs lt
MAX_NEW_ARCS) 9
insert_new_arc(arcnew, new_arcs, tail,
head,
arc_cost, red_cost) 10
new_arcs 11
else if((cost_t)arcnew0.flow gt red_cost) 12
replace_weaker_arc(arcnew,
tail, head,
arc_cost, red_cost)
13 arcin (arc_t
)tail-gtmark
96
1. t228 lw 0x0(t226) 2. t229 lw 0x14(t226) 3.
t230 lw 0x38(t228) 4. t231 addu t229, t230 5.
t232 slt t220, 0 6. bne B3, t232, 0
B1
B3
B2
9. t234 lw 0x2c(t228) 10. t235 subu t225,
t234 11. t233 addiu t235, 0x1e 12. bgez B7, t233
7. t226 lw 0x34(t228) 8. b B8
13. t236 slt t209. t175 14. Beq B6, t236, 0
B4
B5
B6
insert_new_arc()
replace_weaker_arc()
15. t226 lw 0x34(t228)
B7
B8
15. bne B1, t226, 0
97
1. t228 lw 0x0(t226) 2. t229 lw 0x14(t226) 3.
t230 lw 0x38(t228) 4. t231 addu t229, t230 5.
t232 slt t220, 0 6. bne B3, t232, 0
B1
B3
B2
9. t234 lw 0x2c(t228) 10. t235 subu t225,
t234 11. t233 addiu t235, 0x1e 12. bgez B7, t233
7. t226 lw 0x34(t228) 8. b B8
13. t236 slt t209. t175 14. Beq B6, t236, 0
B4
B5
B6
insert_new_arc()
replace_weaker_arc()
15. t226 lw 0x34(t228)
B7
B8
15. bne B1, t226, 0
98
1. t228 lw 0x0(t226) 2. t229 lw 0x14(t226) 3.
t230 lw 0x38(t228) 4. t231 addu t229, t230 5.
t232 slt t220, 0 6. bne B3, t232, 0
B1
B3
B2
9. t234 lw 0x2c(t228) 10. t235 subu t225,
t234 11. t233 addiu t235, 0x1e 12. bgez B7, t233
7. t226 lw 0x34(t228) 8. b B8
13. t236 slt t209. t175 14. Beq B6, t236, 0
B4
B5
B6
insert_new_arc()
replace_weaker_arc()
15. t226 lw 0x34(t228)
B7
B8
15. bne B1, t226, 0
99
B1
1. t228 lw 0x0(t226) 2. t229 lw 0x14(t226) 3.
t230 lw 0x38(t228) 4. t231 addu t229, t230 5.
t232 slt t220, 0 6. bne B3, t232, 0
B3
B4
B6
B5
B2
7. t226 lw 0x34(t228) 8. b B10
B7
15. t226 lw 0x34(t228)
B8
100
B1
1. t228 lw 0x0(t226) 1. tmp subu t228,
t228s 1. tmp addu tmp, tmp 1. tmp addw t228,
tmp 1. pref 0x34(tmp) 1. t228s t228 2. t229
lw 0x14(t226) 3. t230 lw 0x38(t228) 4. t231
addu t229, t230 5. t232 slt t220, 0 6. bne B3,
t232, 0
B3
B4
B6
B5
B2
7. t226 lw 0x34(t228) 7. tmp subu t226,
t226s 7. tmp addu tmp, tmp 7. tmp addu
t226, tmp 7. pref 0x0(tmp) 7. t226s t226
8. b B10
B7
15. t226 lw 0x34(t228) 15. tmp subu t226,
t226s 15. tmp addu tmp, tmp 15. tmp addu
t226, tmp 15. pref 0x0(tmp) 15. t226s t226
B8
101
When Pointer Prefetch Works
102
When Pointer Prefetch Does Not Help
103
Summary of Attributes

Software-only implementation
Simple candidate identification
Simple code transformation
No impact on user data structures
Simple profitability analysis, local to loop
Performance degradations are rare, minor

104
Open Questions

How often is the speculated stride correct?
Can instrumentation feedback help?
How well does the speculative prefetch work with
other recursive data structures trees, graphs,
etc?
How well does this approach work for read/write
recursive data structures?

105
Related Work (Software)

Luk-Mowry (ASPLOS-96)
Greedy prefetching History-Pointer prefetching
Data Linearization Prefetching
Change the data structure storage
Lipatsi et al. (Micro-95)
Prefetching pointers at procedure call sites
Liu-Dimitri-Kaeli (Journal of Syst. Arch.-99)
Maintains a table of offsets for prefetching

106
Related Work (Hardware)

Roth-Moshovos-Sohi (ASPLOS, 1998)
Gonzales-Gonzales (ICS, 1997)
Mehrotra (Urbana-Champaign, 1996)
Chen-Baer (Trans. Computer, 1995)
Charney-Reeves (Trans. Comp., 1994)
Jegou-Teman (ICS, 1993)
Fu-Patel (Micro, 1992)

107
Execution Time Measurements
108
Prefetch Improvement
109
L1 Cache Misses
110
L2 Cache Misses
111
TLB Misses
112
Benchmarks
gcc GNU C compiler li Lisp
interpreter mcf Minimal cost flow
solver parser Syntactic parser of English twolf
Place and route simulator mlp Multi-layer
perceptron simulator ft Minimum spanning tree
algorithm

Write a Comment

User Comments (0)