Optimal Global Scheduling for Itanium Processor Family

About This Presentation

Title:

Optimal Global Scheduling for Itanium Processor Family

Description:

If a load misses the L1 cache, the processor may stall ... Observation: stall time is reversely proportional. to the load-use distance in the schedule ... – PowerPoint PPT presentation

Number of Views:409

Avg rating:3.0/5.0

Slides: 20

Provided by: sebastia66

Category:

more less

Transcript and Presenter's Notes

Title: Optimal Global Scheduling for Itanium Processor Family

1
Optimal Global Scheduling forItanium Processor
Family
Sebastian Winkel Saarland University, Germany
2
Itanium Challenges

Well-known difficulties of greedy instruction
scheduling heuristics
they cannot consider all interactions between
decisions about code motion, predication,
speculation
especially how aggressively should features be
appliedwhich require additional execution
slots? too conservative use opportunities are
wasted overeager use resource shortage could
spoil the benefit
even local scheduling is NP-complete
validation of the result is difficult
Idea Apply integer programming to obtain
globally optimal and provably correct
solutions

3
Optimal Global Schedules

minimal global schedule length
(defined as the sum of the local schedule
lengths, weighted by frequency)
maximal performance
(instruction and data cache effects are not
considered)
Intended Applications
As an optimization tool for performance-critical
software components like encryption routines
probably as a postpass solution
As a research tool to obtain insights into the
potential of the architecture
How much parallelism can be statically extracted
by usingcode motion, predication and
speculation?
How large is the room for improvements in
scheduling heuristics?

4
Integer Linear Programming (ILP)

Minimize
cTx
Subject to
Ax ? b
With x integral
x ? Zn
Solution space consists of integer points inside
a polyhedron
Computing an optimal solution is NP-complete due
to the integrality constraint
Dropping the integrality constraint gives the
relaxed problemwhich can be solved in polynomial
time? solution value is a lower bound on the
integral problem

5
Requirements for the ILP Model

An ILP model for scheduling should be
Correct
no incorrect schedule may be feasible
Complete at least one optimal schedule must be
feasible
Compact as many non-optimal schedules as
possible areexcluded from the search space
Simple
use as much abstraction and unification as
possible
to have as few variables and contraints as
possible
Efficient
describe a polyhedron with many integral
vertices
requires some intuition and experience

Needs to be proven
Crucial to solution efficiency
6
ILP Model Basic Structure

Source block s(n) of an instruction The block
where it originates from before scheduling
Code motion moves it to destination blocks
We collect possible destination blocks in a set
all predecessors of s(n) which are
postdominated by s(n)
all successors of s(n) which are dominated by
s(n)
For each instruction, each of its destination
blocks and each time step therein we define a
binary variable

for non-speculative instructions
7
Code Motion with Compensation Copies
Example for code motion
add
add

Purpose of upward code motionexecute
instructions as early as possible

Primary purpose of downward code motionfree
execution slots for upward code motion

store
store
add
p1
p2
store
(p2) store
add
add
adds source block
8
ILP Model Variables
For each instruction, each of its destination
blocks we use We couple the x anda
variables with thefollowing constraints
A copy of instruction n is scheduled onall
execution paths going through s(n) before block A
A
B
Global constraints Assignment for each
instruction n with a new single empty exit block
?
n is scheduled on all execution paths going
through its source block
9
ILP Model Constraints
Global constraints Precedence let instruction n
depend on m (and let A be a destination block of
both n and m) Local constraints Precedence
n must not be scheduled atthe same or at an
earlier time step than m Resource limits the
number ofscheduled instructions per time step
whichuse the same resource type
m
1
n

? Rk
?1
10
Cyclic Scheduling Regions

So far only acyclic scheduling regions
but loops are frequent
and performance-critical
We allow code motion
into loops (easy)
out of loops (only upward)
Cyclic upward codemotion out of loops
not onlyinto predecessorsof the loop
header,but also along everybackedge

ld r1 r32
add r2 8, r1
add r2 8, r1
add r2 8, r1
Loop Header
ld r1 r33
add r2 8, r1
add r2 8, r1
11
Integrating Control Speculation
ld.s r8a r32 ld.s r8b r33

Two reasons why an instruction
cannot be speculated
it could triggera false exceptione.g. a load
it could overwritea live valuee.g. a concurrent
def

ld r8 r32
ld r8 r33
ld r8 r32 chk.s r8a mov r8 r8a
ld r8 r33 chk.s r8b mov r8 r8b

Instructions for both possibilities(use/use not)
are included in the ILP
The ILP solver decides optimallywhere to apply

add r9 1, r8
12
Objective Function

For each block A,we define a variable TAequal
to the localschedule length
Additional optimization goal Stall minimization
We assume that every load is a L1 hit
If a load misses the L1 cache, the processor may
stallat the first use of the loaded value (if
not yet available)
Observation stall time is reversely
proportionalto the load-use distance in the
schedule
Idea Perform a second run of the ILP solver
which maximizesthe load-use distances without
increasing schedule length

with execution frequency estimates fA
13
Experimental Setup
We created a postpass tool to compare our
approach directly with Intels compiler for
Itanium
assembly file
DDG CFG
Intelcompiler 6.0.1
Reconstruct and minimize data and
control- dependences
Generate ILP
Interpret and display solution
CPLEX 8.0 Solver

False data dependences are removed via register
renaming
Several optimizations are performed, e. g.
redundant DDG edges are removed
for each instruction, a global ASAP-ALAP rangeis
computed from the DDG? impossible destination
blocks/time steps are discarded

14
ILP Generation and Solving

Input Currently four selected hot routines from
SPEC CINT2000
no software pipelining, floating-point
instructions, calls handled yet
Results exemplarily for longest_match from
164.gzip
154 instructions, 20 basic blocks, 2 nested loops
ILP size 2878 constraints, 1794 variables
The ILP solver uses
simplex to solve the relaxed problem
branch-and-bound to obtainan integral solution

RelaxedProblem
X1 isfractional
Infea-sible
X2 is fract.
Integral solutionfound with objectivevalue 28
Lower bound 28.35 ? cutoff
15
Solution Process
ILOG CPLEX 7.100, licensed to "mpi-saarbruecken",
options e m b Tried aggregator 3 times. MIP
Presolve eliminated 2523 rows and 1152
columns. MIP Presolve modified 1802
coefficients. Aggregator did 90
substitutions. Reduced MIP has 2208 rows, 1541
columns, and 25900 nonzeros. Nodes
Cuts/ Node
Left Objective IInf Best Integer Best
Node ItCnt Gap Variable B 0 0
26.5000 89 26.5000
516 27.0000 66
Fractcuts 89 656
27.0387 55 Cuts 67
773 27.2122 56
Cuts 5 778 15 9
32.3000 0 32.3000 27.2122
1540 15.75 X_104__5_9 U 24 12
32.0000 0 32.0000 27.2500
1743 14.84 X_117__5_15 D 26 5
29.0000 0 29.0000 27.2500
1745 6.03 X_117__5_8 U 32 2
28.0000 0 28.0000 27.2500
1921 2.68 X_85__5_7 U Implied bound
cuts applied 4 Mixed integer rounding cuts
applied 3 Gomory fractional cuts applied 32
16
Example
Input Schedule
Output Schedule
ExecutionFrequeny

Inner loop
of longest_match
8 basic blocks,7 early exits
Already highlyoptimized byIntels compiler
Optimal scheduleis 26 shorter

310
250
200
160
130
100
82
63
17
Example
Input Schedule
Output Schedule

get_bb_from_scratch
Intels compilerhere does not
use cyclic codemotion
speculateconcurrent defs
Optimal scheduleis 35 shorter

18
Conclusion and Outlook

We have presented a formal description of global
scheduling
with upward / downward / cyclic code motion
automated generation of compensation code
integrated control speculation
For small routines, optimal schedules can
becomputed within seconds
Planned extensions
partially-ready code motion
long latency instructions
stall minimization
Early experiments are very promisingBenchmark
runs on a real Itanium 2 machinewould be very
interesting

Thank you! Tesekkürler!
19
The Path-based View

Global scheduling a transformation between
schedules which rearranges instructions and
preserves the control-flow structure
A transformation from schedule ?1 to schedule ?2
preserves correctness if for each instruction and
for each execution path holds
The instruction occurs along this path in ?1 ?it
occurs along this path in ?2 (Assignment
criterion)
All data-dependences on other instructionalong
this path are preserved(Precendence criterion)

Write a Comment

User Comments (0)

About PowerShow.com

Optimal Global Scheduling for Itanium Processor Family - PowerPoint PPT Presentation

Optimal Global Scheduling for Itanium Processor Family

If a load misses the L1 cache, the processor may stall ... Observation: stall time is reversely proportional. to the load-use distance in the schedule ... – PowerPoint PPT presentation