Title: Optimal Global Scheduling for Itanium Processor Family
1Optimal Global Scheduling forItanium Processor
Family
Sebastian Winkel Saarland University, Germany
2Itanium Challenges
- Well-known difficulties of greedy instruction
scheduling heuristics - they cannot consider all interactions between
decisions about code motion, predication,
speculation - especially how aggressively should features be
appliedwhich require additional execution
slots? too conservative use opportunities are
wasted overeager use resource shortage could
spoil the benefit - even local scheduling is NP-complete
- validation of the result is difficult
- Idea Apply integer programming to obtain
globally optimal and provably correct
solutions
3Optimal Global Schedules
- minimal global schedule length
- (defined as the sum of the local schedule
lengths, weighted by frequency) - maximal performance
- (instruction and data cache effects are not
considered) - Intended Applications
- As an optimization tool for performance-critical
software components like encryption routines - probably as a postpass solution
- As a research tool to obtain insights into the
potential of the architecture - How much parallelism can be statically extracted
by usingcode motion, predication and
speculation? - How large is the room for improvements in
scheduling heuristics?
4Integer Linear Programming (ILP)
- Minimize
- cTx
- Subject to
- Ax ? b
- With x integral
- x ? Zn
- Solution space consists of integer points inside
a polyhedron - Computing an optimal solution is NP-complete due
to the integrality constraint - Dropping the integrality constraint gives the
relaxed problemwhich can be solved in polynomial
time? solution value is a lower bound on the
integral problem
5Requirements for the ILP Model
- An ILP model for scheduling should be
- Correct
- no incorrect schedule may be feasible
- Complete at least one optimal schedule must be
feasible - Compact as many non-optimal schedules as
possible areexcluded from the search space - Simple
- use as much abstraction and unification as
possible - to have as few variables and contraints as
possible - Efficient
- describe a polyhedron with many integral
vertices - requires some intuition and experience
Needs to be proven
Crucial to solution efficiency
6ILP Model Basic Structure
- Source block s(n) of an instruction The block
where it originates from before scheduling - Code motion moves it to destination blocks
- We collect possible destination blocks in a set
- all predecessors of s(n) which are
postdominated by s(n) - all successors of s(n) which are dominated by
s(n) - For each instruction, each of its destination
blocks and each time step therein we define a
binary variable
for non-speculative instructions
7Code Motion with Compensation Copies
Example for code motion
add
add
- Purpose of upward code motionexecute
instructions as early as possible
- Primary purpose of downward code motionfree
execution slots for upward code motion
store
store
add
p1
p2
store
(p2) store
add
add
adds source block
8ILP Model Variables
For each instruction, each of its destination
blocks we use We couple the x anda
variables with thefollowing constraints
A copy of instruction n is scheduled onall
execution paths going through s(n) before block A
A
B
Global constraints Assignment for each
instruction n with a new single empty exit block
?
n is scheduled on all execution paths going
through its source block
9ILP Model Constraints
Global constraints Precedence let instruction n
depend on m (and let A be a destination block of
both n and m) Local constraints Precedence
n must not be scheduled atthe same or at an
earlier time step than m Resource limits the
number ofscheduled instructions per time step
whichuse the same resource type
m
1
n
? Rk
?1
10Cyclic Scheduling Regions
- So far only acyclic scheduling regions
- but loops are frequent
- and performance-critical
- We allow code motion
- into loops (easy)
- out of loops (only upward)
- Cyclic upward codemotion out of loops
- not onlyinto predecessorsof the loop
header,but also along everybackedge
ld r1 r32
add r2 8, r1
add r2 8, r1
add r2 8, r1
Loop Header
ld r1 r33
add r2 8, r1
add r2 8, r1
11Integrating Control Speculation
ld.s r8a r32 ld.s r8b r33
- Two reasons why an instruction
- cannot be speculated
- it could triggera false exceptione.g. a load
- it could overwritea live valuee.g. a concurrent
def
ld r8 r32
ld r8 r33
ld r8 r32 chk.s r8a mov r8 r8a
ld r8 r33 chk.s r8b mov r8 r8b
- Instructions for both possibilities(use/use not)
are included in the ILP - The ILP solver decides optimallywhere to apply
add r9 1, r8
12Objective Function
- For each block A,we define a variable TAequal
to the localschedule length - Additional optimization goal Stall minimization
- We assume that every load is a L1 hit
- If a load misses the L1 cache, the processor may
stallat the first use of the loaded value (if
not yet available) - Observation stall time is reversely
proportionalto the load-use distance in the
schedule - Idea Perform a second run of the ILP solver
which maximizesthe load-use distances without
increasing schedule length
with execution frequency estimates fA
13Experimental Setup
We created a postpass tool to compare our
approach directly with Intels compiler for
Itanium
assembly file
DDG CFG
Intelcompiler 6.0.1
Reconstruct and minimize data and
control- dependences
Generate ILP
Interpret and display solution
CPLEX 8.0 Solver
- False data dependences are removed via register
renaming - Several optimizations are performed, e. g.
- redundant DDG edges are removed
- for each instruction, a global ASAP-ALAP rangeis
computed from the DDG? impossible destination
blocks/time steps are discarded
14ILP Generation and Solving
- Input Currently four selected hot routines from
SPEC CINT2000 - no software pipelining, floating-point
instructions, calls handled yet - Results exemplarily for longest_match from
164.gzip - 154 instructions, 20 basic blocks, 2 nested loops
- ILP size 2878 constraints, 1794 variables
- The ILP solver uses
- simplex to solve the relaxed problem
- branch-and-bound to obtainan integral solution
RelaxedProblem
X1 isfractional
Infea-sible
X2 is fract.
Integral solutionfound with objectivevalue 28
Lower bound 28.35 ? cutoff
15Solution Process
ILOG CPLEX 7.100, licensed to "mpi-saarbruecken",
options e m b Tried aggregator 3 times. MIP
Presolve eliminated 2523 rows and 1152
columns. MIP Presolve modified 1802
coefficients. Aggregator did 90
substitutions. Reduced MIP has 2208 rows, 1541
columns, and 25900 nonzeros. Nodes
Cuts/ Node
Left Objective IInf Best Integer Best
Node ItCnt Gap Variable B 0 0
26.5000 89 26.5000
516 27.0000 66
Fractcuts 89 656
27.0387 55 Cuts 67
773 27.2122 56
Cuts 5 778 15 9
32.3000 0 32.3000 27.2122
1540 15.75 X_104__5_9 U 24 12
32.0000 0 32.0000 27.2500
1743 14.84 X_117__5_15 D 26 5
29.0000 0 29.0000 27.2500
1745 6.03 X_117__5_8 U 32 2
28.0000 0 28.0000 27.2500
1921 2.68 X_85__5_7 U Implied bound
cuts applied 4 Mixed integer rounding cuts
applied 3 Gomory fractional cuts applied 32
16Example
Input Schedule
Output Schedule
ExecutionFrequeny
- Inner loop
- of longest_match
- 8 basic blocks,7 early exits
- Already highlyoptimized byIntels compiler
- Optimal scheduleis 26 shorter
310
250
200
160
130
100
82
63
17Example
Input Schedule
Output Schedule
- get_bb_from_scratch
- Intels compilerhere does not
- use cyclic codemotion
- speculateconcurrent defs
- Optimal scheduleis 35 shorter
18Conclusion and Outlook
- We have presented a formal description of global
scheduling - with upward / downward / cyclic code motion
- automated generation of compensation code
- integrated control speculation
- For small routines, optimal schedules can
becomputed within seconds - Planned extensions
- partially-ready code motion
- long latency instructions
- stall minimization
- Early experiments are very promisingBenchmark
runs on a real Itanium 2 machinewould be very
interesting
Thank you! Tesekkürler!
19The Path-based View
- Global scheduling a transformation between
schedules which rearranges instructions and
preserves the control-flow structure - A transformation from schedule ?1 to schedule ?2
preserves correctness if for each instruction and
for each execution path holds - The instruction occurs along this path in ?1 ?it
occurs along this path in ?2 (Assignment
criterion) - All data-dependences on other instructionalong
this path are preserved(Precendence criterion)