Optimal Global Scheduling for Itanium Processor Family - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Optimal Global Scheduling for Itanium Processor Family

Description:

If a load misses the L1 cache, the processor may stall ... Observation: stall time is reversely proportional. to the load-use distance in the schedule ... – PowerPoint PPT presentation

Number of Views:409
Avg rating:3.0/5.0
Slides: 20
Provided by: sebastia66
Category:

less

Transcript and Presenter's Notes

Title: Optimal Global Scheduling for Itanium Processor Family


1
Optimal Global Scheduling forItanium Processor
Family
Sebastian Winkel Saarland University, Germany
2
Itanium Challenges
  • Well-known difficulties of greedy instruction
    scheduling heuristics
  • they cannot consider all interactions between
    decisions about code motion, predication,
    speculation
  • especially how aggressively should features be
    appliedwhich require additional execution
    slots? too conservative use opportunities are
    wasted overeager use resource shortage could
    spoil the benefit
  • even local scheduling is NP-complete
  • validation of the result is difficult
  • Idea Apply integer programming to obtain
    globally optimal and provably correct
    solutions

3
Optimal Global Schedules
  • minimal global schedule length
  • (defined as the sum of the local schedule
    lengths, weighted by frequency)
  • maximal performance
  • (instruction and data cache effects are not
    considered)
  • Intended Applications
  • As an optimization tool for performance-critical
    software components like encryption routines
  • probably as a postpass solution
  • As a research tool to obtain insights into the
    potential of the architecture
  • How much parallelism can be statically extracted
    by usingcode motion, predication and
    speculation?
  • How large is the room for improvements in
    scheduling heuristics?

4
Integer Linear Programming (ILP)
  • Minimize
  • cTx
  • Subject to
  • Ax ? b
  • With x integral
  • x ? Zn
  • Solution space consists of integer points inside
    a polyhedron
  • Computing an optimal solution is NP-complete due
    to the integrality constraint
  • Dropping the integrality constraint gives the
    relaxed problemwhich can be solved in polynomial
    time? solution value is a lower bound on the
    integral problem

5
Requirements for the ILP Model
  • An ILP model for scheduling should be
  • Correct
  • no incorrect schedule may be feasible
  • Complete at least one optimal schedule must be
    feasible
  • Compact as many non-optimal schedules as
    possible areexcluded from the search space
  • Simple
  • use as much abstraction and unification as
    possible
  • to have as few variables and contraints as
    possible
  • Efficient
  • describe a polyhedron with many integral
    vertices
  • requires some intuition and experience

Needs to be proven
Crucial to solution efficiency
6
ILP Model Basic Structure
  • Source block s(n) of an instruction The block
    where it originates from before scheduling
  • Code motion moves it to destination blocks
  • We collect possible destination blocks in a set
  • all predecessors of s(n) which are
    postdominated by s(n)
  • all successors of s(n) which are dominated by
    s(n)
  • For each instruction, each of its destination
    blocks and each time step therein we define a
    binary variable

for non-speculative instructions
7
Code Motion with Compensation Copies
Example for code motion
add
add
  • Purpose of upward code motionexecute
    instructions as early as possible
  • Primary purpose of downward code motionfree
    execution slots for upward code motion

store
store
add
p1
p2
store
(p2) store
add
add
adds source block
8
ILP Model Variables
For each instruction, each of its destination
blocks we use We couple the x anda
variables with thefollowing constraints
A copy of instruction n is scheduled onall
execution paths going through s(n) before block A
A
B
Global constraints Assignment for each
instruction n with a new single empty exit block
?
n is scheduled on all execution paths going
through its source block
9
ILP Model Constraints
Global constraints Precedence let instruction n
depend on m (and let A be a destination block of
both n and m) Local constraints Precedence
n must not be scheduled atthe same or at an
earlier time step than m Resource limits the
number ofscheduled instructions per time step
whichuse the same resource type
m
1
n




? Rk
?1
10
Cyclic Scheduling Regions
  • So far only acyclic scheduling regions
  • but loops are frequent
  • and performance-critical
  • We allow code motion
  • into loops (easy)
  • out of loops (only upward)
  • Cyclic upward codemotion out of loops
  • not onlyinto predecessorsof the loop
    header,but also along everybackedge

ld r1 r32
add r2 8, r1
add r2 8, r1
add r2 8, r1
Loop Header
ld r1 r33
add r2 8, r1
add r2 8, r1
11
Integrating Control Speculation
ld.s r8a r32 ld.s r8b r33
  • Two reasons why an instruction
  • cannot be speculated
  • it could triggera false exceptione.g. a load
  • it could overwritea live valuee.g. a concurrent
    def

ld r8 r32
ld r8 r33
ld r8 r32 chk.s r8a mov r8 r8a
ld r8 r33 chk.s r8b mov r8 r8b
  • Instructions for both possibilities(use/use not)
    are included in the ILP
  • The ILP solver decides optimallywhere to apply

add r9 1, r8
12
Objective Function
  • For each block A,we define a variable TAequal
    to the localschedule length
  • Additional optimization goal Stall minimization
  • We assume that every load is a L1 hit
  • If a load misses the L1 cache, the processor may
    stallat the first use of the loaded value (if
    not yet available)
  • Observation stall time is reversely
    proportionalto the load-use distance in the
    schedule
  • Idea Perform a second run of the ILP solver
    which maximizesthe load-use distances without
    increasing schedule length

with execution frequency estimates fA
13
Experimental Setup
We created a postpass tool to compare our
approach directly with Intels compiler for
Itanium
assembly file
DDG CFG
Intelcompiler 6.0.1
Reconstruct and minimize data and
control- dependences
Generate ILP
Interpret and display solution
CPLEX 8.0 Solver
  • False data dependences are removed via register
    renaming
  • Several optimizations are performed, e. g.
  • redundant DDG edges are removed
  • for each instruction, a global ASAP-ALAP rangeis
    computed from the DDG? impossible destination
    blocks/time steps are discarded

14
ILP Generation and Solving
  • Input Currently four selected hot routines from
    SPEC CINT2000
  • no software pipelining, floating-point
    instructions, calls handled yet
  • Results exemplarily for longest_match from
    164.gzip
  • 154 instructions, 20 basic blocks, 2 nested loops
  • ILP size 2878 constraints, 1794 variables
  • The ILP solver uses
  • simplex to solve the relaxed problem
  • branch-and-bound to obtainan integral solution

RelaxedProblem
X1 isfractional
Infea-sible
X2 is fract.
Integral solutionfound with objectivevalue 28
Lower bound 28.35 ? cutoff
15
Solution Process
ILOG CPLEX 7.100, licensed to "mpi-saarbruecken",
options e m b Tried aggregator 3 times. MIP
Presolve eliminated 2523 rows and 1152
columns. MIP Presolve modified 1802
coefficients. Aggregator did 90
substitutions. Reduced MIP has 2208 rows, 1541
columns, and 25900 nonzeros. Nodes
Cuts/ Node
Left Objective IInf Best Integer Best
Node ItCnt Gap Variable B 0 0
26.5000 89 26.5000
516 27.0000 66
Fractcuts 89 656
27.0387 55 Cuts 67
773 27.2122 56
Cuts 5 778 15 9
32.3000 0 32.3000 27.2122
1540 15.75 X_104__5_9 U 24 12
32.0000 0 32.0000 27.2500
1743 14.84 X_117__5_15 D 26 5
29.0000 0 29.0000 27.2500
1745 6.03 X_117__5_8 U 32 2
28.0000 0 28.0000 27.2500
1921 2.68 X_85__5_7 U Implied bound
cuts applied 4 Mixed integer rounding cuts
applied 3 Gomory fractional cuts applied 32
16
Example
Input Schedule
Output Schedule
ExecutionFrequeny
  • Inner loop
  • of longest_match
  • 8 basic blocks,7 early exits
  • Already highlyoptimized byIntels compiler
  • Optimal scheduleis 26 shorter

310
250
200
160
130
100
82
63
17
Example
Input Schedule
Output Schedule
  • get_bb_from_scratch
  • Intels compilerhere does not
  • use cyclic codemotion
  • speculateconcurrent defs
  • Optimal scheduleis 35 shorter

18
Conclusion and Outlook
  • We have presented a formal description of global
    scheduling
  • with upward / downward / cyclic code motion
  • automated generation of compensation code
  • integrated control speculation
  • For small routines, optimal schedules can
    becomputed within seconds
  • Planned extensions
  • partially-ready code motion
  • long latency instructions
  • stall minimization
  • Early experiments are very promisingBenchmark
    runs on a real Itanium 2 machinewould be very
    interesting

Thank you! Tesekkürler!
19
The Path-based View
  • Global scheduling a transformation between
    schedules which rearranges instructions and
    preserves the control-flow structure
  • A transformation from schedule ?1 to schedule ?2
    preserves correctness if for each instruction and
    for each execution path holds
  • The instruction occurs along this path in ?1 ?it
    occurs along this path in ?2 (Assignment
    criterion)
  • All data-dependences on other instructionalong
    this path are preserved(Precendence criterion)
Write a Comment
User Comments (0)
About PowerShow.com