Title: Emery Berger
1Advanced CompilersCMPSCI 710Spring
2003Balanced Scheduling
- Emery Berger
- University of Massachusetts, Amherst
2Topics
- Last time
- Instruction scheduling
- Gibbons Muchnick
- This time
- Balanced scheduling
- Kerns Eggers
3List Scheduling, Redux
- Build dependence dag
- Choose instructions from ready list
- Schedule using heuristicsGibbons Muchnick
- Instruction with greatest latency
- Instruction with most successors
- Instruction on critical path
4Fly in the Ointment
- When scheduling loads, assume hit in primary
cache - On older architectures, this makes sense
- Stall execution on cache miss
- But newer architectures are nonblocking
- Processor executes other instructions while load
in progress - Good creates more ILP but
5Scheduling Options
- Now what?
- Assume cache miss takes N cycles
- N typically 10 or more
- Do we schedule load
- Anticipating 1 cycle delay (a hit)?
- optimistic
- Or N cycle delay (a miss)?
- pessimistic
6Optimistic vs. Pessimistic
Optimistic L0 X2 X1 X3 X4
Pessimistic L0 X2 X3 X1 X4
- Optimistic fine for hits, inferior for misses
- Pessimistic fine for hits, better for misses
7Optimistic vs. Pessimistic,Multiple Loads
Optimistic L1 X1 L2 X2 X3
Pessimistic L1 X1 X2 L2 X3
- Optimistic better for hits, same for misses
- Pessimistic worse for hits, same for misses
8Balanced Scheduling
- Key insights
- No fixed estimate of memory latency is best
- Schedule based available parallelism in the code
- Load level parallelism
- Balanced scheduling
- Computes each weight separately
- Takes other possible instructions into account
- Space out loads, using available instructions as
filler
9Balanced Scheduling,Example
Balanced L0 X2 X3 X1 X4
- Maximizes distance between L0 X1
- Good in case of miss
10Balanced Scheduling,Example
- W load instruction weight
- W5 over-estimate
- Greedy schedule
- W1 under-estimate
- Lazy schedule
- Balanced scheduler
- W3 ( load-level parallelism)
11Balanced Scheduling,Results
- Always achieves fewest interlocks
12Algorithm Idea
- Examine each instruction i in dag
- Determine which loads can run in parallel with i
- Use all (or part) of is execution time to cover
latency of loads
13Balanced Scheduling,Weight Calculation
14Balanced Scheduling,Example
- Locate longest load paths in connected components
- Add 1/( of loads) to loads weights
15Balanced Scheduling,Example II
- Consider instruction X1
- Locate longest load paths in connected components
- Add 1/( of loads) to loads weights
- contributions of X1
16Balanced Scheduling,All Weights
17Balanced Scheduling Algorithm
- After computing weights, perform list scheduling
where - Priority weight plus max priority of successors
- Break ties
- Largest delta between consumed defined
registers - Rank based on successors in dag that would be
exposed - Select instruction generated earliest
- Bottom-up scheduler
- Reverse-order, schedule from leaves toward roots
18Balanced Scheduling,Example I
Balanced L0 X2 X3 X1 X4
19Balanced Scheduling,Example II
20Limitations
- Performed after register allocation
- But introduces false dependences
- Reuse of registers ) dag has extra edges
- Can be fixed with software register renaming
- Had to modify gccs RTL
- Approach required manual pipelining
- Profile-based feedback
- Benchmark based on FORTRANconverted to C with
f2c - Cant disambiguate memory
- Adds many edges to dag
21Workaround Simulate Fortran
- Modify code to avoid aliases
- Improves results, but incorrect!
- Needs advanced alias analysis
22Empirical Results
- Evaluated using simulation
- 3 to 18 improvement over regular scheduler
across different models - Mean 9.9
- Unfortunately
- No results presented without above-mentioned
modifications
23Conclusion
- Balanced scheduling
- Spreads out instructions to cover load latency
- Based on exploitable load-level parallelism
- Effective at improving performance
- Modulo methodological limitations
- Not so great for C/C, possibly useful for Java
- Next time interprocedural analysis
- ACDI Ch. 19, pp. 607-636, 641-656