Emery Berger - PowerPoint PPT Presentation

About This Presentation
Title:

Emery Berger

Description:

Title: Multiprocessor Memory Allocation Last modified by: Emery Berger Created Date: 2/24/2000 4:19:41 AM Document presentation format: Custom Other titles – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 24
Provided by: uma61
Category:

less

Transcript and Presenter's Notes

Title: Emery Berger


1
Advanced CompilersCMPSCI 710Spring
2003Balanced Scheduling
  • Emery Berger
  • University of Massachusetts, Amherst

2
Topics
  • Last time
  • Instruction scheduling
  • Gibbons Muchnick
  • This time
  • Balanced scheduling
  • Kerns Eggers

3
List Scheduling, Redux
  • Build dependence dag
  • Choose instructions from ready list
  • Schedule using heuristicsGibbons Muchnick
  • Instruction with greatest latency
  • Instruction with most successors
  • Instruction on critical path

4
Fly in the Ointment
  • When scheduling loads, assume hit in primary
    cache
  • On older architectures, this makes sense
  • Stall execution on cache miss
  • But newer architectures are nonblocking
  • Processor executes other instructions while load
    in progress
  • Good creates more ILP but

5
Scheduling Options
  • Now what?
  • Assume cache miss takes N cycles
  • N typically 10 or more
  • Do we schedule load
  • Anticipating 1 cycle delay (a hit)?
  • optimistic
  • Or N cycle delay (a miss)?
  • pessimistic

6
Optimistic vs. Pessimistic
Optimistic L0 X2 X1 X3 X4
Pessimistic L0 X2 X3 X1 X4
  • Optimistic fine for hits, inferior for misses
  • Pessimistic fine for hits, better for misses

7
Optimistic vs. Pessimistic,Multiple Loads
Optimistic L1 X1 L2 X2 X3
Pessimistic L1 X1 X2 L2 X3
  • Optimistic better for hits, same for misses
  • Pessimistic worse for hits, same for misses

8
Balanced Scheduling
  • Key insights
  • No fixed estimate of memory latency is best
  • Schedule based available parallelism in the code
  • Load level parallelism
  • Balanced scheduling
  • Computes each weight separately
  • Takes other possible instructions into account
  • Space out loads, using available instructions as
    filler

9
Balanced Scheduling,Example
Balanced L0 X2 X3 X1 X4
  • Maximizes distance between L0 X1
  • Good in case of miss

10
Balanced Scheduling,Example
  • W load instruction weight
  • W5 over-estimate
  • Greedy schedule
  • W1 under-estimate
  • Lazy schedule
  • Balanced scheduler
  • W3 ( load-level parallelism)

11
Balanced Scheduling,Results
  • Always achieves fewest interlocks

12
Algorithm Idea
  • Examine each instruction i in dag
  • Determine which loads can run in parallel with i
  • Use all (or part) of is execution time to cover
    latency of loads

13
Balanced Scheduling,Weight Calculation
  • Time complexity?

14
Balanced Scheduling,Example
  • Locate longest load paths in connected components
  • Add 1/( of loads) to loads weights

15
Balanced Scheduling,Example II
  • Consider instruction X1
  • Locate longest load paths in connected components
  • Add 1/( of loads) to loads weights
  • contributions of X1

16
Balanced Scheduling,All Weights
17
Balanced Scheduling Algorithm
  • After computing weights, perform list scheduling
    where
  • Priority weight plus max priority of successors
  • Break ties
  • Largest delta between consumed defined
    registers
  • Rank based on successors in dag that would be
    exposed
  • Select instruction generated earliest
  • Bottom-up scheduler
  • Reverse-order, schedule from leaves toward roots

18
Balanced Scheduling,Example I
Balanced L0 X2 X3 X1 X4
19
Balanced Scheduling,Example II
20
Limitations
  • Performed after register allocation
  • But introduces false dependences
  • Reuse of registers ) dag has extra edges
  • Can be fixed with software register renaming
  • Had to modify gccs RTL
  • Approach required manual pipelining
  • Profile-based feedback
  • Benchmark based on FORTRANconverted to C with
    f2c
  • Cant disambiguate memory
  • Adds many edges to dag

21
Workaround Simulate Fortran
  • Modify code to avoid aliases
  • Improves results, but incorrect!
  • Needs advanced alias analysis

22
Empirical Results
  • Evaluated using simulation
  • 3 to 18 improvement over regular scheduler
    across different models
  • Mean 9.9
  • Unfortunately
  • No results presented without above-mentioned
    modifications

23
Conclusion
  • Balanced scheduling
  • Spreads out instructions to cover load latency
  • Based on exploitable load-level parallelism
  • Effective at improving performance
  • Modulo methodological limitations
  • Not so great for C/C, possibly useful for Java
  • Next time interprocedural analysis
  • ACDI Ch. 19, pp. 607-636, 641-656
Write a Comment
User Comments (0)
About PowerShow.com