CS 7810 Lecture 4 - PowerPoint PPT Presentation

About This Presentation
Title:

CS 7810 Lecture 4

Description:

The max possible improvement (UB model) is 44% Other Results ... Load imbalance and communication become. worse the best heuristic/threshold will depend ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 34
Provided by: RajeevBala4
Category:
Tags: become | lecture | model

less

Transcript and Presenter's Notes

Title: CS 7810 Lecture 4


1
CS 7810 Lecture 4
Overview of Steering Algorithms, based
on Dynamic Code Partitioning for Clustered
Architectures R. Canal, J-M. Parcerisa, A.
Gonzalez UPC-Barcelona IJPP 01
2
Bottlenecks
  • Recap from Complexity-Effective Superscalars
  • WakeupSelect and Bypass have the longest
  • delays and represent atomic operations
  • Pipelining will prevent back-to-back operations
  • Increased issue width / window size / wire
    delays
  • exacerbate the problem (also for the register
    file
  • and cache)

3
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
Rdy Operands
r1 1 r2 1 r3 0
4
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r3? r1 r2


Rdy Operands
r1 1 r2 1 r3 0
5
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r4? r3 r2 r3? r1 r2


Rdy Operands
r1 1 r2 1 r3 0
6
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r5? r4 r2 r4? r3 r2 r3? r1 r2


Rdy Operands
r1 1 r2 1 r3 0
7
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r5? r4 r2 r4? r3 r2 r3? r1 r2
r6? r4 r2

Rdy Operands
r1 1 r2 1 r3 0
8
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r5? r4 r2 r4? r3 r2 r3? r1 r2
r7? r6 r2 r6? r4 r2

Rdy Operands
r1 1 r2 1 r3 0
9
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r8? r5 r2 r5? r4 r2 r4? r3 r2 r3? r1 r2
r7? r6 r2 r6? r4 r2

Rdy Operands
r1 1 r2 1 r3 0
10
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r8? r5 r2 r5? r4 r2 r4? r3 r2 r3? r1 r2
r7? r6 r2 r6? r4 r2
r9? r1 r2
Rdy Operands
r1 1 r2 1 r3 0
11
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r8? r5 r2 r5? r4 r2 r4? r3 r2 r3? r1 r2
r7? r6 r2 r6? r4 r2
r9? r1 r2
Rdy Operands
r1 1 r2 1 r3 0
r1 ? r2 ?
12
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r8? r5 r2 r5? r4 r2 r4? r3 r2
r7? r6 r2 r6? r4 r2

Rdy Operands
r1 1 r2 1 r3 1
r3 ? r9 ?
13
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r8? r5 r2 r5? r4 r2
r7? r6 r2 r6? r4 r2

Rdy Operands
r1 1 r2 1 r3 1
r4 ?
14
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r8? r5 r2
r7? r6 r2

Rdy Operands
r1 1 r2 1 r3 1
r5 ? r6 ?
15
Pros and Cons
  • Wakeup and select over a subset of issue queue
  • entries (only FIFO heads)
  • Under-utilization as FIFOs do not get filled
    (causes
  • about 5 IPC loss) but it is not hard to
    increase
  • their sizes
  • You still need an operand-rdy table

16
Clustered Microarchitectures
17
Clustered Microarchitectures
  • Simplifies wakeupselect and bypassing
  • Dependence-based, hence most communication
  • is local
  • Low porting requirements on register file, issue
  • queue
  • IPC loss of 6.3, but a clock speed improvement

18
Clustered Microarchitectures
  • Two primary motivations
  • hard to design 8-way machines in future
  • technologies
  • the FP cluster is idle most of the time
  • Advantages
  • Few entries, few ports ? low delays ? fast
  • clocks, simple pipelines
  • Every instruction is not penalized for wire
    delays
  • Potential for large windows and high ILP
  • Design and verification costs do not scale up (?)

19
Dependences
r1 ? r2 r3 cl-1 r4 ? r1 r2
cl-1 r5 ? r6 r7 cl-2 r8 ? r5 r1
?
  • During rename, steer dependent instructions to
  • the same cluster
  • However, we do not know about converging chains
  • (can have workarounds traces/compilers)
  • If the assigned cluster is full, do we stall or
    go
  • elsewhere? not clarified in the paper

20
Load Imbalance
  • All instructions in 1 cluster ? zero
    communication,
  • but zero utilization of other resources
  • Six ready instructions in cl-1 and two in cl-2 ?
  • more contention and wasted issue slots
  • Ready instructions in each should be equal
  • however, instruction readiness happens long
  • after instruction steering

21
Load Imbalance Metrics
  • Metrics
  • Instrs in each cluster
  • Unissued instrs that could have issued
  • elsewhere (note latency between steer
    issue)
  • The second metric does not help much

22
Instruction Assignment
Reg-rename Instr steer
IQ
IQ
Regfile
Regfile
F
F
F
F
40 regs in each cluster
r1 ? r2 r3 r4 ? r1 r2 r5 ? r6 r7 r8 ? r1
r5
p21 ? p2 p3 p22 ? p21 p2 p42 ? p21
p41 ? p56 p57 p43 ? p42 p41
r1 is mapped to p21 and p42 will influence
steering and instr commit on average, only 8
replicated regs
23
Assignment by the Compiler
  • ISA modification
  • Less accurate notion of load
  • Depends on good branch prediction, memory
  • dependence prediction, cache miss prediction,
  • contention modeling, etc.
  • Dynamic mechanisms can add pipeline stages

24
Steering Heuristics
  • Simple Register Mapping Based Steering
  • (Simple-RMBS) if communication cannot be
  • avoided, pick a random cluster
  • Balanced-RMBS if communication cannot be
  • avoided, pick the less-loaded cluster
  • Advanced-RMBS if significant imbalance, pick
  • the less-loaded cluster, else use Balanced-RMBS
  • Modulo-steering assignment alternates between
  • clusters

25
Results
  • Modulo steering too much communication
  • Balanced and Simple RMBS do well (27 and 22
  • better than the base) less than 3 comms per
    100
  • instructions (a single bus is enough)
    assuming
  • zero comm-cost isolates effect of workload
  • imbalance
  • Advanced RMBS performs 35 better than base
  • The max possible improvement (UB model) is 44

26
Other Results
  • Scheduling constraints limit improvements for
  • FP programs
  • The compiler can do better than what Fig.10
  • indicates
  • Palacharla algorithm doesnt do as well no
  • load considerations and few FIFOs ? more
  • communication

27
Optimizations
  • Information on converging chains (slices)
  • First-fit and Mod-N
  • Identify critical source operands
  • Interconnect-sensitive steering
  • Stalls in dispatch

28
Future Trends
  • Increased wire delays and more transistors ?
  • each cluster is smaller
  • more clusters
  • latency across clusters is higher
  • Load imbalance and communication become
  • worse the best heuristic/threshold will
    depend
  • on the assumed model/latency
  • Data cache access time increases

29
Dynamic Cluster Allocation
  • At some point, using more clusters can increase
  • communication costs and worsen performance
  • More clusters ? larger windows/FUs ? more ILP
  • ? more communication
    penalties
  • Steering heuristic should take degree of ILP
    into
  • account (ISCA 03)

30
Other Recent Papers
  • Hierarchical interconnect designs Aggarwal and
  • Franklin
  • Distributed data caches UPC
  • Power-efficiency of clustered designs Zyuban
    and
  • Kogge
  • TRIPS processor UT-Austin (compiler mapping)

31
Important Problems
L1D
L1D
L2
L2
  • Cluster allocation to threads
  • Design of interconnects
  • Latency tolerance
  • Exploiting heterogeneity
  • 3D design
  • Power efficiency and
  • temperature
  • Branch fan-out

F E
F E
F E
F E
L1D
L1D
32
Next Weeks Paper
  • The Optimal Logic Depth per Pipeline Stage is
  • 6 to 8 FO4 Inverter Delays, UT-Austin/Compaq,
  • ISCA02

33
Title
  • Bullet
Write a Comment
User Comments (0)
About PowerShow.com