CS 7810 Lecture 4 - PowerPoint PPT Presentation

About This Presentation

Title:

CS 7810 Lecture 4

Description:

The max possible improvement (UB model) is 44% Other Results ... Load imbalance and communication become. worse the best heuristic/threshold will depend ... – PowerPoint PPT presentation

Number of Views:18

Avg rating:3.0/5.0

Slides: 34

Provided by: RajeevBala4

Learn more at: https://users.cs.utah.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 7810 Lecture 4

1
CS 7810 Lecture 4
Overview of Steering Algorithms, based
on Dynamic Code Partitioning for Clustered
Architectures R. Canal, J-M. Parcerisa, A.
Gonzalez UPC-Barcelona IJPP 01
2
Bottlenecks

Recap from Complexity-Effective Superscalars
WakeupSelect and Bypass have the longest
delays and represent atomic operations
Pipelining will prevent back-to-back operations
Increased issue width / window size / wire
delays
exacerbate the problem (also for the register
file
and cache)

3
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
Rdy Operands
r1 1 r2 1 r3 0
4
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r3? r1 r2

Rdy Operands
r1 1 r2 1 r3 0
5
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r4? r3 r2 r3? r1 r2

Rdy Operands
r1 1 r2 1 r3 0
6
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r5? r4 r2 r4? r3 r2 r3? r1 r2

Rdy Operands
r1 1 r2 1 r3 0
7
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r5? r4 r2 r4? r3 r2 r3? r1 r2
r6? r4 r2

Rdy Operands
r1 1 r2 1 r3 0
8
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r5? r4 r2 r4? r3 r2 r3? r1 r2
r7? r6 r2 r6? r4 r2

Rdy Operands
r1 1 r2 1 r3 0
9
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r8? r5 r2 r5? r4 r2 r4? r3 r2 r3? r1 r2
r7? r6 r2 r6? r4 r2

Rdy Operands
r1 1 r2 1 r3 0
10
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r8? r5 r2 r5? r4 r2 r4? r3 r2 r3? r1 r2
r7? r6 r2 r6? r4 r2
r9? r1 r2
Rdy Operands
r1 1 r2 1 r3 0
11
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r8? r5 r2 r5? r4 r2 r4? r3 r2 r3? r1 r2
r7? r6 r2 r6? r4 r2
r9? r1 r2
Rdy Operands
r1 1 r2 1 r3 0
r1 ? r2 ?
12
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r8? r5 r2 r5? r4 r2 r4? r3 r2
r7? r6 r2 r6? r4 r2

Rdy Operands
r1 1 r2 1 r3 1
r3 ? r9 ?
13
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r8? r5 r2 r5? r4 r2
r7? r6 r2 r6? r4 r2

Rdy Operands
r1 1 r2 1 r3 1
r4 ?
14
Dependence-Based Microarchitecture
r3 ? r1 r2 r4 ? r3 r2 r5 ? r4 r2 r6 ? r4
r2 r7 ? r6 r2 r8 ? r5 r2 r9 ? r1 r2
FIFOs
r8? r5 r2
r7? r6 r2

Rdy Operands
r1 1 r2 1 r3 1
r5 ? r6 ?
15
Pros and Cons

Wakeup and select over a subset of issue queue
entries (only FIFO heads)
Under-utilization as FIFOs do not get filled
(causes
about 5 IPC loss) but it is not hard to
increase
their sizes
You still need an operand-rdy table

16
Clustered Microarchitectures
17
Clustered Microarchitectures

Simplifies wakeupselect and bypassing
Dependence-based, hence most communication
is local
Low porting requirements on register file, issue
queue
IPC loss of 6.3, but a clock speed improvement

18
Clustered Microarchitectures

Two primary motivations
hard to design 8-way machines in future
technologies
the FP cluster is idle most of the time
Advantages
Few entries, few ports ? low delays ? fast
clocks, simple pipelines
Every instruction is not penalized for wire
delays
Potential for large windows and high ILP
Design and verification costs do not scale up (?)

19
Dependences
r1 ? r2 r3 cl-1 r4 ? r1 r2
cl-1 r5 ? r6 r7 cl-2 r8 ? r5 r1
?

During rename, steer dependent instructions to
the same cluster
However, we do not know about converging chains
(can have workarounds traces/compilers)
If the assigned cluster is full, do we stall or
go
elsewhere? not clarified in the paper

20
Load Imbalance

All instructions in 1 cluster ? zero
communication,
but zero utilization of other resources
Six ready instructions in cl-1 and two in cl-2 ?
more contention and wasted issue slots
Ready instructions in each should be equal
however, instruction readiness happens long
after instruction steering

21
Load Imbalance Metrics

Metrics
Instrs in each cluster
Unissued instrs that could have issued
elsewhere (note latency between steer
issue)
The second metric does not help much

22
Instruction Assignment
Reg-rename Instr steer
IQ
IQ
Regfile
Regfile
F
F
F
F
40 regs in each cluster
r1 ? r2 r3 r4 ? r1 r2 r5 ? r6 r7 r8 ? r1
r5
p21 ? p2 p3 p22 ? p21 p2 p42 ? p21
p41 ? p56 p57 p43 ? p42 p41
r1 is mapped to p21 and p42 will influence
steering and instr commit on average, only 8
replicated regs
23
Assignment by the Compiler