Low-Complexity Reorder Buffer Architecture*

About This Presentation

Title:

Low-Complexity Reorder Buffer Architecture*

Description:

Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York – PowerPoint PPT presentation

Number of Views:257

Avg rating:3.0/5.0

Slides: 62

Provided by: Gurh1

Learn more at: https://www.cs.binghamton.edu

Category:

more less

Transcript and Presenter's Notes

Title: Low-Complexity Reorder Buffer Architecture*

1
Low-ComplexityReorder Buffer Architecture
Gurhan Kucuk, Dmitry Ponomarev, Kanad
Ghose Department of Computer Science State
University of New York Binghamton, NY
13902-6000 http//www.cs.binghamton.edu/lowpower
16th Annual ACM International Conference on
Supercomputing (ICS02), June 24th 2002
supported in part by DARPA through the PAC-C
program and NSF
2
Outline

ROB complexities
Motivation for the low-complexity ROB
Low-complexity ROB design
Results
Concluding remarks

3
What This Work is All About

Complex, richly-ported ROBs are common in modern
superscalar datapaths
Number of ports are aggravated when results are
held within ROB slots (Example Pentium III)
ROB complexity reduction is important for
reducing power and improving performance
ROB dissipates a non-trivial fraction of the
total chip power
ROB accesses stretch over several cycles
Goal of this work Reduce the complexity and
power dissipation of the ROB without sacrificing
performance

4
Pentium III-like Superscalar Datapath
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
D-cache
Result/status forwarding buses
5
ROB Port Requirements for a W-way CPU
Writeback W write ports to write results
Decode/Dispatch W write ports to setup entries
ROB
Dispatch/Issue 2W read ports to read the source
operands
Commit W read ports for instruction commitment
6
ROB Port Requirements for a W-way CPU
Writeback W write ports To write results
Decode/Dispatch 1 W-wide write port to setup
entries
ROB
Dispatch/Issue 2W read ports to read the source
operands
Commit 1 W-wide read port for instruction
commitment
7
Where are the Source Values Coming From?
Function Units
Architectural Register File
Instruction Issue
1
2
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
3
Instruction dispatch
D-cache
Result/status forwarding buses
8
Where are the Source Values Coming From ?
62
32
6
96-entry ROB, 4-way processor SPEC2K Benchmarks
9
How Efficiently are the Ports Used ?
Writeback W write ports To write results
Decode/Dispatch W write ports to setup entries
ROB
Dispatch/Issue 2W read ports to read the source
operands
Commit W read ports for instruction commitment
6
10
Approaches to Reducing ROB Complexity

Reduce the number of read ports for reading out
the source operand values
More radical (and better) Completely eliminate
the read ports for reading source operand values!

11
Reducing the Number of Read Ports
3.5
1.0
Average IPC Drop
Performance Drop
bzip2
gap
gcc
gzip
mcf
parser
perl
twolf
Int Avg.
vortex
vpr
applu
apsi
art
equake
mesa
mgrid
swim
wupwise
FP Avg.
12
Problems with Retaining Fewer Source Read Ports
on the ROB

Need arbitration for the small number of ports
Additional logic needed to block the instructions
which could not get the port.
Need a switching network to route the operands to
correct destinations
Multi-cycle access still remains in the critical
path of Dispatch/Issue logic

13
Our Solution Elimination of Read Ports
Function Units
Architectural Register File
Instruction Issue
1
2
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
3
Instruction dispatch
D-cache
Result/status forwarding buses
14
Our Solution Elimination of Read Ports
Function Units
Architectural Register File
Instruction Issue
1
2
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
3
Instruction dispatch
D-cache
Result/status forwarding buses
15
Our Solution Elimination of Read Ports
Function Units
Architectural Register File
Instruction Issue
1
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
3
Instruction dispatch
D-cache
Result/status forwarding buses
16
Comparison of ROB Bitcells (0.18µ, TSMC)
Layout of a 32-ported SRAM bitcell
Layout of a 16-ported SRAM bitcell
Area Reduction 71 Shorter bit and wordlines
17
Our Solution Elimination of Read Ports
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
D-cache
Result/status forwarding buses
Area Reduction 45
18
Eliminating/Reducing the Number of Read Ports
Effects on Power Dissipation

Power is reduced because
shorter bitlines and wordlines
lower capacitive loading
fewer decoders
fewer drivers and sense amps

19
Completely Eliminating the Source Read Ports on
the ROB

The Problem Issue of instructions that require a
value stored in the ROB will stall
Solutions
Forward the value to the waiting instruction at
the time of committing the value
LATE FORWARDING

20
Late Forwarding Use the Normal Forwarding Buses!
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
D-cache
Result/status forwarding buses
21
Late Forwarding Use the Normal Forwarding Buses!
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
D-cache
Result/status forwarding buses
22
Optimizing Late Forwarding

PROBLEM If Late Forwarding is done for every
result that is committed, additional forwarding
buses are needed in order not to degrade the
performance
SOLUTION Selective Late Forwarding (SLF)
SLF requires additional bit in the ROB
That bit is set by the dispatched instructions
that require Late Forwarding
No additional forwarding buses are needed, since
SLF traffic is very small

23
Late Forwarding Use the Normal Forwarding Buses!
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
Only 3.5 of the traffic is from SELECTIVE LATE
FORWARDING
D-cache
Result/status forwarding buses
24
Performance Drop of Simplified ROB
9.6
3.5
1.0
Average IPC Drop
17
Performance Drop
bzip2
gap
gcc
gzip
mcf
parser
perl
twolf
Int Avg.
vortex
vpr
37
applu
apsi
art
equake
mesa
mgrid
swim
wupwise
FP Avg.
25
IPC PenaltySource Value Not Accessible within
the ROB
Lifetime of a Result Value
Late Forwarding/ Commitment
Forwarding
Value within ARF
Result Generation
Value within ROB
time
26
Improving IPC with No Read Ports

Cache recently generated values in a set of
RETENTION LATCHES (RL)
Retention Latches are SMALL and FAST
Only 8 to 16 latches needed in the set
Entire set has 1 or 2 read ports

27
Datapath with the Retention Latches
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
D-cache
Result/status forwarding buses
28
Datapath with the Retention Latches
RETENTION LATCHES
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
D-cache
Result/status forwarding buses
29
The Structure of the Retention Latch Set
L recently-written results (L1 or 2 works great)
8 or 16 latches
L-ported CAM field (key ROB_slot_id)
Result Values
Status
L ROB slot addresses (L1 or 2)
W write ports for writing up to W results in
parallel
30
Retention Latch Management Strategies

FIFO
8 entry RL 42 hit rate
16 entry RL 55 hit rate
LRU
8 entry RL 56 hit rate
16 entry RL 62 hit rate
Random Replacement
Worse performance than FIFO

31
Hit Ratios to Retention Latches
42
55
56
62
Average Hit Ratio
Hit Ratios
bzip2
gap
gcc
gzip
mcf
parser
perl
twolf
Int Avg.
vortex
vpr
applu
apsi
art
equake
mesa
mgrid
swim
wupwise
FP Avg.
32
Accessing Retention Latch Entries

ROB index is used as a unique key in the
Retention Latches to search the result values
Need to maintain unique keys even when we have
Reuse of a ROB slot
Not a problem for FIFO
simply flush a RL entry at commit time for LRU
Branch mispredictions

33
Handling Branch Mispredictions

Selective RL Flushing Retention latch entries
that are in the mispredicted path are flushed
Uses branch tags
Complicated implementation
Complete RL Flushing All retention latch entries
are flushed
Very simple implementation
Performance drop is only 1.5 compared to
selective flushing

34
Misprediction Handling Performance
1.5
Average IPC Drop
IPC
35
Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
5
ROB index
ADD
Instruction
Src1 arch.
2
Src1 valid
?
Src1 value
?
Src2 arch.
3
?
Src2 valid
?
Src2 value
Simplified IDB entry 1
36
Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0

ADD
Instruction
1

Src1 reg.
2
2
12
0
Src1 valid
?
3
3
1
Src1 value
?
4

Src2 reg.
3

?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
37
Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. valid
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.

0

ADD
Instruction
12
1
7
1

Src1 reg.
2

2
12
0
Src1 valid
?
3
3
1
ROB
Src1 value
?
4

Src2 reg.
3

?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
38
Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. valid
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.

0

ADD
Instruction
12
1
7
1

Src1 reg.
2

2
12
0
Src1 valid
1
3
3
1
ROB
Src1 value
7
4

Src2 reg.
3

?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
39
Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. valid
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.

0

ADD
Instruction
12
0
?
1

Src1 reg.
2

2
12
0
Src1 valid
?
3
3
1
ROB
Src1 value
?
4

Src2 reg.
3

?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
40
Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. valid
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.

0

ADD
Instruction
12
0
?
1

Src1 reg.
2

2
12
0
Src1 valid
0
3
3
1
ROB
Src1 value
?
4

Src2 reg.
3

?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
41
Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0

ADD
Instruction
1

Src1 reg.
2
2
12
0
Src1 valid
1
3
3
1
Arch. value
Arch.
Src1 value
7
4

Src2 reg.
3

?
3
43
Src2 valid
Rename Table

?
Src2 value
ARF
Simplified IDB entry 1
42
Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0

ADD
Instruction
1

Src1 reg.
2
2
12
0
Src1 valid
1
3
3
1
Arch. value
Arch.
Src1 value
7
4

Src2 reg.
3

1
3
43
Src2 valid
Rename Table

43
Src2 value
ARF
Simplified IDB entry 1
43
Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
5
ROB index
ADD
Instruction
Src1 arch.
2
Src1 valid
?
Src1 value
?
Src2 arch.
3
?
Src2 valid
?
Src2 value
Simplified IDB entry 1
44
Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0

ADD
Instruction
1

Src1 reg.
2
2
12
0
Src1 valid
?
3
3
1
Src1 value
?
4

Src2 reg.
3

?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
45
Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.

0

ADD
Instruction
Retention Latches
12
7
1

Src1 reg.
2

2
12
0
Src1 valid
?
3
3
1
Src1 value
?
4

Src2 reg.
3

?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
46
Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.

0

ADD
Instruction
Retention Latches
12
7
1

Src1 reg.
2

2
12
0
Src1 valid
1
3
3
1
Src1 value
7
4

Src2 reg.
3

?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
47
Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.

0

ADD
Instruction
Retention Latches
MISS

1

Src1 reg.
2

2
12
0
Src1 valid
?
3
3
1
Src1 value
?
4

Src2 reg.
3

?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
48
Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.

0

ADD
Instruction
Retention Latches
MISS

1

Src1 reg.
2

2
12
0
Src1 valid
0
3
3
1
ROB /Phys.
Phys. valid
Phys. value
SLF
Src1 value
?
4

Src2 reg.
3

?
Src2 valid
12
X
X
0
Rename Table
?
Src2 value

ROB
Simplified IDB entry 1
X Dont Care
49
Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.

0

ADD
Instruction
Retention Latches
MISS

1

Src1 reg.
2

2
12
0
Src1 valid
0
3
3
1
ROB /Phys.
Phys. valid
Phys. value
SLF
Src1 value
?
4

Src2 reg.
3

?
Src2 valid
12
X
X
1
Rename Table
?
Src2 value

ROB
Simplified IDB entry 1
X Dont Care
50
Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0

ADD
Instruction
1

Src1 reg.
2
2
12
0
Src1 valid
1
3
3
1
Arch. value
Arch.
Src1 value
7
4

Src2 reg.
3

?
3
43
Src2 valid
Rename Table

?
Src2 value
ARF
Simplified IDB entry 1
51
Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0

ADD
Instruction
1

Src1 reg.
2
2
12
0
Src1 valid
1
3
3
1
Arch. value
Arch.
Src1 value
7
4

Src2 reg.
3

1
3
43
Src2 valid
Rename Table

43
Src2 value
ARF
Simplified IDB entry 1
52
Experimental Setup the AccuPower (DATE02)
Compiled SPEC benchmarks
Performance stats
Microarchitectural Simulator (Rooted in
SimpleScalar)
Datapath specs
Transition counts, Context information
Power/energy stats
Energy/Power Estimator
VLSI layout data
SPICE
SPICE deck
SPICE measures of energy per transition
53
Configuration of the Simulated System
Machine width
4-way
Issue Queue
32 entries
96 entries
Reorder Buffer
32 entries
Load/Store Queue
Simulated the execution of SPEC2000 benchmarks
54
Assumed Timings
Smaller delay few latches
Rename Table lookup for ROB index
Rename Table Lookup for ROB index
Associative lookup of operand from retention
latches using ROB index as a key
Source operand read from the ROB
Source operand read from the ROB
D1
D1
D2
D3
D2
Timing of the baseline model
Timing of the simplified ROB
55
Experimental Results Effect on Performance
0.1
-1.6
-1.0
-2.3
Avg. IPC Drop
Performance Drop
bzip2
gap
gcc
gzip
mcf
parser
perl
twolf
Int Avg.
vortex
vpr
applu
apsi
art
equake
mesa
mgrid
swim
wupwise
FP Avg.
56
Experimental Results Effect on Performance
3.3
1.7
2.3
1.0
Avg. IPC Drop
Performance Drop
bzip2
gap
gcc
gzip
mcf
parser
perl
twolf
Int Avg.
vortex
vpr
applu
apsi
art
equake
mesa
mgrid
swim
wupwise
FP Avg.
57
Experimental Results Effect on Power
30
23.4
22.2
21
20.2
Avg. Savings
Power Savings
bzip2
gap
gcc
gzip
mcf
parser
perl
twolf
Int Avg.
vortex
vpr
applu
apsi
art
equake
mesa
mgrid
swim
wupwise
FP Avg.
58
Summary of Results

Significantly reduced ROB complexity and power
dissipation
45 area reduction
20 to 30 power reduction across SPEC 2000
benchmarks
Actual IPC improvements
1.6 to 2.3 gain across SPEC benchmarks
IPC gains come from 1 cycle access to RL (vs. 2
cycles that would be needed for ROB access)

59
Related Work

Value-Aging Buffer (Hu Martonosi, PACS 2000)
Forwarding Buffer and Clustered Register Cache
(Borch et.al., HPCA02)
Multiple Register Banks (Cruz et.al., ISCA00
Balasubramonian et.al., MICRO01)
See paper for discussions

60
Conclusions

Typical source operand location statistics can be
successfully exploited to reduce ROB complexity
Significant reduction in ROB area and power no
ROB ports needed for reading source operands
IPC gains are possible because of the use of a
small sized, low-ported Retention Latch to supply
cached operand values in a single cycle

61
Low-ComplexityReorder Buffer Architecture
Gurhan Kucuk, Dmitry Ponomarev, Kanad
Ghose Department of Computer Science State
University of New York Binghamton, NY
13902-6000 http//www.cs.binghamton.edu/lowpower
16th Annual ACM International Conference on
Supercomputing (ICS02), June 24th 2002
supported in part by DARPA through the PAC-C
program and NSF

Write a Comment

User Comments (0)