Title: Low-Complexity Reorder Buffer Architecture*
1Low-ComplexityReorder Buffer Architecture
Gurhan Kucuk, Dmitry Ponomarev, Kanad
Ghose Department of Computer Science State
University of New York Binghamton, NY
13902-6000 http//www.cs.binghamton.edu/lowpower
16th Annual ACM International Conference on
Supercomputing (ICS02), June 24th 2002
supported in part by DARPA through the PAC-C
program and NSF
2Outline
- ROB complexities
- Motivation for the low-complexity ROB
- Low-complexity ROB design
- Results
- Concluding remarks
3What This Work is All About
- Complex, richly-ported ROBs are common in modern
superscalar datapaths - Number of ports are aggravated when results are
held within ROB slots (Example Pentium III) - ROB complexity reduction is important for
reducing power and improving performance - ROB dissipates a non-trivial fraction of the
total chip power - ROB accesses stretch over several cycles
- Goal of this work Reduce the complexity and
power dissipation of the ROB without sacrificing
performance
4Pentium III-like Superscalar Datapath
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
D-cache
Result/status forwarding buses
5ROB Port Requirements for a W-way CPU
Writeback W write ports to write results
Decode/Dispatch W write ports to setup entries
ROB
Dispatch/Issue 2W read ports to read the source
operands
Commit W read ports for instruction commitment
6ROB Port Requirements for a W-way CPU
Writeback W write ports To write results
Decode/Dispatch 1 W-wide write port to setup
entries
ROB
Dispatch/Issue 2W read ports to read the source
operands
Commit 1 W-wide read port for instruction
commitment
7Where are the Source Values Coming From?
Function Units
Architectural Register File
Instruction Issue
1
2
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
3
Instruction dispatch
D-cache
Result/status forwarding buses
8Where are the Source Values Coming From ?
62
32
6
96-entry ROB, 4-way processor SPEC2K Benchmarks
9How Efficiently are the Ports Used ?
Writeback W write ports To write results
Decode/Dispatch W write ports to setup entries
ROB
Dispatch/Issue 2W read ports to read the source
operands
Commit W read ports for instruction commitment
6
10Approaches to Reducing ROB Complexity
- Reduce the number of read ports for reading out
the source operand values - More radical (and better) Completely eliminate
the read ports for reading source operand values!
11Reducing the Number of Read Ports
3.5
1.0
Average IPC Drop
Performance Drop
bzip2
gap
gcc
gzip
mcf
parser
perl
twolf
Int Avg.
vortex
vpr
applu
apsi
art
equake
mesa
mgrid
swim
wupwise
FP Avg.
12Problems with Retaining Fewer Source Read Ports
on the ROB
- Need arbitration for the small number of ports
- Additional logic needed to block the instructions
which could not get the port. - Need a switching network to route the operands to
correct destinations - Multi-cycle access still remains in the critical
path of Dispatch/Issue logic
13Our Solution Elimination of Read Ports
Function Units
Architectural Register File
Instruction Issue
1
2
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
3
Instruction dispatch
D-cache
Result/status forwarding buses
14Our Solution Elimination of Read Ports
Function Units
Architectural Register File
Instruction Issue
1
2
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
3
Instruction dispatch
D-cache
Result/status forwarding buses
15Our Solution Elimination of Read Ports
Function Units
Architectural Register File
Instruction Issue
1
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
3
Instruction dispatch
D-cache
Result/status forwarding buses
16Comparison of ROB Bitcells (0.18µ, TSMC)
Layout of a 32-ported SRAM bitcell
Layout of a 16-ported SRAM bitcell
Area Reduction 71 Shorter bit and wordlines
17Our Solution Elimination of Read Ports
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
D-cache
Result/status forwarding buses
Area Reduction 45
18Eliminating/Reducing the Number of Read Ports
Effects on Power Dissipation
- Power is reduced because
- shorter bitlines and wordlines
- lower capacitive loading
- fewer decoders
- fewer drivers and sense amps
19Completely Eliminating the Source Read Ports on
the ROB
- The Problem Issue of instructions that require a
value stored in the ROB will stall - Solutions
- Forward the value to the waiting instruction at
the time of committing the value
LATE FORWARDING
20Late Forwarding Use the Normal Forwarding Buses!
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
D-cache
Result/status forwarding buses
21Late Forwarding Use the Normal Forwarding Buses!
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
D-cache
Result/status forwarding buses
22Optimizing Late Forwarding
- PROBLEM If Late Forwarding is done for every
result that is committed, additional forwarding
buses are needed in order not to degrade the
performance - SOLUTION Selective Late Forwarding (SLF)
- SLF requires additional bit in the ROB
- That bit is set by the dispatched instructions
that require Late Forwarding - No additional forwarding buses are needed, since
SLF traffic is very small
23Late Forwarding Use the Normal Forwarding Buses!
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
Only 3.5 of the traffic is from SELECTIVE LATE
FORWARDING
D-cache
Result/status forwarding buses
24Performance Drop of Simplified ROB
9.6
3.5
1.0
Average IPC Drop
17
Performance Drop
bzip2
gap
gcc
gzip
mcf
parser
perl
twolf
Int Avg.
vortex
vpr
37
applu
apsi
art
equake
mesa
mgrid
swim
wupwise
FP Avg.
25IPC PenaltySource Value Not Accessible within
the ROB
Lifetime of a Result Value
Late Forwarding/ Commitment
Forwarding
Value within ARF
Result Generation
Value within ROB
time
26Improving IPC with No Read Ports
- Cache recently generated values in a set of
RETENTION LATCHES (RL) - Retention Latches are SMALL and FAST
- Only 8 to 16 latches needed in the set
- Entire set has 1 or 2 read ports
27Datapath with the Retention Latches
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
D-cache
Result/status forwarding buses
28Datapath with the Retention Latches
RETENTION LATCHES
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
D-cache
Result/status forwarding buses
29The Structure of the Retention Latch Set
L recently-written results (L1 or 2 works great)
8 or 16 latches
L-ported CAM field (key ROB_slot_id)
Result Values
Status
L ROB slot addresses (L1 or 2)
W write ports for writing up to W results in
parallel
30Retention Latch Management Strategies
- FIFO
- 8 entry RL 42 hit rate
- 16 entry RL 55 hit rate
- LRU
- 8 entry RL 56 hit rate
- 16 entry RL 62 hit rate
- Random Replacement
- Worse performance than FIFO
31Hit Ratios to Retention Latches
42
55
56
62
Average Hit Ratio
Hit Ratios
bzip2
gap
gcc
gzip
mcf
parser
perl
twolf
Int Avg.
vortex
vpr
applu
apsi
art
equake
mesa
mgrid
swim
wupwise
FP Avg.
32Accessing Retention Latch Entries
- ROB index is used as a unique key in the
Retention Latches to search the result values - Need to maintain unique keys even when we have
- Reuse of a ROB slot
- Not a problem for FIFO
- simply flush a RL entry at commit time for LRU
- Branch mispredictions
33Handling Branch Mispredictions
- Selective RL Flushing Retention latch entries
that are in the mispredicted path are flushed - Uses branch tags
- Complicated implementation
- Complete RL Flushing All retention latch entries
are flushed - Very simple implementation
- Performance drop is only 1.5 compared to
selective flushing
34Misprediction Handling Performance
1.5
Average IPC Drop
IPC
35Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
5
ROB index
ADD
Instruction
Src1 arch.
2
Src1 valid
?
Src1 value
?
Src2 arch.
3
?
Src2 valid
?
Src2 value
Simplified IDB entry 1
36Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0
ADD
Instruction
1
Src1 reg.
2
2
12
0
Src1 valid
?
3
3
1
Src1 value
?
4
Src2 reg.
3
?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
37Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. valid
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0
ADD
Instruction
12
1
7
1
Src1 reg.
2
2
12
0
Src1 valid
?
3
3
1
ROB
Src1 value
?
4
Src2 reg.
3
?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
38Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. valid
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0
ADD
Instruction
12
1
7
1
Src1 reg.
2
2
12
0
Src1 valid
1
3
3
1
ROB
Src1 value
7
4
Src2 reg.
3
?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
39Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. valid
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0
ADD
Instruction
12
0
?
1
Src1 reg.
2
2
12
0
Src1 valid
?
3
3
1
ROB
Src1 value
?
4
Src2 reg.
3
?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
40Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. valid
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0
ADD
Instruction
12
0
?
1
Src1 reg.
2
2
12
0
Src1 valid
0
3
3
1
ROB
Src1 value
?
4
Src2 reg.
3
?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
41Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0
ADD
Instruction
1
Src1 reg.
2
2
12
0
Src1 valid
1
3
3
1
Arch. value
Arch.
Src1 value
7
4
Src2 reg.
3
?
3
43
Src2 valid
Rename Table
?
Src2 value
ARF
Simplified IDB entry 1
42Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0
ADD
Instruction
1
Src1 reg.
2
2
12
0
Src1 valid
1
3
3
1
Arch. value
Arch.
Src1 value
7
4
Src2 reg.
3
1
3
43
Src2 valid
Rename Table
43
Src2 value
ARF
Simplified IDB entry 1
43Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
5
ROB index
ADD
Instruction
Src1 arch.
2
Src1 valid
?
Src1 value
?
Src2 arch.
3
?
Src2 valid
?
Src2 value
Simplified IDB entry 1
44Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0
ADD
Instruction
1
Src1 reg.
2
2
12
0
Src1 valid
?
3
3
1
Src1 value
?
4
Src2 reg.
3
?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
45Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0
ADD
Instruction
Retention Latches
12
7
1
Src1 reg.
2
2
12
0
Src1 valid
?
3
3
1
Src1 value
?
4
Src2 reg.
3
?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
46Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0
ADD
Instruction
Retention Latches
12
7
1
Src1 reg.
2
2
12
0
Src1 valid
1
3
3
1
Src1 value
7
4
Src2 reg.
3
?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
47Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0
ADD
Instruction
Retention Latches
MISS
1
Src1 reg.
2
2
12
0
Src1 valid
?
3
3
1
Src1 value
?
4
Src2 reg.
3
?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
48Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0
ADD
Instruction
Retention Latches
MISS
1
Src1 reg.
2
2
12
0
Src1 valid
0
3
3
1
ROB /Phys.
Phys. valid
Phys. value
SLF
Src1 value
?
4
Src2 reg.
3
?
Src2 valid
12
X
X
0
Rename Table
?
Src2 value
ROB
Simplified IDB entry 1
X Dont Care
49Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0
ADD
Instruction
Retention Latches
MISS
1
Src1 reg.
2
2
12
0
Src1 valid
0
3
3
1
ROB /Phys.
Phys. valid
Phys. value
SLF
Src1 value
?
4
Src2 reg.
3
?
Src2 valid
12
X
X
1
Rename Table
?
Src2 value
ROB
Simplified IDB entry 1
X Dont Care
50Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0
ADD
Instruction
1
Src1 reg.
2
2
12
0
Src1 valid
1
3
3
1
Arch. value
Arch.
Src1 value
7
4
Src2 reg.
3
?
3
43
Src2 valid
Rename Table
?
Src2 value
ARF
Simplified IDB entry 1
51Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0
ADD
Instruction
1
Src1 reg.
2
2
12
0
Src1 valid
1
3
3
1
Arch. value
Arch.
Src1 value
7
4
Src2 reg.
3
1
3
43
Src2 valid
Rename Table
43
Src2 value
ARF
Simplified IDB entry 1
52Experimental Setup the AccuPower (DATE02)
Compiled SPEC benchmarks
Performance stats
Microarchitectural Simulator (Rooted in
SimpleScalar)
Datapath specs
Transition counts, Context information
Power/energy stats
Energy/Power Estimator
VLSI layout data
SPICE
SPICE deck
SPICE measures of energy per transition
53Configuration of the Simulated System
Machine width
4-way
Issue Queue
32 entries
96 entries
Reorder Buffer
32 entries
Load/Store Queue
Simulated the execution of SPEC2000 benchmarks
54Assumed Timings
Smaller delay few latches
Rename Table lookup for ROB index
Rename Table Lookup for ROB index
Associative lookup of operand from retention
latches using ROB index as a key
Source operand read from the ROB
Source operand read from the ROB
D1
D1
D2
D3
D2
Timing of the baseline model
Timing of the simplified ROB
55Experimental Results Effect on Performance
0.1
-1.6
-1.0
-2.3
Avg. IPC Drop
Performance Drop
bzip2
gap
gcc
gzip
mcf
parser
perl
twolf
Int Avg.
vortex
vpr
applu
apsi
art
equake
mesa
mgrid
swim
wupwise
FP Avg.
56Experimental Results Effect on Performance
3.3
1.7
2.3
1.0
Avg. IPC Drop
Performance Drop
bzip2
gap
gcc
gzip
mcf
parser
perl
twolf
Int Avg.
vortex
vpr
applu
apsi
art
equake
mesa
mgrid
swim
wupwise
FP Avg.
57Experimental Results Effect on Power
30
23.4
22.2
21
20.2
Avg. Savings
Power Savings
bzip2
gap
gcc
gzip
mcf
parser
perl
twolf
Int Avg.
vortex
vpr
applu
apsi
art
equake
mesa
mgrid
swim
wupwise
FP Avg.
58Summary of Results
- Significantly reduced ROB complexity and power
dissipation - 45 area reduction
- 20 to 30 power reduction across SPEC 2000
benchmarks - Actual IPC improvements
- 1.6 to 2.3 gain across SPEC benchmarks
- IPC gains come from 1 cycle access to RL (vs. 2
cycles that would be needed for ROB access)
59Related Work
- Value-Aging Buffer (Hu Martonosi, PACS 2000)
- Forwarding Buffer and Clustered Register Cache
(Borch et.al., HPCA02) - Multiple Register Banks (Cruz et.al., ISCA00
Balasubramonian et.al., MICRO01) - See paper for discussions
60Conclusions
- Typical source operand location statistics can be
successfully exploited to reduce ROB complexity - Significant reduction in ROB area and power no
ROB ports needed for reading source operands - IPC gains are possible because of the use of a
small sized, low-ported Retention Latch to supply
cached operand values in a single cycle
61Low-ComplexityReorder Buffer Architecture
Gurhan Kucuk, Dmitry Ponomarev, Kanad
Ghose Department of Computer Science State
University of New York Binghamton, NY
13902-6000 http//www.cs.binghamton.edu/lowpower
16th Annual ACM International Conference on
Supercomputing (ICS02), June 24th 2002
supported in part by DARPA through the PAC-C
program and NSF