A PositionInsensitive Finished Store Buffer - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

A PositionInsensitive Finished Store Buffer

Description:

A Position-Insensitive Finished Store Buffer. Erika Gunadi and Mikko H. Lipasti ... Commonly designed as a circular buffer. Allocate entry on dispatch ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 24

Provided by: mikkol8

Category:

more less

Transcript and Presenter's Notes

Title: A PositionInsensitive Finished Store Buffer

1
A Position-Insensitive Finished Store Buffer

Erika Gunadi and Mikko H. Lipasti
Department of Electrical and Computer Engineering
University of WisconsinMadison

http//www.ece.wisc.edu/pharm
2
Motivation

As microprocessors get wider and deeper
More in-flight stores
Need a larger store queue
Increase access time and power consumption
Needs SQ access time lt D access time
Avoid replay in case of store-to-load forwarding

3
A Brief Store Queue Overview

Serve 2 main purposes
To maintain the order of in-flight stores
To forward store data to later loads
Commonly designed as a circular buffer
Allocate entry on dispatch
Deallocate entry on retirement
Equipped with forwarding logic
CAM structure for address match
Select logic to pick the youngest older matching
store

4
Store to Load Forwarding

Each load needs to search the store queue for any
matching older stores
Forwarding logic consists of 3 components
Store Address CAM
Select Logic
Store Data RAM

Store Address CAM
Select Logic
Store Data RAM
5
SQ Access Latency

Major components of latency CAM and Select
CAM is scalable, Select is not

6
SQ Energy per Access

Major component of energy CAM

7
Outline

Motivation and Background
Finished Store Buffer (FSB)
Initial Study
Details of Design
Methodology
Results
Conclusion

8
SQ Occupancy Study

Most of the time, there are lt 50 of stores are
finished and waiting to retire
The number of waiting-to-retire stores does not
scale linearly with the size of the OoO window
12, 20, 32, and 52 are used as the number of
entry of our FSB for 128, 256, 512, 1024 window
size

9
Finished Store Buffer

The forwarding logic only cares about
waiting-to-retire stores
As shown, only less than 50 of in-flight stores
ROB can be used to track store order
Finished Store Buffer
Much smaller than conventional store queue
Does not maintain positional store ordering

10
FSB Diagram
Fetch
Dec
Rnm
Disp
Queue
Read
Exe
WB
Ret
Sched
FSB
Conventional SQ

Allocate FSB entry at schedule
Deallocate FSB entry at retirement
FSB is maintained using a free-list
A store is issued only if there is an available
entry

11
Forwarding Logic

Load checks the FSB for matching store
FSB position does not reflect relative age
Non-positional select logic
Same problem in a non-compacting scheduler
Solutions Buyuktosunoglu SOC 2002, Robery US
Patent, and Sassone ISCA 2007
Solutions similar to that by Buyuktosunoglu is
used since it requires the least number of bits

12
Youngest Select Logic
st A 0 0 0st A 0 0 1st A 1 0 0st A
1 0 1ld A 1 0 1
1
0
1
4 inputs
4 inputs
4 inputs

A230
1
0 1
0 1
0 1
S30
S2
1100
1100
0100
1 0 1 0
0 0 0 0
0 0 1 1
0 0 1 1
0 0 0 0
1 0 1 0
A22
1100
One hot select signal
0000
0100
0000
0101
A130
A030

4-entry FSB, 3-bits color (111youngest,
000oldest)
Modification
Add one more bit and a simple reverse logic to
handle wrap around
Restructure the algorithm hierarchically,
checking happens in parallel

13
FSB Corner Cases

Deadlock avoidance
Happens when a store to issue is the oldest in
the window and the FSB is full
Reserves an entry in the FSB for the oldest store
In order retirement
Keeps the FSB index in the ROB entry, uses it to
index to FSB at retire
Branch misprediction
Assigns store color to each branch
Uses it to determine which FSB entries to
invalidate

14
Methodology

Simplescalar / Alpha 3.0 tool set
Machine configuration
12-stage pipeline, 4-wide machine
128 ROB, 96 PRF
32 LQ, 24 SQ, 32 scheduler
2 integer ALUs, 1 mult/div, 1 memory port
I-Cache 64KB, DM, 64B, 2-cycle
D-Cache 64KB, 4-way, 64B, 3-cycle
L2 2MB, 8-way, 128B, 8-cycle
Memory 150-cycle

15
Modeling

To estimate timing and power for the select logic
Implemented in Verilog
Synthesized using Synopsys Design Compiler and
LSI Logics gflxp 0.11 micron CMOS standard cell
library
To estimate timing and power for RAM and CAM
structures -gt CACTI

16
Access Latency Comparison

Due to fewer entries, select logic for FSB is
faster
CAM latency is similar

17
Energy per Access Comparison

Fewer entries -gt less CAM power
Subarrays do not reduce energy, only latency

18
IPC Comparison (SPEC INT)

FSB 12, 20, 32, 52 for different window sizes
FSB-min the most aggressive limit
To avoid stall, only needs 20machine-widthissue
-retire stages
5, 10, 20, and 40 for different window sizes
Both FSB and FSB-min less than 1 average slowdown

19
IPC Comparison (SPEC FP)

Sixtrack with 1024 ROB experiences 5 slowdown
Retirement stall of unfinished stores
Slowdown less than 1 with 2 reservation slots
In some cases, FSB slightly outperforms the
baseline IPC
Happens when the store queue size limits
instructions dispatch in the baseline

20
Prior Work

SQIP Sha, 2005
Remove the associative search of SQ
Loads use store-set to predict the index of a
forwarding SQ entry
Misprediction is detected by precommit
re-execution, results in pipeline flush
ULB-LSQ Sethumadhavan, 2007
Unordered SQ, allocated at issue time
Similar to our approach
Differs in forwarding policy and overflow
handling

21
Prior Work

Franklin, 1996 ARB in Multiscalar
Sethumadhavan, 2003, Park, 2003 Filtering
mechanism (bloom filter and store set) to reduce
store queue access
Baugh, 2004 Decomposed store queue
functionality, only stores in forwarding group
need to be put into the forwarding buffer
Torres, 2005 2-level SQ, predicted forwarding
stores in L1, validation is done in L2
Roth, 2005 SVW, breaking SQ functionality into
RSQ and FSQ, validation is done using load
re-execution
Sha, 2005, Stone, 2005 SQIP and AIMD,
removing the associative search capability from
SQ
Subramanian, 2006, Sha, 2006 FnF and NoSQ,
eliminate the whole SQ, load re-execution for
validation
Sethumadhavan, 2007 ULB-LSQ, unordered store
queue that is allocated at issue time