A PositionInsensitive Finished Store Buffer - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

A PositionInsensitive Finished Store Buffer

Description:

A Position-Insensitive Finished Store Buffer. Erika Gunadi and Mikko H. Lipasti ... Commonly designed as a circular buffer. Allocate entry on dispatch ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 24
Provided by: mikkol8
Category:

less

Transcript and Presenter's Notes

Title: A PositionInsensitive Finished Store Buffer


1
A Position-Insensitive Finished Store Buffer
  • Erika Gunadi and Mikko H. Lipasti
  • Department of Electrical and Computer Engineering
  • University of WisconsinMadison

http//www.ece.wisc.edu/pharm
2
Motivation
  • As microprocessors get wider and deeper
  • More in-flight stores
  • Need a larger store queue
  • Increase access time and power consumption
  • Needs SQ access time lt D access time
  • Avoid replay in case of store-to-load forwarding

3
A Brief Store Queue Overview
  • Serve 2 main purposes
  • To maintain the order of in-flight stores
  • To forward store data to later loads
  • Commonly designed as a circular buffer
  • Allocate entry on dispatch
  • Deallocate entry on retirement
  • Equipped with forwarding logic
  • CAM structure for address match
  • Select logic to pick the youngest older matching
    store

4
Store to Load Forwarding
  • Each load needs to search the store queue for any
    matching older stores
  • Forwarding logic consists of 3 components
  • Store Address CAM
  • Select Logic
  • Store Data RAM

Store Address CAM
Select Logic
Store Data RAM
5
SQ Access Latency
  • Major components of latency CAM and Select
  • CAM is scalable, Select is not

6
SQ Energy per Access
  • Major component of energy CAM

7
Outline
  • Motivation and Background
  • Finished Store Buffer (FSB)
  • Initial Study
  • Details of Design
  • Methodology
  • Results
  • Conclusion

8
SQ Occupancy Study
  • Most of the time, there are lt 50 of stores are
    finished and waiting to retire
  • The number of waiting-to-retire stores does not
    scale linearly with the size of the OoO window
  • 12, 20, 32, and 52 are used as the number of
    entry of our FSB for 128, 256, 512, 1024 window
    size

9
Finished Store Buffer
  • The forwarding logic only cares about
    waiting-to-retire stores
  • As shown, only less than 50 of in-flight stores
  • ROB can be used to track store order
  • Finished Store Buffer
  • Much smaller than conventional store queue
  • Does not maintain positional store ordering

10
FSB Diagram
Fetch
Dec
Rnm
Disp
Queue
Read
Exe
WB
Ret
Sched
FSB
Conventional SQ
  • Allocate FSB entry at schedule
  • Deallocate FSB entry at retirement
  • FSB is maintained using a free-list
  • A store is issued only if there is an available
    entry

11
Forwarding Logic
  • Load checks the FSB for matching store
  • FSB position does not reflect relative age
  • Non-positional select logic
  • Same problem in a non-compacting scheduler
  • Solutions Buyuktosunoglu SOC 2002, Robery US
    Patent, and Sassone ISCA 2007
  • Solutions similar to that by Buyuktosunoglu is
    used since it requires the least number of bits

12
Youngest Select Logic
st A 0 0 0st A 0 0 1st A 1 0 0st A
1 0 1ld A 1 0 1
1
0
1
4 inputs
4 inputs
4 inputs



A230
1
0 1
0 1
0 1
S30
S2
1100
1100
0100
1 0 1 0
0 0 0 0
0 0 1 1
0 0 1 1
0 0 0 0
1 0 1 0
A22
1100
One hot select signal
0000
0100
0000
0101
A130
A030
  • 4-entry FSB, 3-bits color (111youngest,
    000oldest)
  • Modification
  • Add one more bit and a simple reverse logic to
    handle wrap around
  • Restructure the algorithm hierarchically,
    checking happens in parallel

13
FSB Corner Cases
  • Deadlock avoidance
  • Happens when a store to issue is the oldest in
    the window and the FSB is full
  • Reserves an entry in the FSB for the oldest store
  • In order retirement
  • Keeps the FSB index in the ROB entry, uses it to
    index to FSB at retire
  • Branch misprediction
  • Assigns store color to each branch
  • Uses it to determine which FSB entries to
    invalidate

14
Methodology
  • Simplescalar / Alpha 3.0 tool set
  • Machine configuration
  • 12-stage pipeline, 4-wide machine
  • 128 ROB, 96 PRF
  • 32 LQ, 24 SQ, 32 scheduler
  • 2 integer ALUs, 1 mult/div, 1 memory port
  • I-Cache 64KB, DM, 64B, 2-cycle
  • D-Cache 64KB, 4-way, 64B, 3-cycle
  • L2 2MB, 8-way, 128B, 8-cycle
  • Memory 150-cycle

15
Modeling
  • To estimate timing and power for the select logic
  • Implemented in Verilog
  • Synthesized using Synopsys Design Compiler and
    LSI Logics gflxp 0.11 micron CMOS standard cell
    library
  • To estimate timing and power for RAM and CAM
    structures -gt CACTI

16
Access Latency Comparison
  • Due to fewer entries, select logic for FSB is
    faster
  • CAM latency is similar

17
Energy per Access Comparison
  • Fewer entries -gt less CAM power
  • Subarrays do not reduce energy, only latency

18
IPC Comparison (SPEC INT)
  • FSB 12, 20, 32, 52 for different window sizes
  • FSB-min the most aggressive limit
  • To avoid stall, only needs 20machine-widthissue
    -retire stages
  • 5, 10, 20, and 40 for different window sizes
  • Both FSB and FSB-min less than 1 average slowdown

19
IPC Comparison (SPEC FP)
  • Sixtrack with 1024 ROB experiences 5 slowdown
  • Retirement stall of unfinished stores
  • Slowdown less than 1 with 2 reservation slots
  • In some cases, FSB slightly outperforms the
    baseline IPC
  • Happens when the store queue size limits
    instructions dispatch in the baseline

20
Prior Work
  • SQIP Sha, 2005
  • Remove the associative search of SQ
  • Loads use store-set to predict the index of a
    forwarding SQ entry
  • Misprediction is detected by precommit
    re-execution, results in pipeline flush
  • ULB-LSQ Sethumadhavan, 2007
  • Unordered SQ, allocated at issue time
  • Similar to our approach
  • Differs in forwarding policy and overflow
    handling

21
Prior Work
  • Franklin, 1996 ARB in Multiscalar
  • Sethumadhavan, 2003, Park, 2003 Filtering
    mechanism (bloom filter and store set) to reduce
    store queue access
  • Baugh, 2004 Decomposed store queue
    functionality, only stores in forwarding group
    need to be put into the forwarding buffer
  • Torres, 2005 2-level SQ, predicted forwarding
    stores in L1, validation is done in L2
  • Roth, 2005 SVW, breaking SQ functionality into
    RSQ and FSQ, validation is done using load
    re-execution
  • Sha, 2005, Stone, 2005 SQIP and AIMD,
    removing the associative search capability from
    SQ
  • Subramanian, 2006, Sha, 2006 FnF and NoSQ,
    eliminate the whole SQ, load re-execution for
    validation
  • Sethumadhavan, 2007 ULB-LSQ, unordered store
    queue that is allocated at issue time

22
Conclusion
  • FSB, an alternative way to build the SQ
  • Only contains finished stores
  • Much smaller
  • More scalable
  • Minimal IPC impact, lt 1
  • Lower power
  • Possible higher frequency
  • FSB-min, a more aggressive approach
  • Also has minimal IPC impact
  • Future work
  • Load Queue
  • Better deadlock handling

23
Thank you
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com