Title: Exploring WakeupFree Instruction Scheduling
1Exploring Wakeup-Free Instruction Scheduling
- Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin
- Microsystems Design Lab
- The Pennsylvania State University
2Outline
- Motivation
- Case study Cyclone
- Towards high-performance wakeup-free scheduler
- A general model
- Employing pre-check scheme
- A segmented issue queue
- Conclusions and future work
3Superscalar Issue Queue
rdyL
rdyL
OR
Wakeup Logic Delay Ttagdrive Ttagmatch
TmatchOR
opd tagL
opd tagL
tagIW
tag1
opd tagR
opd tagR
Ttagdirve c0 (c1 c2xIW)xN (c3 c4xIW
c5xIW2)xN2 Ttagmatch ,TmatchOR c0 c1xIW
c2xIW2 S. Palacharla et al., ISCA24
rdyR
rdyR
OR
instN-1
inst0
4Superscalar Issue Queue
Issue Queue
Selection Logic Tselection c0 c1xlog4N S.
Palacharla et al., ISCA24
from/to other subtrees
root cell
5Challenges in Dynamic Instruction Scheduling
- Broadcast-based dynamic scheduler
- Higher complexity
- Power hungry
- A major limiter to clock frequency increasing
issue queue size, issue width, wire delay, and
shorten logic levels per pipeline stage - Complexity Effective Issue
- Speculative wakeup Stark et.al.
- Dependency chain based ordering Canal/Gonzalez
ICS 00//01 Michaud/Seznec HPCA01 - Segmented Issue queue Raasch et.al. ISCA 2002
- Wakeup-free dynamic scheduler Ernst ISCA 2003
et.al. - Lower complexity
- Lower power consumption
- Better scalability
- Have to trade performance loss
6Our Goals
- Explore the predictability of instruction issue
latency - Identify the performance impediments in
wakeup-free architectures - Design high-performance wakeup-free schedulers
7Cyclone Conflict in the Main Queue
FP benchmarks
Int benchmarks
Enforce ordered placement to avoid conflict
between instructions with different latencies
Order Enforced
8Possible Structural Problems
- Instruction promotion/forwarding incurs conflict
along the path - Very limited instruction pool for selection
- Only entries in column 0 in the main queue can be
issued - Ready instructions (not in column 0) are delayed
due to conflict - Limited number of issue ports has less tolerance
to mispredicted ready instructions - Waste issue port
- Prevent ready instruction from issue
- Complete with newly decoded instructions due to
replay
9A General Model WF-Replay
How to relax the structural constraints?
Instruction is removed if no replay is needed
Timing Table
Wakeup-Free Issue Queue
register file ready bits
replay?
Rename Pre-schedule
From decoder
to FUs
lat
lat
lat
lat
lat
lat
from FUs
Given much wider issue width
Selection Logic
Collapsing issue queue without promotion.
Conventional random selection logic
10Instruction Pre-scheduling
Timing Table
Register Mapping Table
lat0
I0
max
I1
lat1
max
I2
lat2
max
I3
lat3
max
dep check
MUX control
reschedule?
Rename/ PSCHED0
PSCHED1
Adapted from Cyclone, D. Ernst et. al., ISCA03
11Latency Triggered Selection
Wakeup-Free Issue Queue
lat
lat
lat
lat
lat
lat
lat
lat
root cell
12WF-Replay IPC (F4-I8 vs F4-I4)
WF-Replay loses 9.7 performance (IPC) to Base as
the issue width reduces to 4 instruction per cycle
Issue Width 8
Issue Width 4
13Competition at Issue Ports?
Issue Width 8
Issue Width 4
14Precheck to Avoid Competition
- Competition at issue port may delay ready
(predictive) instructions - Delayed instructions may again compete with
instructions dependent on them - Causing more instructions falsely ready or to be
delayed - Wider issue port can avoid unnecessary
competition at cost of higher complexity - Solution preventing falsely ready instructions
from selection by pre-checking register ready bits
15WF-Precheck Scheduler
Selection request is filtered by ry bit
Selection Logic
Wakeup-Free Issue Queue
Issuing
Rename Pre-schedule
From decoder
to FUs
lat
ry
lat
ry
lat
ry
lat
ry
lat
ry
lat
ry
Timing Table
from Mem.
Register Ready Bit Register
Precheck register ready bits when predicted
latency reaches 0
Only issue truly ready instructions
Trade replay for pre-check
16Complexity of Pre-checking
On the average, 40.2 instructions have both
source operands ready and 45.4 instructions have
one source operand ready at pre-schedule stage.
Pre-check request is less than 2 per cycle.
17Issue Port Competition (F4-I4)
18WF-Precheck IPC (F4-I4)
19Impact of Load Related Predictions
20How about Selection Logic?
Issue Queue
Selection Logic Tselection c0 c1xlog4N S.
Palacharla et al., ISCA24
from/to other subtrees
root cell
21WF-Segment Issue Queue
gt4
Dispatch Routing
Rename / Pre-scheduling
Switchback path
from decoder
3-4
1-2
0
ry
ry
ry
ry
ry
ry
ry
ry
Time Table
Register Ready Bits
Selection Logic
4 issue ports
from FUs Mem.
to FUs
22WF-Segment Issue Queue
On the average, WF-Segment trades 3 IPC loss to
WF-Precheck and 5 loss to the Base for
optimizing selection logic.
23Conclusions
- Explore and identify the performance impediments
in wakeup-free scheduling - High-performance wakeup-free dynamic schedulers
- WF-Replay eliminates structural constraints
- WF-Precheck avoids unnecessary competition at
issue ports - WF-Segment optimizes selection logic for high
clock speed
24Future Work
- Routing complexity analysis in WF-Segment
scheduler - Power analysis for wakeup-free schedulers
- Sophisticated pre-scheduler
25Thank You!
26Wire Delay Challenges
- Increasing pipeline depth for high performance
- Clock period (FO4) decreases dramatically
- Cross-chip wire delay will be up to 10 cycles as
technology shrinks
M. S. Hrishikesh et al, ISCA29
Stephen W. Keckler et al, ISSCC03
27Precheck as A Single Stage
28Load/Store Dependence Predictor