Title: Slide 1 of 23
1Macro-op SchedulingRelaxing Scheduling Loop
Constraints
- Ilhyun Kim
- Mikko H. Lipasti
- PHARM Team
- University of Wisconsin-Madison
2Its all about granularity
- Instruction-centric hardware design
- HW structures are built to match an instructions
specifications - Controls occur at every instruction boundary
- Instruction granularity may impose constraints on
the hardware design space - Relaxing the constraints at different processing
granularities
Half-price architecture (ISCA03)
Coarser-granular architecture
conventional
Finer
Processing granularity
Coarser
instruction
operand
macro-op
3Outline
- Scheduling loop constraints
- Overview of coarser-grained scheduling
- Macro-op scheduling implementation
- Performance evaluation
- Conclusions future work
4Scheduling loop constraints
- Loops in out-of-order execution
- Scheduling atomicity (wakeup / select within a
single cycle) - Essential for back-to-back instruction execution
- Hard to pipeline in conventional designs
- Poor scalability
- Extractable ILP is a function of window size
- Complexity increases exponentially as the size
grows - Increasing pressure due to deeper pipelining and
slower memory system
Load latency resolution loop
Scheduling loop (wakeup / select)
Exe loop (bypass)
5Related Work
- Scheduling atomicity
- Speculation pipelining
- Grandparent scheduling Stark, Select-free
scheduling Brown - Poor scalability
- Low complexity scheduling logic
- FIFO style window Palacharla, H.Kim
- Data-flow based window Canal, Michaud, Raasch
- Judicious window scaling
- Segmented windows Hrishikesh, WIB Lebeck
- Issue queue entry sharing
- AMD K7 (MOP), Intel Pentium M (uops fusion)
- ? Still based on instruction-centric scheduler
designs - Making a scheduling decision at every instruction
boundary - Overcoming atomicity and scalability in isolation
6Source of the atomicity constraint
- Minimal execution latency of instruction
- Many ALU operations have single-cycle latency
- Schedule should keep up with execution
- 1-cycle instructions need 1-cycle scheduling
- Multi-cycle operations do not need atomic
scheduling - ? Relax the constraints by increasing the size of
scheduling unit - Combine multiple instructions into a multi-cycle
latency unit - Scheduling decisions occur at multiple
instruction boundaries - Attack both atomicity and scalability constraints
7Macro-op scheduling overview
Fetch / Decode / Rename
Queue
Scheduling
RF / EXE / MEM / WB / Commit
Disp
cache ports
Coarser MOP-grained
Instruction-grained
Instruction-grained
Issue queue insert
RF
Payload RAM
EXE
I-cache Fetch
MEM
WB Commit
MOP formation
Wakeup
Select
Rename
Pipelined scheduling
Sequencing instructions
MOP pointers
Dependence information
MOP detection
Wakeup order information
8MOP scheduling(2x) example
1
2
6
1
2
6
Macro-op (MOP)
3
5
3
4
5
- 9 cycles
- 16 queue entries
- 10 cycles
- 9 queue entries
7
9
8
4
7
10
11
8
12
select
select / wakeup
10
9
13
n
n
wakeup
select
select / wakeup
12
11
14
n1
n1
wakeup
15
13
16
14
15
16
- Pipelined instruction scheduling of multi-cycle
MOPs - Still issues original instructions consecutively
- Larger instruction window
- Multiple original instructions logically share a
single issue queue entry
9Outline
- Scheduling loop constraints
- Overview of coarser-grained scheduling
- Macro-op scheduling implementation
- Performance evaluation
- Conclusions future work
10Issues in grouping instructions
- Candidate instructions
- Single-cycle instructions integer ALU, control,
store agen operations - Multi-cycle instructions (e.g. loads) do not need
single-cycle scheduling - The number of source operands
- Grouping two dependent instructions ? up to 3
source operands - Allow up to 2 source operands (conventional) / no
restriction (wired-OR) - MOP size
- Bigger MOP sizes may be more beneficial
- 2 instructions in this study
- MOP formation scope
- Instructions are processed in order before
inserted into issue queue - Candidate instructions need to be captured within
a reasonable scope
11Dependence edge distance (instruction count)
49.2
50.9
27.8
48.7
37.4
56.3
40.2
47.5
42.7
47.7
37.6
44.7
total insts
- 73 of value-generating candidates (potential MOP
heads) have dependent candidate instructions
(potential MOP tails) - An 8-instruction scope captures many dependent
pairs - Variability in distances (e.g. gap vs. vortex) ?
remember this - ? Our configuration grouping 2 single-cycle
instructions within an 8-instruction scope
12MOP detection
- Finds groupable instruction pairs
- Dependence matrix-based detection (detailed in
the paper) - Performance is insensitive to detection latency
(pointers reused repeatedly) - A pessimistic 100-cycle latency loses 0.22 of
IPC - Generates MOP pointers
- 4 bits per instruction, stored in IL1
- A MOP pointer represents a groupable instruction
pair
13MOP detection Avoiding cycle conditions
- Cycle condition examples (leading to deadlocks)
- Conservative cycle detection heuristic
- Precise detection is hard (multiple levels of dep
tracking)
- Assume a cycle if both outgoing and incoming
edges are detected - Captures over 90 of MOP opportunities (compared
to the precise detection)
?
14MOP formation
MOP
MOP
- Locates MOP pairs using MOP pointers
- MOP pointers are fetched along with instructions
- Converts register dependences to MOP dependences
- Architected register IDs ? MOP IDs
- Identical to register renaming
- Except that it assigns a single ID to two
groupable instructions - Reflects the fact that two instructions are
grouped into one scheduling unit - Two instructions are later inserted into one
issue entry
15Scheduling MOPs
- Instructions in a MOP are scheduled as a single
unit - A MOP is a non-pipelined, 2-cycle operation from
the schedulers perspective - Issued when all source operands are ready, incurs
one tag broadcast - Wakeup / select timings
16Sequencing instructions
sequence original insts
- A MOP is converted back to two original
instructions - The dual-entry payload RAM sends two original
instructions - Original instructions are sequentially executed
within 2 cycles - Register values are accessed using physical
register IDs - ROB separately commits original instructions in
order - MOPs do not affect precise exception or branch
misprediction recovery
17Outline
- Scheduling loop constraints
- Overview of coarser-grained scheduling
- Macro-op scheduling implementation
- Performance evaluation
- Conclusions future work
18Machine parameters
- Simplescalar-Alpha-based 4-wide OoO speculative
scheduling w/ selective replay, 14 stages - Ideally pipelined scheduler
- conceptually equivalent to atomic scheduling 1
extra stage - 128 ROB, unrestricted / 32-entry issue queue
- 4 ALUs, 2 memory ports, 16K IL1 (2), 16K DL1 (2),
256K L2 (8), memory (100) - Combined branch prediction, fetch until the first
taken branch - MOP scheduling
- 2-cycle (pipelined) scheduling 2X MOP technique
- 2 (conventional) or 3 (wired-OR) source operands
- MOP detection scope 2 cycles (4-wide X 2-cycle
up to 8 insts) - Spec2k INT, reduced input sets
- Reference input sets for crafty, eon, gap (up to
3B instructions)
19 grouped instructions
2-src
3-src
- 2846 of total instructions are grouped
- 1423 reduction in the instructions count in
scheduler - Dependent MOP cases enable consecutive issue of
dependent instructions
20MOP scheduling performance(relaxed atomicity
constraint only)
Unrestricted IQ / 128 ROB
- Up to 19 of IPC loss in 2-cycle scheduling
- MOP scheduling restores performance
- Enables consecutive issue of dependent
instructions - 97.2 of atomic scheduling performance on average
21Insight into MOP scheduling
- Performance loss of 2-cycle scheduling
- Correlated to dependence edge distance
- Short dependence edges (e.g. gap)
- ? instruction window is filled up with chains of
dependent instructions - ? 2-cycle scheduler cannot find plenty of ready
instructions to issue - MOP scheduling captures short-distance dependent
instruction pairs - They are the important ones
- Low MOP coverage due to long dependence edges
does not matter - 2-cycle scheduler can find many instructions to
issue (e.g. vortex) - ? MOP scheduling complements 2-cycle scheduling
- Overall performance is less sensitive to code
layout
22MOP scheduling performance(relaxed atomicity
scalability constraints)
32 IQ / 128 ROB
- Benefits from both relaxed atomicity and
scalability constraints - ? Pipelined 2-cycle MOP scheduling performs
comparably or better than atomic scheduling
23Conclusions Future work
- Changing processing granularity can relax the
constraints imposed by instruction-centric
designs - Constraints in instruction scheduling loop
- Scheduling atomicity, poor scalability
- Macro-op scheduling relaxes both constraints at a
coarser granularity - Pipelined, 2-cycle macro-op scheduling can
perform comparably or even better than atomic
scheduling - Potentials for narrow bandwidth microarchitecture
- Extending the MOP idea to the whole pipeline
(Disp, RF, bypass) - e.g. achieving 4-wide machine performance using
2-wide bandwidth
24Questions??
25Select-free (Brown et al.) vs. MOP scheduling
32 IQ / 128 ROB, no extra stage for MOP formation
- 4.1 better IPC on average over
select-free-scoreboard (best 8.3) - Select-free cannot outperform the atomic
scheduling - Select-free scheduling is speculative and
requires recovery operations - MOP scheduling is non-speculative, leading to
many advantages
26MOP detection MOP pointer generation
- Finding dependent pairs
- Dependence matrix-based detection (detailed in
MICRO paper) - Insensitive to detection latency (pointers reused
repeatedly) - A pessimistic 100-cycle latency loses 0.22 of
IPC - Similar to instruction preprocessing in trace
cache lines - MOP pointers (4 bits per instruction)
control
offset
- Control bit (1)
- captures up to 1 control discontinuity
- Offset bits (3)
- instruction count from head to tail
0 011 add r1 ? r2, r3 0 000 lw r4 ? 0(r3) 1
010 and r5 ? r4, r2 0 000 bez r1, 0xff
(taken) 0 000 sub r6 ? r5, 1
MOP pointers
27MOP formation MOP dependence translation
- Assigns a single ID to two MOPable instructions
- reflecting the fact that two instructions are
grouped into one unit - The process and required structure is identical
to register renaming - Register values are still access based on
original register IDs
Register rename table
MOP translation table
Logical reg ID
Physical reg ID
Logical reg ID
p3
MOP ID
1
3
1
3
I1
2
4
2
4
p5
I1
a single MOP ID is allocated to two
grouped instructions
3
5
5
I2
3
6
p6
4
p7
4
5
I2
p4
I3
5
7
6
5
I3
8
6
6
6
I4
I4
7
-
7
-
-
-
28Inserting MOPs into issue queue
Issue queue insert
RF
EXE
I-cache Fetch
MEM
WB Commit
Payload RAM
MOP formation
Wakeup
Select
Rename
MOP pointers
Dependence information
MOP detection
Wakeup order information
- Inserting instructions across different groups
29Performance considerations
- Independent MOPs
- Group independent instructions with the same
source dependences - No direct performance benefit but reduce queue
contention - Last-arriving operands in tail instructions
- Unnecessarily delays head instructions
- MOP detection logic filters out harmful grouping
- Create an alternative pair if any
30(No Transcript)