Slide 1 of 23 - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Slide 1 of 23

Description:

Ilhyun Kim -- MICRO-36. Source of the atomicity constraint ... Ilhyun Kim -- MICRO-36. Scheduling MOPs. Instructions in a MOP are scheduled as a single unit ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 31

Provided by: ikim

Category:

Tags: kim

more less

Transcript and Presenter's Notes

Title: Slide 1 of 23

1
Macro-op SchedulingRelaxing Scheduling Loop
Constraints

Ilhyun Kim
Mikko H. Lipasti
PHARM Team
University of Wisconsin-Madison

2
Its all about granularity

Instruction-centric hardware design
HW structures are built to match an instructions
specifications
Controls occur at every instruction boundary
Instruction granularity may impose constraints on
the hardware design space
Relaxing the constraints at different processing
granularities

Half-price architecture (ISCA03)
Coarser-granular architecture
conventional
Finer
Processing granularity
Coarser
instruction
operand
macro-op
3
Outline

Scheduling loop constraints
Overview of coarser-grained scheduling
Macro-op scheduling implementation
Performance evaluation
Conclusions future work

4
Scheduling loop constraints

Loops in out-of-order execution
Scheduling atomicity (wakeup / select within a
single cycle)
Essential for back-to-back instruction execution
Hard to pipeline in conventional designs
Poor scalability
Extractable ILP is a function of window size
Complexity increases exponentially as the size
grows
Increasing pressure due to deeper pipelining and
slower memory system

Load latency resolution loop
Scheduling loop (wakeup / select)
Exe loop (bypass)
5
Related Work

Scheduling atomicity
Speculation pipelining
Grandparent scheduling Stark, Select-free
scheduling Brown
Poor scalability
Low complexity scheduling logic
FIFO style window Palacharla, H.Kim
Data-flow based window Canal, Michaud, Raasch
Judicious window scaling
Segmented windows Hrishikesh, WIB Lebeck
Issue queue entry sharing
AMD K7 (MOP), Intel Pentium M (uops fusion)
? Still based on instruction-centric scheduler
designs
Making a scheduling decision at every instruction
boundary
Overcoming atomicity and scalability in isolation

6
Source of the atomicity constraint

Minimal execution latency of instruction
Many ALU operations have single-cycle latency
Schedule should keep up with execution
1-cycle instructions need 1-cycle scheduling
Multi-cycle operations do not need atomic
scheduling
? Relax the constraints by increasing the size of
scheduling unit
Combine multiple instructions into a multi-cycle
latency unit
Scheduling decisions occur at multiple
instruction boundaries
Attack both atomicity and scalability constraints

7
Macro-op scheduling overview
Fetch / Decode / Rename
Queue
Scheduling
RF / EXE / MEM / WB / Commit
Disp
cache ports
Coarser MOP-grained
Instruction-grained
Instruction-grained
Issue queue insert
RF
Payload RAM
EXE
I-cache Fetch
MEM
WB Commit
MOP formation
Wakeup
Select
Rename
Pipelined scheduling
Sequencing instructions
MOP pointers
Dependence information
MOP detection
Wakeup order information
8
MOP scheduling(2x) example
1
2
6
1
2
6
Macro-op (MOP)
3
5
3
4
5

9 cycles
16 queue entries

10 cycles
9 queue entries

7
9
8
4
7
10
11
8
12
select
select / wakeup
10
9
13
n
n
wakeup
select
select / wakeup
12
11
14
n1
n1
wakeup
15
13
16
14
15
16

Pipelined instruction scheduling of multi-cycle
MOPs
Still issues original instructions consecutively
Larger instruction window
Multiple original instructions logically share a
single issue queue entry

9
Outline

Scheduling loop constraints
Overview of coarser-grained scheduling
Macro-op scheduling implementation
Performance evaluation
Conclusions future work

10
Issues in grouping instructions

Candidate instructions
Single-cycle instructions integer ALU, control,
store agen operations
Multi-cycle instructions (e.g. loads) do not need
single-cycle scheduling
The number of source operands
Grouping two dependent instructions ? up to 3
source operands
Allow up to 2 source operands (conventional) / no
restriction (wired-OR)
MOP size
Bigger MOP sizes may be more beneficial
2 instructions in this study
MOP formation scope
Instructions are processed in order before
inserted into issue queue
Candidate instructions need to be captured within
a reasonable scope

11
Dependence edge distance (instruction count)
49.2
50.9
27.8
48.7
37.4
56.3
40.2
47.5
42.7
47.7
37.6
44.7
total insts

73 of value-generating candidates (potential MOP
heads) have dependent candidate instructions
(potential MOP tails)
An 8-instruction scope captures many dependent
pairs
Variability in distances (e.g. gap vs. vortex) ?
remember this
? Our configuration grouping 2 single-cycle
instructions within an 8-instruction scope

12
MOP detection

Finds groupable instruction pairs
Dependence matrix-based detection (detailed in
the paper)
Performance is insensitive to detection latency
(pointers reused repeatedly)
A pessimistic 100-cycle latency loses 0.22 of
IPC
Generates MOP pointers
4 bits per instruction, stored in IL1
A MOP pointer represents a groupable instruction
pair

13
MOP detection Avoiding cycle conditions

Cycle condition examples (leading to deadlocks)
Conservative cycle detection heuristic
Precise detection is hard (multiple levels of dep
tracking)

Assume a cycle if both outgoing and incoming
edges are detected
Captures over 90 of MOP opportunities (compared
to the precise detection)

?
14
MOP formation
MOP
MOP

Locates MOP pairs using MOP pointers
MOP pointers are fetched along with instructions
Converts register dependences to MOP dependences
Architected register IDs ? MOP IDs
Identical to register renaming
Except that it assigns a single ID to two
groupable instructions
Reflects the fact that two instructions are
grouped into one scheduling unit
Two instructions are later inserted into one
issue entry

15
Scheduling MOPs

Instructions in a MOP are scheduled as a single
unit
A MOP is a non-pipelined, 2-cycle operation from
the schedulers perspective
Issued when all source operands are ready, incurs
one tag broadcast
Wakeup / select timings

16
Sequencing instructions
sequence original insts

A MOP is converted back to two original
instructions
The dual-entry payload RAM sends two original
instructions
Original instructions are sequentially executed
within 2 cycles
Register values are accessed using physical
register IDs
ROB separately commits original instructions in
order
MOPs do not affect precise exception or branch
misprediction recovery

17
Outline

Scheduling loop constraints
Overview of coarser-grained scheduling
Macro-op scheduling implementation
Performance evaluation
Conclusions future work

18
Machine parameters

Simplescalar-Alpha-based 4-wide OoO speculative
scheduling w/ selective replay, 14 stages
Ideally pipelined scheduler
conceptually equivalent to atomic scheduling 1
extra stage
128 ROB, unrestricted / 32-entry issue queue
4 ALUs, 2 memory ports, 16K IL1 (2), 16K DL1 (2),
256K L2 (8), memory (100)
Combined branch prediction, fetch until the first
taken branch
MOP scheduling
2-cycle (pipelined) scheduling 2X MOP technique
2 (conventional) or 3 (wired-OR) source operands
MOP detection scope 2 cycles (4-wide X 2-cycle
up to 8 insts)
Spec2k INT, reduced input sets
Reference input sets for crafty, eon, gap (up to
3B instructions)

19
grouped instructions
2-src
3-src

2846 of total instructions are grouped
1423 reduction in the instructions count in
scheduler
Dependent MOP cases enable consecutive issue of
dependent instructions

20
MOP scheduling performance(relaxed atomicity
constraint only)
Unrestricted IQ / 128 ROB

Up to 19 of IPC loss in 2-cycle scheduling
MOP scheduling restores performance
Enables consecutive issue of dependent
instructions
97.2 of atomic scheduling performance on average

21
Insight into MOP scheduling

Performance loss of 2-cycle scheduling
Correlated to dependence edge distance
Short dependence edges (e.g. gap)
? instruction window is filled up with chains of
dependent instructions
? 2-cycle scheduler cannot find plenty of ready
instructions to issue
MOP scheduling captures short-distance dependent
instruction pairs
They are the important ones
Low MOP coverage due to long dependence edges
does not matter
2-cycle scheduler can find many instructions to
issue (e.g. vortex)
? MOP scheduling complements 2-cycle scheduling
Overall performance is less sensitive to code
layout

22
MOP scheduling performance(relaxed atomicity
scalability constraints)
32 IQ / 128 ROB

Benefits from both relaxed atomicity and
scalability constraints
? Pipelined 2-cycle MOP scheduling performs
comparably or better than atomic scheduling

23
Conclusions Future work

Changing processing granularity can relax the
constraints imposed by instruction-centric
designs
Constraints in instruction scheduling loop
Scheduling atomicity, poor scalability
Macro-op scheduling relaxes both constraints at a
coarser granularity
Pipelined, 2-cycle macro-op scheduling can
perform comparably or even better than atomic
scheduling
Potentials for narrow bandwidth microarchitecture
Extending the MOP idea to the whole pipeline
(Disp, RF, bypass)
e.g. achieving 4-wide machine performance using
2-wide bandwidth

24
Questions??
25
Select-free (Brown et al.) vs. MOP scheduling
32 IQ / 128 ROB, no extra stage for MOP formation

4.1 better IPC on average over
select-free-scoreboard (best 8.3)
Select-free cannot outperform the atomic
scheduling
Select-free scheduling is speculative and
requires recovery operations
MOP scheduling is non-speculative, leading to
many advantages

26
MOP detection MOP pointer generation

Finding dependent pairs
Dependence matrix-based detection (detailed in
MICRO paper)
Insensitive to detection latency (pointers reused
repeatedly)
A pessimistic 100-cycle latency loses 0.22 of
IPC
Similar to instruction preprocessing in trace
cache lines
MOP pointers (4 bits per instruction)

control
offset

Control bit (1)
captures up to 1 control discontinuity
Offset bits (3)
instruction count from head to tail

0 011 add r1 ? r2, r3 0 000 lw r4 ? 0(r3) 1
010 and r5 ? r4, r2 0 000 bez r1, 0xff
(taken) 0 000 sub r6 ? r5, 1
MOP pointers
27
MOP formation MOP dependence translation

Assigns a single ID to two MOPable instructions
reflecting the fact that two instructions are
grouped into one unit
The process and required structure is identical
to register renaming
Register values are still access based on
original register IDs

Register rename table
MOP translation table
Logical reg ID
Physical reg ID
Logical reg ID
p3
MOP ID
1
3
1
3
I1
2
4
2
4
p5
I1
a single MOP ID is allocated to two
grouped instructions
3
5
5
I2
3
6
p6
4
p7
4
5
I2
p4
I3
5
7
6
5
I3
8
6
6
6
I4
I4
7
-
7
-

-
-
28
Inserting MOPs into issue queue
Issue queue insert
RF
EXE
I-cache Fetch
MEM
WB Commit
Payload RAM
MOP formation
Wakeup
Select
Rename
MOP pointers
Dependence information
MOP detection
Wakeup order information