Title: Slide 1 of 37
1Relaxing Microarchitectural Design Constraints at
Different Processing Granularities
- ILHYUN KIM
- PHARM Team
- University of Wisconsin-Madison
- Advisor Prof. Mikko H. Lipasti
2Processing granularity
- The amount of work associated with a process
- e.g. bytes (work) per cache block transfer
(process) - Coarser granularity incurs fewer transfer for
the certain amount of data - Finer granularity incurs less wasted work
- ? 1 byte
finer
coarser
3Resource, Control and Granularity
- Processing granularity (cache block size)
Processing granularity
Finer
Coarser
- Control (the number of line transfer for data)
More
Fewer
Control
- Resource (data bandwidth per transfer)
Efficient
Redundant
Resource
- ? What is the optimal processing granularity?
- Tradeoffs between resource and control
- Non-linearity in tradeoffs as granularity varies
- Determined by the goals and constraints of your
design - e.g. miss rates vs. latency vs. power
4Granularity of instruction processing
- Conventional instruction-centric hardware design
- HW structures are built to match an instructions
specifications - Controls occur at every instruction boundary
- Instruction (or uop) is the unit of execution
- Running a program executing a series of
instructions - Instruction-granular processing imposes
instruction-granular constraints on hardware
design space - Many hardware parameters are automatically
determined by processing granularity ? not much
flexibility in the design space - e.g. 2x read ports configuration in RF, atomicity
of instruction scheduling - Is instruction the optimal unit of processing in
the pipeline?
5Relaxing Design Constraints at different
granularities
- Each pipeline stage has different types of design
issues - Resource- (e.g. RF) or control-critical (e.g.
Scheduler) constraints - Process instructions at different granularities
- Compensate for critical design issues (resource /
control) - e.g. resource-critical structure ? finer-grained
processing - e.g. control-critical structure ? coarser-grained
processing
Half-price Architecture (ISCA03)
Macro-op Scheduling (MICRO03)
Processing granularity
Finer
Coarser
instruction
operand
multiple insts
6Outline
- Processing granularity
- Relaxing design constraints at different
granularities - Finer-grained processing
- Half-price architecture Sequential RF access
- Coarser-grained processing
- Conclusions Future research
7Motivations for Finer-grained Processing
- Processors are designed to handle 0, 1 and
2-source instructions at equal cost - Satisfy the worst-case requirements of
instructions - No resource arbitrations / pipeline stalls in
handling source operands - Simple controls over instruction and data stream
- Handling source operands requires 2x machine BW
- e.g. 2 read ports / 1 write port per instruction
- Heavily multi-ported structures in many pipeline
stages
8Making the common case faster
- 2 source operands are common
- 1836 of instructions have 2 source operands
- But, structures for 2 source operands are not
fully utilized - Scheduler
- 416 of instructions need two wakeups
- Less than 3 of instructions handle 2 wakeups in
the same clock cycle - Register File
- 0.64 read port per instruction
- Less than 4 of instructions need two register
read ports - ? Why not build a pipeline optimized for 1-source
instructions?
9Half-price Architecture
- Restrict the processors capability to handle 2
source operands - 0- or 1-source instructions are processed without
any restriction - 2-source instructions may execute more slowly
- ? Reduce HW complexity incurred by 2 source
operands - ½ technique in scheduler Sequential wakeup
- ½ technique in RF Sequential register access
HW design point to handle the worst-case
requirements
more HW
Opcode
Rdst
Rsrc 1
Rsrc 2
Opcode
Rdst
Rsrc 1
Half-price architecture design point
Opcode
Rdst / Rsrc
Opcode
less HW
10Two RF read port accesses
- Less than 4 of instructions need 2 read port
accesses - Many 2-source instructions read at least one
value off the bypass path - Detect back-to-back issue to determine if two
values are needed from RF
4-wide
8-wide
2-src insts
require 2 read ports
11Sequential RF access
- !back-to-back issue sequential RF access
- Sequential RF access example
ADD r1, r2, r3 SUB r3, r4, r5 XOR r5, 1, r6
12Machine parameters
- Simplescalar-Alpha-based, 12-stage, 4/8-wide OoO
Speculative scheduling - Alpha-style squashing scheduling recovery
- 4-wide 64 RUUs, 32 LSQs, 2 memory ports
- 8-wide 128 RUUs, 64 LSQs, 4 memory ports
- 64K IL1 (2), 64K DL1 (2), 512K unified L2 (8)
- Combined (bimodal gShare) branch prediction,
fetch until the first taken branch - Sequential RF access
- ½ read-ported RF (1 read port / slot)
- Comparison cases
- Pipelined RF (1 extra RF stage)
- ½ read-ported RF (same as sequential RF access)
crossbar
13Sequential RF access performance
4-wide
8-wide
- Seq RF access slowdown is slight avg 1.1 / 0.7,
worst 2.2 - ½ read ports crossbar almost achieves base
performance - crossbar complexity, global RF port arbitration ?
high control overhead - ? Finer-grained processing in the RF stage can
reduce hardware complexity with a minimal
performance impact
14Outline
- Processing granularity
- Relaxing design constraints at different
granularities - Finer-grained processing
- Coarser-grained processing
- Macro-op Scheduling
- Conclusions Future research
15Motivations for Coarser-grained Processing
- Loops in out-of-order execution
- Scheduling atomicity (wakeup / select within a
single cycle) - Essential for back-to-back instruction execution
- Hard to pipeline in conventional designs
- Poor scalability
- Extractable ILP is a function of window size
- Complexity increases exponentially as the size
grows - Increasing pressure due to deeper pipelining and
slower memory system
Load latency resolution loop
Scheduling loop (wakeup / select)
Exe loop (bypass)
16Related Work
- Scheduling atomicity
- Speculation pipelining
- Grandparent scheduling Stark, Select-free
scheduling Brown - Poor scalability
- Low complexity scheduling logic
- FIFO style window Palacharla, H.Kim
- Data-flow based window Canal, Michaud, Raasch
- Judicious window scaling
- Segmented windows Hrishikesh, WIB Lebeck
- Issue queue entry sharing
- AMD K7 (MOP), Intel Pentium M (uops fusion)
- ? Overcoming atomicity and scalability in
isolation - Lets step back and see the problem from a
different perspective
17Source of the atomicity constraint
- Minimal execution latency of instruction
- Many ALU operations have single-cycle latency
- Schedule should keep up with execution
- 1-cycle instructions need 1-cycle scheduling
- Multi-cycle operations do not need atomic
scheduling - ? Relax the constraints by increasing the size of
scheduling unit - Combine multiple instructions into a multi-cycle
latency unit - Scheduling decisions occur at multiple
instruction boundaries - Attack both atomicity and scalability constraints
at a coarser granularity
18Macro-op scheduling overview
Fetch / Decode / Rename
Queue
Scheduling
RF / EXE / MEM / WB / Commit
Disp
cache ports
Coarser MOP-grained
Instruction-grained
Instruction-grained
Issue queue insert
RF
Payload RAM
EXE
I-cache Fetch
MEM
WB Commit
MOP formation
Wakeup
Select
Rename
Pipelined scheduling
Sequencing instructions
MOP pointers
Dependence information
MOP detection
Wakeup order information
19MOP scheduling(2x) example
1
2
6
1
2
6
Macro-op (MOP)
3
5
3
4
5
- 9 cycles
- 16 queue entries
- 10 cycles
- 9 queue entries
7
9
8
4
7
10
11
8
12
select
select / wakeup
10
9
13
n
n
wakeup
select
select / wakeup
12
11
14
n1
n1
wakeup
15
13
16
14
15
16
- Pipelined instruction scheduling of multi-cycle
MOPs - Still issues original instructions consecutively
- Larger instruction window
- Multiple original instructions logically share a
single issue queue entry
20Issues in grouping instructions
- Candidate instructions
- Single-cycle instructions integer ALU, control,
store agen operations - Multi-cycle instructions (e.g. loads) do not need
single-cycle scheduling - The number of source operands
- Grouping two dependent instructions ? up to 3
source operands - Allow up to 2 source operands (conventional) / no
restriction (wired-OR) - MOP size
- Bigger MOP sizes may be more beneficial
- 2 instructions in this study
- MOP formation scope
- Instructions are processed in order before
inserted into issue queue - Candidate instructions need to be captured within
a reasonable scope
21Dependence edge distance (instruction count)
49.2
50.9
27.8
48.7
37.4
56.3
40.2
47.5
42.7
47.7
37.6
44.7
total insts
- 73 of value-generating candidates (potential MOP
heads) have dependent candidate instructions
(potential MOP tails) - An 8-instruction scope captures many dependent
pairs - Variability in distances (e.g. gap vs. vortex)
- ? Our configuration grouping 2 single-cycle
instructions within an 8-instruction scope
22MOP detection
- Finds groupable instruction pairs
- Dependence matrix-based detection (detailed in
the paper) - Performance is insensitive to detection latency
(pointers reused repeatedly) - A pessimistic 100-cycle latency loses 0.22 of
IPC - Generates MOP pointers
- 4 bits per instruction, stored in IL1
- A MOP pointer represents a groupable instruction
pair
23MOP detection Avoiding cycle conditions
- Cycle condition examples (leading to deadlocks)
- Conservative cycle detection heuristic
- Precise detection is hard (multiple levels of dep
tracking)
- Assume a cycle if both outgoing and incoming
edges are detected - Captures over 90 of MOP opportunities (compared
to the precise detection)
?
24MOP formation
MOP
MOP
- Locates MOP pairs using MOP pointers
- MOP pointers are fetched along with instructions
- Converts register dependences to MOP dependences
- Architected register IDs ? MOP IDs
- Identical to register renaming
- Except that it assigns a single ID to two
groupable instructions - Reflects the fact that two instructions are
grouped into one scheduling unit - Two instructions are later inserted into one
issue entry
25Scheduling MOPs
- Instructions in a MOP are scheduled as a single
unit - A MOP is a non-pipelined, 2-cycle operation from
the schedulers perspective - Issued when all source operands are ready, incurs
one tag broadcast - Wakeup / select timings
26Sequencing instructions
sequence original insts
- A MOP is converted back to two original
instructions - The dual-entry payload RAM sends two original
instructions - Original instructions are sequentially executed
within 2 cycles - Register values are accessed using physical
register IDs - ROB separately commits original instructions in
order - MOPs do not affect precise exception or branch
misprediction recovery
27Machine parameters
- Simplescalar-Alpha-based 4-wide OoO speculative
scheduling w/ selective replay, 14 stages - Ideally pipelined scheduler
- conceptually equivalent to atomic scheduling 1
extra stage - 128 ROB, unrestricted / 32-entry issue queue
- 4 ALUs, 2 memory ports, 16K IL1 (2), 16K DL1 (2),
256K L2 (8), memory (100) - Combined branch prediction, fetch until the first
taken branch - MOP scheduling
- 2-cycle (pipelined) scheduling 2X MOP technique
- 2 (conventional) or 3 (wired-OR) source operands
- MOP detection scope 2 cycles (4-wide X 2-cycle
up to 8 insts) - Spec2k INT, reduced input sets
- Reference input sets for crafty, eon, gap (up to
3B instructions)
28 grouped instructions
2-src
3-src
- 2846 of total instructions are grouped
- 1423 reduction in the instructions count in
scheduler - MOPs cover 2663 of value-generating 1-cycle
instructions - potentially issued as if atomic scheduling is
performed
29MOP scheduling performance(relaxed atomicity
constraint only)
Unrestricted IQ / 128 ROB
- Up to 19 of IPC loss in 2-cycle scheduling
- MOP scheduling restores performance
- Enables consecutive issue of dependent
instructions - 97.2 of atomic scheduling performance on average
30Insight into MOP scheduling
- Performance loss of 2-cycle scheduling
- Correlated to dependence edge distance
- Short dependence edges (e.g. gap)
- ? instruction window is filled up with chains of
dependent instructions - ? 2-cycle scheduler cannot find plenty of ready
instructions to issue - MOP scheduling captures short-distance dependent
instruction pairs - They are the important ones
- Low MOP coverage due to long dependence edges
does not matter - 2-cycle scheduler can find many instructions to
issue (e.g. vortex) - ? MOP scheduling complements 2-cycle scheduling
- Overall performance is less sensitive to code
layout
31MOP scheduling performance(relaxed atomicity
scalability constraints)
32 IQ / 128 ROB
- Benefits from both relaxed atomicity and
scalability constraints at a coarser processing
granularity - ? Pipelined 2-cycle MOP scheduling performs
comparably or better than atomic scheduling
32Conclusions
- Instruction-centric hardware designs impose
microarchitectural design constraints - HW structures are built to match an instructions
specifications - Controls occur at every instruction boundary
- Tradeoffs in different processing granularities
- Control and Resource
- Varying processing granularity exposes greater
opportunities for high-performance,
complexity-effective microarchitectures - Finer-grained processing Half-price architecture
- Coarser-grained processing Macro-op Scheduling
33Future research Revisiting ILP
Goal Keeping resources busy as much as possible
- Ways to extract Instruction-level parallelism
- OoO execution
- may not be scalable to future processors due to
complexity - VLIW
- Binary compatibility matters
- High overhead of dynamic binary translation
- vulnerable to unexpected dynamic events
(distortion in sets of parallel insts) - Stripping horizontal slices from a program
34Future research Exploiting Instruction-level
Serialism!
- Finding vertical slices (chain of dependent
insts) is easier - Executions are serial in nature
- Light-weight conversions in HW / run-time binary
translations - Less vulnerable to dynamic events (good for
caching prescheduled groups) - A collection of vertical slices extracts
parallelism - Let the machine find next vertical slices to
issue, at a slower rate - Increases window size, scheduling slack and
bandwidth
instruction -centric OoO
Coarser-grained Parallel EXE
Coarser-grained Serial EXE
Exe BW 4
Exe BW 4
2 issue slots
2 issue slots
2 cycles
1 cycle
35Applied to MOPs (Macro-op Execution)
36MOP execution Performance
- Pipelined scheduling, fewer issue/payload/RF
ports, simpler bypass - Achieve wider execution bandwidth with narrower
structures
37Future research Parallelism, Granularity and ILS
- Goal better implementation ISA
- Light-weight conversion from U-ISA to I-ISA
- Easy to maintain the original sequential program
semantics - Hardware complexity and power consumption
- Move the burden of timing-critical decisions to
offline - Good front-end code density, fewer operations to
process - Performance
- Adaptability to run-time environments
- Achieving max ILP extractable
- ? Vertically-long instruction word?
- Coarser-granular instruction sets that exploit
ILS - Run-time binary translation / dynamic HW
construction - Granularity and dimension of instruction word
- Impact on the native ILP
- Underlying HW
38Thesis Research Contributions(infomercial)
- Speculative Decode (ISCA02)
- Attacking the problems with value-based dynamic
optimizations under speculative scheduling - Half-price Architecture (ISCA03)
- Operand-centric microarchitecture designs
- Macro-op Scheduling (MICRO03)
- Coarser-grained instruction scheduling
- Studies on Scheduling Replay Schemes (HPCA04)
- Addresses the deficiencies in the literature
- Scalable selective scheduling replay mechanisms
39Questions??
40Macro-op scheduling on x86 (swiped from Shiliang
Hus results)
- x86vm (under development)
- x86 interpreter / functional simulator based on
BOCHS 2.0.2 - Cracks x86 into RISC-style ops (proprietary
mapping) - Timing simulator for detailed microarchitecture
is under construction - x86 ? micro-ops ? MOPs
- Assumes dynamic binary translation
- Allows grouping of SS and SM (? needs
considerations) - Within / across x86 instructions
- Does not allow grouping across conditional BR /
indirect JMP - Dependent MOPs only
41Grouped RISC ops x86
2-cycle scheduling unfriendly ops
- 57 operations are grouped ? 28 reduction in
scheduling units - leaving less than 5 of 2-cycle scheduling
unfriendly operations - Over 95 of MOPs are captured within 3 micro-ops
- 66 are consecutive operations
- 71 of MOPs are created across x86 instructions
- not a reverse process to RISC op cracking
42MOP detection MOP pointer generation
- MOP pointers (4 bits per instruction)
control
offset
- Control bit (1)
- captures up to 1 control discontinuity
- Offset bits (3)
- instruction count from head to tail
0 011 add r1 ? r2, r3 0 000 lw r4 ? 0(r3) 1
010 and r5 ? r4, r2 0 000 bez r1, 0xff
(taken) 0 000 sub r6 ? r5, 1
MOP pointers
43Sequential RF access
- Remove ½ register read ports
- Only a single read port per issue slot
- 0 or 1-source instructions are processed without
any restriction - Sequentially access a single port twice for 2
values if needed - Back-to-back issue Reading values off the
bypass - Back-to-back issue ensures 0 or 1 register read
port access - Non-back-to-back issue incurs sequential RF
access - The scheduler creates a bubble to give the
instruction time window to access the RF twice
44½ technique Sequential RF access
- Scheduler changes for sequential RF access
- Sequential RF access example
ADD r1, r2, r3 SUB r3, r4, r5 XOR r5, 1, r6
45Inserting MOPs into issue queue
Issue queue insert
RF
EXE
I-cache Fetch
MEM
WB Commit
Payload RAM
MOP formation
Wakeup
Select
Rename
MOP pointers
Dependence information
MOP detection
Wakeup order information
- Inserting instructions across different insert
groups
46(No Transcript)
47Processing granularity
- Instruction-granular hardware design
- HW structures are built to match an instructions
specifications - Controls occur at every instruction boundary
- Instruction granularity may impose constraints on
the hardware design space - Relaxing the constraints at different processing
granularities
Finer-granular architecture (ISCA03)
Coarser-granular architecture (MICRO 03)
conventional
instruction
operand
multiple insts
48Its about granularity
- Instruction-granular hardware design
- HW structures are built to match an instructions
specifications - Controls occur at every instruction boundary
- Instruction granularity may impose constraints on
the hardware design space - Relaxing the constraints at different processing
granularities
Half-price architecture (ISCA03)
Coarser-granular architecture
conventional
Finer
Processing granularity
Coarser
instruction
operand
macro-op
49Register file complexity
- Overdesign in register file
- 2x read ports for two source operands
- Superscalar processors need RF to be heavily
multiported - Area increases quadratically, latency increases
linearly - Two read ports are not fully utilized
- 0- / 1-source instructions do not require two
read ports - Many instructions frequently get values off the
bypass path - Speeding up the RF
- Reducing the number of register entries
- Hierarchical register file (Cruz, Borch,
Balasubramonian, ..) - Reducing the number of ports
- Fewer RF ports crossbar (Balasubramonian et al,
Park et al) - Half-price technique Sequential RF Access
50Attacking scheduling loop constraints
- Scheduling atomicity
- Speculation pipelining
- Grandparent scheduling, Select-free scheduling
- Poor scalability
- Low complexity scheduling logic
- FIFO style window, Data-flow based window
- Judicious window scaling
- Segmented windows, WIB
- Issue queue entry sharing
- AMD K7 (MOP), Intel Pentium M (uops fusion)
- ? Overcoming atomicity and scalability in
isolation - Lets step back and see the problem from a
different perspective
51Future research Exploiting Instruction-level
Serialism!
- Finding vertical slices (chain of dependent
insts) is easier - Executions are serial in nature
- Light-weight conversions in HW / run-time binary
translations - Less vulnerable to dynamic events (good for
caching prescheduled groups) - A collection of vertical slices extracts
parallelism - Let the machine find next vertical slices to
issue, at a slower rate - Increases window size, scheduling slacks and
bandwidth
dynamic event (e.g. cache miss)
Exe BW 4
Sched loop 3 cycles
Issue BW 2 slices / cycle