Slide 1 of 37

About This Presentation

Title:

Slide 1 of 37

Description:

Ilhyun Kim, PHARM team, UW-Madison. Slide 1 of 37 ... ILHYUN KIM. PHARM Team. University of Wisconsin-Madison. Advisor: Prof. Mikko H. Lipasti ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 52

Provided by: ikim

Category:

Tags: kim

more less

Transcript and Presenter's Notes

Title: Slide 1 of 37

1
Relaxing Microarchitectural Design Constraints at
Different Processing Granularities

ILHYUN KIM
PHARM Team
University of Wisconsin-Madison
Advisor Prof. Mikko H. Lipasti

2
Processing granularity

The amount of work associated with a process
e.g. bytes (work) per cache block transfer
(process)
Coarser granularity incurs fewer transfer for
the certain amount of data
Finer granularity incurs less wasted work
? 1 byte

finer
coarser
3
Resource, Control and Granularity

Processing granularity (cache block size)

Processing granularity
Finer
Coarser

Control (the number of line transfer for data)

More
Fewer
Control

Resource (data bandwidth per transfer)

Efficient
Redundant
Resource

? What is the optimal processing granularity?
Tradeoffs between resource and control
Non-linearity in tradeoffs as granularity varies
Determined by the goals and constraints of your
design
e.g. miss rates vs. latency vs. power

4
Granularity of instruction processing

Conventional instruction-centric hardware design
HW structures are built to match an instructions
specifications
Controls occur at every instruction boundary
Instruction (or uop) is the unit of execution
Running a program executing a series of
instructions
Instruction-granular processing imposes
instruction-granular constraints on hardware
design space
Many hardware parameters are automatically
determined by processing granularity ? not much
flexibility in the design space
e.g. 2x read ports configuration in RF, atomicity
of instruction scheduling
Is instruction the optimal unit of processing in
the pipeline?

5
Relaxing Design Constraints at different
granularities

Each pipeline stage has different types of design
issues
Resource- (e.g. RF) or control-critical (e.g.
Scheduler) constraints
Process instructions at different granularities
Compensate for critical design issues (resource /
control)
e.g. resource-critical structure ? finer-grained
processing
e.g. control-critical structure ? coarser-grained
processing

Half-price Architecture (ISCA03)
Macro-op Scheduling (MICRO03)
Processing granularity
Finer
Coarser
instruction
operand
multiple insts
6
Outline

Processing granularity
Relaxing design constraints at different
granularities
Finer-grained processing
Half-price architecture Sequential RF access
Coarser-grained processing
Conclusions Future research

7
Motivations for Finer-grained Processing

Processors are designed to handle 0, 1 and
2-source instructions at equal cost
Satisfy the worst-case requirements of
instructions
No resource arbitrations / pipeline stalls in
handling source operands
Simple controls over instruction and data stream
Handling source operands requires 2x machine BW
e.g. 2 read ports / 1 write port per instruction
Heavily multi-ported structures in many pipeline
stages

8
Making the common case faster

2 source operands are common
1836 of instructions have 2 source operands
But, structures for 2 source operands are not
fully utilized
Scheduler
416 of instructions need two wakeups
Less than 3 of instructions handle 2 wakeups in
the same clock cycle
Register File
0.64 read port per instruction
Less than 4 of instructions need two register
read ports
? Why not build a pipeline optimized for 1-source
instructions?

9
Half-price Architecture

Restrict the processors capability to handle 2
source operands
0- or 1-source instructions are processed without
any restriction
2-source instructions may execute more slowly
? Reduce HW complexity incurred by 2 source
operands
½ technique in scheduler Sequential wakeup
½ technique in RF Sequential register access

HW design point to handle the worst-case
requirements
more HW
Opcode
Rdst
Rsrc 1
Rsrc 2
Opcode
Rdst
Rsrc 1
Half-price architecture design point
Opcode
Rdst / Rsrc
Opcode
less HW
10
Two RF read port accesses

Less than 4 of instructions need 2 read port
accesses
Many 2-source instructions read at least one
value off the bypass path
Detect back-to-back issue to determine if two
values are needed from RF

4-wide
8-wide
2-src insts
require 2 read ports
11
Sequential RF access

!back-to-back issue sequential RF access

Sequential RF access example

ADD r1, r2, r3 SUB r3, r4, r5 XOR r5, 1, r6
12
Machine parameters

Simplescalar-Alpha-based, 12-stage, 4/8-wide OoO
Speculative scheduling
Alpha-style squashing scheduling recovery
4-wide 64 RUUs, 32 LSQs, 2 memory ports
8-wide 128 RUUs, 64 LSQs, 4 memory ports
64K IL1 (2), 64K DL1 (2), 512K unified L2 (8)
Combined (bimodal gShare) branch prediction,
fetch until the first taken branch
Sequential RF access
½ read-ported RF (1 read port / slot)
Comparison cases
Pipelined RF (1 extra RF stage)
½ read-ported RF (same as sequential RF access)
crossbar

13
Sequential RF access performance
4-wide
8-wide

Seq RF access slowdown is slight avg 1.1 / 0.7,
worst 2.2
½ read ports crossbar almost achieves base
performance
crossbar complexity, global RF port arbitration ?
high control overhead
? Finer-grained processing in the RF stage can
reduce hardware complexity with a minimal
performance impact

14
Outline

Processing granularity
Relaxing design constraints at different
granularities
Finer-grained processing
Coarser-grained processing
Macro-op Scheduling
Conclusions Future research

15
Motivations for Coarser-grained Processing

Loops in out-of-order execution
Scheduling atomicity (wakeup / select within a
single cycle)
Essential for back-to-back instruction execution
Hard to pipeline in conventional designs
Poor scalability
Extractable ILP is a function of window size
Complexity increases exponentially as the size
grows
Increasing pressure due to deeper pipelining and
slower memory system

Load latency resolution loop
Scheduling loop (wakeup / select)
Exe loop (bypass)
16
Related Work

Scheduling atomicity
Speculation pipelining
Grandparent scheduling Stark, Select-free
scheduling Brown
Poor scalability
Low complexity scheduling logic
FIFO style window Palacharla, H.Kim
Data-flow based window Canal, Michaud, Raasch
Judicious window scaling
Segmented windows Hrishikesh, WIB Lebeck
Issue queue entry sharing
AMD K7 (MOP), Intel Pentium M (uops fusion)
? Overcoming atomicity and scalability in
isolation
Lets step back and see the problem from a
different perspective

17
Source of the atomicity constraint

Minimal execution latency of instruction
Many ALU operations have single-cycle latency
Schedule should keep up with execution
1-cycle instructions need 1-cycle scheduling
Multi-cycle operations do not need atomic
scheduling
? Relax the constraints by increasing the size of
scheduling unit
Combine multiple instructions into a multi-cycle
latency unit
Scheduling decisions occur at multiple
instruction boundaries
Attack both atomicity and scalability constraints
at a coarser granularity

18
Macro-op scheduling overview
Fetch / Decode / Rename
Queue
Scheduling
RF / EXE / MEM / WB / Commit
Disp
cache ports
Coarser MOP-grained
Instruction-grained
Instruction-grained
Issue queue insert
RF
Payload RAM
EXE
I-cache Fetch
MEM
WB Commit
MOP formation
Wakeup
Select
Rename
Pipelined scheduling
Sequencing instructions
MOP pointers
Dependence information
MOP detection
Wakeup order information
19
MOP scheduling(2x) example
1
2
6
1
2
6
Macro-op (MOP)
3
5
3
4
5

9 cycles
16 queue entries

10 cycles
9 queue entries

7
9
8
4
7
10
11
8
12
select
select / wakeup
10
9
13
n
n
wakeup
select
select / wakeup
12
11
14
n1
n1
wakeup
15
13
16
14
15
16

Pipelined instruction scheduling of multi-cycle
MOPs
Still issues original instructions consecutively
Larger instruction window
Multiple original instructions logically share a
single issue queue entry

20
Issues in grouping instructions

Candidate instructions
Single-cycle instructions integer ALU, control,
store agen operations
Multi-cycle instructions (e.g. loads) do not need
single-cycle scheduling
The number of source operands
Grouping two dependent instructions ? up to 3
source operands
Allow up to 2 source operands (conventional) / no
restriction (wired-OR)
MOP size
Bigger MOP sizes may be more beneficial
2 instructions in this study
MOP formation scope
Instructions are processed in order before
inserted into issue queue
Candidate instructions need to be captured within
a reasonable scope

21
Dependence edge distance (instruction count)
49.2
50.9
27.8
48.7
37.4
56.3
40.2
47.5
42.7
47.7
37.6
44.7
total insts

73 of value-generating candidates (potential MOP
heads) have dependent candidate instructions
(potential MOP tails)
An 8-instruction scope captures many dependent
pairs
Variability in distances (e.g. gap vs. vortex)
? Our configuration grouping 2 single-cycle
instructions within an 8-instruction scope

22
MOP detection

Finds groupable instruction pairs
Dependence matrix-based detection (detailed in
the paper)
Performance is insensitive to detection latency
(pointers reused repeatedly)
A pessimistic 100-cycle latency loses 0.22 of
IPC
Generates MOP pointers
4 bits per instruction, stored in IL1
A MOP pointer represents a groupable instruction
pair

23
MOP detection Avoiding cycle conditions

Cycle condition examples (leading to deadlocks)
Conservative cycle detection heuristic
Precise detection is hard (multiple levels of dep
tracking)

Assume a cycle if both outgoing and incoming
edges are detected
Captures over 90 of MOP opportunities (compared
to the precise detection)

?
24
MOP formation
MOP
MOP

Locates MOP pairs using MOP pointers
MOP pointers are fetched along with instructions
Converts register dependences to MOP dependences
Architected register IDs ? MOP IDs
Identical to register renaming
Except that it assigns a single ID to two
groupable instructions
Reflects the fact that two instructions are
grouped into one scheduling unit
Two instructions are later inserted into one
issue entry

25
Scheduling MOPs

Instructions in a MOP are scheduled as a single
unit
A MOP is a non-pipelined, 2-cycle operation from
the schedulers perspective
Issued when all source operands are ready, incurs
one tag broadcast
Wakeup / select timings

26
Sequencing instructions
sequence original insts

A MOP is converted back to two original
instructions
The dual-entry payload RAM sends two original
instructions
Original instructions are sequentially executed
within 2 cycles
Register values are accessed using physical
register IDs
ROB separately commits original instructions in
order
MOPs do not affect precise exception or branch
misprediction recovery

27
Machine parameters

Simplescalar-Alpha-based 4-wide OoO speculative
scheduling w/ selective replay, 14 stages
Ideally pipelined scheduler
conceptually equivalent to atomic scheduling 1
extra stage
128 ROB, unrestricted / 32-entry issue queue
4 ALUs, 2 memory ports, 16K IL1 (2), 16K DL1 (2),
256K L2 (8), memory (100)
Combined branch prediction, fetch until the first
taken branch
MOP scheduling
2-cycle (pipelined) scheduling 2X MOP technique
2 (conventional) or 3 (wired-OR) source operands
MOP detection scope 2 cycles (4-wide X 2-cycle
up to 8 insts)
Spec2k INT, reduced input sets
Reference input sets for crafty, eon, gap (up to
3B instructions)

28
grouped instructions
2-src
3-src

2846 of total instructions are grouped
1423 reduction in the instructions count in
scheduler
MOPs cover 2663 of value-generating 1-cycle
instructions
potentially issued as if atomic scheduling is
performed

29
MOP scheduling performance(relaxed atomicity
constraint only)
Unrestricted IQ / 128 ROB

Up to 19 of IPC loss in 2-cycle scheduling
MOP scheduling restores performance
Enables consecutive issue of dependent
instructions
97.2 of atomic scheduling performance on average

30
Insight into MOP scheduling

Performance loss of 2-cycle scheduling
Correlated to dependence edge distance
Short dependence edges (e.g. gap)
? instruction window is filled up with chains of
dependent instructions
? 2-cycle scheduler cannot find plenty of ready
instructions to issue
MOP scheduling captures short-distance dependent
instruction pairs
They are the important ones
Low MOP coverage due to long dependence edges
does not matter
2-cycle scheduler can find many instructions to
issue (e.g. vortex)
? MOP scheduling complements 2-cycle scheduling
Overall performance is less sensitive to code
layout

31
MOP scheduling performance(relaxed atomicity
scalability constraints)
32 IQ / 128 ROB

Benefits from both relaxed atomicity and
scalability constraints at a coarser processing
granularity
? Pipelined 2-cycle MOP scheduling performs
comparably or better than atomic scheduling

32
Conclusions

Instruction-centric hardware designs impose
microarchitectural design constraints
HW structures are built to match an instructions
specifications
Controls occur at every instruction boundary
Tradeoffs in different processing granularities
Control and Resource
Varying processing granularity exposes greater
opportunities for high-performance,
complexity-effective microarchitectures
Finer-grained processing Half-price architecture
Coarser-grained processing Macro-op Scheduling

33
Future research Revisiting ILP
Goal Keeping resources busy as much as possible

Ways to extract Instruction-level parallelism
OoO execution
may not be scalable to future processors due to
complexity
VLIW
Binary compatibility matters
High overhead of dynamic binary translation
vulnerable to unexpected dynamic events
(distortion in sets of parallel insts)
Stripping horizontal slices from a program

34
Future research Exploiting Instruction-level
Serialism!

Finding vertical slices (chain of dependent
insts) is easier
Executions are serial in nature
Light-weight conversions in HW / run-time binary
translations
Less vulnerable to dynamic events (good for
caching prescheduled groups)
A collection of vertical slices extracts
parallelism
Let the machine find next vertical slices to
issue, at a slower rate
Increases window size, scheduling slack and
bandwidth

instruction -centric OoO
Coarser-grained Parallel EXE
Coarser-grained Serial EXE
Exe BW 4
Exe BW 4
2 issue slots
2 issue slots
2 cycles
1 cycle
35
Applied to MOPs (Macro-op Execution)

2-wide, 2xMOP

Conventional 4-wide

36
MOP execution Performance

Initial Results

Pipelined scheduling, fewer issue/payload/RF
ports, simpler bypass
Achieve wider execution bandwidth with narrower
structures

37
Future research Parallelism, Granularity and ILS

Goal better implementation ISA
Light-weight conversion from U-ISA to I-ISA
Easy to maintain the original sequential program
semantics
Hardware complexity and power consumption
Move the burden of timing-critical decisions to
offline
Good front-end code density, fewer operations to
process
Performance
Adaptability to run-time environments
Achieving max ILP extractable
? Vertically-long instruction word?
Coarser-granular instruction sets that exploit
ILS
Run-time binary translation / dynamic HW
construction
Granularity and dimension of instruction word
Impact on the native ILP
Underlying HW

38
Thesis Research Contributions(infomercial)

Speculative Decode (ISCA02)
Attacking the problems with value-based dynamic
optimizations under speculative scheduling
Half-price Architecture (ISCA03)
Operand-centric microarchitecture designs
Macro-op Scheduling (MICRO03)
Coarser-grained instruction scheduling
Studies on Scheduling Replay Schemes (HPCA04)
Addresses the deficiencies in the literature
Scalable selective scheduling replay mechanisms

39
Questions??
40
Macro-op scheduling on x86 (swiped from Shiliang
Hus results)

x86vm (under development)
x86 interpreter / functional simulator based on
BOCHS 2.0.2
Cracks x86 into RISC-style ops (proprietary
mapping)
Timing simulator for detailed microarchitecture
is under construction
x86 ? micro-ops ? MOPs
Assumes dynamic binary translation
Allows grouping of SS and SM (? needs
considerations)
Within / across x86 instructions
Does not allow grouping across conditional BR /
indirect JMP
Dependent MOPs only

41
Grouped RISC ops x86
2-cycle scheduling unfriendly ops

57 operations are grouped ? 28 reduction in
scheduling units
leaving less than 5 of 2-cycle scheduling
unfriendly operations
Over 95 of MOPs are captured within 3 micro-ops
66 are consecutive operations
71 of MOPs are created across x86 instructions
not a reverse process to RISC op cracking

42
MOP detection MOP pointer generation

MOP pointers (4 bits per instruction)

control
offset

Control bit (1)
captures up to 1 control discontinuity
Offset bits (3)
instruction count from head to tail

0 011 add r1 ? r2, r3 0 000 lw r4 ? 0(r3) 1
010 and r5 ? r4, r2 0 000 bez r1, 0xff
(taken) 0 000 sub r6 ? r5, 1
MOP pointers
43
Sequential RF access

Remove ½ register read ports
Only a single read port per issue slot
0 or 1-source instructions are processed without
any restriction
Sequentially access a single port twice for 2
values if needed
Back-to-back issue Reading values off the
bypass
Back-to-back issue ensures 0 or 1 register read
port access
Non-back-to-back issue incurs sequential RF
access
The scheduler creates a bubble to give the
instruction time window to access the RF twice

44
½ technique Sequential RF access

Scheduler changes for sequential RF access

Sequential RF access example

ADD r1, r2, r3 SUB r3, r4, r5 XOR r5, 1, r6
45
Inserting MOPs into issue queue
Issue queue insert
RF
EXE
I-cache Fetch
MEM
WB Commit
Payload RAM
MOP formation
Wakeup
Select
Rename
MOP pointers
Dependence information
MOP detection
Wakeup order information

Inserting instructions across different insert
groups

46
(No Transcript)
47
Processing granularity

Instruction-granular hardware design
HW structures are built to match an instructions
specifications
Controls occur at every instruction boundary
Instruction granularity may impose constraints on
the hardware design space
Relaxing the constraints at different processing
granularities

Finer-granular architecture (ISCA03)
Coarser-granular architecture (MICRO 03)
conventional
instruction
operand
multiple insts
48
Its about granularity

Instruction-granular hardware design
HW structures are built to match an instructions
specifications
Controls occur at every instruction boundary
Instruction granularity may impose constraints on
the hardware design space
Relaxing the constraints at different processing
granularities

Half-price architecture (ISCA03)
Coarser-granular architecture
conventional
Finer
Processing granularity
Coarser
instruction
operand
macro-op
49
Register file complexity