Slide 1 of 37 - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Slide 1 of 37

Description:

Ilhyun Kim, PHARM team, UW-Madison. Slide 1 of 37 ... ILHYUN KIM. PHARM Team. University of Wisconsin-Madison. Advisor: Prof. Mikko H. Lipasti ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 52
Provided by: ikim
Category:
Tags: kim

less

Transcript and Presenter's Notes

Title: Slide 1 of 37


1
Relaxing Microarchitectural Design Constraints at
Different Processing Granularities
  • ILHYUN KIM
  • PHARM Team
  • University of Wisconsin-Madison
  • Advisor Prof. Mikko H. Lipasti

2
Processing granularity
  • The amount of work associated with a process
  • e.g. bytes (work) per cache block transfer
    (process)
  • Coarser granularity incurs fewer transfer for
    the certain amount of data
  • Finer granularity incurs less wasted work
  • ? 1 byte

finer
coarser
3
Resource, Control and Granularity
  • Processing granularity (cache block size)

Processing granularity
Finer
Coarser
  • Control (the number of line transfer for data)

More
Fewer
Control
  • Resource (data bandwidth per transfer)

Efficient
Redundant
Resource
  • ? What is the optimal processing granularity?
  • Tradeoffs between resource and control
  • Non-linearity in tradeoffs as granularity varies
  • Determined by the goals and constraints of your
    design
  • e.g. miss rates vs. latency vs. power

4
Granularity of instruction processing
  • Conventional instruction-centric hardware design
  • HW structures are built to match an instructions
    specifications
  • Controls occur at every instruction boundary
  • Instruction (or uop) is the unit of execution
  • Running a program executing a series of
    instructions
  • Instruction-granular processing imposes
    instruction-granular constraints on hardware
    design space
  • Many hardware parameters are automatically
    determined by processing granularity ? not much
    flexibility in the design space
  • e.g. 2x read ports configuration in RF, atomicity
    of instruction scheduling
  • Is instruction the optimal unit of processing in
    the pipeline?

5
Relaxing Design Constraints at different
granularities
  • Each pipeline stage has different types of design
    issues
  • Resource- (e.g. RF) or control-critical (e.g.
    Scheduler) constraints
  • Process instructions at different granularities
  • Compensate for critical design issues (resource /
    control)
  • e.g. resource-critical structure ? finer-grained
    processing
  • e.g. control-critical structure ? coarser-grained
    processing

Half-price Architecture (ISCA03)
Macro-op Scheduling (MICRO03)
Processing granularity
Finer
Coarser
instruction
operand
multiple insts
6
Outline
  • Processing granularity
  • Relaxing design constraints at different
    granularities
  • Finer-grained processing
  • Half-price architecture Sequential RF access
  • Coarser-grained processing
  • Conclusions Future research

7
Motivations for Finer-grained Processing
  • Processors are designed to handle 0, 1 and
    2-source instructions at equal cost
  • Satisfy the worst-case requirements of
    instructions
  • No resource arbitrations / pipeline stalls in
    handling source operands
  • Simple controls over instruction and data stream
  • Handling source operands requires 2x machine BW
  • e.g. 2 read ports / 1 write port per instruction
  • Heavily multi-ported structures in many pipeline
    stages

8
Making the common case faster
  • 2 source operands are common
  • 1836 of instructions have 2 source operands
  • But, structures for 2 source operands are not
    fully utilized
  • Scheduler
  • 416 of instructions need two wakeups
  • Less than 3 of instructions handle 2 wakeups in
    the same clock cycle
  • Register File
  • 0.64 read port per instruction
  • Less than 4 of instructions need two register
    read ports
  • ? Why not build a pipeline optimized for 1-source
    instructions?

9
Half-price Architecture
  • Restrict the processors capability to handle 2
    source operands
  • 0- or 1-source instructions are processed without
    any restriction
  • 2-source instructions may execute more slowly
  • ? Reduce HW complexity incurred by 2 source
    operands
  • ½ technique in scheduler Sequential wakeup
  • ½ technique in RF Sequential register access

HW design point to handle the worst-case
requirements
more HW
Opcode
Rdst
Rsrc 1
Rsrc 2
Opcode
Rdst
Rsrc 1
Half-price architecture design point
Opcode
Rdst / Rsrc
Opcode
less HW
10
Two RF read port accesses
  • Less than 4 of instructions need 2 read port
    accesses
  • Many 2-source instructions read at least one
    value off the bypass path
  • Detect back-to-back issue to determine if two
    values are needed from RF

4-wide
8-wide
2-src insts
require 2 read ports
11
Sequential RF access
  • !back-to-back issue sequential RF access
  • Sequential RF access example

ADD r1, r2, r3 SUB r3, r4, r5 XOR r5, 1, r6
12
Machine parameters
  • Simplescalar-Alpha-based, 12-stage, 4/8-wide OoO
    Speculative scheduling
  • Alpha-style squashing scheduling recovery
  • 4-wide 64 RUUs, 32 LSQs, 2 memory ports
  • 8-wide 128 RUUs, 64 LSQs, 4 memory ports
  • 64K IL1 (2), 64K DL1 (2), 512K unified L2 (8)
  • Combined (bimodal gShare) branch prediction,
    fetch until the first taken branch
  • Sequential RF access
  • ½ read-ported RF (1 read port / slot)
  • Comparison cases
  • Pipelined RF (1 extra RF stage)
  • ½ read-ported RF (same as sequential RF access)
    crossbar

13
Sequential RF access performance
4-wide
8-wide
  • Seq RF access slowdown is slight avg 1.1 / 0.7,
    worst 2.2
  • ½ read ports crossbar almost achieves base
    performance
  • crossbar complexity, global RF port arbitration ?
    high control overhead
  • ? Finer-grained processing in the RF stage can
    reduce hardware complexity with a minimal
    performance impact

14
Outline
  • Processing granularity
  • Relaxing design constraints at different
    granularities
  • Finer-grained processing
  • Coarser-grained processing
  • Macro-op Scheduling
  • Conclusions Future research

15
Motivations for Coarser-grained Processing
  • Loops in out-of-order execution
  • Scheduling atomicity (wakeup / select within a
    single cycle)
  • Essential for back-to-back instruction execution
  • Hard to pipeline in conventional designs
  • Poor scalability
  • Extractable ILP is a function of window size
  • Complexity increases exponentially as the size
    grows
  • Increasing pressure due to deeper pipelining and
    slower memory system

Load latency resolution loop
Scheduling loop (wakeup / select)
Exe loop (bypass)
16
Related Work
  • Scheduling atomicity
  • Speculation pipelining
  • Grandparent scheduling Stark, Select-free
    scheduling Brown
  • Poor scalability
  • Low complexity scheduling logic
  • FIFO style window Palacharla, H.Kim
  • Data-flow based window Canal, Michaud, Raasch
  • Judicious window scaling
  • Segmented windows Hrishikesh, WIB Lebeck
  • Issue queue entry sharing
  • AMD K7 (MOP), Intel Pentium M (uops fusion)
  • ? Overcoming atomicity and scalability in
    isolation
  • Lets step back and see the problem from a
    different perspective

17
Source of the atomicity constraint
  • Minimal execution latency of instruction
  • Many ALU operations have single-cycle latency
  • Schedule should keep up with execution
  • 1-cycle instructions need 1-cycle scheduling
  • Multi-cycle operations do not need atomic
    scheduling
  • ? Relax the constraints by increasing the size of
    scheduling unit
  • Combine multiple instructions into a multi-cycle
    latency unit
  • Scheduling decisions occur at multiple
    instruction boundaries
  • Attack both atomicity and scalability constraints
    at a coarser granularity

18
Macro-op scheduling overview
Fetch / Decode / Rename
Queue
Scheduling
RF / EXE / MEM / WB / Commit
Disp
cache ports
Coarser MOP-grained
Instruction-grained
Instruction-grained
Issue queue insert
RF
Payload RAM
EXE
I-cache Fetch
MEM
WB Commit
MOP formation
Wakeup
Select
Rename
Pipelined scheduling
Sequencing instructions
MOP pointers
Dependence information
MOP detection
Wakeup order information
19
MOP scheduling(2x) example
1
2
6
1
2
6
Macro-op (MOP)
3
5
3
4
5
  • 9 cycles
  • 16 queue entries
  • 10 cycles
  • 9 queue entries

7
9
8
4
7
10
11
8
12
select
select / wakeup
10
9
13
n
n
wakeup
select
select / wakeup
12
11
14
n1
n1
wakeup
15
13
16
14
15
16
  • Pipelined instruction scheduling of multi-cycle
    MOPs
  • Still issues original instructions consecutively
  • Larger instruction window
  • Multiple original instructions logically share a
    single issue queue entry

20
Issues in grouping instructions
  • Candidate instructions
  • Single-cycle instructions integer ALU, control,
    store agen operations
  • Multi-cycle instructions (e.g. loads) do not need
    single-cycle scheduling
  • The number of source operands
  • Grouping two dependent instructions ? up to 3
    source operands
  • Allow up to 2 source operands (conventional) / no
    restriction (wired-OR)
  • MOP size
  • Bigger MOP sizes may be more beneficial
  • 2 instructions in this study
  • MOP formation scope
  • Instructions are processed in order before
    inserted into issue queue
  • Candidate instructions need to be captured within
    a reasonable scope

21
Dependence edge distance (instruction count)
49.2
50.9
27.8
48.7
37.4
56.3
40.2
47.5
42.7
47.7
37.6
44.7
total insts
  • 73 of value-generating candidates (potential MOP
    heads) have dependent candidate instructions
    (potential MOP tails)
  • An 8-instruction scope captures many dependent
    pairs
  • Variability in distances (e.g. gap vs. vortex)
  • ? Our configuration grouping 2 single-cycle
    instructions within an 8-instruction scope

22
MOP detection
  • Finds groupable instruction pairs
  • Dependence matrix-based detection (detailed in
    the paper)
  • Performance is insensitive to detection latency
    (pointers reused repeatedly)
  • A pessimistic 100-cycle latency loses 0.22 of
    IPC
  • Generates MOP pointers
  • 4 bits per instruction, stored in IL1
  • A MOP pointer represents a groupable instruction
    pair

23
MOP detection Avoiding cycle conditions
  • Cycle condition examples (leading to deadlocks)
  • Conservative cycle detection heuristic
  • Precise detection is hard (multiple levels of dep
    tracking)
  • Assume a cycle if both outgoing and incoming
    edges are detected
  • Captures over 90 of MOP opportunities (compared
    to the precise detection)

?
24
MOP formation
MOP
MOP
  • Locates MOP pairs using MOP pointers
  • MOP pointers are fetched along with instructions
  • Converts register dependences to MOP dependences
  • Architected register IDs ? MOP IDs
  • Identical to register renaming
  • Except that it assigns a single ID to two
    groupable instructions
  • Reflects the fact that two instructions are
    grouped into one scheduling unit
  • Two instructions are later inserted into one
    issue entry

25
Scheduling MOPs
  • Instructions in a MOP are scheduled as a single
    unit
  • A MOP is a non-pipelined, 2-cycle operation from
    the schedulers perspective
  • Issued when all source operands are ready, incurs
    one tag broadcast
  • Wakeup / select timings

26
Sequencing instructions
sequence original insts
  • A MOP is converted back to two original
    instructions
  • The dual-entry payload RAM sends two original
    instructions
  • Original instructions are sequentially executed
    within 2 cycles
  • Register values are accessed using physical
    register IDs
  • ROB separately commits original instructions in
    order
  • MOPs do not affect precise exception or branch
    misprediction recovery

27
Machine parameters
  • Simplescalar-Alpha-based 4-wide OoO speculative
    scheduling w/ selective replay, 14 stages
  • Ideally pipelined scheduler
  • conceptually equivalent to atomic scheduling 1
    extra stage
  • 128 ROB, unrestricted / 32-entry issue queue
  • 4 ALUs, 2 memory ports, 16K IL1 (2), 16K DL1 (2),
    256K L2 (8), memory (100)
  • Combined branch prediction, fetch until the first
    taken branch
  • MOP scheduling
  • 2-cycle (pipelined) scheduling 2X MOP technique
  • 2 (conventional) or 3 (wired-OR) source operands
  • MOP detection scope 2 cycles (4-wide X 2-cycle
    up to 8 insts)
  • Spec2k INT, reduced input sets
  • Reference input sets for crafty, eon, gap (up to
    3B instructions)

28
grouped instructions
2-src
3-src
  • 2846 of total instructions are grouped
  • 1423 reduction in the instructions count in
    scheduler
  • MOPs cover 2663 of value-generating 1-cycle
    instructions
  • potentially issued as if atomic scheduling is
    performed

29
MOP scheduling performance(relaxed atomicity
constraint only)
Unrestricted IQ / 128 ROB
  • Up to 19 of IPC loss in 2-cycle scheduling
  • MOP scheduling restores performance
  • Enables consecutive issue of dependent
    instructions
  • 97.2 of atomic scheduling performance on average

30
Insight into MOP scheduling
  • Performance loss of 2-cycle scheduling
  • Correlated to dependence edge distance
  • Short dependence edges (e.g. gap)
  • ? instruction window is filled up with chains of
    dependent instructions
  • ? 2-cycle scheduler cannot find plenty of ready
    instructions to issue
  • MOP scheduling captures short-distance dependent
    instruction pairs
  • They are the important ones
  • Low MOP coverage due to long dependence edges
    does not matter
  • 2-cycle scheduler can find many instructions to
    issue (e.g. vortex)
  • ? MOP scheduling complements 2-cycle scheduling
  • Overall performance is less sensitive to code
    layout

31
MOP scheduling performance(relaxed atomicity
scalability constraints)
32 IQ / 128 ROB
  • Benefits from both relaxed atomicity and
    scalability constraints at a coarser processing
    granularity
  • ? Pipelined 2-cycle MOP scheduling performs
    comparably or better than atomic scheduling

32
Conclusions
  • Instruction-centric hardware designs impose
    microarchitectural design constraints
  • HW structures are built to match an instructions
    specifications
  • Controls occur at every instruction boundary
  • Tradeoffs in different processing granularities
  • Control and Resource
  • Varying processing granularity exposes greater
    opportunities for high-performance,
    complexity-effective microarchitectures
  • Finer-grained processing Half-price architecture
  • Coarser-grained processing Macro-op Scheduling

33
Future research Revisiting ILP
Goal Keeping resources busy as much as possible
  • Ways to extract Instruction-level parallelism
  • OoO execution
  • may not be scalable to future processors due to
    complexity
  • VLIW
  • Binary compatibility matters
  • High overhead of dynamic binary translation
  • vulnerable to unexpected dynamic events
    (distortion in sets of parallel insts)
  • Stripping horizontal slices from a program

34
Future research Exploiting Instruction-level
Serialism!
  • Finding vertical slices (chain of dependent
    insts) is easier
  • Executions are serial in nature
  • Light-weight conversions in HW / run-time binary
    translations
  • Less vulnerable to dynamic events (good for
    caching prescheduled groups)
  • A collection of vertical slices extracts
    parallelism
  • Let the machine find next vertical slices to
    issue, at a slower rate
  • Increases window size, scheduling slack and
    bandwidth

instruction -centric OoO
Coarser-grained Parallel EXE
Coarser-grained Serial EXE
Exe BW 4
Exe BW 4
2 issue slots
2 issue slots
2 cycles
1 cycle
35
Applied to MOPs (Macro-op Execution)
  • 2-wide, 2xMOP
  • Conventional 4-wide

36
MOP execution Performance
  • Initial Results
  • Pipelined scheduling, fewer issue/payload/RF
    ports, simpler bypass
  • Achieve wider execution bandwidth with narrower
    structures

37
Future research Parallelism, Granularity and ILS
  • Goal better implementation ISA
  • Light-weight conversion from U-ISA to I-ISA
  • Easy to maintain the original sequential program
    semantics
  • Hardware complexity and power consumption
  • Move the burden of timing-critical decisions to
    offline
  • Good front-end code density, fewer operations to
    process
  • Performance
  • Adaptability to run-time environments
  • Achieving max ILP extractable
  • ? Vertically-long instruction word?
  • Coarser-granular instruction sets that exploit
    ILS
  • Run-time binary translation / dynamic HW
    construction
  • Granularity and dimension of instruction word
  • Impact on the native ILP
  • Underlying HW

38
Thesis Research Contributions(infomercial)
  • Speculative Decode (ISCA02)
  • Attacking the problems with value-based dynamic
    optimizations under speculative scheduling
  • Half-price Architecture (ISCA03)
  • Operand-centric microarchitecture designs
  • Macro-op Scheduling (MICRO03)
  • Coarser-grained instruction scheduling
  • Studies on Scheduling Replay Schemes (HPCA04)
  • Addresses the deficiencies in the literature
  • Scalable selective scheduling replay mechanisms

39
Questions??
40
Macro-op scheduling on x86 (swiped from Shiliang
Hus results)
  • x86vm (under development)
  • x86 interpreter / functional simulator based on
    BOCHS 2.0.2
  • Cracks x86 into RISC-style ops (proprietary
    mapping)
  • Timing simulator for detailed microarchitecture
    is under construction
  • x86 ? micro-ops ? MOPs
  • Assumes dynamic binary translation
  • Allows grouping of SS and SM (? needs
    considerations)
  • Within / across x86 instructions
  • Does not allow grouping across conditional BR /
    indirect JMP
  • Dependent MOPs only

41
Grouped RISC ops x86
2-cycle scheduling unfriendly ops
  • 57 operations are grouped ? 28 reduction in
    scheduling units
  • leaving less than 5 of 2-cycle scheduling
    unfriendly operations
  • Over 95 of MOPs are captured within 3 micro-ops
  • 66 are consecutive operations
  • 71 of MOPs are created across x86 instructions
  • not a reverse process to RISC op cracking

42
MOP detection MOP pointer generation
  • MOP pointers (4 bits per instruction)

control
offset
  • Control bit (1)
  • captures up to 1 control discontinuity
  • Offset bits (3)
  • instruction count from head to tail

0 011 add r1 ? r2, r3 0 000 lw r4 ? 0(r3) 1
010 and r5 ? r4, r2 0 000 bez r1, 0xff
(taken) 0 000 sub r6 ? r5, 1
MOP pointers
43
Sequential RF access
  • Remove ½ register read ports
  • Only a single read port per issue slot
  • 0 or 1-source instructions are processed without
    any restriction
  • Sequentially access a single port twice for 2
    values if needed
  • Back-to-back issue Reading values off the
    bypass
  • Back-to-back issue ensures 0 or 1 register read
    port access
  • Non-back-to-back issue incurs sequential RF
    access
  • The scheduler creates a bubble to give the
    instruction time window to access the RF twice

44
½ technique Sequential RF access
  • Scheduler changes for sequential RF access
  • Sequential RF access example

ADD r1, r2, r3 SUB r3, r4, r5 XOR r5, 1, r6
45
Inserting MOPs into issue queue
Issue queue insert
RF
EXE
I-cache Fetch
MEM
WB Commit
Payload RAM
MOP formation
Wakeup
Select
Rename
MOP pointers
Dependence information
MOP detection
Wakeup order information
  • Inserting instructions across different insert
    groups

46
(No Transcript)
47
Processing granularity
  • Instruction-granular hardware design
  • HW structures are built to match an instructions
    specifications
  • Controls occur at every instruction boundary
  • Instruction granularity may impose constraints on
    the hardware design space
  • Relaxing the constraints at different processing
    granularities

Finer-granular architecture (ISCA03)
Coarser-granular architecture (MICRO 03)
conventional
instruction
operand
multiple insts
48
Its about granularity
  • Instruction-granular hardware design
  • HW structures are built to match an instructions
    specifications
  • Controls occur at every instruction boundary
  • Instruction granularity may impose constraints on
    the hardware design space
  • Relaxing the constraints at different processing
    granularities

Half-price architecture (ISCA03)
Coarser-granular architecture
conventional
Finer
Processing granularity
Coarser
instruction
operand
macro-op
49
Register file complexity
  • Overdesign in register file
  • 2x read ports for two source operands
  • Superscalar processors need RF to be heavily
    multiported
  • Area increases quadratically, latency increases
    linearly
  • Two read ports are not fully utilized
  • 0- / 1-source instructions do not require two
    read ports
  • Many instructions frequently get values off the
    bypass path
  • Speeding up the RF
  • Reducing the number of register entries
  • Hierarchical register file (Cruz, Borch,
    Balasubramonian, ..)
  • Reducing the number of ports
  • Fewer RF ports crossbar (Balasubramonian et al,
    Park et al)
  • Half-price technique Sequential RF Access

50
Attacking scheduling loop constraints
  • Scheduling atomicity
  • Speculation pipelining
  • Grandparent scheduling, Select-free scheduling
  • Poor scalability
  • Low complexity scheduling logic
  • FIFO style window, Data-flow based window
  • Judicious window scaling
  • Segmented windows, WIB
  • Issue queue entry sharing
  • AMD K7 (MOP), Intel Pentium M (uops fusion)
  • ? Overcoming atomicity and scalability in
    isolation
  • Lets step back and see the problem from a
    different perspective

51
Future research Exploiting Instruction-level
Serialism!
  • Finding vertical slices (chain of dependent
    insts) is easier
  • Executions are serial in nature
  • Light-weight conversions in HW / run-time binary
    translations
  • Less vulnerable to dynamic events (good for
    caching prescheduled groups)
  • A collection of vertical slices extracts
    parallelism
  • Let the machine find next vertical slices to
    issue, at a slower rate
  • Increases window size, scheduling slacks and
    bandwidth

dynamic event (e.g. cache miss)
Exe BW 4
Sched loop 3 cycles
Issue BW 2 slices / cycle
Write a Comment
User Comments (0)
About PowerShow.com