Title: Computer Logic Design
1COM515 Advanced Computer Architecture
Lecture 5. Dynamic Scheduling II
Prof. Taeweon Suh Computer Science
Education Korea University
2Modern Processors
- Branch Prediction results in speculative
execution - Speculative instructions (if wrongly speculated)
must not alter the architecture states - Architecture Registers
- Memory
- Requirement of precise exception/interrupts
Prof. Sean Lees Slide
3Modern Out-of-Order Core
Reservation Station issues instructions to
functional units
Allocate instructions
Reorder Buffer maintains state information
(physical registers) for precise interrupts and
speculative execution
ROB
Architectural register file
LSQ
Register Alias Table renames architecture
registers
Load Store Queue maintains memory access ordering
Prof. Sean Lees Slide
4Register Renaming
Architectural Registers
R0
R1
R2
R3
R4
R5
R6
R7
No False Dependencies!
Sandy Bridge 160 PRs for INT 144 PRs for FP
Adapted from Prof. G. Lohs Slides
5Register Renaming
Dest Src1 op Src2
Mapping Mechanism
Src1 ? TagS1
Src2 ? TagS2
TagS1 op TagS2
TagD
Repeat for each instruction
Adapted from Prof. G. Lohs Slides
6Register Alias Table (RAT)
- Use a lookup table for renaming
- One entry per architectural register
- Each entry maps to the most recent version of the
architectural register, could be in - Physical register file
- Architectural register file
Prof. Sean Lees Slide
7RAT Example
Free Physical Regs
T13, T14, T15, T16
R1 R2 R3
T13 R2 R3
T14, T15, T16
R5 R4 R1
T14 R4 T13
R1 R1 R5
T15, T16
T15 T13 T14
R2 R5 / R1
T16
T16 T14 / T15
Adapted from Prof. G. Lohs Slides
8Superscalar Rename
T16 T39 T14 T5
R1 R2 R3 R4 R5 R7 R3 R0 / R2 R5 Ld
12R6
RAT
T23 T7 T16 X
Dont rename immediates
For N-wide superscalar 2N RAT read-ports N RAT
write-ports
Prof. Sean Lees Slide
9Intra-Group Dependencies
T16 T39 T14 T5
R2 R2 R3 R4 R5 R7 R3 R0 / R2 R5 Ld
12R6
RAT
T23 T7 T16 X
T10 T31 T19 T6
From free register pool
Prof. Sean Lees Slide
10Intra-Group Dependencies
R1 R2 R1 R2 R1 R2 R1 R2 / R1 R1 R2 gtgt
R1
T16 T34 T34 T16 T16 T34 T16 T34
RAT
Correct final renamed registers
Modified from Prof. Sean Lees Slide
11Resolving Intra-Group Dependencies
Intra-Group Dependency Checker
Inst 0
Inst 1
Inst 2
Inst 3
RAT
T0L
T0R
Src L
Src R
Dest
T1L
T1R
T2L
T2R
From free register pool
T3L
T3R
Pdst0
Pdst1
Pdst2
Adapted from Prof. G. Lohs Slides
12Intra-Group Dependency Checking
Pdst0
dst0
Pdst1
dst1
Pdst2
dst2
Adapted from Prof. G. Lohs Slides
13Mapping Selection
R1 R2 R1 R2 R1 R2 R1 R2 / R1 R1 R2 gtgt
R1
Only this mapping for R1 should be written into
the RAT
Condition use mapping if instruction is
last writer to the register
Adapted from Prof. G. Lohs Slides
14Issue with Imprecise Interrupt
lw r5, 8(r10) add r10, r9, r8 add r12,
r10, r7
- add instructions take one cycle
- E.g.,
- Load (left side) induces a data page fault
- If out-of-order completion is allowed
- R10 and r12 will be modified
- Wrong values will be used by the re-issued load
- Interrupt classes
- Program interrupts (exceptions or traps)
- External interrupts (asynchronous)
Modified from Prof. Sean Lees Slide
15Precise Interrupts
- To reflect a sequential architecture model ?
Serially correct (think about a single issue,
non-pipelined processor) - Keep Precise State of an execution
- All instructions before the interrupted
instruction must be completed - The state should appear as if no instruction
issued after the interrupted instruction - The interrupted PC should be presented to the
interrupt handler (restartable) - Similar to branch misprediction handling
- Out-of-order execution makes the ordering hard
- Undo what comes after an interrupt
Prof. Sean Lees Slide
16Why Support Precise Interrupts
- Need to maintain a precise state (for recovery)
- Software debugging
- I/O or timer interrupts
- Virtual memory (page fault)
- Instruction emulation
- Virtual machines
Prof. Sean Lees Slide
17Support Precise Interrupt
- Buffer results
- Can reconstruct the scenario (state) as
sequential execution - Restart from saved PC with saved PC state
Prof. Sean Lees Slide
18Reorder Buffer (ROB) SmithPlezkun85 88
- Architecture Register File keeps In-order state
- Reorder Buffer (ROB)
- A circular buffer
- Contains all in-flight instructions
- buffers the Lookahead state
- In-order allocation/deallocation with head/tail
pointers - When an exception occurs
- Halt instruction issues
- Revert to in-order state using RF and discard ROB
results - Also used for branch misprediction recovery
- Pentium Pro/II/III integrates physical register
file within ROB - Pentium 4 decouples ROB and physical register file
Modified from Prof. Sean Lees Slide
19ROB (with physical registers)
Exp event
Spec?
Done?
PC
V
Data (physical register)
RegDst
Head (oldest instruction)
Tail (next inst to be allocated)
Prof. Sean Lees Slide
Sandy Bridge 168-entry ROB
20Handling Precise Interrupts
R1R110
0
11
1
0
0
1
xA004
1
0
0
0000
R2
R2R22
xA008
1
0
0
0000
FR1
FR1FR2/0.0
ARF
R1
1
11
R2
2
1
R3
3
1
R4
4
1
1
R31
Prof. Sean Lees Slide
21Handling Precise Interrupts
0
xA004
1
0
0
0000
R2
R2R22
xA008
1
0
0
0000
FR1
FR1FR2/0.0
xA00C
R3R31
1
0
0
0000
R3
ARF
R1
1
11
R2
2
1
R3
3
1
R4
4
1
1
R31
Prof. Sean Lees Slide
22Handling Precise Interrupts
0
xA004
1
0
0
0000
R2
R2R22
xA008
1
0
0
0000
FR1
FR1FR2/0.0
xA00C
R3R31
1
0
1
0000
R3
4
xA010
1
0
0
0000
R4
R4R42
ARF
R1
1
11
R2
2
1
R3
3
1
R4
4
1
1
R31
Prof. Sean Lees Slide
23Handling Precise Interrupts
0
xA004
1
0
0
0000
R2
R2R22
1
4
xA008
1
0
0
0010
FR1
FR1FR2/0.0
xA00C
R3R31
1
0
1
0000
R3
4
xA010
1
0
1
0000
R4
R4R42
8
xA014
1
0
0
0000
FR4
FR4FR42.0
ARF
R1
1
11
R2
4
2
1
R3
3
1
R4
4
1
1
R31
Prof. Sean Lees Slide
24Handling Precise Interrupts
0
0
xA008
1
0
0
0010
FR1
FR1FR2/0.0
xA00C
R3R31
1
0
1
0000
R3
4
xA010
1
0
1
0000
R4
R4R42
8
xA014
1
0
0
0000
FR4
FR4FR42.0
ARF
R1
1
11
R2
4
1
R3
3
1
R4
4
1
1
R31
Prof. Sean Lees Slide
25Handling Precise Interrupts
These values were not committed into RF
0
0
xA008
1
0
0
0010
FR1
FR1FR2/0.0
xA00C
R3R31
1
0
1
0000
R3
4
xA010
1
0
1
0000
R4
R4R42
8
xA014
1
0
0
0000
FR4
FR4FR42.0
ARF
R1
1
11
R2
4
1
R3
3
1
R4
4
Back up PC and current RF
1
1
R31
Depending on the Exception, process will either
abort or instruction will be resumed from this
excepting instruction
Prof. Sean Lees Slide
26Handling Speculative Execution
R1R110
1
0
0
xB004
1
0
0
0000
BEQ R1,R0,L1
ARF
R1
1
R2
2
1
R3
3
1
R4
4
1
1
R31
Prof. Sean Lees Slide
27Handling Speculative Execution
R1R110
1
0
0
xB004
1
0
0
0000
BEQ R1,R0,L1
xC100
1
1
1
0000
R2R3ltlt2
12
R2
xC104
1
1
0
0000
R1R2R3
R1
xC108
1
1
0
0000
BEQ R3,R0,L1
xD2B0
1
1
1
0000
R1R71
R1
8
ARF
R1
1
R2
2
1
R3
3
1
R4
4
1
1
R31
BEQ R1, R0, L1 is predicted TAKEN
Modified from Prof. Sean Lees Slide
28Handling Speculative Execution
BEQ Misprediction
xB004
1
0
0
0000
BEQ R1,R0,L1
xC100
1
1
1
0000
R2R3ltlt2
12
R2
xC104
1
1
0
0000
R1R2R3
R1
xD2AC
1
1
0
0000
BEQ R3,R0,L1
xD2B0
1
1
1
0000
R1R71
R1
8
ARF
R1
11
R2
2
1
R3
3
1
R4
4
1
1
R31
BEQ R1, R0, L1 is resolved, actually NOT TAKEN !!
Prof. Sean Lees Slide
29Handling Speculative Execution
ARF
R1
11
R2
2
1
R3
3
1
R4
4
1
1
R31
Retire branch, Clear all entries after the
mis-speculated branch
Prof. Sean Lees Slide
30Handling Speculative Execution
xB008
1
0
0
0000
R2R5ltlt4
R2
ARF
R1
11
R2
2
1
R3
3
1
R4
4
1
1
R31
Continue execution from the correct path (Fall
through in this case)
Prof. Sean Lees Slide
31RAT Recovery
ARF state corresponds to state prior to oldest
non-committed instruction
ARF
As instructions are processed, the RAT
corresponds to the register mapping after the
most recently renamed instruction
br
RAT
?!?
On a branch misprediction, wrong-path instructions
are flushed from the machine
The RAT is left with an invalid set of mappings
corresponding to the wrong- path instruction state
Adapted from Prof. G. Lohs Slide
32Solution Stall and Drain
Allow all instructions to execute and commit ARF
corresponds to last committed instruction
ARF
ARF now corresponds to the state right before the
next instruction to be renamed (foo)
br
RAT
X
Reset RAT so that all mappings refer to the ARF
?!?
- Pros Very simple
- to implement
- Cons Performance loss
- due to stalls
Correct path instructions from fetch cant
rename because RAT is wrong
Resume renaming the new correct- path
instructions from fetch
Prof. Sean Lees Slide
33Another Solution Checkpointing
At each branch, make a copy of the RAT (register
mapping at the time of the branch)
ARF
br
br
RAT
RAT
Checkpoint Free Pool
RAT
RAT
br
RAT
br
On a misprediction
1. flush wrong-path instructions
2. deallocate RAT checkpoints
3. recover RAT from checkpoint
4. resume renaming
Prof. Sean Lees Slide
34Modern Instruction Scheduler
- At dispatch, instruction read all available
operands from the register files and store a copy
in the scheduler (Tomasulos algorithm) - Unavailable operands will be captured from the
functional unit outputs (CDB broadcast) - When ready, instructions can issue directly from
the scheduler without reading additional operands
from any other register files (Wakeup and select)
Fetch Dispatch
Fetch Dispatch
Fetch Dispatch
ARF
PRF/ROB
ARF
PRF/ROB
ARF
Physical register update
Instruction Scheduler
Bypass
Functional Units
Adapted from Prof. G. Lohs Slide
35Instruction Scheduling Wakeup and Select
- Wakeup Logic
- To notify the resolution of data dependency of
input operands - Wake up instructions with zero input dependency
- Select Logic
- Choose and fire ready instructions
- Deal with structure hazard
- Wakeup-select is likely on the critical path
- Associative match
Prof. Sean Lees Slide
36Scalar Scheduler (Issue Width 1)
T14
T39
Select Logic
To Execute Logic
Tag Broadcast Bus
T16
T39
T8
T6
T17
T42
T39
T15
T17
T39
From Prof. G. Lohs Slide
37Superscalar Scheduler (Issue Width 4)
Tag Broadcast Bus 3..0
T39
Select Logic
To Execute Logic
T8
T39
T6
T42
T17
T39
T17
T15
T39
Snapshot of RS (only 4 entries shown)
Adapted from Prof. G. Lohs Slide
38Selection Logic
- Select ready instructions to be issued
- Goal to reduce the height of DFG
- Methods
- Location-based (e.g., leftmost ready first)
- Allow simple, faster hardware
- Oldest ready first
- Can use location-based (in-order issue) with
compaction - Compact the issue window to the left every time
instructions are issued and by inserting new
instructions at the right end - Can be slow and complex
Prof. Sean Lees Slide
39Simple Select Logic Implementation
Reservation Station
Leftmost ready first
- The Enable signal to the root cell is high
whenever the functional unit is ready to execute
an instruction - The AnyReq signal is raised if any of the input
Req signals is high
1
Modified from Prof. Sean Lees Slide
Palarchala Dissertation
40Simple Select Logic Implementation
Reservation Station
1
Prof. Sean Lees Slide
Palarchala Dissertation
41Simple Select Logic Implementation
Reservation Station
Grant3
Grant3
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Enable
AnyReq
Enable
AnyReq
Multiple Ready Instruction Request
Grant3
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Enable
AnyReq
1
Prof. Sean Lees Slide
Palarchala Dissertation
42Simple Select Logic Implementation
Reservation Station
Grant3
Grant3
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Enable
AnyReq
Enable
AnyReq
Selective Issue for One FU
Grant3
Req0
Grant0
Req1
Grant1
Req2
Grant02
Req3
Enable
AnyReq
1
Prof. Sean Lees Slide
Palarchala Dissertation
43Issues to Distinctive Functional Units
Distributed Instruction Windows (e.g., MIPS R1000
or Alpha 21264)
Integer Unit
FPU
Faster to have separate instruction schedulers
for different instruction types
Prof. Sean Lees Slide
44Dual Issues to Multiple Units (e.g., 2 Adders)
Req0
Req1
Req2
Req3
Selection Logic for Adder0
Grant0
Grant1
Grant2
Grant3
Selection Logic for Adder1
Prof. Sean Lees Slide
Palarchala Dissertation
45Memory Disambiguation
- Can we undo stores?
- Stores cannot be committed to memory until they
are marked ready to retire - Completed stores are queued and waiting in a
store queue or store buffer - Disambiguate (and resolve) memory dependency
dynamically
Prof. Sean Lees Slide
46Memory Ordering
Source Alpha 21264 HRM
- Load X bypassing Load X violates certain memory
consistency model (e.g., sequential consistency) - Load-load order trap replays
Prof. Sean Lees Slide
47Load Store Queue (LSQ)
Age-ordered
ROB
Store Queue
Load Queue
Split LSQ
- Memory instructions are allocated into LSQ in
program order - LSQ manages memory reference ordering
- Unified LSQ vs. Split LSQ
- Sandy Bridge 64 Load buffers, 36 Store buffers
Prof. Sean Lees Slide
48Issuing a Load for Execution
Issued?
Issued?
age
address
age
address
data
1
A
1
00000001
1
B
1
12340000
1
C
0
FFFF1111
FFFFFF00
Load Queue
Store Queue
- Each load checks against older stores
- Associative search
- A performance issue of scalability
Prof. Sean Lees Slide
49Issuing a Load for Execution
Issued?
Issued?
age
address
age
address
data
1
A
1
00000001
1
B
1
12340000
1
C
0
FFFF1111
FFFFFF00
Load Queue
Store Queue
- Implementation dependent comprehensive size
matching can be prohibitively expensive - Simple method forward when a larger store (word)
precedes a smaller load (half)
Prof. Sean Lees Slide
50Issuing a Load for Execution
Issued?
Issued?
age
address
age
address
data
1
A
1
00000001
1
B
1
12340000
Speculatively issue for execution
1
C
0
FFFF1111
FFFFFF00
2
???
0
Load Queue
Store Queue
- Can speculatively issue loads for shortening
latency (Alpha 21264, Pentium 4 (Prescott)) - Store, when address ready, checks newer loads in
the Load Queue - Replay needed if speculation turns out to be
incorrect (e.g. Alphas store-load replay)
Modified from Prof. Sean Lees Slide
51Store Checks Pre-Mature Loads
Issued?
Issued?
age
address
age
address
data
1
A
1
00000001
1
B
1
12340000
1
C
1
FFFF1111
FFFFFF00
2
K
0
3
K
1
Conflict detected! Replay the load
Load Queue
Store Queue
- Store, when address ready, checks newer loads in
the Load Queue - Associative Search
- Replay needed if speculation turns out to be
incorrect (e.g. Alphas store-load replay)
Prof. Sean Lees Slide
52Issuing a Store for Execution
Issued?
Issued?
age
address
age
address
data
4
A
1
11000000
6
A
0
0F0F0F0F
6
C
0
00000002
Load Queue
Store Queue
- Shown above the basic concept
- Implementation dependent
- Not allow store bypassing load, since it has
little impact on performance - Perform associative search
Prof. Sean Lees Slide
53Issuing a Store for Execution
Issued?
Issued?
age
address
age
address
data
4
A
1
11000000
6
A
0
0F0F0F0F
6
C
0
5
C
0
00000002
cannot issue for execution
Load Queue
Store Queue
Prof. Sean Lees Slide
54Load-Load Ordering
- Needed for
- Multiprocessor support
- Maintaining memory consistency model
- Load-load trap invoked
- Trap on the later, conflicted instructions
- Replay
Issued?
age
address
5
C
1
6
A
1
Load-load trap
Load Queue
54
Prof. Sean Lees Slide
55 56Issue with Imprecise Interrupt
lw r5, 8(r10) add r10, r9, r8 add r12,
r10, r7
L1 add r3, r1, r2 add r4, r1, r4
add r2, r4, r4
End of Non-Resident Page X
Instruction Page Fault
Start of Resident Page X1
- add instructions take one cycle
- E.g.,
- Load (left side) induces a data page fault
- Add (right side) induces an instruction page
fault - If out-of-order completion is allowed
- r10, r12, (or r2, r4) will be modified
- Wrong values will be used by the re-issued load
- Interrupt classes
- Program interrupts (exceptions or traps)
- External interrupts (asynchronous)
Prof. Sean Lees Slide