Title: Superscalar Processors
1Superscalar Processors
2Scalar to Superscalar
- Scalar Processor one instruction pass through
each pipeline stage in each cycle - Superscalar Processor multiple instructions at
each pipeline stage in each cycle - Wider pipeline
- Superpipelined Processor Decompose stages into
smaller stages ? More Stages - Deeper pipeline
Baer p. 75
3Superscalar
- Front end (IF and ID)
- Must fetch and decode multiple instructions per
cycle - m-way superscalar brings (ideally) m
instructions per cycle into the pipeline - Back end (EX, Mem and WB)
- Must execute and write back several instructions
per cycle
Baer p. 75
4Superscalar
- In-order (or static)
- Instructions leave front-end in program order
- Out-of-order (or dynamic)
- instructions leave front-end, and execute, in a
different order than the program order - WB is called commit stage
- must ensure that the program semantics is
followed - more complex design
Baer p. 76
5Limits to Superscalar Performance
- Superscalars rely on exploiting Instruction-Level
Parallelism (ILP) - They remove WAR and WAW dependences
- But the amount of ILP is limited by RAW (true)
dependences
Data Dependence Graph
S0
RAW
WAW
S1
WAR
S2
WAW
RAW
S3
Baer p. 76
6Limits to Superscalar Performance
- Superscalars rely on exploiting Instruction-Level
Parallelism (ILP) - They remove WAR and WAW dependences
- But the amount of ILP is limited by RAW (true)
dependences
Data Dependence Graph
S0
RAW
S1
RA
RB
RA
Baer p. 76
7Limits to Superscalar Performance
- Complexity of logic to remove dependencies
- Designers predicted 8-way and 16-way superscalars
- We have 6-way superscalars and m is not likely to
grow
Baer p. 76
8Limits to Superscalar PerformanceNumber of
Forward Paths
1-way
Baer p. 76
9Limits to Superscalar PerformanceNumber of
Forward Paths
2-way
m-way requires m2 paths
paths may become too long for signal propagation
within a single clock
Baer p. 76
10Limits to Clock Cycle Reduction
- Power dissipation increases with frequency
- Read and Writing to pipeline registers in every
cycle. - Time to access pipeline register imposes a bound
on the duration of a pipeline stage
Baer p. 76
11Limits on Pipeline Length
- Speculative actions (pe. branch prediction) are
resolved later in a longer pipeline - Recovery from misspeculation is delayed
Baer p. 76
12Why the Multicore Revolution?
Baer p. 77
13Speed Demons X Brainiacs
register renaming
reorder buffer
reservation stations
Baer p. 77
14Out-of-Order and Memory Hierarchy
- Question Does out-of-order execution help hide
memory latencies? - Short answer No.
- Latencies of 100 cycles or more are too long and
fill up all internal queues and stall pipelines - Latencies around 100 cycles are too short to
justify context switching. - Solution hardware for several contexts to enable
fast context switching ? multithreading
Baer p. 78
15DEC Alpha 21164
4-way in-order RISC
virtually indexed
32
32 64-bit
Baer p. 79
1621164 Instruction Pipeline
Integer pipe 1 shifter and multiplier Integer
pipe 2 branches
48-entry I-TLB
64-entry D-TLB
Baer p. 79
17Integer pipe 1 shifter and multiplier Integer
pipe 2 branches
48-entry I-TLB
64-entry D-TLB
Baer p. 80
18Example
i1 R1 ? R2 R3 Use integer pipeline
1 i2 R4 ? R1 R5 Use integer
pipeline 2 i3 R7 ? R8 R9 Requires
an integer pipeline i4 F0 ? F2 F4
Floating point add i5 i6 i7 i8 i9 i10 i11 i
12
Assume no structural or data hazard for these
instructions.
Baer p. 81
19Front-end Occupancy
i1 R1 ? R2 R3 i2 R4 ? R1 R5 i3 R7 ?
R8 R9 i4 F0 ? F2 F4
S0
S1
S2
S3
Time t0
Time t0 1
Backend
Baer p. 82
20Front-end Occupancy
i1 R1 ? R2 R3 i2 R4 ? R1 R5 i3 R7 ?
R8 R9 i4 F0 ? F2 F4
S0
S1
S2
S3
Time t0 1
Time t0 2
Backend
Baer p. 82
21Front-end Occupancy
i1 R1 ? R2 R3 i2 R4 ? R1 R5 i3 R7 ?
R8 R9 i4 F0 ? F2 F4
Time t0 2
S0
S1
S2
S3
Time t0 3
Backend
i9
i5
i10
i6
i11
i7
i3
i12
i8
i4
i3 cannot move to S3 because of resource conflict
(there are only two integer pipelines)
i4 does not move to S3 to preserve program order
(it is blocked by i3)
Baer p. 82
22Front-end Occupancy
i1 R1 ? R2 R3 i2 R4 ? R1 R5 i3 R7 ?
R8 R9 i4 F0 ? F2 F4
Time t0 3
S0
S1
S2
S3
Backend
Time t0 4
i9
i5
i1
i10
i6
i2
i11
i7
i3
i12
i8
i4
i2 cannot move to the backend because of of RAW
dependency with i1.
Baer p. 82
23Front-end Occupancy
i1 R1 ? R2 R3 i2 R4 ? R1 R5 i3 R7 ?
R8 R9 i4 F0 ? F2 F4
Time t0 4
S0
S1
S2
S3
Backend
Time t0 5
i1
Baer p. 82
24Backend
Baer p. 82
25Scoreboard Speculation
If the load hits L1-cache, then schedule L at t1
and U at t3.
Scoreboard assumes it is a hit.
If it is a miss, abort any dependent instruction
already issued.
Baer p. 82
26Can Compiler Help Performance?(Example)
i1 R1 ? MemR2 i2 R4 ? R1
R3 i3 R5 ? R1 R6 i4 R7 ? R4 R5
Assume that all instructions are in issuing slot
(state S2) at time t.
27Compiler Effect
i1 R1 ? MemR2 i2 R4 ? R1
R3 i3 R5 ? R1 R6 i4 R7 ? R4 R5
S0
S1
S2
S3
Time t
Time t 1
Backend
i3
i4
Instruction i3 cannot advance to S3 because of an
structural hazard The load in i1 uses an
integer pipe to compute the address
Baer p. 82
28Compiler Effect
i1 R1 ? MemR2 i2 R4 ? R1
R3 i3 R5 ? R1 R6 i4 R7 ? R4 R5
Time t 1
S0
S1
S2
S3
Backend
Time t 2
Time t 3
i1
i2
i3
i4
i2 cannot advance because of the RAW dependency
with i1
at t3 the load continues execution in the back
end (2-cycle latency)
Baer p. 82
29Compiler Effect
i1 R1 ? MemR2 i2 R4 ? R1
R3 i3 R5 ? R1 R6 i4 R7 ? R4 R5
Time t 3
S0
S1
S2
S3
Backend
Time t 4
i1
Baer p. 82
30Compiler Effect
i1 R1 ? MemR2 i2 R4 ? R1
R3 i3 R5 ? R1 R6 i4 R7 ? R4 R5
Time t 4
S0
S1
S2
S3
Backend
Time t 5
i2
i3
i4
i4 cannot advance because of the RAW dependency
with i3
Baer p. 82
31Compiler Effect
i1 R1 ? MemR2 i2 R4 ? R1
R3 i3 R5 ? R1 R6 i4 R7 ? R4 R5
Time t 5
S0
S1
S2
S3
Backend
Time t 6
i3
i4 advances to execution at t6 and it will be
the only integer instruction executing at that
cycle.
Baer p. 82
32After Compiler Optimization
i1 R1 ? MemR2 i1 integer nop i2
R4 ? R1 R3 i3 R5 ? R1 R6 i4
R7 ? R4 R5
S0
S1
S2
S3
Time t
Time t 1
Backend
i4
i5
i2
i6
i3
i7
Two integer Instructions advance to S3.
Baer p. 82
33After Compiler Optimization
i1 R1 ? MemR2 i1 integer nop i2
R4 ? R1 R3 i3 R5 ? R1 R6 i4
R7 ? R4 R5
Time t 1
S0
S1
S2
S3
Backend
Time t 2
Baer p. 82
34After Compiler Optimization
i1 R1 ? MemR2 i1 integer nop i2
R4 ? R1 R3 i3 R5 ? R1 R6 i4
R7 ? R4 R5
Time t 2
S0
S1
S2
S3
Backend
Time t 3
Time t 4
i1
i4
i1
i5
i2
i6
i3
i7
Load in i1 still needs two cycles to execute.
Baer p. 82
35After Compiler Optimization
i1 R1 ? MemR2 i1 integer nop i2
R4 ? R1 R3 i3 R5 ? R1 R6 i4
R7 ? R4 R5
Time t 4
S0
S1
S2
S3
Backend
Time t 5
i1
i2 and i3 can advance to backend together. There
is no depencency between them.
Baer p. 82
36After Compiler Optimization
i1 R1 ? MemR2 i1 integer nop i2
R4 ? R1 R3 i3 R5 ? R1 R6 i4
R7 ? R4 R5
Time t 4
S0
S1
S2
S3
Backend
Time t 5
Time t 6
i12
i4
i5
i6
i7
i4 still advances to backend at t6!
but now i5 could advance along with i4
Textbook says that i4 would advance to backend
at t5.
Baer p. 82
37Scoreboarding
Scoreboarding allows instructions to execute out
of order when there are sufficient resources and
no data dependences.
John L. Hennessy and David A. Patterson Computer
Architecture A Quantitative Approach Third
Edition, p. A-69.
38Another scoreboarding
39Scoreboarding
- Thornton Algorithm (Scoreboarding) CDC 6600
(1964) - A single unit (the scoreboard) monitors the
progress of the execution of instructions and the
status of all registers. - Tomasulos Algorithm IBM 360/91 (1967)
- Reservation stations buffer operands and results.
A Common Data Bus (CDB) distributes results
directly to functional units
Some of this material is from Prof. Vojin G.
Oklobzijas tutorial at ISSCC97.
Baer p. 81
40CDC 6600
Not shown branch unit that modifies the PC
Baer p. 86
41CDC 6600 Scoreboard Operation
Issue
free functional unit?
Baer p. 86
42CDC 6600 Scoreboard Operation
Dispatch
Mark execution unit busy
Baer p. 87
43CDC 6600 Scoreboard Operation
Execution
Baer p. 87
44CDC 6600 Scoreboard Operation
Write result
WAR Example i0 DIV.D F0, F2, F4 i1
ADD.D F10, F0, F8 i2 SUB.D F8, F8, F14 Has
to stall the write of i2 until i1 has read F8
Baer p. 87
45Scoreboarding Example
i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Baer p. 88
46i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 1
i1
issued
R4
R0
R2
1
1
Unit Busy (U)?
Mult1 0
Mult2 0
Adder 0
Register Unit
R4 NIL
R6 NIL
R8 NIL
Mult1
Baer p. 88
47i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 2
i1
dispatched
R4
R0
R2
issued
1
1
i2
issued
R6
R4
R8
Mult1
0
1
Unit Busy (U)?
Mult1 0
Mult2 0
Adder 0
Register Unit
R4 Mult1
R6 NIL
R8 NIL
1
Mult2
Baer p. 88
48i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 3
i1
dispatched
R4
R0
R2
1
1
execute
i2
issued
R6
R4
R8
Mult1
0
1
i3
issued
R8
R2
R12
1
1
Unit Busy (U)?
Mult1 1
Mult2 0
Adder 0
Register Unit
R4 Mult1
R6 Mult2
R8 NIL
Adder
Baer p. 88
49i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 4
i1
R4
R0
R2
1
1
execute
i2
issued
R6
R4
R8
Mult1
0
1
i3
issued
R8
R2
R12
1
1
dispatched
Unit Busy (U)?
Mult1 1
Mult2 0
Adder 0
Register Unit
R4 Mult1
R6 Mult2
R8 Adder
1
Baer p. 88
50i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 5
(No change)
i1
R4
R0
R2
1
1
execute
i2
issued
R6
R4
R8
Mult1
0
1
R8
R2
R12
1
1
dispatched
i3
execute
Unit Busy (U)?
Mult1 1
Mult2 0
Adder 1
Register Unit
R4 Mult1
R6 Mult2
R8 Adder
Baer p. 88
51i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 6
i3 asks for permission to write. Permission is
denied (WAR with i2).
i1
R4
R0
R2
1
1
execute
i2
issued
R6
R4
R8
Mult1
0
1
R8
R2
R12
1
1
i3
execute
Unit Busy (U)?
Mult1 1
Mult2 0
Adder 1
Register Unit
R4 Mult1
R6 Mult2
R8 Adder
Baer p. 88
52i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 8
i1 asks for permission to write. Permission
is granted.
i1
R4
R0
R2
1
1
execute
write
i2
issued
R6
R4
R8
Mult1
0
1
R8
R2
R12
1
1
i3
execute
Unit Busy (U)?
Mult1 1
Mult2 0
Adder 1
Register Unit
R4 Mult1
R6 Mult2
R8 Adder
Baer p. 88
53i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 9
i2
issued
R6
R4
R8
Mult1
0
1
dispatched
R8
R2
R12
1
1
i3
execute
write
Unit Busy (U)?
Mult1 0
Mult2 1
Adder 1
Register Unit
R4
R6 Mult2
R8 Adder
Adder
Baer p. 88
54Register Renaming, Reorder Buffer, and
Reservation Stations
- Difference between in-order X out-of-order
execution - When instructions leave the front end?
- In-order WAR and WAW prevent dispatch
- Out-of-order register renaming avoids WAR and
WAW - How are instructions processed in the back-end?
- Instructions can wait in reservation stations
because of RAW dependencies or structural hazards - A reorder buffer imposes program order commitment
Baer p. 89
55Register Renaming (example)
i1 R1 ? R2/R3 Takes a long time i2
R4 ? R1 R5 i3 R5 ? R6 R7 i4 R1 ?
R8 R9
The registers that appear in the program are
logical or architectural registers.
In-order Only i1 issues. Others are blocked by
RAW dependency.
At the last stage of the front end all registers
are mapped to physical registers.
Out-of-order i3 and i4 can issue and finish
execution while i1 executes
Baer p. 89
56Renaming Process
Renaming Stage
Ri ?Rj op Rk
Ra ? Rb op Rc
Rb Rename(Rj) Rc Rename(Rk) Ra
freelist(first) Rename(Ri) freelist(first) fir
st ?next(first)
Baer p. 90
57Register Renaming (example)
How about i3, can it write into R5 before i1 and
i2 complete?
If i1 generates an exception, what will be the
value of R5 in the exception state?
i1 R1 ? R2/R3 i2 R4 ? R1 R5 i3
R5 ? R6 R7 i4 R1 ? R8 R9
R32
Ri Rename(Ri)
R1 R1
R2 R2
R3 R3
R4 R4
R5 R5
R6 R6
R7 R7
R8 R8
R9 R9
R32
R35
R32
R33
R34
R35
R33
R34
i4 will finish execution before i1. Can we allow
it to write the result to R1 before i1?
Freelist R32, R33, R34, R35, R36,
Baer p. 90
58Reorder Buffer
- Even though we allow out-of-order execution, we
require in-order-completion. - A reorder buffer (ROB) ensures that the results
produced by instructions are committed to the
logical register in order.
Baer p. 91
59Reorder Buffer (cont.)
- Each entry in the ROB has the following fields
- flag has the instruction completed?
- value value computed by the instruction
- result register name logical register
- instruction type arithmetic/load/store/branch/
- Each instruction that has its destination
register renamed is entered in the ROB
Baer p. 91
60Instruction Flag Value Reg. Name Type
Ri Rename(Ri)
R1 R1
R2 R2
R3 R3
R4 R4
R5 R5
R6 R6
R7 R7
R8 R8
R9 R9
R32
R35
i1 R1 ? R2/R3 i2 R4 ? R1 R5 i3
R5 ? R6 R7 i4 R1 ? R8 R9
R32
R32
R33
R33
R34
R34
R35
Freelist R32, R33, R34, R35, R36,
Baer p. 92
61But.
- Where do instructions wait before being executed?
- How an instruction knows that it is ready to be
executed?
Baer p. 93
62Reservation Stations
- After register renaming, the front-end dispatches
the instruction to a reservation station. - Reservation stations can
- be grouped into a centralized queue called an
instruction window. - be associated with functional units according to
the opcode.
Baer p. 93
63Reservation Stations (cont.)
- Each entry in the Reservation Station must
contain - Operation to be performed
- Source operands (either value or physical name of
the register) a flag indicates which one - physical name of the result register
- ROB entry where the result will be stored.
Baer p. 93
64Scheduling
- Scheduling Selection of which instruction should
execute next in a given execution unit - oldest instruction
- critical instruction
Baer p. 93
65Ready Bit
- A ready bit is associated with each physical
register. - When an instruction that uses a physical register
Ri is dispatched - if Ri is ready, pass Ri value to the reservation
station and set flag to true (ready) - if Ri is not ready, pass the name of Ri to the
reservation station and set flag to false (not
ready) - When both flags are true, the instruction is
ready to be issued.
Baer p. 93
66Ready Bit (cont.)
- Upon completion, an instruction broadcasts the
name and content of its result physical register
to all reservation stations (RS). - Each RS that needs it, will grab the content and
update its flags.
Baer p. 93