Superscalar Processors

About This Presentation

Title:

Superscalar Processors

Description:

Superscalar Processors J. Nelson Amaral Ready Bit (cont.) Upon completion, an instruction broadcasts the name and content of its result physical register to all ... – PowerPoint PPT presentation

Number of Views:211

Avg rating:3.0/5.0

Slides: 67

Provided by: JoseN50

Category:

more less

Transcript and Presenter's Notes

Title: Superscalar Processors

1
Superscalar Processors

J. Nelson Amaral

2
Scalar to Superscalar

Scalar Processor one instruction pass through
each pipeline stage in each cycle
Superscalar Processor multiple instructions at
each pipeline stage in each cycle
Wider pipeline
Superpipelined Processor Decompose stages into
smaller stages ? More Stages
Deeper pipeline

Baer p. 75
3
Superscalar

Front end (IF and ID)
Must fetch and decode multiple instructions per
cycle
m-way superscalar brings (ideally) m
instructions per cycle into the pipeline
Back end (EX, Mem and WB)
Must execute and write back several instructions
per cycle

Baer p. 75
4
Superscalar

In-order (or static)
Instructions leave front-end in program order
Out-of-order (or dynamic)
instructions leave front-end, and execute, in a
different order than the program order
WB is called commit stage
must ensure that the program semantics is
followed
more complex design

Baer p. 76
5
Limits to Superscalar Performance

Superscalars rely on exploiting Instruction-Level
Parallelism (ILP)
They remove WAR and WAW dependences
But the amount of ILP is limited by RAW (true)
dependences

Data Dependence Graph
S0
RAW
WAW
S1
WAR
S2
WAW
RAW
S3
Baer p. 76
6
Limits to Superscalar Performance

Superscalars rely on exploiting Instruction-Level
Parallelism (ILP)
They remove WAR and WAW dependences
But the amount of ILP is limited by RAW (true)
dependences

Data Dependence Graph
S0
RAW
S1
RA
RB
RA
Baer p. 76
7
Limits to Superscalar Performance

Complexity of logic to remove dependencies
Designers predicted 8-way and 16-way superscalars
We have 6-way superscalars and m is not likely to
grow

Baer p. 76
8
Limits to Superscalar PerformanceNumber of
Forward Paths
1-way
Baer p. 76
9
Limits to Superscalar PerformanceNumber of
Forward Paths
2-way
m-way requires m2 paths
paths may become too long for signal propagation
within a single clock
Baer p. 76
10
Limits to Clock Cycle Reduction

Power dissipation increases with frequency
Read and Writing to pipeline registers in every
cycle.
Time to access pipeline register imposes a bound
on the duration of a pipeline stage

Baer p. 76
11
Limits on Pipeline Length

Speculative actions (pe. branch prediction) are
resolved later in a longer pipeline
Recovery from misspeculation is delayed

Baer p. 76
12
Why the Multicore Revolution?
Baer p. 77
13
Speed Demons X Brainiacs
register renaming
reorder buffer
reservation stations
Baer p. 77
14
Out-of-Order and Memory Hierarchy

Question Does out-of-order execution help hide
memory latencies?
Short answer No.
Latencies of 100 cycles or more are too long and
fill up all internal queues and stall pipelines
Latencies around 100 cycles are too short to
justify context switching.
Solution hardware for several contexts to enable
fast context switching ? multithreading

Baer p. 78
15
DEC Alpha 21164
4-way in-order RISC
virtually indexed
32
32 64-bit
Baer p. 79
16
21164 Instruction Pipeline
Integer pipe 1 shifter and multiplier Integer
pipe 2 branches
48-entry I-TLB
64-entry D-TLB
Baer p. 79
17
Integer pipe 1 shifter and multiplier Integer
pipe 2 branches
48-entry I-TLB
64-entry D-TLB
Baer p. 80
18
Example
i1 R1 ? R2 R3 Use integer pipeline
1 i2 R4 ? R1 R5 Use integer
pipeline 2 i3 R7 ? R8 R9 Requires
an integer pipeline i4 F0 ? F2 F4
Floating point add i5 i6 i7 i8 i9 i10 i11 i
12
Assume no structural or data hazard for these
instructions.
Baer p. 81
19
Front-end Occupancy
i1 R1 ? R2 R3 i2 R4 ? R1 R5 i3 R7 ?
R8 R9 i4 F0 ? F2 F4
S0
S1
S2
S3
Time t0
Time t0 1
Backend
Baer p. 82
20
Front-end Occupancy
i1 R1 ? R2 R3 i2 R4 ? R1 R5 i3 R7 ?
R8 R9 i4 F0 ? F2 F4
S0
S1
S2
S3
Time t0 1
Time t0 2
Backend
Baer p. 82
21
Front-end Occupancy
i1 R1 ? R2 R3 i2 R4 ? R1 R5 i3 R7 ?
R8 R9 i4 F0 ? F2 F4
Time t0 2
S0
S1
S2
S3
Time t0 3
Backend
i9
i5
i10
i6
i11
i7
i3
i12
i8
i4
i3 cannot move to S3 because of resource conflict
(there are only two integer pipelines)
i4 does not move to S3 to preserve program order
(it is blocked by i3)
Baer p. 82
22
Front-end Occupancy
i1 R1 ? R2 R3 i2 R4 ? R1 R5 i3 R7 ?
R8 R9 i4 F0 ? F2 F4
Time t0 3
S0
S1
S2
S3
Backend
Time t0 4
i9
i5
i1
i10
i6
i2
i11
i7
i3
i12
i8
i4
i2 cannot move to the backend because of of RAW
dependency with i1.
Baer p. 82
23
Front-end Occupancy
i1 R1 ? R2 R3 i2 R4 ? R1 R5 i3 R7 ?
R8 R9 i4 F0 ? F2 F4
Time t0 4
S0
S1
S2
S3
Backend
Time t0 5
i1
Baer p. 82
24
Backend
Baer p. 82
25
Scoreboard Speculation
If the load hits L1-cache, then schedule L at t1
and U at t3.
Scoreboard assumes it is a hit.
If it is a miss, abort any dependent instruction
already issued.
Baer p. 82
26
Can Compiler Help Performance?(Example)
i1 R1 ? MemR2 i2 R4 ? R1
R3 i3 R5 ? R1 R6 i4 R7 ? R4 R5
Assume that all instructions are in issuing slot
(state S2) at time t.
27
Compiler Effect
i1 R1 ? MemR2 i2 R4 ? R1
R3 i3 R5 ? R1 R6 i4 R7 ? R4 R5
S0
S1
S2
S3
Time t
Time t 1
Backend
i3
i4
Instruction i3 cannot advance to S3 because of an
structural hazard The load in i1 uses an
integer pipe to compute the address
Baer p. 82
28
Compiler Effect
i1 R1 ? MemR2 i2 R4 ? R1
R3 i3 R5 ? R1 R6 i4 R7 ? R4 R5
Time t 1
S0
S1
S2
S3
Backend
Time t 2
Time t 3
i1
i2
i3
i4
i2 cannot advance because of the RAW dependency
with i1
at t3 the load continues execution in the back
end (2-cycle latency)
Baer p. 82
29
Compiler Effect
i1 R1 ? MemR2 i2 R4 ? R1
R3 i3 R5 ? R1 R6 i4 R7 ? R4 R5
Time t 3
S0
S1
S2
S3
Backend
Time t 4
i1
Baer p. 82
30
Compiler Effect
i1 R1 ? MemR2 i2 R4 ? R1
R3 i3 R5 ? R1 R6 i4 R7 ? R4 R5
Time t 4
S0
S1
S2
S3
Backend
Time t 5
i2
i3
i4
i4 cannot advance because of the RAW dependency
with i3
Baer p. 82
31
Compiler Effect
i1 R1 ? MemR2 i2 R4 ? R1
R3 i3 R5 ? R1 R6 i4 R7 ? R4 R5
Time t 5
S0
S1
S2
S3
Backend
Time t 6
i3
i4 advances to execution at t6 and it will be
the only integer instruction executing at that
cycle.
Baer p. 82
32
After Compiler Optimization
i1 R1 ? MemR2 i1 integer nop i2
R4 ? R1 R3 i3 R5 ? R1 R6 i4
R7 ? R4 R5
S0
S1
S2
S3
Time t
Time t 1
Backend
i4
i5
i2
i6
i3
i7
Two integer Instructions advance to S3.
Baer p. 82
33
After Compiler Optimization
i1 R1 ? MemR2 i1 integer nop i2
R4 ? R1 R3 i3 R5 ? R1 R6 i4
R7 ? R4 R5
Time t 1
S0
S1
S2
S3
Backend
Time t 2
Baer p. 82
34
After Compiler Optimization
i1 R1 ? MemR2 i1 integer nop i2
R4 ? R1 R3 i3 R5 ? R1 R6 i4
R7 ? R4 R5
Time t 2
S0
S1
S2
S3
Backend
Time t 3
Time t 4
i1
i4
i1
i5
i2
i6
i3
i7
Load in i1 still needs two cycles to execute.
Baer p. 82
35
After Compiler Optimization
i1 R1 ? MemR2 i1 integer nop i2
R4 ? R1 R3 i3 R5 ? R1 R6 i4
R7 ? R4 R5
Time t 4
S0
S1
S2
S3
Backend
Time t 5
i1
i2 and i3 can advance to backend together. There
is no depencency between them.
Baer p. 82
36
After Compiler Optimization
i1 R1 ? MemR2 i1 integer nop i2
R4 ? R1 R3 i3 R5 ? R1 R6 i4
R7 ? R4 R5
Time t 4
S0
S1
S2
S3
Backend
Time t 5
Time t 6
i12
i4
i5
i6
i7
i4 still advances to backend at t6!
but now i5 could advance along with i4
Textbook says that i4 would advance to backend
at t5.
Baer p. 82
37
Scoreboarding
Scoreboarding allows instructions to execute out
of order when there are sufficient resources and
no data dependences.
John L. Hennessy and David A. Patterson Computer
Architecture A Quantitative Approach Third
Edition, p. A-69.
38
Another scoreboarding
39
Scoreboarding

Thornton Algorithm (Scoreboarding) CDC 6600
(1964)
A single unit (the scoreboard) monitors the
progress of the execution of instructions and the
status of all registers.
Tomasulos Algorithm IBM 360/91 (1967)
Reservation stations buffer operands and results.
A Common Data Bus (CDB) distributes results
directly to functional units

Some of this material is from Prof. Vojin G.
Oklobzijas tutorial at ISSCC97.
Baer p. 81
40
CDC 6600
Not shown branch unit that modifies the PC
Baer p. 86
41
CDC 6600 Scoreboard Operation
Issue
free functional unit?
Baer p. 86
42
CDC 6600 Scoreboard Operation
Dispatch
Mark execution unit busy
Baer p. 87
43
CDC 6600 Scoreboard Operation
Execution
Baer p. 87
44
CDC 6600 Scoreboard Operation
Write result
WAR Example i0 DIV.D F0, F2, F4 i1
ADD.D F10, F0, F8 i2 SUB.D F8, F8, F14 Has
to stall the write of i2 until i1 has read F8
Baer p. 87
45
Scoreboarding Example
i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Baer p. 88
46
i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 1
i1
issued
R4
R0
R2
1
1
Unit Busy (U)?
Mult1 0
Mult2 0
Adder 0
Register Unit
R4 NIL
R6 NIL
R8 NIL
Mult1
Baer p. 88
47
i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 2
i1
dispatched
R4
R0
R2
issued
1
1
i2
issued
R6
R4
R8
Mult1
0
1
Unit Busy (U)?
Mult1 0
Mult2 0
Adder 0
Register Unit
R4 Mult1
R6 NIL
R8 NIL
1
Mult2
Baer p. 88
48
i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 3
i1
dispatched
R4
R0
R2
1
1
execute
i2
issued
R6
R4
R8
Mult1
0
1
i3
issued
R8
R2
R12
1
1
Unit Busy (U)?
Mult1 1
Mult2 0
Adder 0
Register Unit
R4 Mult1
R6 Mult2
R8 NIL
Adder
Baer p. 88
49
i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 4
i1
R4
R0
R2
1
1
execute
i2
issued
R6
R4
R8
Mult1
0
1
i3
issued
R8
R2
R12
1
1
dispatched
Unit Busy (U)?
Mult1 1
Mult2 0
Adder 0
Register Unit
R4 Mult1
R6 Mult2
R8 Adder
1
Baer p. 88
50
i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 5
(No change)
i1
R4
R0
R2
1
1
execute
i2
issued
R6
R4
R8
Mult1
0
1
R8
R2
R12
1
1
dispatched
i3
execute
Unit Busy (U)?
Mult1 1
Mult2 0
Adder 1
Register Unit
R4 Mult1
R6 Mult2
R8 Adder
Baer p. 88
51
i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 6
i3 asks for permission to write. Permission is
denied (WAR with i2).
i1
R4
R0
R2
1
1
execute
i2
issued
R6
R4
R8
Mult1
0
1
R8
R2
R12
1
1
i3
execute
Unit Busy (U)?
Mult1 1
Mult2 0
Adder 1
Register Unit
R4 Mult1
R6 Mult2
R8 Adder
Baer p. 88
52
i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 8
i1 asks for permission to write. Permission
is granted.
i1
R4
R0
R2
1
1
execute
write
i2
issued
R6
R4
R8
Mult1
0
1
R8
R2
R12
1
1
i3
execute
Unit Busy (U)?
Mult1 1
Mult2 0
Adder 1
Register Unit
R4 Mult1
R6 Mult2
R8 Adder
Baer p. 88
53
i1 R4 ? R0 R2 Uses multiplier
1 i2 R6 ? R4 R8 Uses multiplier
2 i3 R8 ? R2 R12 Uses Adder i4
R4 ? R14 R16 Uses Adder
Cycle 9
i2
issued
R6
R4
R8
Mult1
0
1
dispatched
R8
R2
R12
1
1
i3
execute
write
Unit Busy (U)?
Mult1 0
Mult2 1
Adder 1
Register Unit
R4
R6 Mult2
R8 Adder
Adder
Baer p. 88
54
Register Renaming, Reorder Buffer, and
Reservation Stations

Difference between in-order X out-of-order
execution
When instructions leave the front end?
In-order WAR and WAW prevent dispatch
Out-of-order register renaming avoids WAR and
WAW
How are instructions processed in the back-end?
Instructions can wait in reservation stations
because of RAW dependencies or structural hazards
A reorder buffer imposes program order commitment

Baer p. 89
55
Register Renaming (example)
i1 R1 ? R2/R3 Takes a long time i2
R4 ? R1 R5 i3 R5 ? R6 R7 i4 R1 ?
R8 R9
The registers that appear in the program are
logical or architectural registers.
In-order Only i1 issues. Others are blocked by
RAW dependency.
At the last stage of the front end all registers
are mapped to physical registers.
Out-of-order i3 and i4 can issue and finish
execution while i1 executes
Baer p. 89
56
Renaming Process
Renaming Stage
Ri ?Rj op Rk
Ra ? Rb op Rc
Rb Rename(Rj) Rc Rename(Rk) Ra
freelist(first) Rename(Ri) freelist(first) fir
st ?next(first)
Baer p. 90
57
Register Renaming (example)
How about i3, can it write into R5 before i1 and
i2 complete?
If i1 generates an exception, what will be the
value of R5 in the exception state?
i1 R1 ? R2/R3 i2 R4 ? R1 R5 i3
R5 ? R6 R7 i4 R1 ? R8 R9
R32
Ri Rename(Ri)
R1 R1
R2 R2
R3 R3
R4 R4
R5 R5
R6 R6
R7 R7
R8 R8
R9 R9
R32
R35
R32
R33
R34
R35
R33
R34
i4 will finish execution before i1. Can we allow
it to write the result to R1 before i1?
Freelist R32, R33, R34, R35, R36,
Baer p. 90
58
Reorder Buffer

Even though we allow out-of-order execution, we
require in-order-completion.
A reorder buffer (ROB) ensures that the results
produced by instructions are committed to the
logical register in order.

Baer p. 91
59
Reorder Buffer (cont.)

Each entry in the ROB has the following fields
flag has the instruction completed?
value value computed by the instruction
result register name logical register
instruction type arithmetic/load/store/branch/
Each instruction that has its destination
register renamed is entered in the ROB

Baer p. 91
60
Instruction Flag Value Reg. Name Type

Ri Rename(Ri)
R1 R1
R2 R2
R3 R3
R4 R4
R5 R5
R6 R6
R7 R7
R8 R8
R9 R9
R32
R35
i1 R1 ? R2/R3 i2 R4 ? R1 R5 i3
R5 ? R6 R7 i4 R1 ? R8 R9
R32
R32
R33
R33
R34
R34
R35
Freelist R32, R33, R34, R35, R36,
Baer p. 92
61
But.

Where do instructions wait before being executed?
How an instruction knows that it is ready to be
executed?

Baer p. 93
62
Reservation Stations

After register renaming, the front-end dispatches
the instruction to a reservation station.
Reservation stations can
be grouped into a centralized queue called an
instruction window.
be associated with functional units according to
the opcode.

Baer p. 93
63
Reservation Stations (cont.)

Each entry in the Reservation Station must
contain
Operation to be performed
Source operands (either value or physical name of
the register) a flag indicates which one
physical name of the result register
ROB entry where the result will be stored.

Baer p. 93
64
Scheduling

Scheduling Selection of which instruction should
execute next in a given execution unit
oldest instruction
critical instruction

Baer p. 93
65
Ready Bit

A ready bit is associated with each physical
register.
When an instruction that uses a physical register
Ri is dispatched
if Ri is ready, pass Ri value to the reservation
station and set flag to true (ready)
if Ri is not ready, pass the name of Ri to the
reservation station and set flag to false (not
ready)
When both flags are true, the instruction is
ready to be issued.

Baer p. 93
66
Ready Bit (cont.)

Upon completion, an instruction broadcasts the
name and content of its result physical register
to all reservation stations (RS).
Each RS that needs it, will grab the content and
update its flags.

Baer p. 93

Write a Comment

User Comments (0)

About PowerShow.com

Superscalar Processors - PowerPoint PPT Presentation

Superscalar Processors

Superscalar Processors J. Nelson Amaral Ready Bit (cont.) Upon completion, an instruction broadcasts the name and content of its result physical register to all ... – PowerPoint PPT presentation