Superscalar Organization - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

Superscalar Organization

Description:

Fetch, decode centralized whereas dispatch decentralized ... Stop fetching - Allow pipeline to drain. Another interrupt might occur while allowing to drain ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 46

Provided by: aash63

Category:

more less

Transcript and Presenter's Notes

Title: Superscalar Organization

1
Chapter 4

Superscalar Organization

2
Limitations of Scalar Pipelines

Maximum throughput bounded by one instruction per
cycle
Inefficient Unification into One Pipeline
ALU, MEM stages very diverse eg FP
Rigid nature of in order pipeline
If a leading instrn is stalled every subsequent
instrn is stalled

3
A Rigid Pipeline
4
Solving Problems of Scalar Pipelines

Maximum throughput bounded by one instruction per
cycle -gt parallel pipelines
Inefficient Unification into Single Pipeline
-gt diversified pipelines
Rigid nature of in order pipeline
-gt allow out of ordering
OOO pipelines or dynamic pipelines

5
Machine Parallelism

No Parallelism (Nonpipelined)
Temporal Parallelism (Pipelined)
Spatial Parallelism (Multiple units)
Combined Temporal and Spatial Parallelism

6
A Parallel Pipeline
Width 3
7
Scalar and Parallel Pipeline

The five-stage i486 scalar pipeline
The five-stage Pentium Parallel Pipeline of
width2

8
Diversified Parallel Pipeline
9
Diversified Functional Units
CDC 6600 with 10 diversified functional units
10
Motorola 88110 Superscalar uP
11
Interpipeline Stage Buffer

Single entry buffer
Multientry buffer
Multientry buffer with reordering

12
A Dynamic Pipeline
13
In-order Issue into Diversified Pipelines
RD ? Fn (RS, RT)
Inorder Inst. Stream
Dest. Reg.
Func Unit
Source Registers
Issue stage needs to check 1. Structural
Dependence 2. RAW Hazard 3. WAW Hazard
4. WAR Hazard
14
A Superscalar Pipeline
A six stage Template (TEM) superscalar pipeline
15
Superscalar Pipeline Design
Instruction Flow
Data Flow
Retire
16
Inorder Pipelines
Inorder pipeline, no WAW no WAR (almost always
true)
17
Limitations of Inorder Pipelines

CPI of inorder pipelines degrades very sharply if
the machine parallelism is increased beyond a
certain point, i.e. when NxM approaches average
distance between dependent instructions
Forwarding is no longer effective
? must stall more often
Pipeline may never be full due to frequent
dependency stalls!!

18
Out-of-order Pipelining 101
IF

ID

RD

INT
Fadd1
Fmult1
LD/ST
EX
Fadd2
Fmult2
Fmult3
WB

19
Instruction Fetching

Fetch should not be bottleneck
Wide I-caches
Fetch bandwidth
Two major problems
Misaligned accesses
Control flow (branch) instructions
2 solutions for misaligned accesses
Compiler
Hardware alignment network

20
Organization of wide I-cache
(a) One cache line is equal to one physical row
(b) One cache line is equal to two physical rows
21
Misalignment of the Fetch Group
22
RS-6000 with Auto-Realignment
23
Instruction Decoding

Identify individual instructions
Determine I-types
Detect dependencies
2 major complexity factors
ISA (RISC/CISC)
Width of the parallel pipeline

24
Instruction Decoding (2)

ISA (RISC/CISC)
Uniform width
Regular encoding
Few different formats
Few addressing modes
Dependences -gt comparators
Multiported reg files, multiple buses
Branches
CISC examples Pentium, K-5
How to identify the start of the next instruction

25
Fetch/Decode Unit of Intel P6 Pipeline
Uops employ load/store model Decoder 0
generalized decoder Decode needs multiple stages
-gt Concept of predecoding
26
Pre-decoding Mechanism of AMD K-5
Predecode logic sits between memory and
I-cache Additional info stored into cache 5
bits start/end of instrn, number of uops,
location of opcodes, prefixes
27
Predecoding

Overhead of predecoding
Increase in I-cache size
K5 increase is 50
Increased I-cache miss penalty
Not a problem if I-miss rates low
Predecoding in RISC machines too
Identify branch instructions
PowerPC 7 bits
UltraSPARC, R10000, HP PA-8000 4/5 bits

28
Instruction Dispatch in Superscalar Pipeline
29
Instruction Dispatching

Route instruction to appropriate functional unit
(FU) for execution
Temporary buffering before execution
Prior to exec, an instrn must have all its
operands
Reservation Stations
Fetch, decode centralized whereas dispatch
decentralized
Centralized/distributed reservation station
Pentium 4 centralized, PowerPC - distributed

30
Centralized Reservation Station
31
Distributed Reservation Station
32
Dispatch (contd)

Hybrid approaches
Clustered reservation stations
Not centralized, but some reservation stations
feed more than 1 FU
Centralized best overall utilization
Needs to be multiported
Distributed possible idling

33
Dispatch vs Issue

Dispatching associating instructions with FU
types
Issuing initiation of execution in FUs
Separation of dispatch/issue makes sense only for
distributed reservation stations
For centralized instruction windows (reservation
stations), the two terms interchangeable

34
Instruction Execution

Heart of a CPU
Lots of FUs in current superscalars
INT (1 or more), FP, LD/ST,
Some FUs are specialized
TI SuperSPARC contains cascaded ALUs
IBM RS/6000 has multiply-add units

35
Functional Units
(a) Int Functional Unit in TI Supersparc
(b) FP Functional Unit in IBM RS6000
36
Instruction Execution

LD/ST units (L/S units) (L/S pipes)
Dedicated L/S unit, or INT units used
Branch units
Dedicated or INT units used
Graphics and image processing units
Pixel processing units
Bit manipulation units
Trimedia media processor
Quad avg

37
Instruction Execution (2)

Quad average
Quad avg used in MPEG decoding fro decompressing
compressed videos
(ae1)/2 (bf1)/2 (cg1)/2 (dh1)/2
a,b,c,d,e,f,g,h are bytes
Stored as 2 32-bit quantities
Replaces numerous add and divide instrns

38
Instruction Execution (3)

Best mix of FUs for a superscalar proc
Depends on application domain and I-mix
If 40 ALU, 20 branches, 40 L/S, 4-2-4 rule of
thumb
ALU units abundant in current processors
L/S units are more scarce
Needs D-cache to be multiported
Banked memory easier to make than multiported

39
Instruction Execution (4)

Banked D-caches/Multiported D-caches
If no bank conflicts, good bandwidth
Intel Pentium 8-banked D-cache
Banking cheaper than true multiporting or
multiple copies
Total number of FUs often more than processor
superscalarity
Superscalarity typically F/D/retire width
Complexity n2 where n FUs

40
Completion and Retiring

Completion means Finish execution
Retiring means Update Machine state
Retiring in my opinion - committing results to
register or memory (Book uses different
terminology pg 207)
Completion Buffer/Reorder Buffer (ROB)
Out of order execution, in-order retiring
Stages between instruction buffer and ROB are out
of order

41
A Dynamic Pipeline
42
Reorder Buffers (ROB)

Put instructions back to order
Instructions enter ROB in o-o-o (out of order)
Instructions leave ROB in order
An instruction is architecturally complete when
it leaves ROB
Registers updated memory updated
Some times memory instructions have their own
ROB or MOB (memory ordering buffer)

43
Interrupts/Exceptions

Interrupts external, I/O devices, OS
Exceptions internal, errors
Illegal opcode, divide by 0, overflow/underflow,
page faults
OS needs to intervene to handle exceptions
2 ways of interrupt/exception handling
Precise interrupts
Imprecise interrupts

44
Precise and Imprecise Interrupts

Save state of machine at interrupt, restart from
the point of interrupt
Stop fetching - Allow pipeline to drain
Another interrupt might occur while allowing to
drain
Process interrupt from an earlier instruction
first
Precise interrupts if exceptions are processed
in the same order as a non-pipelined machine
Imprecise interrupts if exception processing
order different from the non-pipelined order

45
ROB for exception handling

When exception occurs, instrn tagged in ROB
Completion stage checks each instruction before
it is completed
Tagged instructions not allowed to complete
Instrns prior to tagged instrn allowed to
complete
Machine state is checkpointed or saved
Remaining instructions in pipeline some of
which may have finished are discarded
Exception processed check point restored
execution resumes