Title: Superscalar Organization
1Chapter 4
2Limitations of Scalar Pipelines
- Maximum throughput bounded by one instruction per
cycle - Inefficient Unification into One Pipeline
- ALU, MEM stages very diverse eg FP
- Rigid nature of in order pipeline
- If a leading instrn is stalled every subsequent
instrn is stalled
3A Rigid Pipeline
4Solving Problems of Scalar Pipelines
- Maximum throughput bounded by one instruction per
cycle -gt parallel pipelines - Inefficient Unification into Single Pipeline
- -gt diversified pipelines
- Rigid nature of in order pipeline
- -gt allow out of ordering
- OOO pipelines or dynamic pipelines
5Machine Parallelism
- No Parallelism (Nonpipelined)
- Temporal Parallelism (Pipelined)
- Spatial Parallelism (Multiple units)
- Combined Temporal and Spatial Parallelism
6A Parallel Pipeline
Width 3
7Scalar and Parallel Pipeline
- The five-stage i486 scalar pipeline
- The five-stage Pentium Parallel Pipeline of
width2
8Diversified Parallel Pipeline
9Diversified Functional Units
CDC 6600 with 10 diversified functional units
10Motorola 88110 Superscalar uP
11Interpipeline Stage Buffer
- Single entry buffer
- Multientry buffer
- Multientry buffer with reordering
12A Dynamic Pipeline
13In-order Issue into Diversified Pipelines
RD ? Fn (RS, RT)
Inorder Inst. Stream
Dest. Reg.
Func Unit
Source Registers
Issue stage needs to check 1. Structural
Dependence 2. RAW Hazard 3. WAW Hazard
4. WAR Hazard
14A Superscalar Pipeline
A six stage Template (TEM) superscalar pipeline
15Superscalar Pipeline Design
Instruction Flow
Data Flow
Retire
16Inorder Pipelines
Inorder pipeline, no WAW no WAR (almost always
true)
17Limitations of Inorder Pipelines
- CPI of inorder pipelines degrades very sharply if
the machine parallelism is increased beyond a
certain point, i.e. when NxM approaches average
distance between dependent instructions - Forwarding is no longer effective
- ? must stall more often
- Pipeline may never be full due to frequent
dependency stalls!!
18Out-of-order Pipelining 101
IF
ID
RD
INT
Fadd1
Fmult1
LD/ST
EX
Fadd2
Fmult2
Fmult3
WB
19Instruction Fetching
- Fetch should not be bottleneck
- Wide I-caches
- Fetch bandwidth
- Two major problems
- Misaligned accesses
- Control flow (branch) instructions
- 2 solutions for misaligned accesses
- Compiler
- Hardware alignment network
20Organization of wide I-cache
(a) One cache line is equal to one physical row
(b) One cache line is equal to two physical rows
21Misalignment of the Fetch Group
22RS-6000 with Auto-Realignment
23Instruction Decoding
- Identify individual instructions
- Determine I-types
- Detect dependencies
- 2 major complexity factors
- ISA (RISC/CISC)
- Width of the parallel pipeline
24Instruction Decoding (2)
- ISA (RISC/CISC)
- Uniform width
- Regular encoding
- Few different formats
- Few addressing modes
- Dependences -gt comparators
- Multiported reg files, multiple buses
- Branches
- CISC examples Pentium, K-5
- How to identify the start of the next instruction
25Fetch/Decode Unit of Intel P6 Pipeline
Uops employ load/store model Decoder 0
generalized decoder Decode needs multiple stages
-gt Concept of predecoding
26Pre-decoding Mechanism of AMD K-5
Predecode logic sits between memory and
I-cache Additional info stored into cache 5
bits start/end of instrn, number of uops,
location of opcodes, prefixes
27Predecoding
- Overhead of predecoding
- Increase in I-cache size
- K5 increase is 50
- Increased I-cache miss penalty
- Not a problem if I-miss rates low
- Predecoding in RISC machines too
- Identify branch instructions
- PowerPC 7 bits
- UltraSPARC, R10000, HP PA-8000 4/5 bits
28Instruction Dispatch in Superscalar Pipeline
29Instruction Dispatching
- Route instruction to appropriate functional unit
(FU) for execution - Temporary buffering before execution
- Prior to exec, an instrn must have all its
operands - Reservation Stations
- Fetch, decode centralized whereas dispatch
decentralized - Centralized/distributed reservation station
- Pentium 4 centralized, PowerPC - distributed
30Centralized Reservation Station
31Distributed Reservation Station
32Dispatch (contd)
- Hybrid approaches
- Clustered reservation stations
- Not centralized, but some reservation stations
feed more than 1 FU - Centralized best overall utilization
- Needs to be multiported
- Distributed possible idling
33Dispatch vs Issue
- Dispatching associating instructions with FU
types - Issuing initiation of execution in FUs
- Separation of dispatch/issue makes sense only for
distributed reservation stations - For centralized instruction windows (reservation
stations), the two terms interchangeable
34Instruction Execution
- Heart of a CPU
- Lots of FUs in current superscalars
- INT (1 or more), FP, LD/ST,
- Some FUs are specialized
- TI SuperSPARC contains cascaded ALUs
- IBM RS/6000 has multiply-add units
35Functional Units
(a) Int Functional Unit in TI Supersparc
(b) FP Functional Unit in IBM RS6000
36Instruction Execution
- LD/ST units (L/S units) (L/S pipes)
- Dedicated L/S unit, or INT units used
- Branch units
- Dedicated or INT units used
- Graphics and image processing units
- Pixel processing units
- Bit manipulation units
- Trimedia media processor
- Quad avg
37Instruction Execution (2)
- Quad average
- Quad avg used in MPEG decoding fro decompressing
compressed videos - (ae1)/2 (bf1)/2 (cg1)/2 (dh1)/2
- a,b,c,d,e,f,g,h are bytes
- Stored as 2 32-bit quantities
- Replaces numerous add and divide instrns
38Instruction Execution (3)
- Best mix of FUs for a superscalar proc
- Depends on application domain and I-mix
- If 40 ALU, 20 branches, 40 L/S, 4-2-4 rule of
thumb - ALU units abundant in current processors
- L/S units are more scarce
- Needs D-cache to be multiported
- Banked memory easier to make than multiported
39Instruction Execution (4)
- Banked D-caches/Multiported D-caches
- If no bank conflicts, good bandwidth
- Intel Pentium 8-banked D-cache
- Banking cheaper than true multiporting or
multiple copies - Total number of FUs often more than processor
superscalarity - Superscalarity typically F/D/retire width
- Complexity n2 where n FUs
40Completion and Retiring
- Completion means Finish execution
- Retiring means Update Machine state
- Retiring in my opinion - committing results to
register or memory (Book uses different
terminology pg 207) - Completion Buffer/Reorder Buffer (ROB)
- Out of order execution, in-order retiring
- Stages between instruction buffer and ROB are out
of order
41 A Dynamic Pipeline
42Reorder Buffers (ROB)
- Put instructions back to order
- Instructions enter ROB in o-o-o (out of order)
- Instructions leave ROB in order
- An instruction is architecturally complete when
it leaves ROB - Registers updated memory updated
- Some times memory instructions have their own
ROB or MOB (memory ordering buffer)
43Interrupts/Exceptions
- Interrupts external, I/O devices, OS
- Exceptions internal, errors
- Illegal opcode, divide by 0, overflow/underflow,
page faults - OS needs to intervene to handle exceptions
- 2 ways of interrupt/exception handling
- Precise interrupts
- Imprecise interrupts
44Precise and Imprecise Interrupts
- Save state of machine at interrupt, restart from
the point of interrupt - Stop fetching - Allow pipeline to drain
- Another interrupt might occur while allowing to
drain - Process interrupt from an earlier instruction
first - Precise interrupts if exceptions are processed
in the same order as a non-pipelined machine - Imprecise interrupts if exception processing
order different from the non-pipelined order
45ROB for exception handling
- When exception occurs, instrn tagged in ROB
- Completion stage checks each instruction before
it is completed - Tagged instructions not allowed to complete
- Instrns prior to tagged instrn allowed to
complete - Machine state is checkpointed or saved
- Remaining instructions in pipeline some of
which may have finished are discarded - Exception processed check point restored
execution resumes