Title: Computer Organization and Architecture
1Computer Organization and Architecture
- Instruction Level Parallelism
- and Superscalar Processors
Chapter 14
2What is Superscalar?
- Common instructions (arithmetic, load/store,
conditional branch) can be initiated and executed
independently - Equally applicable to RISC CISC
- In practice usually RISC
3Why Superscalar?
- Most operations are on scalar quantities (see
RISC notes) - Improve these operations to get an overall
improvement
4General Superscalar Organization
5Superpipelined
- Many pipeline stages need less than half a clock
cycle - Double internal clock speed gets two tasks per
external clock cycle - Superscalar allows parallel fetch execute
6Superscalar vSuperpipeline
7Limitations
- Instruction level parallelism
- Compiler based optimisation
- Hardware techniques
- Limited by
- True data dependency
- Procedural dependency
- Resource conflicts
- Output dependency
- Antidependency
8True Data Dependency
- ADD r1, r2 (r1 r1r2)
- MOVE r3,r1 (r3 r1)
- Can fetch and decode second instruction in
parallel with first - Can NOT execute second instruction until first is
finished
9Procedural Dependency
- Can not execute instructions after a branch in
parallel with instructions before a branch - Also, if instruction length is not fixed,
instructions have to be decoded to find out how
many fetches are needed - This prevents simultaneous fetches
10Resource Conflict
- Two or more instructions requiring access to the
same resource at the same time - e.g. two arithmetic instructions
- Can duplicate resources
- e.g. have two arithmetic units
11Effect of Dependencies
12Design Issues
- Instruction level parallelism
- Instructions in a sequence are independent
- Execution can be overlapped
- Governed by data and procedural dependency
- Machine Parallelism
- Ability to take advantage of instruction level
parallelism - Governed by number of parallel pipelines
13Instruction Issue Policy
- Order in which instructions are fetched
- Order in which instructions are executed
- Order in which instructions change registers and
memory
14In-Order Issue In-Order Completion
- Issue instructions in the order they occur
- Not very efficient
- May fetch gt1 instruction
- Instructions must stall if necessary
15In-Order Issue In-Order Completion (Diagram)
16In-Order Issue Out-of-Order Completion
- Output dependency
- R3 R3 R5 (I1)
- R4 R3 1 (I2)
- R3 R5 1 (I3)
- I2 depends on result of I1 - data dependency
- If I3 completes before I1, the result from I1
will be wrong - output (read-write) dependency
17In-Order Issue Out-of-Order Completion (Diagram)
18Out-of-Order IssueOut-of-Order Completion
- Decouple decode pipeline from execution pipeline
- Can continue to fetch and decode until this
pipeline is full - When a functional unit becomes available an
instruction can be executed - Since instructions have been decoded, processor
can look ahead
19Out-of-Order Issue Out-of-Order Completion
(Diagram)
20Antidependency
- Write-write dependency
- R3R3 R5 (I1)
- R4R3 1 (I2)
- R3R5 1 (I3)
- R7R3 R4 (I4)
- I3 can not complete before I2 starts as I2 needs
a value in R3 and I3 changes R3
21Register Renaming
- Output and antidependencies occur because
register contents may not reflect the correct
ordering from the program - May result in a pipeline stall
- Registers allocated dynamically
- i.e. registers are not specifically named
22Register Renaming example
- R3bR3a R5a (I1)
- R4bR3b 1 (I2)
- R3cR5a 1 (I3)
- R7bR3c R4b (I4)
- Without subscript refers to logical register in
instruction - With subscript is hardware register allocated
- Note R3a R3b R3c
23Machine Parallelism
- Duplication of Resources
- Out of order issue
- Renaming
- Not worth duplication functions without register
renaming - Need instruction window large enough (more than
8)
24Branch Prediction
- 80486 fetches both next sequential instruction
after branch and branch target instruction - Gives two cycle delay if branch taken
25RISC - Delayed Branch
- Calculate result of branch before unusable
instructions pre-fetched - Always execute single instruction immediately
following branch - Keeps pipeline full while fetching new
instruction stream - Not as good for superscalar
- Multiple instructions need to execute in delay
slot - Instruction dependence problems
- Revert to branch prediction
26Superscalar Execution
27Superscalar Implementation
- Simultaneously fetch multiple instructions
- Logic to determine true dependencies involving
register values - Mechanisms to communicate these values
- Mechanisms to initiate multiple instructions in
parallel - Resources for parallel execution of multiple
instructions - Mechanisms for committing process state in
correct order
28Pentium 4
- 80486 - CISC
- Pentium some superscalar components
- Two separate integer execution units
- Pentium Pro Full blown superscalar
- Subsequent models refine enhance superscalar
design
29Pentium 4 Block Diagram
30Pentium 4 Operation
- Fetch instructions form memory in order of static
program - Translate instruction into one or more fixed
length RISC instructions (micro-operations) - Execute micro-ops on superscalar pipeline
- micro-ops may be executed out of order
- Commit results of micro-ops to register set in
original program flow order - Outer CISC shell with inner RISC core
- Inner RISC core pipeline at least 20 stages
- Some micro-ops require multiple execution stages
- Longer pipeline
- c.f. five stage pipeline on x86 up to Pentium
31Pentium 4 Pipeline
32Pentium 4 Pipeline Operation (1)
33Pentium 4 Pipeline Operation (2)
34Pentium 4 Pipeline Operation (3)
35Pentium 4 Pipeline Operation (4)
36Pentium 4 Pipeline Operation (5)
37Pentium 4 Pipeline Operation (6)
38PowerPC
- Direct descendent of IBM 801, RT PC and RS/6000
- All are RISC
- RS/6000 first superscalar
- PowerPC 601 superscalar design similar to RS/6000
- Later versions extend superscalar concept
39PowerPC 601 General View
40PowerPC 601 Pipeline Structure
41PowerPC 601 Pipeline