Title: Cosc 2150
1Cosc 2150
- Chapter 9 a
- Instruction Level Parallelism
- and Superscalar Processors
2Introduction
- Before we can look at different methods that are
used to increase the speed of processors - We need to take a closer look at the
fetch/execute cycle
3Micro-Operations
- A computer executes a program
- Fetch/Execute cycle
- Each cycle has a number of steps
- see pipelining
- Called micro-operations
- Each step does very little
- Atomic operation of CPU
4Constituent Elements of Program Execution
5Fetch - 4 Registers
- Memory Address Register (MAR)
- Connected to address bus
- Specifies address for read or write op
- Memory Buffer Register (MBR)
- Connected to data bus
- Holds data to write or last data read
- Program Counter (PC)
- Holds address of next instruction to be fetched
- Instruction Register (IR)
- Holds last instruction fetched
6Fetch Sequence
- Address of next instruction is in PC
- Address (MAR) is placed on address bus
- Control unit issues READ command
- Result (data from memory) appears on data bus
- Data from data bus copied into MBR
- PC incremented by 1 (in parallel with data fetch
from memory) - Data (instruction) moved from MBR to IR
- MBR is now free for further data fetches
7Fetch Sequence (symbolic)
- (tx time unit/clock cycle)
- t1 MAR lt- (PC)
- t2 MBR lt- (memory)
- PC lt- (PC) 1
- t3 IR lt- (MBR)
- or
- t1 MAR lt- (PC)
- t2 MBR lt- (memory)
- t3 PC lt- (PC) 1
- IR lt- (MBR)
8Rules for Clock Cycle Grouping
- Proper sequence must be followed
- MAR lt- (PC) must precede MBR lt- (memory)
- Conflicts must be avoided
- Must not read write same register at same time
- MBR lt- (memory) IR lt- (MBR) must not be in same
cycle - Also PC lt- (PC) 1 involves addition
- Use ALU
- May need additional micro-operations
9Indirect Cycle
- MAR lt- (IRaddress) address field of IR
- MBR lt- (memory)
- IRaddress lt- (MBRaddress)
- MBR contains an address
- IR is now in same state as if direct addressing
had been used
10Interrupt Cycle
- t1 MBR lt-(PC)
- t2 MAR lt- save-address
- PC lt- routine-address
- t3 memory lt- (MBR)
- This is a minimum
- May be additional micro-ops to get addresses
- N.B. saving context is done by interrupt handler
routine, not micro-ops
11Execute Cycle
- Different for each instruction
- In general, complete the task of the instruction
- Example
- ADD R1,X - add the contents of location X to
Register 1 , result in R1 - t1 MAR lt- (IRaddress)
- t2 MBR lt- (memory)
- t3 R1 lt- R1 (MBR)
12Execute Cycle (BSA)
- BSA X - Branch and save address
- Address of instruction following BSA is saved in
X - Execution continues from X1
- t1 MAR lt- (IRaddress)
- MBR lt- (PC)
- t2 PC lt- (IRaddress)
- memory lt- (MBR)
- t3 PC lt- (PC) 1
13Instruction Cycle
- Each phase decomposed into sequence of elementary
micro-operations - E.g. fetch, indirect, and interrupt cycles
- Execute cycle
- One sequence of micro-operations for each opcode
- Need to tie sequences together
- Assume new 2-bit register
- Instruction cycle code (ICC) designates which
part of cycle processor is in - 00 Fetch
- 01 Indirect
- 10 Execute
- 11 Interrupt
14What is Superscalar?
- Common instructions (arithmetic, load/store,
conditional branch) can be initiated and executed
independently - Equally applicable to RISC CISC
- In practice usually RISC
15General Superscalar Organization
16Superpipelined
- Many pipeline stages need less than half a clock
cycle - Double internal clock speed gets two tasks per
external clock cycle - Superscalar allows parallel fetch and execute
17Superscalar vSuperpipeline
18Limitations
- Instruction level parallelism
- Compiler based optimisation
- Hardware techniques
- Limited by
- True data dependency
- Procedural dependency
- Resource conflicts
- Output dependency
- Antidependency
19True Data Dependency
- ADD r1, r2 (r1 r1r2)
- MOVE r3,r1 (r3 r1)
- Can fetch and decode second instruction in
parallel with first - Can NOT execute second instruction until first is
finished
20Procedural Dependency
- Can not execute instructions after a branch in
parallel with instructions before a branch - Also, if instruction length is not fixed,
instructions have to be decoded to find out how
many fetches are needed - This prevents simultaneous fetches
21Resource Conflict
- Two or more instructions requiring access to the
same resource at the same time - e.g. two arithmetic instructions
- Can duplicate resources
- e.g. have two arithmetic units
22Effect of Dependencies
23Design Issues
- Instruction level parallelism
- Instructions in a sequence are independent
- Execution can be overlapped
- Governed by data and procedural dependency
- Machine Parallelism
- Ability to take advantage of instruction level
parallelism - Governed by number of parallel pipelines
24Instruction Issue Policy
- Order in which instructions are fetched
- Order in which instructions are executed
- Order in which instructions change registers and
memory
25In-Order Issue In-Order Completion
- Issue instructions in the order they occur
- Not very efficient
- May fetch gt1 instruction
- Instructions must stall if necessary
26In-Order Issue In-Order Completion (Diagram)
27In-Order Issue Out-of-Order Completion
- Output dependency
- R3 R3 R5 (I1)
- R4 R3 1 (I2)
- R3 R5 1 (I3)
- I2 depends on result of I1 - data dependency
- If I3 completes before I1, the result from I1
will be wrong - output (read-write) dependency
28In-Order Issue Out-of-Order Completion (Diagram)
29Out-of-Order IssueOut-of-Order Completion
- Decouple decode pipeline from execution pipeline
- Can continue to fetch and decode until this
pipeline is full - When a functional unit becomes available an
instruction can be executed - Since instructions have been decoded, processor
can look ahead
30Out-of-Order Issue Out-of-Order Completion
(Diagram)
31Antidependency
- Write-write dependency
- R3R3 R5 (I1)
- R4R3 1 (I2)
- R3R5 1 (I3)
- R7R3 R4 (I4)
- I3 can not complete before I2 starts as I2 needs
a value in R3 and I3 changes R3
32Register Renaming
- Output and antidependencies occur because
register contents may not reflect the correct
ordering from the program - May result in a pipeline stall
- Registers allocated dynamically
- i.e. registers are not specifically named
33Register Renaming example
- R3bR3a R5a (I1)
- R4bR3b 1 (I2)
- R3cR5a 1 (I3)
- R7bR3c R4b (I4)
- Without subscript refers to logical register in
instruction - With a subscript then hardware register allocated
- Note R3a R3b R3c
34Speedups of Machine Organizations Without
Procedural Dependencies
35Machine Parallelism
- Duplication of Resources
- Out of order issue
- Renaming
- Not worth duplication functions without register
renaming - Need instruction window large enough (more than
8)
36Superscalar Implementation
- Simultaneously fetch multiple instructions
- Logic to determine true dependencies involving
register values - Mechanisms to communicate these values
- Mechanisms to initiate multiple instructions in
parallel - Resources for parallel execution of multiple
instructions - Mechanisms for committing process state in
correct order
37Q
A