Title: Lecture 2: Processor and Pipelining
1Graduate Computer Architecture I
- Lecture 2 Processor and Pipelining
- Young Cho
2Instruction Set Architecture
- Set of Elementary Commands
- Good ISA
- CONVENIENT functionality to higher levels
- EFFICIENT functionality to higher levels
- GENERAL Used in many different ways
- PORTABLE Lasts through many Gen
- Points of View
- Provides different HW and SW interface
3Processors Today
- General Purpose Register
- Split of CISC and RISC
- Development of RISC
- Very Complex Design
- No longer REDUCED
- Rapid Technology Advancements
4Performance Definition
- Performance is in units of things per second
- bigger is better
- If we are primarily concerned with response time
- " X is n times faster than Y" means
5Amdahls Law
Best you could ever hope to do
6Semester Schedule Review
- Basic Architecture Organization Weeks 2-4
- Processors and Pipelining Week 2
- Memory Hierarchy and Cache Design Week 3
- Hazards and Predictions Week 4
- Quiz 1 Week 4
- Quantitative Approach Weeks 5-10
- Instructional Level Parallelism Week 5 6
- Vector and Multi-Processors Week 7 8
- Storage and I/O Week 9
- Interconnects and Clustering Week 10
- Quiz 2 Week 6
- Quiz 3 Week 9
- Advanced Topics Weeks 11-15
- Network Processors Week 11
- Reconfigurable Devices and SoC Week 12
- Low Power Hardware and Techniques Week 12
- HW and SW Co-design Week 13
- Other Topics Week 14 15
- Quiz 4 Week 11
- Course Web Site
- http//www.arl.wustl.edu/young/cse560m
- Xilinx Tools
- May use Urbauer Room 116 Computers
- Accounts will be available
- ISE Version 7.1 and Modelsim 6.0a
- http//direct.xilinx.com/direct/webpack/71/WebPACK
_71_fcfull_i.exe - http//direct.xilinx.com/direct/webpack/71/MXE_6.0
a_Full_installer.exe - Prerequisite Course Text
- (Optional) D. Patterson and J. Hennessy, Computer
Organization and Design The Hardware/Software
Interface, Third Edition. - Quizzes A and B
- For your own benefit
- May need prerequisite course text but not
necessary - Look for answers on the WWW
- Project
- Groups of 2-3 by Thursday
- Weeks 1-5 Pipelined 32bit Processor
- Build on top of the basic Processor afterwards
- Lectures at Urbauer Room 116 (project check
8Fabricated IC Costs
9Traditional CISC and RISC
- Reduced Instruction Set Computer
- Smaller Design Footprint ? Reduced Cost
- Essential Set of Instructions
- Intuitively Larger Program
- Complex Instruction Set Computer
- Complex set of desired Instructions
- Pack many functions in one Instruction
- Compact Program Memory WAS Expensive
- RISC a better fit for the Changes
- Cheaper Memory
- Shorter Critical Path Fast Clock Cycles
- CISC Chips Integrated the RISC Concepts
- Better Compilers
- RISC!? of Today
- Very Complex and Large set of Instructions
- The original motivation cannot be seen
- High Performance and Throughput
A lot of H and V-Lines
10Real Performance Measurement
CPU time is the REAL measure of computer
performance. NOT Clock rate and NOT CPI
11Cycles Per Instructions
Average Cycles per Instruction
CPI (CPU Time Clock Rate) / Instruction Count
Cycles / Instruction Count
Instruction Frequency
12Calculating CPI
Run benchmark and collect workload
characterization (simulate, machine counters, or
Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (
Time) ALU 50 1 .5 (33) Load 20 2
.4 (27) Store 10 2 .2 (13) Branch 20 2
.4 (27) 1.5
Typical Mix of instruction types in program
Design guideline Make the common case fast MIPS
1 rule only consider adding an instruction of
it is shown to add 1 performance improvement on
reasonable benchmarks.
13Impact of Stalls
- Assume CPI 1.0 ignoring branches (ideal)
- Assume solution was stalling for 3 cycles
- If 30 branch, Stall 3 cycles on 30
- Op Freq Cycles CPI(i) ( Time) Other
70 1 .7 (37) Branch 30 4
1.2 (63) - ? new CPI 1.9
- The Machine is 1/1.9 0.52 times
- Far from ideal
14Instruction Set Architecture Design
- Definition
- Set of Operations
- Instruction Format
- Hardware Data Types
- Named Storage
- Addressing Modes and Sequencing
- Description in Register Transfer Language
- Intermediate Representation
- Map Instruction to RTLs
- Technology Constraint Considerations
- Architected storage mapped to actual storage
- Function units to do all the required operations
- Possible additional storage (e.g. MAddressR,
MBufferR, ) - Interconnect to move information among regs and
FUs - Controller
- Sequences into symbolic controller state
transition diagram (STD) - Lower symbolic STD to control points
- Controller Implementation
15Typical Load/Store Processor
16Instruction Format
General instruction format
4 bits remaining 28 bits vary
according to instruction type
R-type instruction
I-type instruction
J-type instruction
17Instruction Type Datapath
R-type instructions
I-type instructions
J-type instructions
18Cloth Washing Process
30 minutes
35 minutes
25 minutes
One set of Clothes in 1 Hour 30 minutes
19Pipelining Laundry
30 minutes
35 minutes
35 minutes
35 minutes
25 minutes
53 min/set
3X Increase in Productivity!!!
With large number of sets, the each load takes
average of 35 min to wash
Three sets of Clean Clothes in 2 hours 40 minutes
20Introducing Problems
- Hazards prevent next instruction from executing
during its designated clock cycle - Structural hazards HW cannot support this
combination of instructions (single person to dry
and iron clothes simultaneously) - Data hazards Instruction depends on result of
prior instruction still in the pipeline (missing
sock needs both before putting them away) - Control hazards Caused by delay between the
fetching of instructions and decisions about
changes in control flow (Erbranch jump)
21One Memory Port/Structural Hazards
Time (clock cycles)
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 6
Cycle 7
Cycle 5
I n s t r. O r d e r
Instr 3
Instruction Fetch as well as Load from Memory
22Speed Up Equation for Pipelining
For simple RISC pipeline, CPI 1
23Memory and Pipeline
- Machine A Dual ported memory
- Machine B Single ported memory
- 1.05 times faster clock rate
- Ideal CPI 1 for both
- Loads are 40 of instructions executed
- FreqRatio Clockunpipe/Clockpipe
- SpeedUpA (Pipeline Depth/(1 0)) x FreqRatio
- Pipeline Depth x
FreqRatio - SpeedUpB (Pipeline Depth/(1 0.4 x 1)) x
FreqRatio x 1.05 - Pipeline Depth x 0.75 x
FreqRatio - SpeedUpA / SpeedUpB 1.33
- Machine A is 1.33 times faster
24Data Hazard on r1
Time (clock cycles)
25Data Hazards
- Read After Write (RAW) InstrJ tries to read
operand before InstrI writes it - Caused by a Dependence (in compiler
nomenclature). This hazard results from an
actual need for communication.
I add r1,r2,r3 J sub r4,r1,r3
26Data Hazards
- Write After Read (WAR) InstrJ writes operand
before InstrI reads it - Called an anti-dependence by compiler
writers.This results from reuse of the name
r1. - Cant happen in DLX 5 stage pipeline because
- All instructions take 5 stages, and
- Reads are always in stage 2, and
- Writes are always in stage 5
27Data Hazards
- Write After Write (WAW) InstrJ writes operand
before InstrI writes it. - Output dependence by compiler writers
- This also results from the reuse of name r1.
- Cant happen in DLX 5 stage pipeline because
- All instructions take 5 stages, and
- Writes are always in stage 5
- Will see WAR and WAW in complicated pipelines
28Solution Data Forwarding
Time (clock cycles)
29HW Change for Forwarding
Data Memory
30Data Hazard Even with Forwarding
Time (clock cycles)
lw r1, 0(r2)
I n s t r. O r d e r
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
31Software Scheduling
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
- Fast code
- LW Rb,b
- LW Rc,c
- LW Re,e
- ADD Ra,Rb,Rc
- LW Rf,f
- SW a,Ra
- SUB Rd,Re,Rf
- SW d,Rd
Compiler optimizes for performance. Hardware
checks for safety.
32Control Hazard on Branches
What do you do with the 3 instructions in
between? How do you do it? Where is the commit?
33Branch Hazard Alternatives
- Stall until branch direction is clear
- Predict Branch Not Taken
- Execute successor instructions in sequence
- Squash instructions in pipeline if branch
actually taken - Advantage of late pipeline state update
- 47 DLX branches not taken on average
- PC4 already calculated, so use it to get next
instr - Predict Branch Taken
- 53 DLX branches taken on average
- DLX still incurs 1 cycle branch penalty
- Other machines branch target known before outcome
34Branch Hazard Alternatives
- Delayed Branch
- Define branch to take place AFTER a following
instruction (Fill in Branch Delay Slot) - branch instruction sequential
successor1 sequential successor2 ........ seque
ntial successorn - branch target if taken
- 1 slot delay allows proper decision and branch
target address in 5 stage pipeline
Branch delay of length n
35Evaluating Branch Alternatives
- Scheduling Branch CPI speedup v. speedup v.
scheme penalty unpipelined stall -
- Stall pipeline 3 1.42 3.5 1.0
- Predict taken 1 1.14 4.4 1.26
- Predict not taken 1 1.09 4.5 1.29
- Delayed branch 0.5 1.07 4.6 1.31
- Conditional Unconditional 14, 65 change PC
- Instruction Set Architecture
- Things to Consider when designing a new ISA
- Processor
- Concept behind Pipelining
- Five Stage Pipeline RISC
- Proper Processor Performance Evaluation
- Limitations of Pipelining
- Structural, Data, and Control Hazards
- Techniques to Recover Performance
- Re-evaluating Speed-ups