Lecture 2: Processor and Pipelining

About This Presentation

Title:

Lecture 2: Processor and Pipelining

Description:

opcode. 27 25. rs. 24 22. rt. 21 19. rd. unused. unused. 3 0. funct. 3 0. funct. I-type instruction ... opcode. 27 25. rs. 24 22. rt. unused. 15 0. imm16 ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 37

Provided by: Rand254

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 2: Processor and Pipelining

1
Graduate Computer Architecture I

Lecture 2 Processor and Pipelining
Young Cho

2
Instruction Set Architecture

Set of Elementary Commands
Good ISA
CONVENIENT functionality to higher levels
EFFICIENT functionality to higher levels
GENERAL Used in many different ways
PORTABLE Lasts through many Gen
Points of View
Provides different HW and SW interface

3
Processors Today

General Purpose Register
Split of CISC and RISC
Development of RISC
Very Complex Design
No longer REDUCED
Rapid Technology Advancements

4
Performance Definition

Performance is in units of things per second
bigger is better
If we are primarily concerned with response time
" X is n times faster than Y" means

5
Amdahls Law
Best you could ever hope to do
6
Semester Schedule Review

Basic Architecture Organization Weeks 2-4
Processors and Pipelining Week 2
Memory Hierarchy and Cache Design Week 3
Hazards and Predictions Week 4
Quiz 1 Week 4
Quantitative Approach Weeks 5-10
Instructional Level Parallelism Week 5 6
Vector and Multi-Processors Week 7 8
Storage and I/O Week 9
Interconnects and Clustering Week 10
Quiz 2 Week 6
Quiz 3 Week 9
Advanced Topics Weeks 11-15
Network Processors Week 11
Reconfigurable Devices and SoC Week 12
Low Power Hardware and Techniques Week 12
HW and SW Co-design Week 13
Other Topics Week 14 15
Quiz 4 Week 11

7
Administrative

Course Web Site
http//www.arl.wustl.edu/young/cse560m
Xilinx Tools
May use Urbauer Room 116 Computers
Accounts will be available
ISE Version 7.1 and Modelsim 6.0a
http//direct.xilinx.com/direct/webpack/71/WebPACK
_71_fcfull_i.exe
http//direct.xilinx.com/direct/webpack/71/MXE_6.0
a_Full_installer.exe
Prerequisite Course Text
(Optional) D. Patterson and J. Hennessy, Computer
Organization and Design The Hardware/Software
Interface, Third Edition.
Quizzes A and B
For your own benefit
May need prerequisite course text but not
necessary
Look for answers on the WWW
Project
Groups of 2-3 by Thursday
Weeks 1-5 Pipelined 32bit Processor
Build on top of the basic Processor afterwards
Lectures at Urbauer Room 116 (project check
points)

8
Fabricated IC Costs

9
Traditional CISC and RISC

Reduced Instruction Set Computer
Smaller Design Footprint ? Reduced Cost
Essential Set of Instructions
Intuitively Larger Program
Complex Instruction Set Computer
Complex set of desired Instructions
Pack many functions in one Instruction
Compact Program Memory WAS Expensive
RISC a better fit for the Changes
Cheaper Memory
Shorter Critical Path Fast Clock Cycles
CISC Chips Integrated the RISC Concepts
Better Compilers
RISC!? of Today
Very Complex and Large set of Instructions
The original motivation cannot be seen
High Performance and Throughput

H-Line
V-Line
Circle
A lot of H and V-Lines
10
Real Performance Measurement
CPU time is the REAL measure of computer
performance. NOT Clock rate and NOT CPI
11
Cycles Per Instructions
Average Cycles per Instruction
CPI (CPU Time Clock Rate) / Instruction Count
Cycles / Instruction Count
Instruction Frequency
12
Calculating CPI
Run benchmark and collect workload
characterization (simulate, machine counters, or
sampling)
Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (
Time) ALU 50 1 .5 (33) Load 20 2
.4 (27) Store 10 2 .2 (13) Branch 20 2
.4 (27) 1.5
Typical Mix of instruction types in program
Design guideline Make the common case fast MIPS
1 rule only consider adding an instruction of
it is shown to add 1 performance improvement on
reasonable benchmarks.
13
Impact of Stalls

Assume CPI 1.0 ignoring branches (ideal)
Assume solution was stalling for 3 cycles
If 30 branch, Stall 3 cycles on 30
Op Freq Cycles CPI(i) ( Time) Other
70 1 .7 (37) Branch 30 4
1.2 (63)
? new CPI 1.9
The Machine is 1/1.9 0.52 times
Far from ideal

14
Instruction Set Architecture Design

Definition
Set of Operations
Instruction Format
Hardware Data Types
Named Storage
Addressing Modes and Sequencing
Description in Register Transfer Language
Intermediate Representation
Map Instruction to RTLs
Technology Constraint Considerations
Architected storage mapped to actual storage
Function units to do all the required operations
Possible additional storage (e.g. MAddressR,
MBufferR, )
Interconnect to move information among regs and
FUs
Controller
Sequences into symbolic controller state
transition diagram (STD)
Lower symbolic STD to control points
Controller Implementation

15
Typical Load/Store Processor
16
Instruction Format

General instruction format
4 bits remaining 28 bits vary
according to instruction type
R-type instruction
unused
unused
I-type instruction
unused
unused
J-type instruction
unused
unused

17
Instruction Type Datapath
R-type instructions
I-type instructions
J-type instructions
18
Cloth Washing Process
30 minutes
35 minutes
25 minutes
One set of Clothes in 1 Hour 30 minutes
19
Pipelining Laundry
30 minutes
35 minutes
35 minutes
35 minutes
25 minutes
53 min/set
3X Increase in Productivity!!!
With large number of sets, the each load takes
average of 35 min to wash
Three sets of Clean Clothes in 2 hours 40 minutes
20
Introducing Problems

Hazards prevent next instruction from executing
during its designated clock cycle
Structural hazards HW cannot support this
combination of instructions (single person to dry
and iron clothes simultaneously)
Data hazards Instruction depends on result of
prior instruction still in the pipeline (missing
sock needs both before putting them away)
Control hazards Caused by delay between the
fetching of instructions and decisions about
changes in control flow (Erbranch jump)

21
One Memory Port/Structural Hazards
Time (clock cycles)
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 6
Cycle 7
Cycle 5
I n s t r. O r d e r
Stall
Instr 3
Instruction Fetch as well as Load from Memory
22
Speed Up Equation for Pipelining
For simple RISC pipeline, CPI 1
23
Memory and Pipeline

Machine A Dual ported memory
Machine B Single ported memory
1.05 times faster clock rate
Ideal CPI 1 for both
Loads are 40 of instructions executed
FreqRatio Clockunpipe/Clockpipe
SpeedUpA (Pipeline Depth/(1 0)) x FreqRatio
Pipeline Depth x
FreqRatio
SpeedUpB (Pipeline Depth/(1 0.4 x 1)) x
FreqRatio x 1.05
Pipeline Depth x 0.75 x
FreqRatio
SpeedUpA / SpeedUpB 1.33
Machine A is 1.33 times faster

24
Data Hazard on r1
Time (clock cycles)
25
Data Hazards

Read After Write (RAW) InstrJ tries to read
operand before InstrI writes it
Caused by a Dependence (in compiler
nomenclature). This hazard results from an
actual need for communication.

I add r1,r2,r3 J sub r4,r1,r3
26
Data Hazards

Write After Read (WAR) InstrJ writes operand
before InstrI reads it
Called an anti-dependence by compiler
writers.This results from reuse of the name
r1.
Cant happen in DLX 5 stage pipeline because
All instructions take 5 stages, and
Reads are always in stage 2, and
Writes are always in stage 5

27
Data Hazards

Write After Write (WAW) InstrJ writes operand
before InstrI writes it.
Output dependence by compiler writers
This also results from the reuse of name r1.
Cant happen in DLX 5 stage pipeline because
All instructions take 5 stages, and
Writes are always in stage 5
Will see WAR and WAW in complicated pipelines

28
Solution Data Forwarding
Time (clock cycles)
29
HW Change for Forwarding
MEM/WR
ID/EX
EX/MEM
NextPC
mux
Registers
Data Memory
mux
mux
Immediate
30
Data Hazard Even with Forwarding
Time (clock cycles)
lw r1, 0(r2)
I n s t r. O r d e r
ALU
Reg
Reg
Mem
IF
Bubble
sub r4,r1,r6
Reg
IF
Bubble
and r6,r1,r7
IF
Bubble
or r8,r1,r9
31
Software Scheduling
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd

Fast code
LW Rb,b
LW Rc,c
LW Re,e
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SW d,Rd

Compiler optimizes for performance. Hardware
checks for safety.
32
Control Hazard on Branches
What do you do with the 3 instructions in
between? How do you do it? Where is the commit?
33
Branch Hazard Alternatives

Stall until branch direction is clear
Predict Branch Not Taken
Execute successor instructions in sequence
Squash instructions in pipeline if branch
actually taken
Advantage of late pipeline state update
47 DLX branches not taken on average
PC4 already calculated, so use it to get next
instr
Predict Branch Taken
53 DLX branches taken on average
DLX still incurs 1 cycle branch penalty
Other machines branch target known before outcome

34
Branch Hazard Alternatives

Delayed Branch
Define branch to take place AFTER a following
instruction (Fill in Branch Delay Slot)
branch instruction sequential
successor1 sequential successor2 ........ seque
ntial successorn
branch target if taken
1 slot delay allows proper decision and branch
target address in 5 stage pipeline

Branch delay of length n
35
Evaluating Branch Alternatives

Scheduling Branch CPI speedup v. speedup v.
scheme penalty unpipelined stall
Stall pipeline 3 1.42 3.5 1.0
Predict taken 1 1.14 4.4 1.26
Predict not taken 1 1.09 4.5 1.29
Delayed branch 0.5 1.07 4.6 1.31
Conditional Unconditional 14, 65 change PC

36
Conclusion

Instruction Set Architecture
Things to Consider when designing a new ISA
Processor
Concept behind Pipelining
Five Stage Pipeline RISC
Proper Processor Performance Evaluation
Limitations of Pipelining
Structural, Data, and Control Hazards
Techniques to Recover Performance
Re-evaluating Speed-ups

Write a Comment

User Comments (0)

About PowerShow.com

Lecture 2: Processor and Pipelining - PowerPoint PPT Presentation

Lecture 2: Processor and Pipelining

opcode. 27 25. rs. 24 22. rt. 21 19. rd. unused. unused. 3 0. funct. 3 0. funct. I-type instruction ... opcode. 27 25. rs. 24 22. rt. unused. 15 0. imm16 ... – PowerPoint PPT presentation