Lecture 2: Processor and Pipelining - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Lecture 2: Processor and Pipelining

Description:

opcode. 27 25. rs. 24 22. rt. 21 19. rd. unused. unused. 3 0. funct. 3 0. funct. I-type instruction ... opcode. 27 25. rs. 24 22. rt. unused. 15 0. imm16 ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 37
Provided by: Rand254
Category:

less

Transcript and Presenter's Notes

Title: Lecture 2: Processor and Pipelining


1
Graduate Computer Architecture I
  • Lecture 2 Processor and Pipelining
  • Young Cho

2
Instruction Set Architecture
  • Set of Elementary Commands
  • Good ISA
  • CONVENIENT functionality to higher levels
  • EFFICIENT functionality to higher levels
  • GENERAL Used in many different ways
  • PORTABLE Lasts through many Gen
  • Points of View
  • Provides different HW and SW interface

3
Processors Today
  • General Purpose Register
  • Split of CISC and RISC
  • Development of RISC
  • Very Complex Design
  • No longer REDUCED
  • Rapid Technology Advancements

4
Performance Definition
  • Performance is in units of things per second
  • bigger is better
  • If we are primarily concerned with response time
  • " X is n times faster than Y" means

5
Amdahls Law
Best you could ever hope to do
6
Semester Schedule Review
  • Basic Architecture Organization Weeks 2-4
  • Processors and Pipelining Week 2
  • Memory Hierarchy and Cache Design Week 3
  • Hazards and Predictions Week 4
  • Quiz 1 Week 4
  • Quantitative Approach Weeks 5-10
  • Instructional Level Parallelism Week 5 6
  • Vector and Multi-Processors Week 7 8
  • Storage and I/O Week 9
  • Interconnects and Clustering Week 10
  • Quiz 2 Week 6
  • Quiz 3 Week 9
  • Advanced Topics Weeks 11-15
  • Network Processors Week 11
  • Reconfigurable Devices and SoC Week 12
  • Low Power Hardware and Techniques Week 12
  • HW and SW Co-design Week 13
  • Other Topics Week 14 15
  • Quiz 4 Week 11

7
Administrative
  • Course Web Site
  • http//www.arl.wustl.edu/young/cse560m
  • Xilinx Tools
  • May use Urbauer Room 116 Computers
  • Accounts will be available
  • ISE Version 7.1 and Modelsim 6.0a
  • http//direct.xilinx.com/direct/webpack/71/WebPACK
    _71_fcfull_i.exe
  • http//direct.xilinx.com/direct/webpack/71/MXE_6.0
    a_Full_installer.exe
  • Prerequisite Course Text
  • (Optional) D. Patterson and J. Hennessy, Computer
    Organization and Design The Hardware/Software
    Interface, Third Edition.
  • Quizzes A and B
  • For your own benefit
  • May need prerequisite course text but not
    necessary
  • Look for answers on the WWW
  • Project
  • Groups of 2-3 by Thursday
  • Weeks 1-5 Pipelined 32bit Processor
  • Build on top of the basic Processor afterwards
  • Lectures at Urbauer Room 116 (project check
    points)

8
Fabricated IC Costs

9
Traditional CISC and RISC
  • Reduced Instruction Set Computer
  • Smaller Design Footprint ? Reduced Cost
  • Essential Set of Instructions
  • Intuitively Larger Program
  • Complex Instruction Set Computer
  • Complex set of desired Instructions
  • Pack many functions in one Instruction
  • Compact Program Memory WAS Expensive
  • RISC a better fit for the Changes
  • Cheaper Memory
  • Shorter Critical Path Fast Clock Cycles
  • CISC Chips Integrated the RISC Concepts
  • Better Compilers
  • RISC!? of Today
  • Very Complex and Large set of Instructions
  • The original motivation cannot be seen
  • High Performance and Throughput

H-Line
V-Line
Circle
A lot of H and V-Lines
10
Real Performance Measurement
CPU time is the REAL measure of computer
performance. NOT Clock rate and NOT CPI
11
Cycles Per Instructions
Average Cycles per Instruction
CPI (CPU Time Clock Rate) / Instruction Count
Cycles / Instruction Count
Instruction Frequency
12
Calculating CPI
Run benchmark and collect workload
characterization (simulate, machine counters, or
sampling)
Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (
Time) ALU 50 1 .5 (33) Load 20 2
.4 (27) Store 10 2 .2 (13) Branch 20 2
.4 (27) 1.5
Typical Mix of instruction types in program
Design guideline Make the common case fast MIPS
1 rule only consider adding an instruction of
it is shown to add 1 performance improvement on
reasonable benchmarks.
13
Impact of Stalls
  • Assume CPI 1.0 ignoring branches (ideal)
  • Assume solution was stalling for 3 cycles
  • If 30 branch, Stall 3 cycles on 30
  • Op Freq Cycles CPI(i) ( Time) Other
    70 1 .7 (37) Branch 30 4
    1.2 (63)
  • ? new CPI 1.9
  • The Machine is 1/1.9 0.52 times
  • Far from ideal

14
Instruction Set Architecture Design
  • Definition
  • Set of Operations
  • Instruction Format
  • Hardware Data Types
  • Named Storage
  • Addressing Modes and Sequencing
  • Description in Register Transfer Language
  • Intermediate Representation
  • Map Instruction to RTLs
  • Technology Constraint Considerations
  • Architected storage mapped to actual storage
  • Function units to do all the required operations
  • Possible additional storage (e.g. MAddressR,
    MBufferR, )
  • Interconnect to move information among regs and
    FUs
  • Controller
  • Sequences into symbolic controller state
    transition diagram (STD)
  • Lower symbolic STD to control points
  • Controller Implementation

15
Typical Load/Store Processor
16
Instruction Format
 
General instruction format
4 bits remaining 28 bits vary
according to instruction type
R-type instruction
unused
unused
I-type instruction
unused
unused
J-type instruction
unused
unused
   
17
Instruction Type Datapath
R-type instructions
I-type instructions
J-type instructions
18
Cloth Washing Process
30 minutes
35 minutes
25 minutes
One set of Clothes in 1 Hour 30 minutes
19
Pipelining Laundry
30 minutes
35 minutes
35 minutes
35 minutes
25 minutes
53 min/set
3X Increase in Productivity!!!
With large number of sets, the each load takes
average of 35 min to wash
Three sets of Clean Clothes in 2 hours 40 minutes
20
Introducing Problems
  • Hazards prevent next instruction from executing
    during its designated clock cycle
  • Structural hazards HW cannot support this
    combination of instructions (single person to dry
    and iron clothes simultaneously)
  • Data hazards Instruction depends on result of
    prior instruction still in the pipeline (missing
    sock needs both before putting them away)
  • Control hazards Caused by delay between the
    fetching of instructions and decisions about
    changes in control flow (Erbranch jump)

21
One Memory Port/Structural Hazards
Time (clock cycles)
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 6
Cycle 7
Cycle 5
I n s t r. O r d e r
Stall
Instr 3
Instruction Fetch as well as Load from Memory
22
Speed Up Equation for Pipelining
For simple RISC pipeline, CPI 1
23
Memory and Pipeline
  • Machine A Dual ported memory
  • Machine B Single ported memory
  • 1.05 times faster clock rate
  • Ideal CPI 1 for both
  • Loads are 40 of instructions executed
  • FreqRatio Clockunpipe/Clockpipe
  • SpeedUpA (Pipeline Depth/(1 0)) x FreqRatio
  • Pipeline Depth x
    FreqRatio
  • SpeedUpB (Pipeline Depth/(1 0.4 x 1)) x
    FreqRatio x 1.05
  • Pipeline Depth x 0.75 x
    FreqRatio
  • SpeedUpA / SpeedUpB 1.33
  • Machine A is 1.33 times faster

24
Data Hazard on r1
Time (clock cycles)
25
Data Hazards
  • Read After Write (RAW) InstrJ tries to read
    operand before InstrI writes it
  • Caused by a Dependence (in compiler
    nomenclature). This hazard results from an
    actual need for communication.

I add r1,r2,r3 J sub r4,r1,r3
26
Data Hazards
  • Write After Read (WAR) InstrJ writes operand
    before InstrI reads it
  • Called an anti-dependence by compiler
    writers.This results from reuse of the name
    r1.
  • Cant happen in DLX 5 stage pipeline because
  • All instructions take 5 stages, and
  • Reads are always in stage 2, and
  • Writes are always in stage 5

27
Data Hazards
  • Write After Write (WAW) InstrJ writes operand
    before InstrI writes it.
  • Output dependence by compiler writers
  • This also results from the reuse of name r1.
  • Cant happen in DLX 5 stage pipeline because
  • All instructions take 5 stages, and
  • Writes are always in stage 5
  • Will see WAR and WAW in complicated pipelines

28
Solution Data Forwarding
Time (clock cycles)
29
HW Change for Forwarding
MEM/WR
ID/EX
EX/MEM
NextPC
mux
Registers
Data Memory
mux
mux
Immediate
30
Data Hazard Even with Forwarding
Time (clock cycles)
lw r1, 0(r2)
I n s t r. O r d e r
ALU
Reg
Reg
Mem
IF
Bubble
sub r4,r1,r6
Reg
IF
Bubble
and r6,r1,r7
IF
Bubble
or r8,r1,r9
31
Software Scheduling
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
  • Fast code
  • LW Rb,b
  • LW Rc,c
  • LW Re,e
  • ADD Ra,Rb,Rc
  • LW Rf,f
  • SW a,Ra
  • SUB Rd,Re,Rf
  • SW d,Rd

Compiler optimizes for performance. Hardware
checks for safety.
32
Control Hazard on Branches
What do you do with the 3 instructions in
between? How do you do it? Where is the commit?
33
Branch Hazard Alternatives
  • Stall until branch direction is clear
  • Predict Branch Not Taken
  • Execute successor instructions in sequence
  • Squash instructions in pipeline if branch
    actually taken
  • Advantage of late pipeline state update
  • 47 DLX branches not taken on average
  • PC4 already calculated, so use it to get next
    instr
  • Predict Branch Taken
  • 53 DLX branches taken on average
  • DLX still incurs 1 cycle branch penalty
  • Other machines branch target known before outcome

34
Branch Hazard Alternatives
  • Delayed Branch
  • Define branch to take place AFTER a following
    instruction (Fill in Branch Delay Slot)
  • branch instruction sequential
    successor1 sequential successor2 ........ seque
    ntial successorn
  • branch target if taken
  • 1 slot delay allows proper decision and branch
    target address in 5 stage pipeline

Branch delay of length n
35
Evaluating Branch Alternatives
  • Scheduling Branch CPI speedup v. speedup v.
    scheme penalty unpipelined stall
  • Stall pipeline 3 1.42 3.5 1.0
  • Predict taken 1 1.14 4.4 1.26
  • Predict not taken 1 1.09 4.5 1.29
  • Delayed branch 0.5 1.07 4.6 1.31
  • Conditional Unconditional 14, 65 change PC

36
Conclusion
  • Instruction Set Architecture
  • Things to Consider when designing a new ISA
  • Processor
  • Concept behind Pipelining
  • Five Stage Pipeline RISC
  • Proper Processor Performance Evaluation
  • Limitations of Pipelining
  • Structural, Data, and Control Hazards
  • Techniques to Recover Performance
  • Re-evaluating Speed-ups
Write a Comment
User Comments (0)
About PowerShow.com