ECE369 Chapter 4 - PowerPoint PPT Presentation

1 / 97
About This Presentation
Title:

ECE369 Chapter 4

Description:

Memory-reference instructions: lw, sw. Arithmetic-logical instructions: add, sub, and, or, slt ... 00 = lw, sw. 01 = beq, 10 = arithmetic. 24. ECE369. Control ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 98
Provided by: AliAk7
Category:
Tags: chapter | ece369 | lw

less

Transcript and Presenter's Notes

Title: ECE369 Chapter 4


1
ECE369Chapter 4

2
State Elements
  • Unclocked vs. Clocked
  • Clocks used in synchronous logic
  • Clocks are needed in sequential logic to decide
    when an element that contains state should be
    updated.

3
Latches and Flip-flops
4
Latches and Flip-flops
5
Latches and Flip-flops
Latches whenever the inputs change, and the
clock is asserted Flip-flop state changes only
on a clock edge (edge-triggered methodology)
6
SRAM
7
SRAM vs. DRAM
Which one has a better memory density?
static RAM (SRAM) value stored in a cell is kept
on a pair of inverting gates dynamic RAM
(DRAM), value kept in a cell is stored as a
charge in a capacitor. DRAMs use only a single
transistor per bit of storage, By comparison,
SRAMs require four to six transistors per bit
Which one is faster?
In DRAMs, the charge is stored on a capacitor, so
it cannot be kept indefinitely and must
periodically be refreshed. (called dynamic)
Synchronous RAMs ??
is the ability to transfer a burst of data from a
series of sequential addresses within an array
or row.
8
Datapath control design
  • We will design a simplified MIPS processor
  • The instructions supported are
  • Memory-reference instructions lw, sw
  • Arithmetic-logical instructions add, sub, and,
    or, slt
  • Control flow instructions beq, j
  • Generic implementation
  • Use the program counter (PC) to supply
    instruction address
  • Get the instruction from memory
  • Read registers
  • Use the instruction to decide exactly what to do
  • All instructions use the ALU after reading the
    registersWhy? memory-reference? arithmetic?
    control flow?

9
ALU Control
  • ALU's operation based on instruction type and
    function code
  • Example
  • add t1, s7, s8
  • 000 AND
  • 001 OR
  • 010 Add
  • 110 Subtract
  • 111 Set-on-less-than

lw t0, 32(s2)
10
Summary of Instruction Types
11
Building blocks
Why do we need each of these?
12
Fetching instructions
13
Reading registers
14
Load/Store memory access
15
Branch target
16
Combining datapath for memory and R-type
instructions
17
Appending instruction fetch
18
Now Insert Branch
19
The simple datapath
20
Control
  • For each instruction
  • Select the registers to be read (always read two)
  • Select the 2nd ALU input
  • Select the operation to be performed by ALU
  • Select if data memory is to be read or written
  • Select what is written and where in the register
    file
  • Select what goes in PC
  • Information comes from the 32 bits of the
    instruction

21
Adding control to datapath
22
Adding control to datapath

23
ALU Control
  • given instruction type 00 lw, sw 01 beq,
    10 arithmetic

24
Control (Reading Assignment Appendix C.2)
  • Simple combinational logic (truth tables)

25

26
Datapath in Operation for R-Type Instruction
27
Datapath in Operation for Load Instruction
28
Datapath in Operation for Branch Equal Instruction
29
Datapath with control for Jump instruction
  • J-type instructions use 6 bits for the opcode,
    and 26 bits for the immediate value (called the
    target).
  • newPC lt- PC31-28 IR25-0 00

30
Timing Single cycle implementation
  • Calculate cycle time assuming negligible delays
    except
  • Memory (2ns), ALU and adders (2ns), Register file
    access (1ns)

31
Why is Single Cycle not GOOD???
  • Memory - 2ns
  • ALU - 2ns Adder - 2ns
  • Reg - 1ns
  • what if we had floating point instructions to
    handle?

32
1 clock cycle fixed vs. variable for each
instruction
  • Memory - 2ns
  • ALU - 2ns Adder - 2ns
  • Reg - 1ns
  • Loads 24
  • Stores 12
  • R-type 44
  • Branch 18
  • Jumps 2

33
1 clock cycle fixed vs. variable for each
instruction
  • Memory - 2ns
  • ALU - 2ns Adder - 2ns
  • Reg - 1ns
  • Loads 24
  • Stores 12
  • R-type 44
  • Branch 18
  • Jumps 2

CPU IC CPI CC CPU 824 712 644
518 22 CPU 6.3ns
34
Single Cycle Problems
  • Wasteful of area
  • Each unit used once per clock cycle
  • Clock cycle equal to worst case scenario
  • Will reducing the delay of common case help?

35
Pipelining
36
Pipelining
  • Improve performance by increasing instruction
    throughput

Ideal speedup is number of stages in the
pipeline. Do we achieve this?
37
Pipelining
  • What makes it easy
  • all instructions are the same length
  • just a few instruction formats
  • memory operands appear only in loads and stores
  • What makes it hard?
  • structural hazards suppose we had only one
    memory
  • control hazards need to worry about branch
    instructions
  • data hazards an instruction depends on a
    previous instruction
  • Well build a simple pipeline and look at these
    issues
  • Well talk about modern processors and what
    really makes it hard
  • exception handling
  • trying to improve performance with out-of-order
    execution, etc.

38
Representation
39
Hazards
40
Hazards
41
Hazards
42
Basic Idea
What do we need to add to actually split the
datapath into stages?
43
Pipelined datapath
44
Five Stages (lw)
Memory and registers Left half write Right half
read
45
Five Stages (lw)
46
Five Stages (lw)
47
What is wrong with this datapath?
48
Store Instruction
49
Store Instruction
50
Graphically representing pipelines
  • Can help with answering questions like
  • How many cycles does it take to execute this
    code?
  • What is the ALU doing during cycle 4?
  • Use this representation to help understand
    datapaths

51
Pipeline operation
  • In pipeline one operation begins in every cycle
  • Also, one operation completes in each cycle
  • Each instruction takes 5 clock cycles
  • k cycles in general, where k is pipeline depth
  • When a stage is not used, no control needs to be
    applied
  • In one clock cycle, several instructions are
    active
  • Different stages are executing different
    instructions
  • How to generate control signals for them is an
    issue

52
Pipeline control
  • We have 5 stages. What needs to be controlled in
    each stage?
  • Instruction Fetch and PC Increment
  • Instruction Decode / Register Fetch
  • Execution
  • Memory Stage
  • Write Back
  • How would control be handled in an automobile
    plant?
  • A fancy control center telling everyone what to
    do?
  • Should we use a finite state machine?

53
Pipeline control
54
Pipeline control
55
Datapath with control
56
Dependencies
  • Problem with starting next instruction before
    first is finished
  • Dependencies that go backward in time are data
    hazards

57
Forwarding
  • Use temporary results, dont wait for them to be
    written
  • register file forwarding to handle read/write to
    same register
  • ALU forwarding

58
Forwarding
sub 2, 1, 3 and 12, 2, 5 or 13, 6,
2 add 14, 2, 2 sw 15, 100(2)
59
Forwarding
60
Can't always forward
  • Load word can still cause a hazard
  • an instruction tries to read a register following
    a load instruction that writes to the same
    register.

61
Stalling
  • Hardware detection and no-op insertion is called
    stalling
  • Stall pipeline by keeping instruction in the same
    stage

62
Example
63
(No Transcript)
64
Stall logic
  • Stall logic
  • If (ID/EX.MemRead) // Load word instruction AND
  • If ((ID/EX.Rt IF/ID.Rs) or (ID/EX.Rt
    IF/ID.Rt))
  • Insert no-op (no-operation)
  • Deasserting all control signals
  • Stall following instruction
  • Not writing program counter
  • Not writing IF/ID registers

PCWrite
IF/ID.Rs IF/ID.Rt
ID/EX.Rt
65
Pipeline with hazard detection
66
Assume that register file is written in the first
half and read in the second half of the clock
cycle.
load r2 lt- mem(r10) LOAD1 r3 lt- r3 r2
ADD load r4 lt- mem(r2r3) LOAD2 r4 lt- r5 -
r3 SUB
IF
ID
EX
ME
WB
IF
ID
S
S
EX
ME
WB
IF
S
S
ID
EX
ME
WB
IF
ID
S
EX
ME
WB
S
S
67
Summary
68
Forwarding Case Summary
69
Multi-cycle
70
Multi-cycle
71
Multi-cycle Pipeline
72
Branch Hazards
73
Branch hazards
  • When we decide to branch, other instructions are
    in the pipeline!
  • We are predicting branch not taken
  • need to add hardware for flushing instructions if
    we are wrong

74
Solution to control hazards
  • Branch prediction
  • We are predicting branch not taken
  • Need to add hardware for flushing instructions if
    we are wrong
  • Reduce branch penalty
  • By advancing the branch decision to ID stage
  • Compare the data read from two registers read in
    ID stage
  • Comparison for equality is a simpler design!
    (Why?)
  • Still need to flush instruction in IF stage
  • Make the hazard into a feature!
  • Delayed branch slot - Always execute instruction
    following branch

75
Branch detection in ID stage

76
Dynamic branch prediction
  • Use lower part of instruction address
  • Use one bit to say denote branch taken or not
    taken
  • Disadvantage poor performance in loops
  • Dynamic branch prediction
  • Use two bits instead of one
  • Condition must be satisfied twice to predict
  • More sophisticated
  • Count the number of times branch is taken

2-bit branch prediction State diagram
77
Correlating Branches
  • Hypothesis recent branches are correlated that
    is, behavior of recently executed branches
    affects prediction of current branch
  • Idea record m most recently executed branches as
    taken or not taken, and use that pattern to
    select the proper branch history table
  • In general, (m,n) predictor means record last m
    branches to select between 2m history tables each
    with n-bit counters
  • Old 2-bit BHT is then a (0,2) predictor
  • If (aa 2)
  • aa0
  • If (bb 2)
  • bb 0
  • If (aa ! bb)
  • do something

78
Correlating Branches
  • (2,2) predictor
  • Then behavior of recent branches selects between,
    say, four predictions of next branch, updating
    just that prediction

Branch address
2-bits per branch predictors
Prediction
2-bit global branch history
79
Accuracy of Different Schemes
18
4096 Entries 2-bit BHT Unlimited Entries 2-bit
BHT 1024 Entries (2,2) BHT
Frequency of Mispredictions
0
80
Branch Prediction
  • Sophisticated Techniques
  • A branch target buffer to help us look up the
    destination
  • Correlating predictors that base prediction on
    global behaviorand recently executed branches
    (e.g., prediction for a specificbranch
    instruction based on what happened in previous
    branches)
  • Tournament predictors that use different types of
    prediction strategies and keep track of which one
    is performing best.
  • A branch delay slot which the compiler tries to
    fill with a useful instruction (make the one
    cycle delay part of the ISA)
  • Branch prediction is especially important because
    it enables other more advanced pipelining
    techniques to be effective!
  • Modern processors predict correctly 95 of the
    time!

81
Branch Target Buffer
  • Branch Target Buffer (BTB) Address of branch
    index to get prediction AND branch address (if
    taken)
  • Note must check for branch match now, since
    cant use wrong branch address
  • Return instruction addresses predicted with stack

Branch Prediction Taken or not Taken
Predicted PC
82
Scheduling in delayed branching
83
Other issues in pipelines
  • Exceptions
  • Errors in ALU for arithmetic instructions
  • Memory non-availability
  • Exceptions lead to a jump in a program
  • However, the current PC value must be saved so
    that the program can return to it back for
    recoverable errors
  • Multiple exception can occur in a pipeline
  • Preciseness of exception location is important in
    some cases
  • I/O exceptions are handled in the same manner

84
Exceptions
85
Improving Performance
  • Try and avoid stalls! E.g., reorder these
    instructions
  • lw t0, 0(t1)
  • lw t2, 4(t1)
  • sw t2, 0(t1)
  • sw t0, 4(t1)
  • Dynamic Pipeline Scheduling
  • Hardware chooses which instructions to execute
    next
  • Will execute instructions out of order (e.g.,
    doesnt wait for a dependency to be resolved, but
    rather keeps going!)
  • Speculates on branches and keeps the pipeline
    full (may need to rollback if prediction
    incorrect)
  • Trying to exploit instruction-level parallelism

86
Advanced Pipelining
  • Increase the depth of the pipeline
  • Start more than one instruction each cycle
    (multiple issue)
  • Loop unrolling to expose more ILP (better
    scheduling)
  • Superscalar processors
  • DEC Alpha 21264 9 stage pipeline, 6 instruction
    issue
  • All modern processors are superscalar and issue
    multiple instructions usually with some
    limitations (e.g., different pipes)
  • VLIW very long instruction word, static
    multiple issue (relies more on compiler
    technology)
  • This class has given you the background you need
    to learn more!

87
  • Source For ( I1 Ilt 1000 II1 )
  • xI xI yI
  • Direct translation
  • Loop LD F0, 0 (R1) R1 points to
    x1000 ADDD F4, F0, F2 F2 scalar
    value SD 0(R1), F4 SUBI R1, R1, 8 BNEZ R1,
    loop x0 is at address 0

88
Reducing stalls
  • Pipeline Implementation
  • Loop LD F0, 0 (R1) stall ADDD F4, F0,
    F2 stall stall SD 0(R1), F4 SUBI R1, R1, 8
  • stall BNEZ R1, loop stall
  • Loop LD F0, 0 (R1) stall ADDD F4, F0,
    F2 SUBI R1, R1, 8 BNEZ R1, loop
    SD 8 (R1), F4

89
Loop Unrolling
Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 drop SUBI BNEZ LD F6, -8
(R1) ADDD F8, F6, F2 SD -8 (R1), F8 drop
SUBI BNEZ LD F10, -16 (R1) ADDD F12, F10,
F2 SD -16 (R1), F12 drop SUBI
BNEZ LD F14, -24 (R1) ADDD F16, F14,
F2 SD -24 (R1), F16 SUBI R1, R1,
32 BNEZ R1, Loop
90
  • Loop LD F0, 0(R1) LD F6, -8 (R1) LD F10,
    -16(R1) LD F14, -24(R1) ADDD F4, F0, F2
    ADDD F8, F6, F2 ADDD F12, F10, F2
    ADDD F16, F14, F2 SD 0(R1), F4 SD -8
    (R1), F8 SD -16 (R1), F12 SUBI R1, R1,
    32 BNEZ R1, Loop SD 8(R1), F16 8 - 32
    24
  • 14 instructions (3.5 inst/iteration vs 6)

91
Superscalar architecture -- Two instructions
executed in parallel
92
Dynamically scheduled pipeline
93
Motorola G4e
94
Intel Pentium 4
95
IBM PowerPC 970
96
Important facts to remember
  • Pipelined processors divide execution in multiple
    steps
  • However pipeline hazards reduce performance
  • Structural, data, and control hazard
  • Data forwarding helps resolve data hazards
  • But all hazards cannot be resolved
  • Some data hazards require bubble or noop
    insertion
  • Effects of control hazard reduced by branch
    prediction
  • Predict always taken, delayed slots, branch
    prediction table
  • Structural hazards are resolved by duplicating
    resources
  • Time to execute n instructions depends on
  • of stages (k)
  • of control hazard and penalty of each step
  • of data hazards and penalty for each
  • Time n k - 1 (load hazard penalty)
    (branch penalty)
  • Load hazard penalty is 1 or 0 cycle
  • Depending on data use with forwarding
  • Branch penalty is 3, 2, 1, or zero cycles
    depending on scheme

97
Design and performance issues with pipelining
  • Pipelined processors are not EASY to design
  • Technology affect implementation
  • Instruction set design affect the performance
  • i.e., beq, bne
  • More stages do not lead to higher performance!
Write a Comment
User Comments (0)
About PowerShow.com