CSC 4250 Computer Architectures - PowerPoint PPT Presentation

About This Presentation
Title:

CSC 4250 Computer Architectures

Description:

All operations on data apply to data in registers ... The register file is used in two stages: one for reading in ID and one for writing in WB. ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 23
Provided by: stude6
Learn more at: http://www.cs.rpi.edu
Category:

less

Transcript and Presenter's Notes

Title: CSC 4250 Computer Architectures


1
CSC 4250Computer Architectures
  • September 15, 2006Appendix A. Pipelining

2
What is Pipelining?
  • Implementation technique whereby multiple
    instructions are overlapped in execution
  • Pipelining exploits parallelism among the
    instructions in a sequential instruction stream
  • Recall the formula CPU time IC CPI cct
  • Pipelining yields a reduction in the average
    execution time per instruction i.e., it
    decreases the CPI

3
RISC Architectures
  • Reduced Instruction Set Computer
  • All operations on data apply to data in registers
  • Only operations that affect memory are loads and
    stores that move data from memory to register or
    to memory from register, respectively
  • Instruction formats are few in number with all
    instructions typically the same in size

4
Three Classes of Instructions
  • We consider
  • ALU instructions
  • Load and store instructions
  • Branches (no jumps)

5
ALU Instructions
  • Take either two registers or a register and a
    sign-extended immediate, operate on them, and
    store result into a third register
  • DADD R1,R2,R3
  • Opcode R2 R3 R1 shamt opx
  • rs rt rd
  • RegR1 ? RegR2 RegR3
  • DADDI R1,R2,3
  • Opcode R2 R1 Immediate
  • rs rt
  • RegR1 ? RegR2 3

6
Load and Store Instructions
  • Take register source (base register) and
    immediate field (offset). The sum (effective
    address) is memory address. Second register is
    destination (load) or source (store) of data.
  • LD R2,30(R1)
  • Opcode R1 R2 Immediate
  • RegR2 ? Mem30RegR1
  • SD R2,30(R1)
  • Opcode R1 R2 Immediate
  • MemoffsetRegR1 ? RegR2

7
Branches
  • Branches are conditional transfers of control
  • Branch destination obtained by adding a
    sign-extended offset to current PC
  • We consider only comparison against zero
  • BEQZ R1,name
  • BEQZ is pseudo-instruction for BEQ with R0
  • BEQ R1,R0,name
  • Opcode R1 R0 Immediate

8
RISC Instruction Set
  • At most five clock cycles
  • Instruction fetch cycle (IF)
  • Instruction decode/register fetch cycle (ID)
  • Execution/effective address cycle (EX)
  • Memory access/branch completion (MEM)
  • Write-back cycle (WB)

9
Instruction Fetch (IF)
  • Send program counter (PC) to memory and fetch
    current instruction from memory
  • Update PC by adding 4 (why 4?).
  • Operations
  • IR ? MemPC
  • NPC ? PC 4

10
Instruction Decode/Register Fetch (ID)
  • Decode instruction
  • Read registers
  • Decoding is done in parallel with reading
    registers (fixed-field decoding)
  • Sign-extend the offset field
  • Operations
  • A ? Regrs
  • B ? Regrt
  • Imm ? sign-extended immediate field of IR
  • (A and B are temporary registers).

11
Execution/Effective Address (EX)
  • ALU operates on the operands prepared in ID,
    performing one of four possible functions
  • Memory ref. (add base register and offset)
  • ALUOutput ? A Imm
  • Register-Register ALU instruction
  • ALUOutput ? A func B
  • Register-Immediate ALU instruction
  • ALUOutput ? A op Imm
  • Branch
  • ALUOutput ? NPC (Imm ltlt 2)
  • Cond ? (A 0)

12
Memory Access/Branch Completion (MEM)
  • PC is updated
  • PC ? NPC
  • Access memory if needed
  • LMD Load Memory Data Register
  • LMD ? MemALUOutput
  • or
  • MemALUOutput ? B
  • Branch
  • If (cond) PC ? ALUOutput

13
Write Back (WB)
  • Register-Register ALU
  • Regrd ? ALUOutput
  • Register-Immediate ALU
  • Regrt ? ALUOutput
  • Load
  • Regrt ? LMD

14
Simple RISC Pipeline
  • Clock Number
  • Instr. 1 2 3 4 5
    6 7 8 9
  • Instr. i IF ID EX ME WB
  • Instr. i1 IF ID EX ME WB
  • Instr. i2 IF ID EX
    ME WB
  • Instr. i3 IF ID EX ME
    WB
  • Instr. i4 IF ID EX ME WB
  • What are the stages needed for an ALU
    instruction?
  • What are the stages needed for a Store
    instruction?
  • What are the stages needed for a Branch
    instruction?
  • Which stage is expected to take the most time?

15
Figure A.2. Pipeline
16
Three Observations on Overlapping Execution
  • Use separate instruction and data memories, which
    is typically implemented with separate
    instruction and data caches. The use of separate
    caches eliminates a conflict for a single memory
    that would arise between instruction fetch and
    data memory access.

17
Three Observations on Overlapping Execution
  • The register file is used in two stages one for
    reading in ID and one for writing in WB. These
    uses are distinct. Hence, we need to perform two
    reads and one write every clock cycle (why two
    reads?). To handle reads and a write to the same
    register (and for another reason that will
    arise), we perform the register write in the
    first half and the reads in the second half.

18
Three Observations on Overlapping Execution
  • To start a new instruction every clock, we must
    increment and store the PC every clock, and this
    must be done during the IF stage in preparation
    for the next instruction. Another problem is
    that a branch does not change the PC until the
    MEM stage (this problem will be handled soon).

19
Pipeline Registers
  • Prevent interference between two different
    instructions in adjacent stages in pipeline.
  • Carry data of a given instruction from one stage
    to the next.
  • Registers are triggered by clock edge - values
    change instantaneously on clock edge.
  • Add pipelining overhead.

20
Figure A.3. Pipeline Registers
21
Example
  • Consider unpipelined processor. Assume 1 ns
    clock cycle, 4 cycles for ALU operations and
    branches, and 5 cycles for memory operations.
    Suppose relative frequencies are 40, 20, and
    40, respectively. The pipelining overhead is 0.2
    ns. What is the speedup from pipelining?

22
Answer
  • Average execution time on unpipelined processor
  • Clock Average CPI
  • 1 ns ((4020)4405)
  • 4.4 ns
  • Speedup from pipelining
  • 4.4 ns / 1.2 ns
  • 3.7
Write a Comment
User Comments (0)
About PowerShow.com