SuperScalar MIPS Processor - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

SuperScalar MIPS Processor

Description:

Requires dual ported write buffer and data cache ... Write buffer no longer talks to arbiter, ... Combination of Translation buffer and history table in one SRAM ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 19
Provided by: manp9
Category:

less

Transcript and Presenter's Notes

Title: SuperScalar MIPS Processor


1
Super-Scalar MIPS Processor
  • With predicted branches/jumps and write back cache

2
Outline
  • Enhanced Super Scalar datapath
  • 256-bit Memory bus and write back cache
  • Branch/Jump Prediction
  • Performance
  • Testing

3
Super-Scalar Datapath
  • Duplicated original 5-stage pipeline
  • Dual ported instruction cache
  • Dual ported register file
  • 4 read ports and 2 write ports
  • 4 times the forwarding logic
  • 4 times the stall logic

4
Enhanced Super-Scalar
  • Allow memory operations in both pipelines
  • Requires dual ported write buffer and data cache
  • Also must protect against special memory hazards

5
Upgrades to memory system
  • Allow two writes to write buffer and two reads
    from data cache
  • Rest of hierarchy stays same
  • Data cache uses dual ported SRAMs
  • Write buffer made of registers
  • Easily extensible

6
New Memory Hazards
  • Lw/sw in parallel to same address
  • Stall sw via write buffer until lw completes
  • Cache miss is possible
  • Sw/lw in parallel to same address
  • No problem because sw goes to WB, lw checks there
    first
  • Sw/sw in parallel to same address
  • Sw in pipe A is ignored unless

7
Queuing writes in MMIO
  • If sw/sw in parallel to same hex-LED
  • Accept both into queue in MMIO module
  • Pipe As sw ahead of pipe Bs
  • Queue is emptied at rate of 1 store/cycle
  • Keep track of which is most recent write
  • In case we read from LEDs
  • Each LED has own queue

8
Write back cache
  • Product of complete rewrite
  • Expanded all memory buses to 8 words
  • Allows for burst reads and writes
  • No need for asynchronous FIFO
  • Burst writes works well w/ write back
  • Write buffer no longer talks to arbiter, but only
    to cache

9
Diagram of Write back cache
writeKick
readKick
writeMiss
readMiss
Idle
writeReady
readDone
writeHit
readRetry
writeKickBack
10
Branch/Jump prediction
  • Local prediction of target based on instructions
    address
  • Same hardware for branches and jumps
  • Cache all jumps (not just jr)
  • Predictor replaces signal coming from ID in old
    5-stage pipeline (single pipe)

11
Branch Target Table
  • 256 entries, 58 bits wide
  • Valid bit 57
  • Tag 5635
  • Target 343
  • Branch Confidence 21
  • Jump Confidence 0

12
Using Branch Target Table
  • Combination of Translation buffer and history
    table in one SRAM
  • Table is always written following prediction
  • Data determined by outputs of branch hardware in
    ID

13
Overall Prediction Logic
  • Look up PC in parallel with IF
  • If found then predict based on stored target and
    confidence bits
  • Else return PC8
  • On misprediction
  • Determined by comparing values from ID with
    values last predicted
  • Update table with new confidence bits/tag

14
Prediction Logic/State
  • Standard 2 bit branch prediction
  • 2 states predict taken, 2 not taken
  • 1 bit jump prediction
  • Always taken, so bit is only confidence
  • Used only for updating table

15
Branch in pipe B?
  • Branch hardware (prediction and actual) only
    works for pipe A
  • If branch decoded in pipe B, IF is flushed and
    branch refetched in pipe A
  • Handled by hazard/issue unit
  • Cheaper hardware, but takes more cycles compared
    to moving instructions via MUX

16
Branch/Jump performance
  • Pipe A, Predicted correctly no penalty
  • Pipe A, mispredicted 2 NOPs
  • Pipe B, Predicted correctly 3 NOPs
  • Pipe B, mispredicted 5 NOPs
  • Expensive, but result is less MUXes in critical
    path

17
Overall Performance
  • Current Status
  • Passed test0/Infinite Loop, test1/basic.v
  • Passed test3 with cycle count of X190
  • Hardware Statistics
  • Number of Block RAMS 67
  • Number of Slices 51
  • Timing Statistics
  • Critical Path 125ns
  • Clock Frequency 8MHz

18
Testing
  • Integrated Testing
  • Ability to run instructions from boot ROM
  • Gradually expand set of instructions as new
    features are added/supported
  • Eventually use proper boot ROM to load
    instructions To DRAM
  • Assembly test code
  • Basic programs to test functionality
  • Simple branch, arithmetic, and memory operations
  • Complex programs to test overall operation
  • Fibonacci, Radix Sort
  • Used to dig up more corner cases
Write a Comment
User Comments (0)
About PowerShow.com