Problem set 1.5 - PowerPoint PPT Presentation

About This Presentation
Title:

Problem set 1.5

Description:

Dlx 4 stage pipe merge ex, mem lengthen clock 50% How much faster ... in dlx4, ex mem merged so no load stalls. Branches are the same. Stalls per inst4 = .04 ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 23
Provided by: richarde67
Learn more at: https://www.cs.umb.edu
Category:
Tags: merged | problem | set

less

Transcript and Presenter's Notes

Title: Problem set 1.5


1
Problem set 1.5
  • Inst freq cpi
  • Alu 43 1
  • Load 21 2
  • Store 12 2
  • Brn 24 2
  • Inst miss rate of 5
  • Data miss rate of 10
  • Miss penalty 40 cycles
  • What is the ratio between ideal machine and
    machine with cache?

2
  • Inst freq cpi I-acc d I-m d-m I-pen d-pen
  • Alu 43 1 1 0 .05 .10 40 40
  • Load 21 2 1 1
  • Store 12 2 1 1
  • Brn 24 2 1 0
  • Cpi (ideal) .431 .212 .122 .242 1.6
  • sum of freq cycles
  • Alu 43 1 .0540 3
  • Load 21 2 .0540 .140 8
  • Store 12 2 .0540 .140 8
  • Br 24 2 .0540 4
  • Cpi .433 .21 8 .12 8 .24 4 4.9
  • Ratio 4.9/1.6 3.1

3
Problem 1.15
  • Mflops float operations/time 109,970,178/94
    sec
  • 1.17 mflops
  • Normalized flops divide/sqrt count as 4,
  • 109,970,178 3 (15,682,333) 157017177
  • 157,017,177/94 1.67
  • (1.67 1.17) / 1.17 43 better

4
Problem 2.1
  • Instructions are 16 bits offset of 0,8,or 16
    bits
  • Alu 16 bit, load/store/brn multiple sizes
  • Loads 26, stores 9 35
  • Branch 19
  • Alu 46
  • Table of data offset freq, branch offset freq
  • L/s 17 0, 43 1-8 bits, 40 9-16
  • Br 0 0 981-8, 2 are 9 16

5
Average length
  • load/store
    branch alu
  • percent0 17 35 0 19 146 52
  • Percent8 43 35 9819
    34
  • Percent16 40 35 2 19
    14
  • Ave len 52 16 34 24 1432 21 bits

6
Fixed 8 bit offset, additional inst for larger
size
  • Percent that fit in 8
  • 46 (17 43) 35 9819 86
  • Percent gt8 14
  • Ave len 86 24 14 48 27 bits
  • Part c no offset alu or offset16
  • 46 16 5432 25 bits
  • all alu instructions are 16 bit
  • all other instructions are 32 bits

7
Problem 2.10
  • Percent of data access number of data acc/mem
    ref
  • ofdata reads number data reads/number of data
    accesses
  • mem reads number of mem reads/
  • number of mem access
  • Load 1 data read, store 1 data write
  • Instruction 1 mem read

8
  • Load 26, stores 9
  • Mem acc instructions (1 .26 .09 )
  • Data acc instructions (.26 .09)
  • Data reads instructions .26
  • Mem reads instructions (1 .26)
  • data acc/mem ac (.26 .09)/ (1.26.09)
    26
  • data read/data acc .26/(.26.09) 74
  • mem read/mem acc 1.26/(1.26.09) 93

9
Problem 3.5 - underpipelined
  • Dlx 4 stage pipe merge ex, mem lengthen clock
    50
  • How much faster is conventional dlx
  • Ratio ave execution time dlx4/
  • ave execution time of dlx5
  • clock5 1.5 (1 stalls per inst4)/
  • clock5 ( 1 stalls per inst5)

10
  • Stall cycles alu none
  • load/store
  • branch
  • Dlx5 gcc data 4 of branches stall, 5 of loads
  • in dlx4, exmem merged so no load stalls
  • Branches are the same
  • Stalls per inst4 .04
  • Stalls per inst5 .04 .05 .09
  • Ratio 1.5 (1.04) / (1.09) 1.43
  • Underpiped machine takes 1.43 times as long

11
Problem 3.9
  • Conditional branches 20 (60 taken)
  • Jumps and calls 5
  • Pipeline is 4 deep.
  • How much faster would the machine be with branch
    hazards?
  • speedup pipeline depth / ( 1- pipeline
    stalls)
  • Ideal 4/1 4

12
Stalls
  • Jmp/call resolved in cycle 2
  • Clock 1 2 3 4 5 6
  • jmp if id ex wb
  • Cycle 2 fetch instruction after jmp
  • Cycle 3 fetch real next instruction
  • Stall of 1 cycle for 5 of the instructions

13
Cond br resolved in stage 3
  • Cbr if d ex wb
  • Fetch next inst cycle 2
  • Stall cycle 3 (if taken or not taken)
  • Stall cycle 4 (if taken)
  • 2 cycle stall if taken, 20 60 12
  • 1 if not taken 20 40 8
  • Stalls 15 212 18 .37
  • Speedup real 4/(1.37) 2.92
  • Ratio 4/2.92 1.37 (37 slower because of
    branch hazards)

14
Problem 4.7
  • Code sequence where scoreboard stalls but
    tomasulo does not

15
Data buses
Registers
FP mult
FP mult
FP divide
FP add
Integer unit
Scoreboard
Control/
Control/
status
status
FIGURE 4.3 The basic structure of a DLX
processor with a scoreboard.
16
From instruction unit
Floating-
From
point
operation
memory
queue
FP registers
Load buffers
6
5
4
3
Store buffers
Operand
2
buses
3
1
2
1
To
Operation bus
memory
3
2
Reservation
2
1
1
stations
FP adders
FP multipliers
Common data bus (CDB)
FIGURE 4.8 The basic structure of a DLX FP unit
using Tomasulo's algorithm.
17
Problem 4.7
  • Two instructions read source in same cycle,
  • two instructions use same group of functional
    units
  • Fadd f0, f2,f4
  • Fmul f6,f0,f10
  • Fmul f8,f0,f10
  • Issue, read, execute, write
  • Cannot tell which mult was first

18
Problem 4.8
  • Tomasulo stalls, scoreboard does not
  • Only one result per cycle
  • Two instructions write a result in same cycle
  • Different groups of functional units
  • Issue, execute, write
  • Fmul f6, f2,f4
  • Nop
  • Nop
  • Ld f6, xxxx

19
4.8
  • Inst 1 2 3 4 5 6
  • Mult is ex ex ex ex wb
  • Nop is ex wb
  • Nop is ex wb
  • Ld is ex wb

20
Problem 5.3
21
5.3
  • Cache size
  • Time goes up dramatically for array gt 64k
  • Array access must miss so cache is 64k
  • Block size knee in upper curve 16 bytes
  • Miss penalty
  • Difference between two curves 650ns
  • From program all read misses
  • xindex xindex1

22
Problem 5.4
  • 95 of all accesses are in cache
  • Cache block 2 words
  • Processor references 109 words per second
  • 25 of references are writes
  • Mem system can handle 109 words per second
  • Bus reads/writes 1 word at a time
  • 30 of the blocks in the cache are dirty
  • What is the bandwidth required?
Write a Comment
User Comments (0)
About PowerShow.com