Problem set 1.5

About This Presentation

Title:

Problem set 1.5

Description:

Dlx 4 stage pipe merge ex, mem lengthen clock 50% How much faster ... in dlx4, ex mem merged so no load stalls. Branches are the same. Stalls per inst4 = .04 ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 23

Provided by: richarde67

Learn more at: https://www.cs.umb.edu

Category:

more less

Transcript and Presenter's Notes

Title: Problem set 1.5

1
Problem set 1.5

Inst freq cpi
Alu 43 1
Load 21 2
Store 12 2
Brn 24 2
Inst miss rate of 5
Data miss rate of 10
Miss penalty 40 cycles
What is the ratio between ideal machine and
machine with cache?

Inst freq cpi I-acc d I-m d-m I-pen d-pen
Alu 43 1 1 0 .05 .10 40 40
Load 21 2 1 1
Store 12 2 1 1
Brn 24 2 1 0
Cpi (ideal) .431 .212 .122 .242 1.6
sum of freq cycles
Alu 43 1 .0540 3
Load 21 2 .0540 .140 8
Store 12 2 .0540 .140 8
Br 24 2 .0540 4
Cpi .433 .21 8 .12 8 .24 4 4.9
Ratio 4.9/1.6 3.1

3
Problem 1.15

Mflops float operations/time 109,970,178/94
sec
1.17 mflops
Normalized flops divide/sqrt count as 4,
109,970,178 3 (15,682,333) 157017177
157,017,177/94 1.67
(1.67 1.17) / 1.17 43 better

4
Problem 2.1

Instructions are 16 bits offset of 0,8,or 16
bits
Alu 16 bit, load/store/brn multiple sizes
Loads 26, stores 9 35
Branch 19
Alu 46
Table of data offset freq, branch offset freq
L/s 17 0, 43 1-8 bits, 40 9-16
Br 0 0 981-8, 2 are 9 16

5
Average length

load/store
branch alu
percent0 17 35 0 19 146 52
Percent8 43 35 9819
34
Percent16 40 35 2 19
14
Ave len 52 16 34 24 1432 21 bits

6
Fixed 8 bit offset, additional inst for larger
size

Percent that fit in 8
46 (17 43) 35 9819 86
Percent gt8 14
Ave len 86 24 14 48 27 bits
Part c no offset alu or offset16
46 16 5432 25 bits
all alu instructions are 16 bit
all other instructions are 32 bits

7
Problem 2.10

Percent of data access number of data acc/mem
ref
ofdata reads number data reads/number of data
accesses
mem reads number of mem reads/
number of mem access
Load 1 data read, store 1 data write
Instruction 1 mem read

Load 26, stores 9
Mem acc instructions (1 .26 .09 )
Data acc instructions (.26 .09)
Data reads instructions .26
Mem reads instructions (1 .26)
data acc/mem ac (.26 .09)/ (1.26.09)
26
data read/data acc .26/(.26.09) 74
mem read/mem acc 1.26/(1.26.09) 93

9
Problem 3.5 - underpipelined

Dlx 4 stage pipe merge ex, mem lengthen clock
50
How much faster is conventional dlx
Ratio ave execution time dlx4/
ave execution time of dlx5
clock5 1.5 (1 stalls per inst4)/
clock5 ( 1 stalls per inst5)

Stall cycles alu none
load/store
branch
Dlx5 gcc data 4 of branches stall, 5 of loads
in dlx4, exmem merged so no load stalls
Branches are the same
Stalls per inst4 .04
Stalls per inst5 .04 .05 .09
Ratio 1.5 (1.04) / (1.09) 1.43
Underpiped machine takes 1.43 times as long

11
Problem 3.9

Conditional branches 20 (60 taken)
Jumps and calls 5
Pipeline is 4 deep.
How much faster would the machine be with branch
hazards?
speedup pipeline depth / ( 1- pipeline
stalls)
Ideal 4/1 4

12
Stalls

Jmp/call resolved in cycle 2
Clock 1 2 3 4 5 6
jmp if id ex wb
Cycle 2 fetch instruction after jmp
Cycle 3 fetch real next instruction
Stall of 1 cycle for 5 of the instructions

13
Cond br resolved in stage 3

Cbr if d ex wb
Fetch next inst cycle 2
Stall cycle 3 (if taken or not taken)
Stall cycle 4 (if taken)
2 cycle stall if taken, 20 60 12
1 if not taken 20 40 8
Stalls 15 212 18 .37
Speedup real 4/(1.37) 2.92
Ratio 4/2.92 1.37 (37 slower because of
branch hazards)

14
Problem 4.7

Code sequence where scoreboard stalls but
tomasulo does not

15
Data buses
Registers
FP mult
FP mult
FP divide
FP add
Integer unit
Scoreboard
Control/
Control/
status
status
FIGURE 4.3 The basic structure of a DLX
processor with a scoreboard.
16
From instruction unit
Floating-
From
point
operation
memory
queue
FP registers
Load buffers
6
5
4
3
Store buffers
Operand
2
buses
3
1
2
1
To
Operation bus
memory
3
2
Reservation
2
1
1
stations
FP adders
FP multipliers
Common data bus (CDB)
FIGURE 4.8 The basic structure of a DLX FP unit
using Tomasulo's algorithm.
17
Problem 4.7

Two instructions read source in same cycle,
two instructions use same group of functional
units
Fadd f0, f2,f4
Fmul f6,f0,f10
Fmul f8,f0,f10
Issue, read, execute, write
Cannot tell which mult was first

18
Problem 4.8

Tomasulo stalls, scoreboard does not
Only one result per cycle
Two instructions write a result in same cycle
Different groups of functional units
Issue, execute, write
Fmul f6, f2,f4
Nop
Nop
Ld f6, xxxx

19
4.8

Inst 1 2 3 4 5 6
Mult is ex ex ex ex wb
Nop is ex wb
Nop is ex wb
Ld is ex wb

20
Problem 5.3
21
5.3

Cache size
Time goes up dramatically for array gt 64k
Array access must miss so cache is 64k
Block size knee in upper curve 16 bytes
Miss penalty
Difference between two curves 650ns
From program all read misses
xindex xindex1

22
Problem 5.4

95 of all accesses are in cache
Cache block 2 words
Processor references 109 words per second
25 of references are writes
Mem system can handle 109 words per second
Bus reads/writes 1 word at a time
30 of the blocks in the cache are dirty
What is the bandwidth required?

Write a Comment

User Comments (0)

About PowerShow.com

Problem set 1.5 - PowerPoint PPT Presentation

Problem set 1.5

Dlx 4 stage pipe merge ex, mem lengthen clock 50% How much faster ... in dlx4, ex mem merged so no load stalls. Branches are the same. Stalls per inst4 = .04 ... – PowerPoint PPT presentation