Lecture 14: Course Review

About This Presentation

Title:

Lecture 14: Course Review

Description:

Title: PowerPoint Presentation Last modified by: lenovo Created Date: 1/1/1601 12:00:00 AM Document presentation format: (4:3) Other titles – PowerPoint PPT presentation

Number of Views:126

Avg rating:3.0/5.0

Slides: 205

Provided by: educ5460

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 14: Course Review

1
Lecture 14 Course Review

Kai Bu
kaibu_at_zju.edu.cn
http//list.zju.edu.cn/kaibu/comparch2015

2
THANK YOU
3

Email LinkedIn Twitter Weibo... Don't hesitate
to keep in touch)

4
(No Transcript)
5
Lectures 02-03

Fundamentals of Computer Design

6
Classes of Parallel Architectures

by Michael Flynn
according to the parallelism
in the instruction and data
streams called for by the
instructions at the most
constrained component of
the multiprocessor
SISD, SIMD, MISD, MIMD

7
SISD

Single instruction stream, single data stream
uniprocessor
Can exploit instruction-level parallelism

8
SIMD

Single instruction stream, multiple data stream
The same instruction is executed by multiple
processors using different data streams.
Exploits data-level parallelism
Data memory for each processor
whereas a single instruction memory and control
processor.

9
MISD

Multiple instruction streams, single data stream
No commercial multiprocessor of this type yet

10
MIMD

Multiple instruction streams, multiple data
streams
Each processor fetches its own instructions and
operates on its own data.
Exploits task-level parallelism

11
Instruction Set Architecture

ISA
actual programmer-visible instruction set
the boundary between software and hardware
7 major dimensions

12
ISA Class

Most are general-purpose register architectures
with operands of either registers or memory
locations
Two popular versions
register-memory ISA e.g., 80x86
many instructions can access memory
load-store ISA e.g., ARM, MIPS
only load or store instructions can access
memory

13
ISA Memory Addressing

Byte addressing
Aligned address
object width s bytes
address A
aligned if A mod s 0

14
Each misaligned object requires two memory
accesses
15
ISA Addressing Modes

Specify the address of a memory object
Register, Immediate, Displacement

16
Trends in Cost

Cost of an Integrated Circuit
wafer for test chopped into dies for
packaging

17
Trends in Cost

Cost of an Integrated Circuit

percentage of manufactured devices that
survives the testing procedure
18
Trends in Cost

Cost of an Integrated Circuit

19
Trends in Cost

Cost of an Integrated Circuit

20
Trends in Cost

Cost of an Integrated Circuit
N process-complexity factor for measuring
manufacturing difficulty

21
Dependability

Two measures of dependability
Module reliability
Module availability

22
Dependability

Two measures of dependability
Module reliability
continuous service accomplishment from a
reference initial instant
MTTF mean time to failure
MTTR mean time to repair
MTBF mean time between failures
MTBF MTTF MTTR

23
Dependability

Two measures of dependability
Module reliability
FIT failures in time
failures per billion hours
MTTF of 1,000,000 hours
109/106
1000 FIT

24
Dependability

Two measures of dependability
Module availability

25
Measuring Performance

Execution time
the time between the start and the completion of
an event
Throughput
the total amount of work done in a given time

26
Measuring Performance

Computer X and Computer Y
X is n times faster than Y

27
Quantitative Principles

Parallelism
Locality
temporal locality recently accessed items are
likely to be accessed in the near future
spatial locality items whose addresses are near
one another tend to be referenced close together
in time

28
Quantitative Principles

Amdahls Law

29
Quantitative Principles

Amdahls Law two factors
1. Fractionenhanced
e.g., 20/60 if 20 seconds out of a 60-second
program to enhance
2. Speedupenhanced
e.g., 5/2 if enhanced to 2 seconds while
originally 5 seconds

30
(No Transcript)
31
Quantitative Principles

The Processor Performance Equation

32
(No Transcript)
33
(No Transcript)
34
Lecture 04

Instruction Set Principles

35
ISA Classification

Classification Basis
the type of internal storage
stack
accumulator
register
ISA Classes
stack architecture
accumulator architecture
general-purpose register architecture (GPR)

36
ISA ClassesStack Architecture

implicit operands
on the Top Of the Stack
C A B
Push A
Push B
Add
Pop C
First operand removed from stack
Second op replaced by the result

memory
37
ISA ClassesAccumulator Architecture

one implicit operand the accumulator
one explicit operand mem location
C A B
Load A
Add B
Store C
accumulator is both
an implicit input operand
and a result

memory
38
ISA ClassesGeneral-Purpose Register Arch

Only explicit operands
registers
memory locations
Operand access
direct memory access
loaded into temporary storage first

39
ISA ClassesGeneral-Purpose Register Arch

Two Classes
register-memory architecture
any instruction can access memory
load-store architecture
only load and store instructions can access
memory

40
ISA ClassesGeneral-Purpose Register Arch

Two Classes
register-memory architecture
any instruction can access mem
C A B
Load R1, A
Add R3, R1, B
Store R3, C

41
ISA ClassesGeneral-Purpose Register Arch

Two Classes
load-store architecture
only load and store instructions
can access memory
C A B
Load R1, A
Load R2, B
Add R3, R1, R2
Store R3, C

42
GPR Classification

ALU instruction has 2 or 3 operands?
2 1 resultsource op 1 source op
3 1 result op 2 source op
ALU instruction has 0, 1, 2, or 3 operands of
memory address?

43
Addressing Modes

How instructions specify addresses
of objects to access
Types
constant
register
memory location effective address

44
frequently used
45
(No Transcript)
46
Lectures 05-07

Pipelining

47
Pipelining
start executing one instruction before completing
the previous one
48
Pipelined Laundry
3.5 Hours

Observations
No speed up for individual task
e.g., A still takes 30402090
But speed up for average task execution time
e.g., 3.560/452.5 lt 30402090

Time
30 40 40
40 40 20
A
Task Order
B
C
D
49
MIPS Instruction

at most 5 clock cycles per instruction
IF ID EX MEM WB

50
MIPS Instruction
IF ID EX MEM WB
IR ? MemPC NPC ? PC 4
51
MIPS Instruction
IF ID EX MEM WB
A ? Regsrs B ? Regsrt Imm ? sign-extended
immediate field of IR
(lower 16 bits)
52
MIPS Instruction
IF ID EX MEM WB
ALUOutput ? A Imm ALUOutput ? A func
B ALUOutput ? A op Imm ALUOutput ? NPC
(Immltlt2) Cond ? (A 0)
53
MIPS Instruction
IF ID EX MEM WB
LMD ? MemALUOutput MemALUOutput ? B if
(cond) PC ? ALUOutput
54
MIPS Instruction
IF ID EX MEM WB
Regsrd ? ALUOutput Regsrt ?
ALUOutput Regsrt ? LMD
55
MIPS Instruction Demo

Prof. Gurpur Prabhu, Iowa State Univ
http//www.cs.iastate.edu/prabhu/Tutorial/PIPELIN
E/DLXimplem.html
Load, Store
Register-register ALU
Register-immediate ALU
Branch

56
Load
57
Load
58
Load
59
Load
60
Load
61
Load
62
Store
63
Store
64
Store
65
Store
66
Store
67
Store
68
Register-Register ALU
69
Register-Register ALU
70
Register-Register ALU
71
Register-Register ALU
72
Register-Register ALU
73
Register-Register ALU
74
Register-Immediate ALU
75
Register-Immediate ALU
76
Register-Immediate ALU
77
Register-Immediate ALU
78
Register-Immediate ALU
79
Register-Immediate ALU
80
Branch
81
Branch
82
Branch
83
Branch
84
Branch
85
Branch
86
Structural Hazard
MEM

Example
1 mem port
mem conflict
data access
vs
instr fetch

Load
Instr i1
Instr i2
IF
Instr i3
87
Structural Hazard

Stall Instr i3
till CC 5

88
Data Hazard
R1
DADD
R1, R2, R3
DSUB
R4, R1, R5
AND
R6, R1, R7
No hazard 1st half cycle w 2nd half cycle r
OR
R8, R1, R9
XOR
R10, R1, R11
89
Data Hazard

Solution forwarding
directly feed back EX/MEMMEM/WB
pipeline regs results to the ALU inputs
if forwarding hw detects that previous ALU has
written the reg corresponding to a source for the
current ALU,
control logic selects the forwarded result as
the ALU input.

90
Data Hazard Forwarding
R1
DADD
R1, R2, R3
DSUB
R4, R1, R5
AND
R6, R1, R7
OR
R8, R1, R9
XOR
R10, R1, R11
91
Data Hazard Forwarding
R1
EX/MEM
DADD
R1, R2, R3
DSUB
R4, R1, R5
AND
R6, R1, R7
OR
R8, R1, R9
XOR
R10, R1, R11
92
Data Hazard Forwarding
R1
MEM/WB
DADD
R1, R2, R3
DSUB
R4, R1, R5
AND
R6, R1, R7
OR
R8, R1, R9
XOR
R10, R1, R11
93
Data Hazard Forwarding

Generalized forwarding
pass a result directly to the functional unit
that requires it
forward results to not only ALU inputs but also
other types of functional units

94
Data Hazard Forwarding

Generalized forwarding

DADD
R1, R2, R3
R1
R1
R4
LD
R4, 0(R1)
R1
SD
R4, 12(R1)
R1
R4
95
Data Hazard

Sometimes stall is necessary

MEM/WB
R1
LD
R1, 0(R2)
DSUB
R4, R1, R5
R1
Forwarding cannot be backward.
Has to stall.
96
Branch Hazard

Redo IF
If the branch is untaken,
the stall is unnecessary.

essentially a stall
97
Branch Hazard Solutions

4 simple compile time schemes 1
Freeze or flush the pipeline
hold or delete any instructions after the branch
till the branch dst is known
i.e., Redo IF w/o the first IF

98
Branch Hazard Solutions

4 simple compile time schemes 2
Predicted-untaken
simply treat every branch as untaken
when the branch is untaken,
pipelining as if no hazard.

99
Branch Hazard Solutions

4 simple compile time schemes 2
Predicted-untaken
but if the branch is taken
turn fetched instr into a no-op (idle)
restart the IF at the branch target addr

100
Branch Hazard Solutions

4 simple compile time schemes 3
Predicted-taken
simply treat every branch as taken
not apply to the five-stage pipeline
apply to scenarios when branch target addr is
known before branch outcome.

101
Branch Hazard Solutions

4 simple compile time schemes 4
Delayed branch
delay the branch execution after the next
instruction
pipelining sequence
branch instruction
sequential successor
branch target if taken

Branch delay slot the next instruction
102
Branch Hazard Solutions

Delayed branch

103
(No Transcript)
104
Lectures 08-10

Memory Hierarchy

105
Memory Hierarchy
106
Cache Performance
107
Cache Performance

Memory stall cycles
the number of cycles during processor is stalled
waiting for a mem access
Miss rate
number of misses over number of accesses
Miss penalty
the cost per miss (number of extra clock cycles
to wait)

108
Block Placement
109
Block Identification

Block address block offset
Block address tag index
Index select the set
Tag check all blocks in the set
Block offset the address of the desired data
within the block chosen by index tag
Fully associative caches have no index field

110
Write Strategy

Write-through
info is written to both the block in the cache
and to the block in the lower-level memory
Write-back
info is written only to the block in the cache
to the main memory only when the modified cache
block is replaced

111
Write Strategy

Options on a write miss
Write allocate
the block is allocated on a write miss
No-write allocate
write miss not affect the cache
the block is modified in the lower-level memory
until the program tries to read the block

112
Write Strategy
113
Write Strategy

No-write allocate 4 misses 1 hit
cache not affected- address 100 not in the
cache
read 200 miss, block replaced, then write
200 hits
Write allocate 2 misses 3 hits

114
Avg Mem Access Time

Average memory access time
Hit time Miss rate x Miss penalty

115
Opt 4 Multilevel Cache

Two-level cache
Add another level of cache between the original
cache and memory
L1 small enough to match the clock cycle time of
the fast processor
L2 large enough to capture many accesses that
would go to main memory, lessening miss penalty

116
Opt 4 Multilevel Cache

Average memory access time
Hit timeL1 Miss rateL1 x Miss penaltyL1
Hit timeL1 Miss rateL1
x(Hit timeL2Miss rateL2xMiss penaltyL2)
Average mem stalls per instruction
Misses per instructionL1 x Hit timeL2
Misses per instrL2 x Miss penaltyL2

117
Opt 4 Multilevel Cache

Local miss rate
the number of misses in a cache
divided by the total number of mem accesses to
this cache
Miss rateL1, Miss rateL2
Global miss rates
the number of misses in the cache
divided by the number of mem accesses generated
by the processor
Miss rateL1, Miss rateL1 x Miss rateL2

118

Answer
1. various miss rates?
L1 local global
40/1000 4
L2
local 20/40 50
global 20/10000 2

119

Answer
2. avg mem access time?
average memory access time
Hit timeL1 Miss rateL1
x(Hit timeL2Miss rateL2xMiss penaltyL2)
1 4 x (10 50 x 200)
5.4

120

Answer
3. avg stall cycles per instruction?
average stall cycles per instruction
Misses per instructionL1 x Hit timeL2
Misses per instrL2 x Miss penaltyL2
(1.5x40/1000)x10(1.5x20/1000)x200
6.6

121
Virtual Memory
122
Virtual Memory

Program uses
discontiguous memory locations
Use secondary/non-memory storage

123
Virtual Memory

Program thinks
contiguous memory locations
larger physical memory

124
Virtual Memory

relocation
allows the same program to run in any location in
physical memory

125
Virtual Memory

Paged virtual memory
page fixed-size block
Segmented virtual memory
segment variable-size block

126
Virtual Memory

Paged virtual memory
page address page offset
Segmented virtual memory
segment address seg offset

127
Address Translation

Example Opteron data TLB

Steps 12 send the virtual address to all tags
Step 2 check the type of mem access against
protection info in TLB
128
Address Translation

Example Opteron data TLB

Steps 3 the matching tag sends phy addr through
multiplexor
129
Address Translation

Example Opteron data TLB

Steps 4 concatenate page offset to phy page
frame to form final phy addr
130
Virtual Memory Caches
131
(No Transcript)
132
Lectures 11-12

Storage

133
Disk

http//cf.ydcdn.net/1.0.1.19/images/computer/MAGDI
SK.GIF

134
Disk

http//www.cs.uic.edu/jbell/CourseNotes/Operating
Systems/images/Chapter10/10_01_DiskMechanism.jpg

135
Disk Capacity

Areal Density
bits/inch2
(tracks/inch) x (bits-per-track/inch)

136
Disk Arrays

Disk arrays with redundant disks to tolerate
faults
If a single disk fails, the lost information is
reconstructed from redundant information
Striping simply spreading data over multiple
disks
RAID redundant array of inexpensive/independent
disks

137
RAID
138
RAID 0

JBOD just a bunch of disks
No redundancy
No failure tolerated
Measuring stick for other RAID levels in terms of
cost, performance, and dependability

139
RAID 1

Mirroring or Shadowing
Two copies for every piece of data
one logical write two physical writes
100 capacity/space
overhead
http//www.petemarovichimages.com/wp-content/uploa
ds/2013/11/RAID1.jpg

140

https//www.icc-usa.com/content/raid-calculator/ra
id-0-1.png

141
RAID 2

http//www.acnc.com/raidedu/2
Each bit of data word is written to a data disk
drive
Each data word has its (Hamming Code) ECC word
recorded on the ECC disks
On read, the ECC code verifies correct data or
corrects single disks errors

142
RAID 3

http//www.acnc.com/raidedu/3
Data striped over all data disks
Parity of a stripe to parity disk
Require at least 3 disks to implement

143
RAID 3

Even Parity
parity bit makes
the of 1 even
p sum(data1) mod 2
Recovery
if a disk fails,
subtract good data
from good blocks
what remains is missing data

144
RAID 4

http//www.acnc.com/raidedu/4
Favor small accesses
Allows each disk to perform independent reads,
using sectors own error checking

145
RAID 5

http//www.acnc.com/raidedu/5
Distributes the parity info across all disks in
the array
Removes the bottleneck of a single parity disk as
RAID 3 and RAID 4

146
RAID 6 Row-diagonal Parity

RAID-DP
Recover from two failures
xor
row 00112233r4
diagonal 011131r1d1

147
Double-Failure Recovery
148
Double-Failure Recovery
149
Double-Failure Recovery
150
Double-Failure Recovery
151
Double-Failure Recovery
152
Double-Failure Recovery
153
Double-Failure Recovery
154
Double-Failure Recovery
155
Double-Failure Recovery
156
RAID Further Readings

Raid Types Classifications
BytePile.com
https//www.icc-usa.com/content/raid-calculator/r
aid-0-1.png
RAID
JetStor
http//www.acnc.com/raidedu/0

157
Littles Law

Assumptions
multiple independent I/O requests in
equilibrium
input rate output rate
a steady supply of tasks independent for how
long they wait for service

158
Littles Law

Mean number of tasks in system
Arrival rate x Mean response time

159
Littles Law

Mean number of tasks in system
Arrival rate x Mean response time
applies to any system in equilibrium
nothing inside the black box creating new tasks
or destroying them

160
Littles Law

Observe a sys for Timeobserve mins
Sum the times for each task to be serviced
Timeaccumulated
Numbertask completed during Timeobserve
TimeaccumulatedTimeobserve
because tasks can overlap in time

161
Littles Law
162
Single-Server Model

Queue / Waiting line
the area where the tasks accumulate, waiting to
be serviced
Server
the device performing the requested service is
called the server

163
Single-Server Model

Timeserver
average time to service a task
average service rate 1/Timeserver
Timequeue
average time per task in the queue
Timesystem
average time/task in the system, or the response
time
the sum of Timequeue and Timeserver

164
Single-Server Model

Arrival rate
average of arriving tasks per second
Lengthserver
average of tasks in service
Lengthqueue
average length of queue
Lengthsystem
average of tasks in system,
the sum of Lengthserver and Lengthqueue

165
Server Utilization / traffic intensity

Server utilization
the mean number of tasks being serviced divided
by the service rate
Service rate 1/Timeserver
Server utilization
Arrival rate x Timeserver
(littles law again)

166
Server Utilization

Example
an I/O sys with a single disk gets on average 50
I/O requests per sec
10 ms on avg to service an I/O request
server utilization
arrival rate x timeserver
50 x 0.01 0.5 1/2
Could handle 100 tasks/sec, but only 50

167
Queue Discipline

How the queue delivers tasks to server
FIFO first in, first out
Timequeue
Lengthqueue x Timeserver
Mean time to complete service of task when
new task arrives if server is busy

168
Queue

with exponential/Poisson distribution of
events/requests

169
Lengthqueue

Example
an I/O sys with a single disk gets on average 50
I/O requests per sec
10 ms on avg to service an I/O request
Lengthqueue

170
(No Transcript)
171
Lectures 13

Multiprocessors

172
centralized shared-memory
eight or fewer cores
173
centralized shared-memory
Share a single centralized memory All processors
have equal access to
174
centralized shared-memory
All processors have uniform latency from
memory Uniform memory access (UMA)
multiprocessors
175
distributed shared memory
more processors
physically distributed memory
176
distributed shared memory
more processors
physically distributed memory
Distributing mem among the nodes increases
bandwidth reduces local-mem latency
177
distributed shared memory
more processors
physically distributed memory
NUMA nonuniform memory access access time
depends on data word loc in mem
178
distributed shared memory
more processors
physically distributed memory
Disadvantages more complex inter-processor
communication more complex software to handle
distributed mem
179
Cache Coherence Problem
write-through cache
180
Cache Coherence Problem

Global state defined by main memory
Local state defined by the individual caches

181
Cache Coherence Problem

A memory system is Coherent if any read of a data
item returns the most recently written value of
that data item
Two critical aspects
coherence defines what values can be returned
by a read
consistency determines when a written value
will be returned by a read

182
Coherence Property

A read by processor P to location X that follows
a write by P to X, with writes of X by another
processor occurring between the write and the
read by P,
always returns the value written by P.
preserves program order

183
Coherence Property

A read by a processor to location X that follows
a write by anther processor to X returns the
written value if the read the write are
sufficiently separated in time and no other
writes to X occur between the two accesses.

184
Consistency

When a written value will be seen is important
For example, a write of X on one processor
precedes a read of X on another processor by a
very small time, it may be impossible to ensure
that the read returns the value of the data
written,
since the written data may not even have left
the processor at that point

185
Cache Coherence Protocols

Directory based
the sharing status of a particular block of
physical memory is kept in one location, called
directory
Snooping
every cache that has a copy of the data from a
block of physical memory could track the sharing
status of the block

186
Snooping Coherence Protocol

Write invalidation protocol
invalidates other copies on a write
exclusive access ensures that no other readable
or writable copies of an item exist when the
write occurs

187
Snooping Coherence Protocol

Write invalidation protocol
invalidates other copies on a write

write-back cache
188
Snooping Coherence Protocol

Write update/broadcast protocol
update all cached copies of a data item when
that item is written
consumes more bandwidth

189
Write Invalidation Protocol

To perform an invalidate, the processor simply
acquires bus access and broadcasts the address to
be invalidated on the bus
All processors continuously snoop on the bus,
watching the addresses
The processors check whether the address on the
bus is in their cache
if so, the corresponding data in the cache is
invalidated.

190
Coherence Miss

True sharing miss
first write by a processor to a shared cache
block causes an invalidation to establish
ownership of that block
another processor reads a modified word in that
cache block
False sharing miss

191
Coherence Miss

True sharing miss
False sharing miss
a single valid bit per cache block
occurs when a block is invalidated (and a
subsequent reference causes a miss) because some
word in the block, other than the one being read,
is written into

192
Coherence Miss

Example
assume words x1 and x2 are in the same cache
block, which is in shared state in the caches of
both P1 and P2.
identify each miss as a true sharing miss, a
false sharing miss, or a hit?

193
Coherence Miss