Title: Lecture 14: Course Review
1Lecture 14 Course Review
- Kai Bu
- kaibu_at_zju.edu.cn
- http//list.zju.edu.cn/kaibu/comparch2015
2THANK YOU
3- Email LinkedIn Twitter Weibo... Don't hesitate
to keep in touch)
4(No Transcript)
5Lectures 02-03
- Fundamentals of Computer Design
6Classes of Parallel Architectures
- by Michael Flynn
- according to the parallelism
- in the instruction and data
- streams called for by the
- instructions at the most
- constrained component of
- the multiprocessor
- SISD, SIMD, MISD, MIMD
7SISD
- Single instruction stream, single data stream
uniprocessor - Can exploit instruction-level parallelism
8SIMD
- Single instruction stream, multiple data stream
- The same instruction is executed by multiple
processors using different data streams. - Exploits data-level parallelism
- Data memory for each processor
- whereas a single instruction memory and control
processor.
9MISD
- Multiple instruction streams, single data stream
- No commercial multiprocessor of this type yet
10MIMD
- Multiple instruction streams, multiple data
streams - Each processor fetches its own instructions and
operates on its own data. - Exploits task-level parallelism
11Instruction Set Architecture
- ISA
- actual programmer-visible instruction set
- the boundary between software and hardware
- 7 major dimensions
12ISA Class
- Most are general-purpose register architectures
with operands of either registers or memory
locations - Two popular versions
- register-memory ISA e.g., 80x86
- many instructions can access memory
- load-store ISA e.g., ARM, MIPS
- only load or store instructions can access
memory
13ISA Memory Addressing
- Byte addressing
- Aligned address
- object width s bytes
- address A
- aligned if A mod s 0
14Each misaligned object requires two memory
accesses
15ISA Addressing Modes
- Specify the address of a memory object
- Register, Immediate, Displacement
16Trends in Cost
- Cost of an Integrated Circuit
- wafer for test chopped into dies for
- packaging
17Trends in Cost
- Cost of an Integrated Circuit
percentage of manufactured devices that
survives the testing procedure
18Trends in Cost
- Cost of an Integrated Circuit
19Trends in Cost
- Cost of an Integrated Circuit
20Trends in Cost
- Cost of an Integrated Circuit
- N process-complexity factor for measuring
manufacturing difficulty
21Dependability
- Two measures of dependability
- Module reliability
- Module availability
22Dependability
- Two measures of dependability
- Module reliability
- continuous service accomplishment from a
reference initial instant - MTTF mean time to failure
- MTTR mean time to repair
- MTBF mean time between failures
- MTBF MTTF MTTR
23Dependability
- Two measures of dependability
- Module reliability
- FIT failures in time
- failures per billion hours
- MTTF of 1,000,000 hours
- 109/106
- 1000 FIT
-
24Dependability
- Two measures of dependability
- Module availability
25Measuring Performance
- Execution time
- the time between the start and the completion of
an event - Throughput
- the total amount of work done in a given time
26Measuring Performance
- Computer X and Computer Y
- X is n times faster than Y
27Quantitative Principles
- Parallelism
- Locality
- temporal locality recently accessed items are
likely to be accessed in the near future - spatial locality items whose addresses are near
one another tend to be referenced close together
in time
28Quantitative Principles
29Quantitative Principles
- Amdahls Law two factors
- 1. Fractionenhanced
- e.g., 20/60 if 20 seconds out of a 60-second
program to enhance - 2. Speedupenhanced
- e.g., 5/2 if enhanced to 2 seconds while
originally 5 seconds
30(No Transcript)
31Quantitative Principles
- The Processor Performance Equation
32(No Transcript)
33(No Transcript)
34Lecture 04
- Instruction Set Principles
35ISA Classification
- Classification Basis
- the type of internal storage
- stack
- accumulator
- register
- ISA Classes
- stack architecture
- accumulator architecture
- general-purpose register architecture (GPR)
36ISA ClassesStack Architecture
- implicit operands
- on the Top Of the Stack
- C A B
- Push A
- Push B
- Add
- Pop C
- First operand removed from stack
- Second op replaced by the result
memory
37ISA ClassesAccumulator Architecture
- one implicit operand the accumulator
- one explicit operand mem location
- C A B
- Load A
- Add B
- Store C
- accumulator is both
- an implicit input operand
- and a result
memory
38ISA ClassesGeneral-Purpose Register Arch
- Only explicit operands
- registers
- memory locations
- Operand access
- direct memory access
- loaded into temporary storage first
39ISA ClassesGeneral-Purpose Register Arch
- Two Classes
- register-memory architecture
- any instruction can access memory
- load-store architecture
- only load and store instructions can access
memory -
-
40ISA ClassesGeneral-Purpose Register Arch
- Two Classes
- register-memory architecture
- any instruction can access mem
- C A B
- Load R1, A
- Add R3, R1, B
- Store R3, C
-
-
41ISA ClassesGeneral-Purpose Register Arch
- Two Classes
- load-store architecture
- only load and store instructions
- can access memory
- C A B
- Load R1, A
- Load R2, B
- Add R3, R1, R2
- Store R3, C
42GPR Classification
- ALU instruction has 2 or 3 operands?
- 2 1 resultsource op 1 source op
- 3 1 result op 2 source op
- ALU instruction has 0, 1, 2, or 3 operands of
memory address?
43Addressing Modes
- How instructions specify addresses
- of objects to access
- Types
- constant
- register
- memory location effective address
44frequently used
45(No Transcript)
46Lectures 05-07
47Pipelining
start executing one instruction before completing
the previous one
48Pipelined Laundry
3.5 Hours
- Observations
- No speed up for individual task
- e.g., A still takes 30402090
- But speed up for average task execution time
- e.g., 3.560/452.5 lt 30402090
Time
30 40 40
40 40 20
A
Task Order
B
C
D
49MIPS Instruction
- at most 5 clock cycles per instruction
- IF ID EX MEM WB
50MIPS Instruction
IF ID EX MEM WB
IR ? MemPC NPC ? PC 4
51MIPS Instruction
IF ID EX MEM WB
A ? Regsrs B ? Regsrt Imm ? sign-extended
immediate field of IR
(lower 16 bits)
52MIPS Instruction
IF ID EX MEM WB
ALUOutput ? A Imm ALUOutput ? A func
B ALUOutput ? A op Imm ALUOutput ? NPC
(Immltlt2) Cond ? (A 0)
53MIPS Instruction
IF ID EX MEM WB
LMD ? MemALUOutput MemALUOutput ? B if
(cond) PC ? ALUOutput
54MIPS Instruction
IF ID EX MEM WB
Regsrd ? ALUOutput Regsrt ?
ALUOutput Regsrt ? LMD
55MIPS Instruction Demo
- Prof. Gurpur Prabhu, Iowa State Univ
http//www.cs.iastate.edu/prabhu/Tutorial/PIPELIN
E/DLXimplem.html - Load, Store
- Register-register ALU
- Register-immediate ALU
- Branch
56Load
57Load
58Load
59Load
60Load
61Load
62Store
63Store
64Store
65Store
66Store
67Store
68Register-Register ALU
69Register-Register ALU
70Register-Register ALU
71Register-Register ALU
72Register-Register ALU
73Register-Register ALU
74Register-Immediate ALU
75Register-Immediate ALU
76Register-Immediate ALU
77Register-Immediate ALU
78Register-Immediate ALU
79Register-Immediate ALU
80Branch
81Branch
82Branch
83Branch
84Branch
85Branch
86Structural Hazard
MEM
- Example
- 1 mem port
- mem conflict
- data access
- vs
- instr fetch
Load
Instr i1
Instr i2
IF
Instr i3
87Structural Hazard
88Data Hazard
R1
DADD
R1, R2, R3
DSUB
R4, R1, R5
AND
R6, R1, R7
No hazard 1st half cycle w 2nd half cycle r
OR
R8, R1, R9
XOR
R10, R1, R11
89Data Hazard
- Solution forwarding
- directly feed back EX/MEMMEM/WB
- pipeline regs results to the ALU inputs
- if forwarding hw detects that previous ALU has
written the reg corresponding to a source for the
current ALU, - control logic selects the forwarded result as
the ALU input.
90Data Hazard Forwarding
R1
DADD
R1, R2, R3
DSUB
R4, R1, R5
AND
R6, R1, R7
OR
R8, R1, R9
XOR
R10, R1, R11
91Data Hazard Forwarding
R1
EX/MEM
DADD
R1, R2, R3
DSUB
R4, R1, R5
AND
R6, R1, R7
OR
R8, R1, R9
XOR
R10, R1, R11
92Data Hazard Forwarding
R1
MEM/WB
DADD
R1, R2, R3
DSUB
R4, R1, R5
AND
R6, R1, R7
OR
R8, R1, R9
XOR
R10, R1, R11
93Data Hazard Forwarding
- Generalized forwarding
- pass a result directly to the functional unit
that requires it - forward results to not only ALU inputs but also
other types of functional units
94Data Hazard Forwarding
DADD
R1, R2, R3
R1
R1
R4
LD
R4, 0(R1)
R1
SD
R4, 12(R1)
R1
R4
95Data Hazard
- Sometimes stall is necessary
MEM/WB
R1
LD
R1, 0(R2)
DSUB
R4, R1, R5
R1
Forwarding cannot be backward.
Has to stall.
96Branch Hazard
- Redo IF
- If the branch is untaken,
- the stall is unnecessary.
essentially a stall
97Branch Hazard Solutions
- 4 simple compile time schemes 1
- Freeze or flush the pipeline
- hold or delete any instructions after the branch
till the branch dst is known - i.e., Redo IF w/o the first IF
98Branch Hazard Solutions
- 4 simple compile time schemes 2
- Predicted-untaken
- simply treat every branch as untaken
- when the branch is untaken,
- pipelining as if no hazard.
99Branch Hazard Solutions
- 4 simple compile time schemes 2
- Predicted-untaken
- but if the branch is taken
- turn fetched instr into a no-op (idle)
- restart the IF at the branch target addr
100Branch Hazard Solutions
- 4 simple compile time schemes 3
- Predicted-taken
- simply treat every branch as taken
- not apply to the five-stage pipeline
-
- apply to scenarios when branch target addr is
known before branch outcome.
101Branch Hazard Solutions
- 4 simple compile time schemes 4
- Delayed branch
- delay the branch execution after the next
instruction - pipelining sequence
- branch instruction
- sequential successor
- branch target if taken
Branch delay slot the next instruction
102Branch Hazard Solutions
103(No Transcript)
104Lectures 08-10
105Memory Hierarchy
106Cache Performance
107Cache Performance
- Memory stall cycles
- the number of cycles during processor is stalled
waiting for a mem access - Miss rate
- number of misses over number of accesses
- Miss penalty
- the cost per miss (number of extra clock cycles
to wait)
108Block Placement
109Block Identification
- Block address block offset
- Block address tag index
- Index select the set
- Tag check all blocks in the set
- Block offset the address of the desired data
within the block chosen by index tag - Fully associative caches have no index field
110Write Strategy
- Write-through
- info is written to both the block in the cache
and to the block in the lower-level memory - Write-back
- info is written only to the block in the cache
- to the main memory only when the modified cache
block is replaced
111Write Strategy
- Options on a write miss
- Write allocate
- the block is allocated on a write miss
- No-write allocate
- write miss not affect the cache
- the block is modified in the lower-level memory
- until the program tries to read the block
112Write Strategy
113Write Strategy
- No-write allocate 4 misses 1 hit
- cache not affected- address 100 not in the
cache - read 200 miss, block replaced, then write
200 hits - Write allocate 2 misses 3 hits
114Avg Mem Access Time
- Average memory access time
- Hit time Miss rate x Miss penalty
115Opt 4 Multilevel Cache
- Two-level cache
- Add another level of cache between the original
cache and memory - L1 small enough to match the clock cycle time of
the fast processor - L2 large enough to capture many accesses that
would go to main memory, lessening miss penalty
116Opt 4 Multilevel Cache
- Average memory access time
- Hit timeL1 Miss rateL1 x Miss penaltyL1
- Hit timeL1 Miss rateL1
- x(Hit timeL2Miss rateL2xMiss penaltyL2)
- Average mem stalls per instruction
- Misses per instructionL1 x Hit timeL2
- Misses per instrL2 x Miss penaltyL2
117Opt 4 Multilevel Cache
- Local miss rate
- the number of misses in a cache
- divided by the total number of mem accesses to
this cache - Miss rateL1, Miss rateL2
- Global miss rates
- the number of misses in the cache
- divided by the number of mem accesses generated
by the processor - Miss rateL1, Miss rateL1 x Miss rateL2
118- Answer
- 1. various miss rates?
- L1 local global
- 40/1000 4
- L2
- local 20/40 50
- global 20/10000 2
119- Answer
- 2. avg mem access time?
- average memory access time
- Hit timeL1 Miss rateL1
- x(Hit timeL2Miss rateL2xMiss penaltyL2)
- 1 4 x (10 50 x 200)
- 5.4
120- Answer
- 3. avg stall cycles per instruction?
- average stall cycles per instruction
- Misses per instructionL1 x Hit timeL2
- Misses per instrL2 x Miss penaltyL2
- (1.5x40/1000)x10(1.5x20/1000)x200
- 6.6
121Virtual Memory
122Virtual Memory
- Program uses
- discontiguous memory locations
- Use secondary/non-memory storage
123Virtual Memory
- Program thinks
- contiguous memory locations
- larger physical memory
124Virtual Memory
- relocation
- allows the same program to run in any location in
physical memory
125Virtual Memory
- Paged virtual memory
- page fixed-size block
- Segmented virtual memory
- segment variable-size block
126Virtual Memory
- Paged virtual memory
- page address page offset
- Segmented virtual memory
- segment address seg offset
127Address Translation
Steps 12 send the virtual address to all tags
Step 2 check the type of mem access against
protection info in TLB
128Address Translation
Steps 3 the matching tag sends phy addr through
multiplexor
129Address Translation
Steps 4 concatenate page offset to phy page
frame to form final phy addr
130Virtual Memory Caches
131(No Transcript)
132Lectures 11-12
133Disk
- http//cf.ydcdn.net/1.0.1.19/images/computer/MAGDI
SK.GIF
134Disk
- http//www.cs.uic.edu/jbell/CourseNotes/Operating
Systems/images/Chapter10/10_01_DiskMechanism.jpg
135Disk Capacity
- Areal Density
- bits/inch2
- (tracks/inch) x (bits-per-track/inch)
136Disk Arrays
- Disk arrays with redundant disks to tolerate
faults - If a single disk fails, the lost information is
reconstructed from redundant information - Striping simply spreading data over multiple
disks - RAID redundant array of inexpensive/independent
disks
137RAID
138RAID 0
- JBOD just a bunch of disks
- No redundancy
- No failure tolerated
- Measuring stick for other RAID levels in terms of
cost, performance, and dependability
139RAID 1
- Mirroring or Shadowing
- Two copies for every piece of data
- one logical write two physical writes
- 100 capacity/space
- overhead
- http//www.petemarovichimages.com/wp-content/uploa
ds/2013/11/RAID1.jpg
140- https//www.icc-usa.com/content/raid-calculator/ra
id-0-1.png
141RAID 2
- http//www.acnc.com/raidedu/2
- Each bit of data word is written to a data disk
drive - Each data word has its (Hamming Code) ECC word
recorded on the ECC disks - On read, the ECC code verifies correct data or
corrects single disks errors
142RAID 3
- http//www.acnc.com/raidedu/3
- Data striped over all data disks
- Parity of a stripe to parity disk
- Require at least 3 disks to implement
143RAID 3
- Even Parity
- parity bit makes
- the of 1 even
- p sum(data1) mod 2
- Recovery
- if a disk fails,
- subtract good data
- from good blocks
- what remains is missing data
144RAID 4
- http//www.acnc.com/raidedu/4
- Favor small accesses
- Allows each disk to perform independent reads,
using sectors own error checking
145RAID 5
- http//www.acnc.com/raidedu/5
- Distributes the parity info across all disks in
the array - Removes the bottleneck of a single parity disk as
RAID 3 and RAID 4
146RAID 6 Row-diagonal Parity
- RAID-DP
- Recover from two failures
- xor
- row 00112233r4
- diagonal 011131r1d1
147Double-Failure Recovery
148Double-Failure Recovery
149Double-Failure Recovery
150Double-Failure Recovery
151Double-Failure Recovery
152Double-Failure Recovery
153Double-Failure Recovery
154Double-Failure Recovery
155Double-Failure Recovery
156RAID Further Readings
- Raid Types Classifications
- BytePile.com
- https//www.icc-usa.com/content/raid-calculator/r
aid-0-1.png - RAID
- JetStor
- http//www.acnc.com/raidedu/0
-
157Littles Law
- Assumptions
- multiple independent I/O requests in
equilibrium - input rate output rate
-
- a steady supply of tasks independent for how
long they wait for service
158Littles Law
- Mean number of tasks in system
- Arrival rate x Mean response time
159Littles Law
- Mean number of tasks in system
- Arrival rate x Mean response time
- applies to any system in equilibrium
- nothing inside the black box creating new tasks
or destroying them
160Littles Law
- Observe a sys for Timeobserve mins
- Sum the times for each task to be serviced
Timeaccumulated - Numbertask completed during Timeobserve
- TimeaccumulatedTimeobserve
- because tasks can overlap in time
161Littles Law
162Single-Server Model
- Queue / Waiting line
- the area where the tasks accumulate, waiting to
be serviced - Server
- the device performing the requested service is
called the server
163Single-Server Model
- Timeserver
- average time to service a task
- average service rate 1/Timeserver
- Timequeue
- average time per task in the queue
- Timesystem
- average time/task in the system, or the response
time - the sum of Timequeue and Timeserver
164Single-Server Model
- Arrival rate
- average of arriving tasks per second
- Lengthserver
- average of tasks in service
- Lengthqueue
- average length of queue
- Lengthsystem
- average of tasks in system,
- the sum of Lengthserver and Lengthqueue
165Server Utilization / traffic intensity
- Server utilization
- the mean number of tasks being serviced divided
by the service rate - Service rate 1/Timeserver
- Server utilization
- Arrival rate x Timeserver
- (littles law again)
166Server Utilization
- Example
- an I/O sys with a single disk gets on average 50
I/O requests per sec - 10 ms on avg to service an I/O request
- server utilization
- arrival rate x timeserver
- 50 x 0.01 0.5 1/2
- Could handle 100 tasks/sec, but only 50
167Queue Discipline
- How the queue delivers tasks to server
- FIFO first in, first out
- Timequeue
- Lengthqueue x Timeserver
- Mean time to complete service of task when
new task arrives if server is busy
168Queue
- with exponential/Poisson distribution of
events/requests
169Lengthqueue
- Example
- an I/O sys with a single disk gets on average 50
I/O requests per sec - 10 ms on avg to service an I/O request
- Lengthqueue
-
170(No Transcript)
171Lectures 13
172centralized shared-memory
eight or fewer cores
173centralized shared-memory
Share a single centralized memory All processors
have equal access to
174centralized shared-memory
All processors have uniform latency from
memory Uniform memory access (UMA)
multiprocessors
175distributed shared memory
more processors
physically distributed memory
176distributed shared memory
more processors
physically distributed memory
Distributing mem among the nodes increases
bandwidth reduces local-mem latency
177distributed shared memory
more processors
physically distributed memory
NUMA nonuniform memory access access time
depends on data word loc in mem
178distributed shared memory
more processors
physically distributed memory
Disadvantages more complex inter-processor
communication more complex software to handle
distributed mem
179Cache Coherence Problem
write-through cache
180Cache Coherence Problem
- Global state defined by main memory
- Local state defined by the individual caches
181Cache Coherence Problem
- A memory system is Coherent if any read of a data
item returns the most recently written value of
that data item - Two critical aspects
- coherence defines what values can be returned
by a read - consistency determines when a written value
will be returned by a read
182Coherence Property
- A read by processor P to location X that follows
a write by P to X, with writes of X by another
processor occurring between the write and the
read by P, - always returns the value written by P.
- preserves program order
183Coherence Property
- A read by a processor to location X that follows
a write by anther processor to X returns the
written value if the read the write are
sufficiently separated in time and no other
writes to X occur between the two accesses.
184Consistency
- When a written value will be seen is important
- For example, a write of X on one processor
precedes a read of X on another processor by a
very small time, it may be impossible to ensure
that the read returns the value of the data
written, - since the written data may not even have left
the processor at that point
185Cache Coherence Protocols
- Directory based
- the sharing status of a particular block of
physical memory is kept in one location, called
directory - Snooping
- every cache that has a copy of the data from a
block of physical memory could track the sharing
status of the block
186Snooping Coherence Protocol
- Write invalidation protocol
- invalidates other copies on a write
- exclusive access ensures that no other readable
or writable copies of an item exist when the
write occurs
187Snooping Coherence Protocol
- Write invalidation protocol
- invalidates other copies on a write
write-back cache
188Snooping Coherence Protocol
- Write update/broadcast protocol
- update all cached copies of a data item when
that item is written - consumes more bandwidth
189Write Invalidation Protocol
- To perform an invalidate, the processor simply
acquires bus access and broadcasts the address to
be invalidated on the bus - All processors continuously snoop on the bus,
watching the addresses - The processors check whether the address on the
bus is in their cache - if so, the corresponding data in the cache is
invalidated.
190Coherence Miss
- True sharing miss
- first write by a processor to a shared cache
block causes an invalidation to establish
ownership of that block - another processor reads a modified word in that
cache block - False sharing miss
191Coherence Miss
- True sharing miss
- False sharing miss
- a single valid bit per cache block
- occurs when a block is invalidated (and a
subsequent reference causes a miss) because some
word in the block, other than the one being read,
is written into
192Coherence Miss
- Example
- assume words x1 and x2 are in the same cache
block, which is in shared state in the caches of
both P1 and P2. - identify each miss as a true sharing miss, a
false sharing miss, or a hit?
193Coherence Miss
- Example
-
- 1. true sharing miss
- since x1 was read by P2 and needs to be
invalidated from P2
194Coherence Miss
- Example
-
- 2. false sharing miss
- since x2 was invalidated by the write of x1 in
P1, - but that value of x1 is not used in P2
195Coherence Miss
- Example
-
- 3. false sharing miss
- since the block is in shared state, need to
invalidate it to write - but P2 read x2 rather than x1
196Coherence Miss
- Example
-
- 4. false sharing miss
- need to invalidate the block
- P2 wrote x1 rather than x2
197Coherence Miss
- Example
-
- 5. true sharing miss
- since the value being read was written by P2
(invalid -gt shared)
198Lab/Experiment
- Refer to also archlab website
199?
200- Exam July 5
- one A4 handwritten notes
- Gook Luck)
201A Few More Words
202- Dont Settle
- Strive for Better
203- If you can dream it,
- you can accomplish it.
LinkedIn www.youtube.com/watch?vU6JxljIXzGw
204Reid Hoffman http//t.cn/zTrc5bd The 3 Secrets of
Highly Successful Graduates