Title: 204321 Computer Architecture
1????????????????????????? (???????)
- ?????????? ??????????
- ??????????????????????????
- ??????????????????????
2?????????????????????????????????
- ?? miss rate
- ?? miss penalty
- ?? hit time
3Misses
- Compulsory
- ???????????????????????????
- ?? miss ??????????????????????????????????????
- A.k.a cold start misses ???? first reference
misses. - Capacity
- ????????? ????????????????????????????????????????
?????????????????? - Fully Associative Size X Cache)
- Conflict
- ??????????????????????????????????????????????????
??????????????????????????? - A.k.a. collision misses ???? interference misses.
- N-way Associative, Size X Cache)
43Cs Absolute Miss Rate (SPEC92)
Conflict
Compulsory ???????
521 Cache Rule
miss rate 1-way associative cache size X
miss rate 2-way associative cache size X/2
Conflict
63Cs Relative Miss Rate
Conflict
7Pop Quiz
- 3 Cs Compulsory, Capacity, Conflict
- ????????? ????????????????? C ??????????
- ??????? Block Size
- ??????? Associativity
- ??????? Compiler
8?????? miss
- Larger Block Size
- Higher Associativity
- Victim Cache
- Pseudo-Associativity
- HW Prefetching Instruction, Data
- SW Prefetching Data
- Compiler Optimizations
91. Larger Block Size
102. Higher Associativity
- 21 Cache Rule
- Miss Rate ??? DM ???? N ? Miss Rate 2-way SA ????
N/2 - ???????????
- ???????????????? Execution time ????????
- Pop quiz
- Clock cycle ????????????????
- hit time for 2-way vs. 1-way
- external cache 10
- internal 2
11???????? AMAT vs. Miss Rate
- AMAT Average memory access time
-
- Cache Size Associativity
- (KB) 1-way 2-way 4-way 8-way
- 1 2.33 2.15 2.07 2.01
- 2 1.98 1.86 1.76 1.68
- 4 1.72 1.67 1.61 1.53
- 8 1.46 1.48 1.47 1.43
- 16 1.29 1.32 1.32 1.32
- 32 1.20 1.24 1.25 1.27
- 64 1.14 1.20 1.21 1.23
- 128 1.10 1.17 1.18 1.20
-
- (Blue means A.M.A.T. not improved by more
associativity)
123. Victim Cache
- ?????????????????? fast hit time ???
Direct-mapped ?????????? conflict misses - ???? Associative cache ??????????????????
conflict ??? direct-mapped - 4-entry victim cache ???????? conflicts ???
20-95 ?? 4 KB direct mapped - ????? Alpha, HP
134. Pseudo-Associativity
- ?????????????????? fast hit time ???
Direct-mapped ?????conflict misses ?????????
2-way SA cache? - Hit ??????????
- Miss ???????? pseudo cache
- ????????????? pseudo hit ????????? miss
- ??????????????pseudo miss ????????????????????
- ??????
- CPU pipeline ???????????????????????????????? 1
???? 2 cycle - ??????????? L2????? MIPS R1000, UltraSPARC
145. Hardware Prefetching
- Instruction Prefetching
- Alpha 21064 ??? miss ?? fetches 2 ?????
- ???????????????????????????? stream buffer
- ????? miss ???????? stream buffer
- Data Prefetching
- ?????? 4KB cache
- 1 stream buffer ????? 25 misses
- 4 streams ????? 43
- ?????? 2 64KB, 4-way set associative caches
- 8 streams ????? 50 to 70
- Prefetching ????????????? extra memory bandwidth
????????????????????????????
156. Software Prefetching Data
- Data ????????
- Data Prefetch
- Load data into register (HP PA-RISC loads)
- Cache Prefetch
- load into cache (MIPS IV, PowerPC, SPARC v. 9)
- ????? Prefetch Instructions ???????????
- Cost of prefetch issues lt Savings in reduced
misses?
167. Compiler Optimizations
- ???????? misses ??? 8KB direct mapped cache, 4
byte blocks ??? 75 ???? software - Instructions
- ?????????????????????? conflict
- Data
- Merging Arrays
- Loop Interchange
- Loop Fusion
- Blocking
17???????? Merging Arrays
- / Before 2 sequential arrays /
- int valSIZE
- int keySIZE
- / After 1 array of structures /
- struct merge
- int val
- int key
-
- struct merge merged_arraySIZE
- ?? conflicts ??????? val key improve spatial
locality
18???????? Loop Interchange
- / Before /
- for (k 0 k lt 100 k k1)
- for (j 0 j lt 100 j j1)
- for (i 0 i lt 5000 i i1)
- xij 2 xij
- / After /
- for (k 0 k lt 100 k k1)
- for (i 0 i lt 5000 i i1)
- for (j 0 j lt 100 j j1)
- xij 2 xij
- ??????????????????????????????????? 100 words
19???????? Loop Fusion
- / Before /
- for (i 0 i lt N i i1)
- for (j 0 j lt N j j1)
- aij 1/bij cij
- for (i 0 i lt N i i1)
- for (j 0 j lt N j j1)
- dij aij cij
- / After /
- for (i 0 i lt N i i1)
- for (j 0 j lt N j j1)
- aij 1/bij cij
- dij aij cij
- 2 misses ????????? a ??? c ?????????????? 1 miss
?????????? c ????????
20???????? Blocking
- / Before /
- for (i 0 i lt N i i1)
- for (j 0 j lt N j j1)
- r 0
- for (k 0 k lt N k k1)
- r r yikzkj
- xij r
-
- Idea compute on many BxB sub-matrixes that fits
21???????? Blocking
- / After /
- for (jj 0 jj lt N jj jjB)
- for (kk 0 kk lt N kk kkB)
- for (i 0 i lt N i i1)
- for (j jj j lt min(jjB-1,N) j j1)
- r 0
- for (k kk k lt min(kkB-1,N) k k1)
- r r yikzkj
- xij xij r
-
- B called Blocking Factor
- Capacity Misses from 2N3 N2 to 2N3/B N2
- POP QUIZ Conflict Misses Too?
22?????? miss
- ????????????? parameter ??????????????????????????
????????????????
- Larger Block Size
- Higher Associativity
- Victim Cache
- Pseudo-Associativity
- HW Prefetching Instruction, Data
- SW Prefetching Data
- Compiler Optimizations
23?????????????????????????????????
- ?? miss rate
- ?? miss penalty
- ?? hit time
24?????? Miss penalty
- Read priority over write on miss
- Subblock placement
- Early Restart and Critical Word First on miss
- Non-blocking Caches
- Second Level Cache
251. Read Priority over Write on Miss
- ??? Write through ???????? write buffers
- ??????????? RAW conflicts ????????????????????????
?????? misses ????? - ???????????? Miss penalty ?????????????????
- ??????????????? write buffer ?????????????
???????? conflict ?????????? - Write Back?
- ????????? Miss ???????????????? cache
???????????????????????????? (dirty block) - ???? ??????? dirty block ????????????????????????
??? - ??????????? ??? dirty block ?????? write buffer
???????????? ???????????????? - CPU ?? stall ???????????? ????????????????????
???? ?????
262. ?????????? Subblock
- ?????????????????? load ?????????????? miss
- ???? valid bits ??? subblock to ?????????????????
valid
273. Early Restart and Critical Word First
- ?????????????????????? ????? ???????????? cpu
???????? - Early restart
- ??????????????? word ??????????????? cpu ??? cpu
???????????????? - Critical Word First
- ?? word ???????????????????????????? ????? cpu
?????????????????? - Also called wrapped fetch and requested word
first - ??????????????????????
284. Non-blocking Caches
- Non-blocking cache ???? lockup-free cache
- ??????????????????????????????????(?????? hit)
?????????????????? miss ??? ??????????????????????
????? - ???????????? out-of-order execution
- hit under miss
- ??????? miss penalty ??????????????????? miss
- hit under multiple miss ???? miss under miss
- ?????????? miss penalty ??????????????
overlapping ????????? multiple misses - ????????????????????????
- ??????? multiple memory banks
295. Second Level Cache
- L2 Equations
- AMAT Hit TimeL1 Miss RateL1 x Miss
PenaltyL1Miss PenaltyL1 Hit TimeL2 Miss
RateL2 x Miss PenaltyL2 - AMAT Hit TimeL1 Miss RateL1 x (Hit TimeL2
Miss RateL2 Miss PenaltyL2) - Definitions
- Local miss rate
- Misses ???????????????????????????????????????????
????????????? (Miss rateL2) - Global miss rate
- Misses ???????????????????????????????????????????
?? cpu ???????? (Miss RateL1 x Miss RateL2) - Global Miss Rate is what matters
30??????????????? L2
- Reducing Miss Rate
- Larger Block Size
- Higher Associativity
- Victim Cache
- Pseudo-Associativity
- HW Prefetching Instruction, Data
- SW Prefetching Data
- Compiler Optimizations
31POP QUIZ
- ???????????? L3 ??????????????????????????????????
??? - ????????? L1, L2, ??? L3 ???????????????????????
- ??????????? Miss penalty ??????? L2 ??? L3
- ??????????? Miss rate ??? L2 ??? L3
- ???????? L3 ??????? ?????????
32????????? L3
- Miss penalty ????????? (????????????????)
- ??? L2 ?????????????? L3 ?????????????????????????
??
33???? ????? Miss Penalty
- ?????? miss penalty
- Read priority over write on miss
- Subblock placement
- Early Restart and Critical Word First on miss
- Non-blocking Caches (Hit under Miss, Miss under
Miss) - Second Level Cache
34?????????????????????????????????
- ?? miss rate
- ?? miss penalty
- ?? hit time
35????? hit time
- ?????????????????????????????????
- ???????????????????????
- Pipelining Writes
361. ?????????????????????????????????
- Alpha 21164
- 8KB Instruction cache
- 8KB data cache
- 96KB L2 cache (??????????? inst ??? data)
- Direct Mapped, on chip
372. ???????????????? address (1)
- ???virtual address ????? cache
- ???????? Virtually Addressed Cache ???? Virtual
Cache vs. Physical Cache - ?????????????????? process ???????????? flush
- Cost of flush compulsory misses
- ???????? virtual address ?????????????????????????
?? (aliases ???? synonyms) - ??????????????????? I/O ???????????????? virtual
address
382. ???????????????? address (2)
- Solution ?????? aliases
- ??? HW ?????????????????????????????? physical
address ???????? - ??? SW ?????????????? ??? n ??????????????????????
?????? (page coloring) - Solution ?????? cache flush
- ????? process identifier tag ??????????????
process ?? ??????????????????? address ?? Process
39Virtually Addressed Caches
CPU
CPU
CPU
VA
VA
VA
VA Tags
PA Tags
TB
TB
VA
PA
PA
L2
TB
MEM
PA
PA
MEM
MEM
Overlap access with VA translation requires
index to remain invariant across translation
Conventional Organization
Virtually Addressed Cache Translate only on
miss Synonym Problem
403. Pipelined Writes
- ????????????????? tag ???????????????????????
- ???????????????? n ??????????? tag
- ???????????????? n-1 ????????????????
41???? Cache Optimization
- Technique MR MP HT Complexity
- Larger Block Size 0Higher
Associativity 1Victim Caches 2Pseudo-As
sociative Caches 2HW Prefetching of
Instr/Data 2Compiler Controlled
Prefetching 3Compiler Reduce Misses 0 - Priority to Read Misses 1Subblock Placement
1Early Restart Critical Word 1st
2Non-Blocking Caches 3Second Level
Caches 2 - Small Simple Caches 0Avoiding Address
Translation 2Pipelining Writes 1
miss rate
miss penalty
hit time