Title: Storage System
1Storage System Quiz Review
- COMP381
- Tutorial 12
- 25-28 Nov, 08
2Enterprise Storage
- Sep 4th 1956, IBM 350 disk storage system
introducing magnetic disks to computers. - 1.70m high, 50 disk platters, 24 inches in
diameter and a total capacity of 5 megabytes for
an annual lease of 38,400. - Today, 1 terabyte disk using 5 disk platters of
3.5 inches available at a purchase price of 200.
- you can get 5 MB of disk storage basically free
of charge. - A huge amount of data must be provisioned, backed
up, archived and restored if necessary.
3The types of data storage
- Online
- Direct Attached Storage(DAS)
- Network Attached Storage(NAS)
- Storage Area Network(SAN)
- Offline
- Tape
4DAS- Direct Attached Storage
- Directly attach the disks to an internal I/O bus
- Reside within the computer chassis or inside an
external peripheral chassis
5DAS- Drawbacks
- Limit the distance between the computer and the
disk storage - Disk administration must be done for each
individual computer - Different computers can not share the storage
capacity - A central data backup for all computers can be
implemented across a network only. Backups must
be able to run during a time of low network
usage. - The various computers often use different disk
technologies. The administrators must be familiar
with all kinds of storage used.
6NAS- Network Attached Storage
- A server creating filesystems on its internal
disks - Linux or Unix computer sharing its filesystems
via NFS - Windows server may be sharing its NTFS
filesystems via CIFS. - Connect via LAN
7NAS- Merits
- Dedicated file system achieves much better
performance - The internal structures of these file systems are
not visible to storage clients - Centralized control, easy to configure, backup,
share the storage - Flexible Deployment (Long Distance via LAN)
8SAN- Storage Area Network
- While a NAS system in many cases will be accessed
via a regular LAN, a SAN has a separate network
used for storage data only - Storage is not presented as a file share but as a
block device - A computer accessing a SAN disk will see it as a
locally attached disk - SAN technology is available with the Fibrechannel
and the iSCSI protocols.
9SAN- Merits
10Quiz Review
- Questions for both L1 and L2
- Sequence may be different
11Q1.Multiple choices
- With a VLIW design, which of the following
components can be simplified? - A. Processor (Could be simplified)
- B. Physical memory (DRAM) (irrelevant)
- C. Cache (irrelevant)
- D. Compiler (much more complex, take HW2 Q6 as an
example)
12HW2 Q6
The latency between an integer ALU operation and
any other operation is 1 cycle.
You need to compensate the effect of out-order!
13Q1.Multiple choices
- Which one of the following types of hazards can
be reduced by register renaming? - Â A. RAW hazards
- Â B. WAR hazards (Or WAW by Reservation Station)
- Â C. Control hazards
- Â D. Structural hazards
- Â E. All of the above
14Q1.Multiple choices
- Which one of the following items does NOT
directly affect a processors ability to exploit
ILP? single-core, single-threaded, large enough
physical, no virtual memory swapping. - A. The design of the branch predictor ( (m,n)
predictor) - B. The number of registers (the more the better)
- C. The number of instructions a processor can
issue every cycle (the more the better, exploit
ILP from a larger scope) - D. The mapping (correspondence) between virtual
page numbers and physical page numbers - E. The optimization level used for compiling the
programs (highly related)
15Q1.Multiple choices
- With a loop unrolling, which of the following
components actually will be likely to perform
LESS efficiently? - A. The processors pipeline (could be better)
- B. Register file (irrelevant)
- C. Cache (More instructions to go through the
cache and be fetched to the processor. Miss rate
can go up.) - D. None of the above
16Q2. True, false, and why?
- a) There are no compulsory misses with a cache of
infinite size. - False. Since the first access to a block will
cause a compulsory cache miss so that this block
can be brought into the cache.
17Q2. True, false, and why?
- b) Two programs with an identical instruction mix
(using the same types of instructions and having
the same number of instructions for each type)
running on the same computer must have the same
cache miss rate. - False, different orders of instructions also
affect the happen of cache miss.
18Q2. True, false, and why?
- c) Over the last 20 years, the performance of the
processor and the DRAM improves at the same rate
in terms of throughput and latency. - False, processor improved faster in the past 20
years.
19Q2. True, false, and why?
- d) A loop which contains 20 instructions and
iterates for 10000 times has no temporal or
spatial locality if none of the 20 instructions
are memory access instructions (e.g., load/store
instructions). - False. Execution of the program requires fetching
instructions from memory to the (instruction)
cache. Instruction fetching usually shows a
strong temporal or spatial locality.
20Q2. True, false, and why?
- e) If a processor running the Tomasulos
Algorithm has infinite reservation stations, it
would be able to eliminate all the WAR and WAW
hazards. - True. WAR and WAW can be eliminated if there are
registers (reservation stations) we can use to
perform register renaming.
21Q2. True, false, and why?
- f) The correlating branch predictors are a method
of dynamic branch prediction they are used at
run time by the processor but cannot be used at
compile time by the compiler. - True. The correlating branch predictors use
run-time information that is not available to
compilers
22Q3. Code scheduling
- L.S F0, 0(R1)
- L.S F1, 0(R2)
- ADD.S F0, F0, F1
- L.S F2, 0(R3)
- L.S F3, 0(R4)
- MUL.S F2, F2, F3
- ADD.S F0, F0, F2
- S.S F0, 0(R5)
Load ? FP ALU 2 cycles FP multiplication ? FP
ALU 4 cycles FP multiplication ? Store 4
cycles FP addition ? FP ALU 2cycles FP addition
? Store 2cycles
- 1 L.S F0, 0(R1)
- 2 L.S F1, 0(R2)
- 3 Stall Load ? FP ALU2 cycles
- 4 Stall
- 5 MUL.S F0, F0, F1
- 6 L.S F2, 0(R3)
- 7 L.S F3, 0(R4)
- 8 Stall Load ? FP ALU2 cycles
- 9 Stall
- 10 MUL.S F2, F2, F3
- 11 Stall FP multiplication ? FP ALU4 cycles
- 12 Stall
- 13 Stall
- 14 Stall
- 15 ADD.S F0, F0, F2
- 16 Stall FP addition ? Store 2cycles
- 17 Stall
- 18 S.S F0, 0(R5)
in-order
23Q3. Code scheduling
- L.S F0, 0(R1)
- L.S F1, 0(R2)
- ADD.S F0, F0, F1
- L.S F2, 0(R3)
- L.S F3, 0(R4)
- MUL.S F2, F2, F3
- ADD.S F0, F0, F2
- S.S F0, 0(R5)
Load ? FP ALU 2 cycles FP multiplication ? FP
ALU 4 cycles FP multiplication ? Store 4
cycles FP addition ? FP ALU 2cycles FP addition
? Store 2cycles
No Data Hazard!
- 1 L.S F2, 0(R3)
- 2 L.S F3, 0(R4)
- 3 L.S F0, 0(R1)
- 4 L.S F1, 0(R2)
- 5 MUL.S F2, F2, F3
- 6 STALL
- 7 ADD.S F0, F0, F1
- 8 STALL
- 9 STALL
- 10 ADD.S F0, F0, F2
- 11 STALL
- 12 STALL
- S.S F0, 0(R5)
- Totoally 13 cycles.
Switch two instructions!
out-order
24Q4. Branch Prediction
Suppose we have a program with the following
sequence of the C-like statements. It has three
branches as indicated by B1, B2 and B3. if
(altb) a2a branch B1 if (cgtb) cc-b
branch B2 if (agtc) aa-b branch B3 The
instruction sequence corresponding to the above
statements is shown in Fig. 1 in assembly
language. In Fig. 1, register R1 is used for the
variable a, R2 for b and R3 for c. R4 is a
register for storing temporary results. We
maintain a (m,n) predictor for each branch and
the predictor for the branch B3 is illustrated in
the following table (Fig. 2 ).
25Q4. Branch Prediction
- a) For the (m, n) predictor, what should be the
value of parameter m and what should be the value
of the parameter n, based on Fig. 2?
Another Example
Answer m2, n2
26Q4. Branch Prediction
- b) Suppose, initially, when the program counter
(PC) points to the first instruction at label S1
( SUB R4, R1, R2) but has not executed it, the
variables a26, b50 and c46 (or, REGSr126,
REGSr250, REGSr346). At that time, the
state of predictor of B3 is shown in Fig. 2.
After a few cycles, the programs execution
reaches the B3 branch. The PC now points to B3.
According to this predictor, what prediction will
be made for the branch instruction at B3 (TAKEN
or NOT TAKEN)? Explain the reason. - Answer a26 and b50, so B1 is NOT TAKEN. c46
and b50, so B2 is TAKEN According to the
outcomes of branches B1 and B2, the predictor
indexed by NT/T is selected. Since the so-far
state is 01, we will made the prediction of
NOT TAKEN.
R1a, R2b,R3c S1 SUB R4, R1, R2
R4R1-R2 B1 BGE R4, S2 if R40, then
branch to S2 (B1 branch) ADD R1, R1, R1
R1R1R1 S2 SUB R4, R3, R2 R4R3-R2 B2
BLE R4, S3 if R4?0, then branch to S3
(B2 branch) SUB R3, R3, R2 R3R3-R2 S3
SUB R4, R1, R3 R4R1-R3 B3 BLE R4, S4
if R4?0, then branch to S4 (B3 branch)
SUB R1, R1, R2 R1R1-R2 S4
27Q4. Branch Prediction
- c) Follow the conditions of the question b). When
the program has just finished the execution of
the branch B3 (i.e., PC becomes more than B3),
what will be the state of the predictors of B3?
(Hint draw a graph or table similar to Fig. 2) - Answer When PCB3, we have a52 and c46.
Therefore B3 is NOT TAKEN. The state of the
corresponding 2-bit predictor (the 01 entry)
will be changed from 01 to 00.
28Q4. Branch Prediction
- d) Suppose we can use up to 10000 bits for
dynamic branch prediction using this (m,n)
predictor scheme. How many entries can we hold in
the cache at most? Assume the number of entries
is a power of 2, and each entry corresponds to a
different instruction address. (Hint m and n are
determined in question a) - Answer m2 and n2. So each entry needs
2228bits. We have 10000 bits and therefore we
can hold (10000/8)1250 entries. We support only
the number of power of 2 entries, 1250 is
grounded to 1024 entries.
29Q5v1. The Tomasulos Algorithm
- a) In the Cycle 12, two iterations have been
issued and the LD for the third iteration
(LD3) has also been issued. Whether the MULTD
for the third iteration (Mult3) can be issued
in the Cycle 13? Briefly explain the reasons. - Answer No, because both of the multiplication
units are busy. We are facing a structural hazard
and therefore have to stall.
Loop LD F0, 0, R1 MULTD F4, F0, F2 SD
F4, 0, R1 SUBI R1, R1, 8 BNEZ R1, Loop
The state of Tomasulos organization in Cycle 12
The instruction sequence of the loop
30Q5v1. The Tomasulos Algorithm
- b) In the Cycle 14, Whether the SD for the
third iteration (SD3) can be issued? Briefly
explain the reasons. - Answer No, because we issue instructions
in-order. The MULTD3 is not issued and therefore
SD3 cannot be issued either. - c) In the Cycle 15, the MULTD for the second
iteration (MULT2) is completed. Who is waiting
for the results produced by MULT2? Briefly
explain the based on the organization state. - Answer Function units Store2 and register F4
are waiting.
31Q5v1. The Tomasulos Algorithm
- d) Briefly explain why both of WAW and WAR
hazards become possible in Tomasulos algorithm,
and how Tomasulos algorithm solves the problem - Answer It is because instructions are executed
out-of-order. They are solved by renaming.
32Q5v2. Memory hierarchy
- L1 cache (NO L2 cache)
- unified cache
- write-through with write-allocate
- hit rate is H1 and the hit time is 1 cycle
- The percentage of read is r and the write is w.
- TLB Memory
- The hit rate of TLB is H2 and the page hit rate
is H3. - The hit time of TLB is one cycle
- M stall cycles to complete reading/writing the
physical memory (a block or a page table entry) - reading a block and writing a word in parallel
- Disk
- we need D cycles including the cycles needed for
memory and disk operations - The percentage of clean pages is c and that of
dirty pages is d
33Q5v2. Memory hierarchy
memory read with TLB miss, memory write with a
TLB hit
D Disk Read operation M Load to L1 from
Memory M Get the Page Table from Memory 1 TLB
lookup
The stall cycles per memory access for case(a)
---memory read with TLB miss, L1 miss, Page
fault, clean page
3 miss
The stall cycles per memory access for the
case(b) ---memory write with TLB hit but page
fault on a dirty page,
b
write through (no L1)
D Disk write back operation (dirty, modified
page, write allocate) D Disk Read operation M
Load to L1 from Memory 1 TLB lookup
a
34Q5v2. Memory hierarchy
- Complete the Memory Access Tree