Storage System - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Storage System

Description:

M stall cycles to complete reading/writing the physical memory (a block or a page table entry) ... The stall cycles per memory access for case(a) ---memory read ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 35

Provided by: papadopoul3

Category:

more less

Transcript and Presenter's Notes

Title: Storage System

1
Storage System Quiz Review

COMP381
Tutorial 12
25-28 Nov, 08

2
Enterprise Storage

Sep 4th 1956, IBM 350 disk storage system
introducing magnetic disks to computers.
1.70m high, 50 disk platters, 24 inches in
diameter and a total capacity of 5 megabytes for
an annual lease of 38,400.
Today, 1 terabyte disk using 5 disk platters of
3.5 inches available at a purchase price of 200.
you can get 5 MB of disk storage basically free
of charge.
A huge amount of data must be provisioned, backed
up, archived and restored if necessary.

3
The types of data storage

Online
Direct Attached Storage(DAS)
Network Attached Storage(NAS)
Storage Area Network(SAN)
Offline
Tape

4
DAS- Direct Attached Storage

Directly attach the disks to an internal I/O bus
Reside within the computer chassis or inside an
external peripheral chassis

5
DAS- Drawbacks

Limit the distance between the computer and the
disk storage
Disk administration must be done for each
individual computer
Different computers can not share the storage
capacity
A central data backup for all computers can be
implemented across a network only. Backups must
be able to run during a time of low network
usage.
The various computers often use different disk
technologies. The administrators must be familiar
with all kinds of storage used.

6
NAS- Network Attached Storage

A server creating filesystems on its internal
disks
Linux or Unix computer sharing its filesystems
via NFS
Windows server may be sharing its NTFS
filesystems via CIFS.
Connect via LAN

7
NAS- Merits

Dedicated file system achieves much better
performance
The internal structures of these file systems are
not visible to storage clients
Centralized control, easy to configure, backup,
share the storage
Flexible Deployment (Long Distance via LAN)

8
SAN- Storage Area Network

While a NAS system in many cases will be accessed
via a regular LAN, a SAN has a separate network
used for storage data only
Storage is not presented as a file share but as a
block device
A computer accessing a SAN disk will see it as a
locally attached disk
SAN technology is available with the Fibrechannel
and the iSCSI protocols.

9
SAN- Merits

Better Performance

10
Quiz Review

Questions for both L1 and L2
Sequence may be different

11
Q1.Multiple choices

With a VLIW design, which of the following
components can be simplified?
A. Processor (Could be simplified)
B. Physical memory (DRAM) (irrelevant)
C. Cache (irrelevant)
D. Compiler (much more complex, take HW2 Q6 as an
example)

12
HW2 Q6
The latency between an integer ALU operation and
any other operation is 1 cycle.
You need to compensate the effect of out-order!
13
Q1.Multiple choices

Which one of the following types of hazards can
be reduced by register renaming?
A. RAW hazards
B. WAR hazards (Or WAW by Reservation Station)
C. Control hazards
D. Structural hazards
E. All of the above

14
Q1.Multiple choices

Which one of the following items does NOT
directly affect a processors ability to exploit
ILP? single-core, single-threaded, large enough
physical, no virtual memory swapping.
A. The design of the branch predictor ( (m,n)
predictor)
B. The number of registers (the more the better)
C. The number of instructions a processor can
issue every cycle (the more the better, exploit
ILP from a larger scope)
D. The mapping (correspondence) between virtual
page numbers and physical page numbers
E. The optimization level used for compiling the
programs (highly related)

15
Q1.Multiple choices

With a loop unrolling, which of the following
components actually will be likely to perform
LESS efficiently?
A. The processors pipeline (could be better)
B. Register file (irrelevant)
C. Cache (More instructions to go through the
cache and be fetched to the processor. Miss rate
can go up.)
D. None of the above

16
Q2. True, false, and why?

a) There are no compulsory misses with a cache of
infinite size.
False. Since the first access to a block will
cause a compulsory cache miss so that this block
can be brought into the cache.

17
Q2. True, false, and why?

b) Two programs with an identical instruction mix
(using the same types of instructions and having
the same number of instructions for each type)
running on the same computer must have the same
cache miss rate.
False, different orders of instructions also
affect the happen of cache miss.

18
Q2. True, false, and why?

c) Over the last 20 years, the performance of the
processor and the DRAM improves at the same rate
in terms of throughput and latency.
False, processor improved faster in the past 20
years.

19
Q2. True, false, and why?

d) A loop which contains 20 instructions and
iterates for 10000 times has no temporal or
spatial locality if none of the 20 instructions
are memory access instructions (e.g., load/store
instructions).
False. Execution of the program requires fetching
instructions from memory to the (instruction)
cache. Instruction fetching usually shows a
strong temporal or spatial locality.

20
Q2. True, false, and why?

e) If a processor running the Tomasulos
Algorithm has infinite reservation stations, it
would be able to eliminate all the WAR and WAW
hazards.
True. WAR and WAW can be eliminated if there are
registers (reservation stations) we can use to
perform register renaming.

21
Q2. True, false, and why?

f) The correlating branch predictors are a method
of dynamic branch prediction they are used at
run time by the processor but cannot be used at
compile time by the compiler.
True. The correlating branch predictors use
run-time information that is not available to
compilers

22
Q3. Code scheduling

L.S F0, 0(R1)
L.S F1, 0(R2)
ADD.S F0, F0, F1
L.S F2, 0(R3)
L.S F3, 0(R4)
MUL.S F2, F2, F3
ADD.S F0, F0, F2
S.S F0, 0(R5)

Load ? FP ALU 2 cycles FP multiplication ? FP
ALU 4 cycles FP multiplication ? Store 4
cycles FP addition ? FP ALU 2cycles FP addition
? Store 2cycles

1 L.S F0, 0(R1)
2 L.S F1, 0(R2)
3 Stall Load ? FP ALU2 cycles
4 Stall
5 MUL.S F0, F0, F1
6 L.S F2, 0(R3)
7 L.S F3, 0(R4)
8 Stall Load ? FP ALU2 cycles
9 Stall
10 MUL.S F2, F2, F3
11 Stall FP multiplication ? FP ALU4 cycles
12 Stall
13 Stall
14 Stall
15 ADD.S F0, F0, F2
16 Stall FP addition ? Store 2cycles
17 Stall
18 S.S F0, 0(R5)

in-order
23
Q3. Code scheduling

L.S F0, 0(R1)
L.S F1, 0(R2)
ADD.S F0, F0, F1
L.S F2, 0(R3)
L.S F3, 0(R4)
MUL.S F2, F2, F3
ADD.S F0, F0, F2
S.S F0, 0(R5)

Load ? FP ALU 2 cycles FP multiplication ? FP
ALU 4 cycles FP multiplication ? Store 4
cycles FP addition ? FP ALU 2cycles FP addition
? Store 2cycles
No Data Hazard!

1 L.S F2, 0(R3)
2 L.S F3, 0(R4)
3 L.S F0, 0(R1)
4 L.S F1, 0(R2)
5 MUL.S F2, F2, F3
6 STALL
7 ADD.S F0, F0, F1
8 STALL
9 STALL
10 ADD.S F0, F0, F2
11 STALL
12 STALL
S.S F0, 0(R5)
Totoally 13 cycles.

Switch two instructions!
out-order
24
Q4. Branch Prediction
Suppose we have a program with the following
sequence of the C-like statements. It has three
branches as indicated by B1, B2 and B3. if
(altb) a2a branch B1 if (cgtb) cc-b
branch B2 if (agtc) aa-b branch B3 The
instruction sequence corresponding to the above
statements is shown in Fig. 1 in assembly
language. In Fig. 1, register R1 is used for the
variable a, R2 for b and R3 for c. R4 is a
register for storing temporary results. We
maintain a (m,n) predictor for each branch and
the predictor for the branch B3 is illustrated in
the following table (Fig. 2 ).
25
Q4. Branch Prediction

a) For the (m, n) predictor, what should be the
value of parameter m and what should be the value
of the parameter n, based on Fig. 2?

Another Example
Answer m2, n2
26
Q4. Branch Prediction

b) Suppose, initially, when the program counter
(PC) points to the first instruction at label S1
( SUB R4, R1, R2) but has not executed it, the
variables a26, b50 and c46 (or, REGSr126,
REGSr250, REGSr346). At that time, the
state of predictor of B3 is shown in Fig. 2.
After a few cycles, the programs execution
reaches the B3 branch. The PC now points to B3.
According to this predictor, what prediction will
be made for the branch instruction at B3 (TAKEN
or NOT TAKEN)? Explain the reason.
Answer a26 and b50, so B1 is NOT TAKEN. c46
and b50, so B2 is TAKEN According to the
outcomes of branches B1 and B2, the predictor
indexed by NT/T is selected. Since the so-far
state is 01, we will made the prediction of
NOT TAKEN.

R1a, R2b,R3c S1 SUB R4, R1, R2
R4R1-R2 B1 BGE R4, S2 if R40, then
branch to S2 (B1 branch) ADD R1, R1, R1
R1R1R1 S2 SUB R4, R3, R2 R4R3-R2 B2
BLE R4, S3 if R4?0, then branch to S3
(B2 branch) SUB R3, R3, R2 R3R3-R2 S3
SUB R4, R1, R3 R4R1-R3 B3 BLE R4, S4
if R4?0, then branch to S4 (B3 branch)
SUB R1, R1, R2 R1R1-R2 S4
27
Q4. Branch Prediction

c) Follow the conditions of the question b). When
the program has just finished the execution of
the branch B3 (i.e., PC becomes more than B3),
what will be the state of the predictors of B3?
(Hint draw a graph or table similar to Fig. 2)
Answer When PCB3, we have a52 and c46.
Therefore B3 is NOT TAKEN. The state of the
corresponding 2-bit predictor (the 01 entry)
will be changed from 01 to 00.

28
Q4. Branch Prediction

d) Suppose we can use up to 10000 bits for
dynamic branch prediction using this (m,n)
predictor scheme. How many entries can we hold in
the cache at most? Assume the number of entries
is a power of 2, and each entry corresponds to a
different instruction address. (Hint m and n are
determined in question a)
Answer m2 and n2. So each entry needs
2228bits. We have 10000 bits and therefore we
can hold (10000/8)1250 entries. We support only
the number of power of 2 entries, 1250 is
grounded to 1024 entries.

29
Q5v1. The Tomasulos Algorithm

a) In the Cycle 12, two iterations have been
issued and the LD for the third iteration
(LD3) has also been issued. Whether the MULTD
for the third iteration (Mult3) can be issued
in the Cycle 13? Briefly explain the reasons.
Answer No, because both of the multiplication
units are busy. We are facing a structural hazard
and therefore have to stall.

Loop LD F0, 0, R1 MULTD F4, F0, F2 SD
F4, 0, R1 SUBI R1, R1, 8 BNEZ R1, Loop
The state of Tomasulos organization in Cycle 12
The instruction sequence of the loop
30
Q5v1. The Tomasulos Algorithm

b) In the Cycle 14, Whether the SD for the
third iteration (SD3) can be issued? Briefly
explain the reasons.
Answer No, because we issue instructions
in-order. The MULTD3 is not issued and therefore
SD3 cannot be issued either.
c) In the Cycle 15, the MULTD for the second
iteration (MULT2) is completed. Who is waiting
for the results produced by MULT2? Briefly
explain the based on the organization state.
Answer Function units Store2 and register F4
are waiting.

31
Q5v1. The Tomasulos Algorithm

d) Briefly explain why both of WAW and WAR
hazards become possible in Tomasulos algorithm,
and how Tomasulos algorithm solves the problem
Answer It is because instructions are executed
out-of-order. They are solved by renaming.

32
Q5v2. Memory hierarchy

L1 cache (NO L2 cache)
unified cache
write-through with write-allocate
hit rate is H1 and the hit time is 1 cycle
The percentage of read is r and the write is w.
TLB Memory
The hit rate of TLB is H2 and the page hit rate
is H3.
The hit time of TLB is one cycle
M stall cycles to complete reading/writing the
physical memory (a block or a page table entry)
reading a block and writing a word in parallel
Disk
we need D cycles including the cycles needed for
memory and disk operations
The percentage of clean pages is c and that of
dirty pages is d

33
Q5v2. Memory hierarchy
memory read with TLB miss, memory write with a
TLB hit
D Disk Read operation M Load to L1 from
Memory M Get the Page Table from Memory 1 TLB
lookup
The stall cycles per memory access for case(a)
---memory read with TLB miss, L1 miss, Page
fault, clean page
3 miss
The stall cycles per memory access for the
case(b) ---memory write with TLB hit but page
fault on a dirty page,
b
write through (no L1)
D Disk write back operation (dirty, modified
page, write allocate) D Disk Read operation M
Load to L1 from Memory 1 TLB lookup
a
34
Q5v2. Memory hierarchy