Storage System - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Storage System

Description:

M stall cycles to complete reading/writing the physical memory (a block or a page table entry) ... The stall cycles per memory access for case(a) ---memory read ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 35
Provided by: papadopoul3
Category:
Tags: stall | storage | system

less

Transcript and Presenter's Notes

Title: Storage System


1
Storage System Quiz Review
  • COMP381
  • Tutorial 12
  • 25-28 Nov, 08

2
Enterprise Storage
  • Sep 4th 1956, IBM 350 disk storage system
    introducing magnetic disks to computers.
  • 1.70m high, 50 disk platters, 24 inches in
    diameter and a total capacity of 5 megabytes for
    an annual lease of 38,400.
  • Today, 1 terabyte disk using 5 disk platters of
    3.5 inches available at a purchase price of 200.
  • you can get 5 MB of disk storage basically free
    of charge.
  • A huge amount of data must be provisioned, backed
    up, archived and restored if necessary.

3
The types of data storage
  • Online
  • Direct Attached Storage(DAS)
  • Network Attached Storage(NAS)
  • Storage Area Network(SAN)
  • Offline
  • Tape

4
DAS- Direct Attached Storage
  • Directly attach the disks to an internal I/O bus
  • Reside within the computer chassis or inside an
    external peripheral chassis

5
DAS- Drawbacks
  • Limit the distance between the computer and the
    disk storage
  • Disk administration must be done for each
    individual computer
  • Different computers can not share the storage
    capacity
  • A central data backup for all computers can be
    implemented across a network only. Backups must
    be able to run during a time of low network
    usage.
  • The various computers often use different disk
    technologies. The administrators must be familiar
    with all kinds of storage used.

6
NAS- Network Attached Storage
  • A server creating filesystems on its internal
    disks
  • Linux or Unix computer sharing its filesystems
    via NFS
  • Windows server may be sharing its NTFS
    filesystems via CIFS.
  • Connect via LAN

7
NAS- Merits
  • Dedicated file system achieves much better
    performance
  • The internal structures of these file systems are
    not visible to storage clients
  • Centralized control, easy to configure, backup,
    share the storage
  • Flexible Deployment (Long Distance via LAN)

8
SAN- Storage Area Network
  • While a NAS system in many cases will be accessed
    via a regular LAN, a SAN has a separate network
    used for storage data only
  • Storage is not presented as a file share but as a
    block device
  • A computer accessing a SAN disk will see it as a
    locally attached disk
  • SAN technology is available with the Fibrechannel
    and the iSCSI protocols.

9
SAN- Merits
  • Better Performance

10
Quiz Review
  • Questions for both L1 and L2
  • Sequence may be different

11
Q1.Multiple choices
  • With a VLIW design, which of the following
    components can be simplified?
  • A. Processor (Could be simplified)
  • B. Physical memory (DRAM) (irrelevant)
  • C. Cache (irrelevant)
  • D. Compiler (much more complex, take HW2 Q6 as an
    example)

12
HW2 Q6
The latency between an integer ALU operation and
any other operation is 1 cycle.
You need to compensate the effect of out-order!
13
Q1.Multiple choices
  • Which one of the following types of hazards can
    be reduced by register renaming?
  •  A. RAW hazards
  •  B. WAR hazards (Or WAW by Reservation Station)
  •  C. Control hazards
  •  D. Structural hazards
  •  E. All of the above

14
Q1.Multiple choices
  • Which one of the following items does NOT
    directly affect a processors ability to exploit
    ILP? single-core, single-threaded, large enough
    physical, no virtual memory swapping.
  • A. The design of the branch predictor ( (m,n)
    predictor)
  • B. The number of registers (the more the better)
  • C. The number of instructions a processor can
    issue every cycle (the more the better, exploit
    ILP from a larger scope)
  • D. The mapping (correspondence) between virtual
    page numbers and physical page numbers
  • E. The optimization level used for compiling the
    programs (highly related)

15
Q1.Multiple choices
  • With a loop unrolling, which of the following
    components actually will be likely to perform
    LESS efficiently?
  • A. The processors pipeline (could be better)
  • B. Register file (irrelevant)
  • C. Cache (More instructions to go through the
    cache and be fetched to the processor. Miss rate
    can go up.)
  • D. None of the above

16
Q2. True, false, and why?
  • a) There are no compulsory misses with a cache of
    infinite size.
  • False. Since the first access to a block will
    cause a compulsory cache miss so that this block
    can be brought into the cache.

17
Q2. True, false, and why?
  • b) Two programs with an identical instruction mix
    (using the same types of instructions and having
    the same number of instructions for each type)
    running on the same computer must have the same
    cache miss rate.
  • False, different orders of instructions also
    affect the happen of cache miss.

18
Q2. True, false, and why?
  • c) Over the last 20 years, the performance of the
    processor and the DRAM improves at the same rate
    in terms of throughput and latency.
  • False, processor improved faster in the past 20
    years.

19
Q2. True, false, and why?
  • d) A loop which contains 20 instructions and
    iterates for 10000 times has no temporal or
    spatial locality if none of the 20 instructions
    are memory access instructions (e.g., load/store
    instructions).
  • False. Execution of the program requires fetching
    instructions from memory to the (instruction)
    cache. Instruction fetching usually shows a
    strong temporal or spatial locality.

20
Q2. True, false, and why?
  • e) If a processor running the Tomasulos
    Algorithm has infinite reservation stations, it
    would be able to eliminate all the WAR and WAW
    hazards.
  • True. WAR and WAW can be eliminated if there are
    registers (reservation stations) we can use to
    perform register renaming.

21
Q2. True, false, and why?
  • f) The correlating branch predictors are a method
    of dynamic branch prediction they are used at
    run time by the processor but cannot be used at
    compile time by the compiler.
  • True. The correlating branch predictors use
    run-time information that is not available to
    compilers

22
Q3. Code scheduling
  • L.S F0, 0(R1)
  • L.S F1, 0(R2)
  • ADD.S F0, F0, F1
  • L.S F2, 0(R3)
  • L.S F3, 0(R4)
  • MUL.S F2, F2, F3
  • ADD.S F0, F0, F2
  • S.S F0, 0(R5)

Load ? FP ALU 2 cycles FP multiplication ? FP
ALU 4 cycles FP multiplication ? Store 4
cycles FP addition ? FP ALU 2cycles FP addition
? Store 2cycles
  • 1 L.S F0, 0(R1)
  • 2 L.S F1, 0(R2)
  • 3 Stall Load ? FP ALU2 cycles
  • 4 Stall
  • 5 MUL.S F0, F0, F1
  • 6 L.S F2, 0(R3)
  • 7 L.S F3, 0(R4)
  • 8 Stall Load ? FP ALU2 cycles
  • 9 Stall
  • 10 MUL.S F2, F2, F3
  • 11 Stall FP multiplication ? FP ALU4 cycles
  • 12 Stall
  • 13 Stall
  • 14 Stall
  • 15 ADD.S F0, F0, F2
  • 16 Stall FP addition ? Store 2cycles
  • 17 Stall
  • 18 S.S F0, 0(R5)

in-order
23
Q3. Code scheduling
  • L.S F0, 0(R1)
  • L.S F1, 0(R2)
  • ADD.S F0, F0, F1
  • L.S F2, 0(R3)
  • L.S F3, 0(R4)
  • MUL.S F2, F2, F3
  • ADD.S F0, F0, F2
  • S.S F0, 0(R5)

Load ? FP ALU 2 cycles FP multiplication ? FP
ALU 4 cycles FP multiplication ? Store 4
cycles FP addition ? FP ALU 2cycles FP addition
? Store 2cycles
No Data Hazard!
  • 1 L.S F2, 0(R3)
  • 2 L.S F3, 0(R4)
  • 3 L.S F0, 0(R1)
  • 4 L.S F1, 0(R2)
  • 5 MUL.S F2, F2, F3
  • 6 STALL
  • 7 ADD.S F0, F0, F1
  • 8 STALL
  • 9 STALL
  • 10 ADD.S F0, F0, F2
  • 11 STALL
  • 12 STALL
  • S.S F0, 0(R5)
  • Totoally 13 cycles.

Switch two instructions!
out-order
24
Q4. Branch Prediction
Suppose we have a program with the following
sequence of the C-like statements. It has three
branches as indicated by B1, B2 and B3. if
(altb) a2a branch B1 if (cgtb) cc-b
branch B2 if (agtc) aa-b branch B3 The
instruction sequence corresponding to the above
statements is shown in Fig. 1 in assembly
language. In Fig. 1, register R1 is used for the
variable a, R2 for b and R3 for c. R4 is a
register for storing temporary results. We
maintain a (m,n) predictor for each branch and
the predictor for the branch B3 is illustrated in
the following table (Fig. 2 ).
25
Q4. Branch Prediction
  • a) For the (m, n) predictor, what should be the
    value of parameter m and what should be the value
    of the parameter n, based on Fig. 2?

Another Example
Answer m2, n2
26
Q4. Branch Prediction
  • b) Suppose, initially, when the program counter
    (PC) points to the first instruction at label S1
    ( SUB R4, R1, R2) but has not executed it, the
    variables a26, b50 and c46 (or, REGSr126,
    REGSr250, REGSr346). At that time, the
    state of predictor of B3 is shown in Fig. 2.
    After a few cycles, the programs execution
    reaches the B3 branch. The PC now points to B3.
    According to this predictor, what prediction will
    be made for the branch instruction at B3 (TAKEN
    or NOT TAKEN)? Explain the reason.
  • Answer a26 and b50, so B1 is NOT TAKEN. c46
    and b50, so B2 is TAKEN According to the
    outcomes of branches B1 and B2, the predictor
    indexed by NT/T is selected. Since the so-far
    state is 01, we will made the prediction of
    NOT TAKEN.

R1a, R2b,R3c S1 SUB R4, R1, R2
R4R1-R2 B1 BGE R4, S2 if R40, then
branch to S2 (B1 branch) ADD R1, R1, R1
R1R1R1 S2 SUB R4, R3, R2 R4R3-R2 B2
BLE R4, S3 if R4?0, then branch to S3
(B2 branch) SUB R3, R3, R2 R3R3-R2 S3
SUB R4, R1, R3 R4R1-R3 B3 BLE R4, S4
if R4?0, then branch to S4 (B3 branch)
SUB R1, R1, R2 R1R1-R2 S4
27
Q4. Branch Prediction
  • c) Follow the conditions of the question b). When
    the program has just finished the execution of
    the branch B3 (i.e., PC becomes more than B3),
    what will be the state of the predictors of B3?
    (Hint draw a graph or table similar to Fig. 2)
  • Answer When PCB3, we have a52 and c46.
    Therefore B3 is NOT TAKEN. The state of the
    corresponding 2-bit predictor (the 01 entry)
    will be changed from 01 to 00.

28
Q4. Branch Prediction
  • d) Suppose we can use up to 10000 bits for
    dynamic branch prediction using this (m,n)
    predictor scheme. How many entries can we hold in
    the cache at most? Assume the number of entries
    is a power of 2, and each entry corresponds to a
    different instruction address. (Hint m and n are
    determined in question a)
  • Answer m2 and n2. So each entry needs
    2228bits. We have 10000 bits and therefore we
    can hold (10000/8)1250 entries. We support only
    the number of power of 2 entries, 1250 is
    grounded to 1024 entries.

29
Q5v1. The Tomasulos Algorithm
  • a) In the Cycle 12, two iterations have been
    issued and the LD for the third iteration
    (LD3) has also been issued. Whether the MULTD
    for the third iteration (Mult3) can be issued
    in the Cycle 13? Briefly explain the reasons.
  • Answer No, because both of the multiplication
    units are busy. We are facing a structural hazard
    and therefore have to stall.

Loop LD F0, 0, R1 MULTD F4, F0, F2 SD
F4, 0, R1 SUBI R1, R1, 8 BNEZ R1, Loop
The state of Tomasulos organization in Cycle 12
The instruction sequence of the loop
30
Q5v1. The Tomasulos Algorithm
  • b) In the Cycle 14, Whether the SD for the
    third iteration (SD3) can be issued? Briefly
    explain the reasons.
  • Answer No, because we issue instructions
    in-order. The MULTD3 is not issued and therefore
    SD3 cannot be issued either.
  • c) In the Cycle 15, the MULTD for the second
    iteration (MULT2) is completed. Who is waiting
    for the results produced by MULT2? Briefly
    explain the based on the organization state.
  • Answer Function units Store2 and register F4
    are waiting.

31
Q5v1. The Tomasulos Algorithm
  • d) Briefly explain why both of WAW and WAR
    hazards become possible in Tomasulos algorithm,
    and how Tomasulos algorithm solves the problem
  • Answer It is because instructions are executed
    out-of-order. They are solved by renaming.

32
Q5v2. Memory hierarchy
  • L1 cache (NO L2 cache)
  • unified cache
  • write-through with write-allocate
  • hit rate is H1 and the hit time is 1 cycle
  • The percentage of read is r and the write is w.
  • TLB Memory
  • The hit rate of TLB is H2 and the page hit rate
    is H3.
  • The hit time of TLB is one cycle
  • M stall cycles to complete reading/writing the
    physical memory (a block or a page table entry)
  • reading a block and writing a word in parallel
  • Disk
  • we need D cycles including the cycles needed for
    memory and disk operations
  • The percentage of clean pages is c and that of
    dirty pages is d

33
Q5v2. Memory hierarchy
memory read with TLB miss, memory write with a
TLB hit
D Disk Read operation M Load to L1 from
Memory M Get the Page Table from Memory 1 TLB
lookup
The stall cycles per memory access for case(a)
---memory read with TLB miss, L1 miss, Page
fault, clean page
3 miss
The stall cycles per memory access for the
case(b) ---memory write with TLB hit but page
fault on a dirty page,
b
write through (no L1)
D Disk write back operation (dirty, modified
page, write allocate) D Disk Read operation M
Load to L1 from Memory 1 TLB lookup
a
34
Q5v2. Memory hierarchy
  • Complete the Memory Access Tree
Write a Comment
User Comments (0)
About PowerShow.com