Storage Systems - PowerPoint PPT Presentation

1 / 52

About This Presentation

Title:

Storage Systems

Description:

Disk Redundancy and RAIDs. Disk failures are a significant fraction of all hardware failures ... RAID: redundant array of inexpensive disks ... – PowerPoint PPT presentation

Number of Views:174

Avg rating:3.0/5.0

Slides: 53

Provided by: venkat3

Category:

more less

Transcript and Presenter's Notes

Title: Storage Systems

1
Storage Systems
2
Memory Hierarchy III I/O System
Registers

Boring, but important
I/O has been the orphan of Comp. Arch
Most widely used performance measure
CPU time
By definition ignores I/O
Second class citizenship
peripheral applied to I/O devices
Common sense
Response time is better measure
I/O is a big part of it!
Customer who pays for it cares,
Even if computer designer does not

Data Cache (D)
Instruction Cache (I)
L2 Cache
L3 Cache
Memory
Disk (swap)
3
I/O (Disk) Performance

Who cares? You do
Remember Amdahls Law
Want fast disk access (fast swap, fast file
reads)
I/O performance
Bandwidth I/Os per second (IOPS)
Latency response time
Is I/O (disk) latency important? Why not just
context-switch?
Context-switching requires more memory
Context-switching requires jobs to context-switch
to
Context-switching annoys users (productivity
f(1/response time))

4
I/O Device Characteristics

Type
Input read only
Output write only
Storage both
Partner
Human
Machine
Data rate
Peak transfer rate

5
Disk Parameters

1-20 platters (data on both sides)
Magnetic iron-oxide coating
1 read/write head per side
500-2,500 tracks per platter
32-128 sectors per track
Sometimes fewer on inside tracks
512-2048 bytes per sector
Usually fixed length
DataECC (parity) gap
4-24 GB total
3,000-10,000 RPM

platter
head
spindle
R/W Cache
Controller
track
sector
6
Disk Performance
Service time

Response time
tdisk tseek trotation ttransfer
tcontroller tqueuing
tseek (seek time) move head to track
trotation (rotational latency) wait for sector
to come around
average trotation 0.5 /RPS //RPS RPM/60
ttransfer (transfer time) read disk
ratetransfer bytes/sector sector/track RPS
ttransfer bytes transferred /ratetransfer
tcontroller (controller delay) wait for
controller to do its thing
tqueuing (queueing delay) wait for older
requests to finish

7
Example Seegate

Cheetah 73LP
Model NumberST373405LWCapacity73.4
GBSpeed10000 rpmSeek time5.1 ms
avgInterfaceUltra160 SCSISuggested Resale
Price 980.00

8
Disk Performance Example

Parameters
3600 RPM ? 60 RPS
Avg seek time 9 ms
100 sectors per track, 512 bytes per sector
Controller queuing delays 1 ms
Q average time to read 1 sector?
ratetransfer 100 sectors/track 512 B/Sector
60 RPS 24 Mb/s
ttransfer 512 B/24 Mb/s 0.16ms
trotation 0.5/60 RPS 8.3ms
tdisk 9ms 8.3 ms 0.2ms 1ms 18.5ms
ttransfer is only a small component!!
End of story? No! tqueuing not fixed (gets longer
with more requests)

9
Disk Performance Queuing Theory

I/O is a queuing system
equilibrium ratearrival ratedeparture
total time tsystem tqueue tserver
ratearrival tsystem lengthsystem (Littles
Law)
utilizationserver tserver ratearrival
The important result (derivation in HP)
tsystem tserver / (1 utilizationserver )
If server highly utilized tsystem gets VERY HIGH
Lesson keep utilization low (below 75)
Q what is new tdisk if disk is 50 utilized
tdisk_new tdisk_old /(1-0.5) 37 ms

Server
10
Disk Usage Models

Data mining supercomputing
Large files, sequential reads
Raw data transfer rate is most important
Transaction processing
Large files, but random access, many small
requests
IOPS is most important
Time sharing filesystems
Small files, sequential accesses, potential for
file caching
IOPS is most important
Must design disk (I/O) system based on target
workload
Use disk benchmarks (they exist)

11
Disk Alternatives

Solid state disk (SSD)
DRAM battery backup with standard disk
interface
Fast no seek time, no rotation time, fast
transfer rate
Expensive
FLASH memory
Fast no seek time, no rotation time, fast
transfer rate
Non-volatile
Slow bulk erase before write
Wears out over time
Optical disks (CDs)
Cheap if write-once, expensive if write-multiple
Slow

12
Extensions to Conventional Disks

Increasing density more sensitive heads, finer
control
Increases cost
Fixed head head per track
Seek time eliminated
Low track density
Parallel transfer simultaneous read from
multiple platters
Difficulty in looking onto different tracks on
multiple surfaces
Lower cost alternatives possible (disk arrays)

13
More Extensions to Conventional Disks

Disk caches disk-controller RAM buffers data
Fast writes RAM acts as a write buffer
Better utilization of host-to-device path
High miss rate increases request latency
Disk scheduling schedule requests to reduce
latency
e.g., schedule request with shortest seek time
e.g., elevator algorithm for seeks (head sweeps
back and forth)
Works best for unlikely cases (long queues)

14
Disk Arrays

Collection of individual disks (D of disks)
Distribute data across disks
Access in parallel for higher b/w (IOPS)
Issue data distribution ? load balancing
e.g., 3 disks, 3 files (A, B, and C) each 2
sectors long

undistributed
coarse-grain striping
fine-grain striping
A0
B0
C0
A0
A1
B0
A0
A0
A0
A1
A1
A1
B0
B0
B0
A1
B1
C1
B1
C0
C1
B1
B1
B1
C0
C0
C0
C1
C1
C1
15
Disk Arrays Stripe Width

Fine-grain striping
D stripe width evenly divides smallest
accessible data (sector)
Only one request served at a time
Perfect load balance
Effective transfer rate approx D times better
than single disk
Access time can go up, unless disks synchronized
(disk skew)
Coarse-grain striping
Data transfer parallelism for large requests
Concurrency for small requests (several small
requests at once)
statistical load balance
Must consider workload to determine stripe width

16
Disk Redundancy and RAIDs

Disk failures are a significant fraction of all
hardware failures
Electrical failures rare, mechanical failures
more common
Striping increases number of files touched by
failure
Fix with replication and / or parity protection
RAID redundant array of inexpensive disks
Arrays of cheap disks provide high performance
dependability
MTTF is high and MTTR is low ? redundancy can
increase significantly

17
Reliability, Availability, and Dependability

Reliability and Availability are measure of
Dependability
Reliability measure of the continuous service
accomplishment
MTTF Mean time to failure
Rate of failure 1/MTTF
Service interruption is measured as MTTR Mean
time to repair
if a collection of modules have exponentially
distributed lifetimes,
overall failure rate sum of the failure rates
of the modules
Availability measure of the service
accomplishment
MTTF / (MTTF MTTR)
MTBF Mean time between failures MTTF MTTR
Widely used
MTTF is often more appropriate

18
Array Reliability
Reliability of N disks Reliability of 1 Disk
N 1,200,000 Hours 100 disks 12,000 hours 1
year 365 24 8,700 hours Disk system MTTF
Drops from 140 years to about 1.5
years! Arrays (without redundancy) too
unreliable to be useful!
19
Redundant Arrays of Independent Disks
Files are "striped" across multiple
spindles Redundancy yields high data availability

Disks will fail!
Contents reconstructed from data redundantly
stored in the array
Capacity penalty to store it
Bandwidth penalty to update

Mirroring/Shadowing (high cost) Parity
Techniques
20
RAID Levels

6 levels of RAID depend on redundancy
/concurrency (D of data disks, C of check
disks)
Level 0 nonredundant striped (D0, C0) widely
used
Level 1 full mirroring (D C)
Level 2 Memory-style ECC (D8, C 4) Not used
Level 3 bit-interleaved parity (e.g., D8, C1)
Level 4 block-interleaved parity (e.g., D8,
C1)
Level 5 block-interleaved distributed parity
(e.g., D8, C1) most widely used
Level 6 two-dimensional error bits (e.g., D8,
C2) Not presently available

21
RAID1 Disk Mirroring/Shadowing
recovery group
Each disk is fully duplicated onto its
"shadow" ?high availability Bandwidth
sacrifice on write Logical write ? two
physical writes Reads may be optimized Most
expensive solution 100 capacity overhead
Targeted for high I/O rate , high availability
environments
Probability of failure (assuming 24 hours MTTR)
24 / ( 1.2 X 106 X 1.2 X 106 ) 6.9 x 10-13
170,000,000 years
22
RAID 3 Parity Disk
10010011 11001101 10010011 . . .
P
logical record
1 0 0 1 0 0 1 1
1 1 0 0 1 1 0 1
1 0 0 1 0 0 1 1
0 0 1 1 0 0 0 0
Striped physical records
Parity computed across recovery group to
protect against hard disk failures 33
capacity cost for parity in this configuration
wider arrays reduce capacity costs, decrease
expected availability, increase
reconstruction time Arms logically
synchronized, spindles rotationally synchronized
logically a single high capacity, high
transfer rate disk
Targeted for high bandwidth applications
Scientific, Image Processing
23
RAID 3 Write Update
RAID-3 Small Write Algorithm
1 Logical Write ? 3 Reads 2 Writes
D0
D1
D2
D3
D0'
P
new data
Involves all the disks

XOR
D0'
D1
D2
D3
P'
24
RAID 4 Write Update
RAID-5 Small Write Algorithm
1 Logical Write ? 2 Reads 2 Writes
D0
D1
D2
D3
D0'
P

Involves just two disks
Lesser read/write ops
Increasing size of parity group
Increases savings
Bottleneck
Parity disk update on every write
Spread parity info on all disks
? RAID 5

new data
XOR

XOR
D0'
D1
D2
D3
P'
25
RAID 5 High I/O Rate Parity
Increasing Logical Disk Addresses
D0
D1
D2
D3
P
A logical write becomes four physical
I/Os multiple writes can occur simultaneously as
long as stripe units are not located in the same
disks
D4
D5
D6
P
D7
D8
D9
P
D10
D11
D12
P
D13
D14
D15
Stripe
P
D16
D17
D18
D19
Stripe Unit
D20
D21
D22
D23
P
Targeted for mixed applications
. . .
. . .
. . .
. . .
. . .
Disk Columns
26
Subsystem Organization
single board disk controller
Cache
array controller
host
host adapter
interface to host, DMA
control, buffering, parity logic
single board disk controller
single board disk controller
no applications modifications no reduction of
host performance
single board disk controller
physical device control
27
System Availability Orthogonal RAIDs
Fault-tolerant scheme to protect against string
faults as well as disk faults
28
System Level Availability
Fully redundant No single point of failure
host
host
I/O Controller
I/O Controller
Cache Array Controller
Cache Array Controller
. . .
. . .
. . .
. . .
. . .
Recovery Group
. . .
with duplicated paths, higher performance when no
failures
29
Basic Computer Structure
CPU
Memory
Memory Bus (System Bus)
Bridge
I/O Bus
NIC
30
A Typical PC Bus Structure
31
PC Bus View
Processor/Memory Bus
PCI Bus
I/O Buses
32
I/O and Memory Buses
Memory buses speed (usually custom design) I/O
buses compatibility (usually industry standard)
cost
33
Buses
Network
Channel
Backplane
Connects
Machines
Chips
Devices
gt1000 m
10 - 100 m
0.1 m
Distance
10 - 1000 Mb/s
40 - 1000 Mb/s
320 - 2000 Mb/s
Bandwidth
high ( 1ms)
medium
low (Nanosecs)
Latency
low
medium
high
Reliability
Extensive CRC
Byte Parity
Byte Parity
memory-mapped wide pathways centralized
arbitration
message-based narrow pathways distributed
arbitration
34
What Defines A Bus?
Transaction Protocol
Timing and Signaling Specification
Bunch of Wires
Electrical Specification
Physical / Mechanical Characteristics the
connectors

Glue that interfaces computer system components

35
Bus Issues

Clocking is bus clocked?
Synchronous
clocked, little logic ? fast (includes a clock in
the control lines)
all devices need to run at same clock rate
to avoid clock skew, busses cannot be long if
they are fast
Asynchronous
No clock, use handshaking instead ? slow
Can accommodate a wide range of devices
Can be lengthened without worrying about clock
skew
Switching when control of bus is acquired and
released?
Atomic bus held until request complete ? slow
Split-transaction bus free between request and
reply ? fast

36
More Bus Issues

Arbitration deciding who gets the bus next?
Chaos is avoided by master-slave arrangement
Processor is the only bus master ?CPU involved in
ever transaction
Overlap arbitration for next master with current
transfer
Daisy chain closer devices have priority ?
simple slow
Distributed wired-OR, low priority back-off ?
medium
Other issues
Split data/address lines, width, burst transfer

Device 1 Highest Priority
Device N Lowest Priority
Device 2
Daisy Chain
Order Request Grant Release
Grant
Grant
Grant
Release
Bus Arbiter
Request
37
Asynchronous Handshake Write Operation
Address Data Read Request Acknowledge
Master Asserts Address
Next Address
Master Asserts Data
t0 t1 t2 t3 t4
t5

t0 Master has obtained control and asserts
address, direction (not read), and data.
Waits a specified amount of time for slaves to
decode target
t1 Master asserts request line
t2 Slave asserts ack, indicating data received
t3 Master releases req
t4 Slave releases ack

38
Asynchronous Handshake Read Operation
Address Data Read Req Ack
Master Asserts Address
Next Address
Slave Data
t0 t1 t2 t3 t4
t5

t0 Master has obtained control and asserts
address and direction
Waits a specified amount of time for slaves to
decode target
t1 Master asserts request line
t2 Slave asserts ack, indicating ready to
transmit data
t3 Master releases req, data received
t4 Slave releases ack

39
Example PCI Read/Write Transactions

All signals sampled on rising edge
Centralized Parallel Arbitration
overlapped with previous transaction
All transfers are (unlimited) bursts
Address phase starts by asserting FRAME
Next cycle initiator asserts cmd and address
Data transfers happen on when
IRDY asserted by master when ready to transfer
data
TRDY asserted by target when ready to transfer
data
transfer when both asserted on rising edge
FRAME de-asserted when master intends to
complete only one more data transfer

40
Example PCI Read Transaction
Turn-around cycle on any signal driven by more
than one agent
41
Who Does I/O?

Main CPU (programmed IO)
Explicitly executes all I/O operations
High overhead, potential cache pollution
Memory mapped I/O
Physical addresses are set apart (no real
memory!)
When processor sees these addresses, it aims the
instruction to IO processor
I/O Processor (IOP or channel processor)
Special or general processor dedicated to I/O
operations
Fast
May be overkill, cache coherence problems
DMAC (direct memory access controller)
Can transfer data to/from memory given start
address
Fast, usually simple
Still may be coherence problems, must be on
memory bus

42
Programmed I/O vs. DMA

Programmed I/O is ok for sending commands,
receiving status, and communication of a small
amount of data
Inefficient for large amount of data
Keeps CPU busy during the transfer
Programmed I/O ? memory operations ? slow
Direct Memory Access
Device read/write directly from/to memory
Memory ? device typically initiated from CPU
Device ? memory can be initiated by either the
device or the CPU

43
Programmed I/O vs. DMA
CPU
Memory
CPU
Memory
CPU
Memory
Interconnect
Interconnect
Interconnect
Programmed I/O
DMA
DMA Device ? Memory
44
Six Steps to Perform DMA Transfer
45
Communicating with I/O Processors

Not issues if main CPU performs I/O by itself
I/O control how to initialize DMAC/IOP?
Memory mapped VM-protected addresses
Privileged I/O instructions
I/O completion how does CPU know DMAC/IOP is
finished
Polling periodically check status bit ? slow
Interrupt I/O completion interrupts CPU ? fast
Q Do DMAC/IOP use physical or virtual addresses?
Physical simpler, but can only transfer 1 page
at a time. Why?
Virtual more powerful, but DMAC, IOP needs TLB

46
Polling Vs. Interrupts

Polling
Busy-wait cycle to wait for I/O from device
Inefficient unless the device is very fast
Interrupts
CPU Interrupt request line triggered by I/O
device
Interrupt handler receives interrupts
Maskable to ignore or delay some interrupts
Interrupt vector to dispatch interrupt to correct
handler
Based on priority
Some unmaskable
Interrupt mechanism also used for exceptions

47
Example Interrupt Vs. DMA

1000 transfers at 1000 bytes each
Interrupt mechanism
1000 interrupts _at_ 2 microsec per interrupt
1000 interrupt service _at_ 98 microsec each
?0.1 CPU sec
Device transfer rate 10 MB/Sec ? 1000 bytes
100 micro sec
1000 transfers X 100 microsec 0.1 CPU seconds
Total 0.2 CPU second ( 50 overhead due to
interrupt handling)
DMA
1 DMA setup sequence _at_ 50 microsec
1 interrupt _at_ 2microsec
1 interrupt service sequence _at_ 48 microsec
Total 0.0001 second of CPU time

48
Other Issues/Trends

Block Servers versus Filers where is file
illusion maintained?
Traditional answer server
Access storage as disk blocks and maintain meta
data
Use file cache
Alternative disk subsystem maintains the file
abstraction
Server uses file system protocol to communicate
Ex Network File System (NFS) for UNIX, Common
Internet File System (CIFS) for Windows
Referred as Network attached storage (NAS)
devices
Switches Replacing Buses
Moores Law driving cost of switch components
down
Replace the bus with point to point links and
switches
Switched networks provide higher aggregate
bandwidth
Point-to-point links can be longer
Ex Infiniband delivers 2-24 Gbps and link can go
up to 17 m (PCI 0.5m)