STORAGE AND I/O - PowerPoint PPT Presentation

About This Presentation

Title:

STORAGE AND I/O

Description:

MTTF, MMTR and MTBF MTTF is mean time to failure MTTR is mean time to repair 1/MTTF is failure rate l MTTBF, the mean time between failures, ... – PowerPoint PPT presentation

Number of Views:206

Avg rating:3.0/5.0

Slides: 93

Provided by: Jehan65

Learn more at: https://www2.cs.uh.edu

Category:

more less

Transcript and Presenter's Notes

Title: STORAGE AND I/O

1
STORAGE AND I/O

Jehan-François Pâris
jfparis_at_uh.edu

2
Chapter Organization

Availability and Reliability
Technology review
Solid-state storage devices
I/O Operations
Reliable Arrays of Inexpensive Disks

3
DEPENDABILITY
4
Reliability and Availability

Reliability
Probability R(t) that system will be up at time
t if it was up at time t 0
Availability
Fraction of time the system is up
Reliability and availability do not measure the
same thing!

5
Which matters?

It depends
Reliability for real-time systems
Flight control
Process control,
Availability for many other applications
DSL service
File server, web server,

6
MTTF, MMTR and MTBF

MTTF is mean time to failure
MTTR is mean time to repair
1/MTTF is failure rate l
MTTBF, the mean time between failures, is
MTBF MTTF MTTR

7
Reliability

As a first approximation R(t) exp(t/MTTF)
Not true if failure rate varies over time

8
Availability

Measured by (MTTF)/(MTTF MTTR) MTTF/MTBF
MTTR is very important
A good MTTR requires that we detect quickly the
failure

9
The nine notation

Availability is often expressed in "nines"
99 percent is two nines
99.9 percent is three nines
Formula is log10 (1 A)
Examplelog10 (1 0.999) log10 (10-3) 3

10
Example

A server crashes on the average once a month
When this happens, it takes 12 hours to reboot it
What is the server availability ?

11
Solution

MTBF 30 days
MTTR 12 hours ½ day
MTTF 29 ½ days
Availability is 29.5/30 98.3

12
Keep in mind

A 99 percent availability is not as great as we
might think
One hour down every 100 hours
Fifteen minutes down every 24 hours

13
Example

A disk drive has a MTTF of 20 years.
What is the probability that the data it contains
will not be lost over a period of five years?

14
Example

A disk farm contains 100 disks whose MTTF is 20
years.
What is the probability that no data will be
lost over a period of five years?

15
Solution

The aggregate failure rate of the disk farm is
100x1/20 5 failures/year
The mean time to failure of the farm is
1/5 year
We apply the formula
R(t) exp(t/MTTF) -exp(55) 1.4 10-11
Almost zero chance!

16
TECHNOLOGY OVERVIEW
17
Disk drives

See previous chapter
Recall that the disk access time is the sum of
The disk seek time (to get to the right track)
The disk rotational latency
The actual transfer time

18
Flash drives

Widely used in flash drives, most MP3 players and
some small portable computers
Similar technology as EEPROM
Two technologies

19
What about flash?

Widely used in flash drives, most MP3 players and
some small portable computers
Several important limitations
Limited write bandwidth
Must erase a whole block of data before
overwriting them
Limited endurance
10,000 to 100,000 write cycles

20
Storage Class Memories

Solid-state storage
Non-volatile
Much faster than conventional disks
Numerous proposals
Ferro-electric RAM (FRAM)
Magneto-resistive RAM (MRAM)
Phase-Change Memories (PCM)

21
Phase-Change Memories
No moving parts
A data cell
Crossbarorganization
22
Phase-Change Memories

Cells contain a chalcogenide material that has
two states
Amorphous with high electrical resistivity
Crystalline with low electrical resistivity
Quickly cooling material from above fusion point
leaves it in amorphous state
Slowly cooling material from above
crystallization point leaves it in crystalline
state

23
Projections

Target date 2012
Access time 100 ns
Data Rate 2001000 MB/s
Write Endurance 109 write cycles
Read Endurance no upper limit
Capacity 16 GB
Capacity growth gt 40 per year
MTTF 1050 million hours
Cost lt 2/GB

24
Interesting Issues (I)

Disks will remain much cheaper than SCM for some
time
Could use SCMs as intermediary level between main
memory and disks

Main memory
SCM
Disk
25
A last comment

The technology is still experimental
Not sure when it will come to the market
Might even never come to the market

26
Interesting Issues (II)

Rather narrow gap between SCM access times and
main memory access times
Main memory and SCM will interact
As the L3 cache interact with the main memory
Not as the main memory now interacts with the disk

27
RAID Arrays
28
Todays Motivation

We use RAID today for
Increasing disk throughput by allowing parallel
access
Eliminating the need to make disk backups
Disks are too big to be backed up in an efficient
fashion

29
RAID LEVEL 0

No replication
Advantages
Simple to implement
No overhead
Disadvantage
If array has n disks failure rate is n times the
failure rate of a single disk

30
RAID levels 0 and 1
Mirrors
RAID level 1
31
RAID LEVEL 1

Mirroring
Two copies of each disk block
Advantages
Simple to implement
Fault-tolerant
Disadvantage
Requires twice the disk capacity of normal file
systems

32
RAID LEVEL 2

Instead of duplicating the data blocks we use an
error correction code
Very bad idea because disk drives either work
correctly or do not work at all
Only possible errors are omission errors
We need an omission correction code
A parity bit is enough to correct a single
omission

33
RAID levels 2 and 3
Check disks
RAID level 2
Parity disk
RAID level 3
34
RAID LEVEL 3

Requires N1 disk drives
N drives contain data (1/N of each data block)
Block bk now partitioned into N fragments
bk,1, bk,2, ... bk,N
Parity drive contains exclusive or of these N
fragments
pk bk,1 ? bk,2 ? ... ? bk,N

35
How parity works?

Truth table for XOR (same as parity)

A B A?B
0 0 0
0 1 1
1 0 1
1 1 0
36
Recovering from a disk failure

Small RAID level 3 array with data disks D0 and
D1 and parity disk P can tolerate failure of
either D0 or D1

D0 D1 P
0 0 0
0 1 1
1 0 1
1 1 0
D1?PD0 D0?PD1
0 0
0 1
1 0
1 1
37
How RAID level 3 works (I)

Assume we have N 1 disks
Each block is partitioned into N equal chunks

N 4 in example
38
How RAID level 3 works (II)

XOR data chunks to compute the parity chunk

Each chunk is written into a separate disk

Parity
39
How RAID level 3 works (III)

Each read/write involves all disks in RAID array
Cannot do two or more reads/writes in parallel
Performance of array not better than that of a
single disk

40
RAID LEVEL 4 (I)

Requires N1 disk drives
N drives contain data
Individual blocks, not chunks
Blocks with same disk address form a stripe

x
x
x
x
?
41
RAID LEVEL 4 (II)

Parity drive contains exclusive or of the N
blocks in stripe
pk bk ? bk1 ? ... ? bkN-1
Parity block now reflects contents of several
blocks!
Can now do parallel reads/writes

42
RAID levels 4 and 5
Bottleneck
RAID level 4
RAID level 5
43
RAID LEVEL 5

Single parity drive of RAID level 4 is involved
in every write
Will limit parallelism
RAID-5 distribute the parity blocks among the
N1 drives
Much better

44
The small write problem

Specific to RAID 5
Happens when we want to update a single block
Block belongs to a stripe
How can we compute the new value of the parity
block

pk
...
bk1
bk2
bk
45
First solution

Read values of N-1 other blocks in stripe
Recompute
pk bk ? bk1 ? ... ? bkN-1
Solution requires
N-1 reads
2 writes (new block and new parity block)

46
Second solution

Assume we want to update block bm
Read old values of bm and parity block pk
Compute
pk new bm ? old bm ? old pk
Solution requires
2 reads (old values of block and parity block)
2 writes (new block and new parity block)

47
RAID level 6 (I)

Not part of the original proposal
Two check disks
Tolerates two disk failures
More complex updates

48
RAID level 6 (II)

Has become more popular as disks become
Bigger
More vulnerable to irrecoverable read errors
Most frequent cause for RAID level 5 array
failures is
Irrecoverable read error occurring while
contents of a failed disk are reconstituted

49
RAID level 6 (III)

Typical array size is 12 disks
Space overhead is 2/12 16.7
Sole real issue is cost of small writes
Three reads and three writes
Read old value of block being updated,old parity
block P, old party block Q
Write new value of block being updated, new
parity block P, new party block Q

50
CONCLUSION (II)

Low cost of disk drives made RAID level 1
attractive for small installations
Otherwise pick
RAID level 5 for higher parallelism
RAID level 6 for higher protection
Can tolerate one disk failure and irrecoverable
read errors

51
A review question

Consider an array consisting of four 750 GB disks
What is the storage capacity of the array if we
organize it
As a RAID level 0 array?
As a RAID level 1 array?
As a RAID level 5 array?

52
The answers

Consider an array consisting of four 750 GB disks
What is the storage capacity of the array if we
organize it
As a RAID level 0 array? 3 TB
As a RAID level 1 array? 1.5 TB
As a RAID level 5 array? 2.25 TB

53
CONNECTING I/O DEVICES
54
Busses

Connecting computer subsystems with each other
was traditionally done through busses
A bus is a shared communication link connecting
multiple devices
Transmit several bits at a time
Parallel buses

55
Busses
56
Examples

Processor-memory busses
Connect CPU with memory modules
Short and high-speed
I/O busses
Longer
Wide range of data bandwidths
Connect to memory through processor-memory bus of
backplane bus

57
Standards

Firewire
For external use
63 devices per channel
4 signal lines
400 Mb/s or 800 Mb/s
Up to 4.5 m

58
Standards

USB 2.0
For external use
127 devices per channels
2 signal lines
1.5 Mb/s (Low Speed), 12 Mb/s (Full Speed) and
480 Mb/s (Hi Speed)
Up to 5 m

59
Standards

USB 3.0
For external use
Adds a 5 Gb/s transfer rate (Super Speed)
Maximum distance is still 5m

60
Standards

PCI Express
For internal use
1 device per channel
2 signal lines per "lane"
Multiples of 250 MB/s
1x, 2x, 4x, 8x, 16x and 32x
Up to 0.5 m

61
Standards

Serial ATA
For internal use
Connects cheap disks to computer
1 device per channel
4 data lines
300 MB/s
Up to 1 m

62
Standards

Serial Attached SCSI (SAS)
For external use
4 devices per channel
4 data lines
300 MB/s
Up to 8 m

63
Synchronous busses

Include a clock in the control lines
Bus protocols expressed in actions to be taken at
each clock pulse
Have very simple protocols
Disadvantages
All bus devices must run at same clock rate
Due to clock skew issues, cannot be both fast
and long

64
Asynchronous busses

Have no clock
Can accommodate a wide variety of devices
Have no clock skew issues
Require a handshaking protocol before any
transmission
Implemented with extra control lines

65
Advantages of busses

Cheap
One bus can link many devices
Flexible
Can add devices

66
Disadvantages of busses

Shared devices
can become bottlenecks
Hard to run many parallel lines at high clock
speeds

67
New trend

Away from parallel shared buses
Towards serial point-to-point switched
interconnections
Serial
One bit at a time
Point-to-point
Each line links a specific device to another
specific device

68
x86 bus organization

Processor connects to peripherals through two
chips (bridges)
North Bridge
South Bridge

69
x86 bus organization
North Bridge
South Bridge
70
North bridge

Essentially a DMA controller
Lets disk controller access main memory w/o any
intervention of the CPU
Connects CPU to
Main memory
Optional graphics card
South Bridge

71
South Bridge

Connects North bridge to a wide variety of I/O
busses

72
Communicating with I/O devices

Two solutions
Memory-mapped I/O
Special I/O instructions

73
Memory mapped I/O

A portion of the address space reserved for I/O
operations
Writes to any to these addresses are interpreted
as I/O commands
Reading from these addresses gives access to
Error bit
I/O completion bit
Data being read

74
Memory mapped I/O

User processes cannot access these addresses
Only the kernel
Prevents user processes from accessing the disk
in an uncontrolled fashion

75
Dedicated I/O instructions

Privileged instructions that cannot be executed
by user processes
Only the kernel
Prevents user processes from accessing the disk
in an uncontrolled fashion

76
Polling

Simplest way for an I/O device to communicate
with the CPU
CPU periodically checks the status of pending I/O
operations
High CPU overhead

77
I/O completion interrupts

Notify the CPU that an I/O operation has
completed
Allows the CPU to do something else while waiting
for the completion of an I/O operation
Multiprogramming
I/O completion interrupts are processed by CPU
between instructions
No internal instruction state to save

78
Interrupts levels

See previous chapter

79
Direct memory access

DMA
Lets disk controller access main memory w/o any
intervention of the CPU

80
DMA and virtual memory

A single DMA transfer may cross page boundaries
with
One page being in main memory
One missing page

81
Solutions

Make DMA work with virtual addresses
Issue is then dealt by the virtual memory
subsystem
Break DMA transfers crossing page boundaries into
chains of transfers that do not cross page
boundaries

82
Solutions

Make DMA work with virtual addresses
Issue is then dealt by the virtual memory
subsystem
Break DMA transfers crossing page boundaries into
chains of transfers that do not cores page
boundaries

83
An Example
Break
DMA transfer
DMA
DMA
into
84
DMA and cache hierarchy

Three approaches for handling temporary
inconsistencies between caches and main memory

85
Solutions

Running all DMA accesses to the cache
Bad solution
Have OS selectively
Invalidate affected cache entries when
performing a read
Forcing immediate flush of dirty cache entries
when performing a write
Have specific hardware to do same

86
Benchmarking I/O
87
Benchmarks

Specific benchmarks for
Transaction processing
Emphasis on speed and graceful recovery from
failures
Atomic transactions
All or nothing behavior

88
An important observation

Very difficult to operate a disk subsystem at a
reasonable fraction of its maximum throughput
Unless we access sequentially very large ranges
of data
512 KB and more

89
Major fallacies

Since rated MTTFs of disk drives exceed one
million hours, disk can last more than 100 years
MTTF expresses failure rate during the disk
actual lifetime
Disk failure rates in the field match the MMTTFS
mentioned in the manufacturers literature
They are up to ten times higher

90
Major fallacies