IO, Disks, and RAID - PowerPoint PPT Presentation

About This Presentation
Title:

IO, Disks, and RAID

Description:

How does a computer system interact with its environment? Disks ... (3) Cut out a slice of pizza and eat it (4) Return the knife and fork to the pile ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 69
Provided by: ranveer7
Category:
Tags: raid | cutout | disks

less

Transcript and Presenter's Notes

Title: IO, Disks, and RAID


1
I/O, Disks, and RAID
2
Goals for Today
  • Review I/O
  • How does a computer system interact with its
    environment?
  • Disks
  • How does a computer system permanently store
    data?
  • Prelim graded!
  • Discuss and pass back today
  • RAID
  • How to make storage both efficient and reliable?

3
The Requirements of I/O
  • So far in this course
  • We have learned how to manage CPU, memory
  • What about I/O?
  • Without I/O, computers are useless (disembodied
    brains?)
  • But thousands of devices, each slightly
    different
  • How can we standardize the interfaces to these
    devices?
  • Devices unreliable media failures and
    transmission errors
  • How can we make them reliable???
  • Devices unpredictable and/or slow
  • How can we manage them if we dont know what they
    will do or how they will perform?
  • Some operational parameters
  • Byte/Block
  • Some devices provide single byte at a time (e.g.
    keyboard)
  • Others provide whole blocks (e.g. disks,
    networks, etc)
  • Sequential/Random
  • Some devices must be accessed sequentially (e.g.
    tape)
  • Others can be accessed randomly (e.g. disk, cd,
    etc.)
  • Polling/Interrupts
  • Some devices require continual monitoring

4
Modern I/O Systems
5
Example Device-Transfer Rates (Sun Enterprise
6000)
  • Device Rates vary over many orders of magnitude
  • System better be able to handle this wide range
  • Better not have high overhead/byte for fast
    devices!
  • Better not waste time waiting for slow devices

6
The Goal of the I/O Subsystem
  • Provide Uniform Interfaces, Despite Wide Range of
    Different Devices
  • This code works on many different devices
  • int fd open(/dev/something) for (int i
    0 i lt 10 i) fprintf(fd,Count
    d\n,i) close(fd)
  • Why? Because code that controls devices (device
    driver) implements standard interface.
  • We will try to get a flavor for what is involved
    in actually controlling devices in rest of
    lecture
  • Can only scratch surface!

7
Want Standard Interfaces to Devices
  • Block Devices e.g. disk drives, tape drives,
    DVD-ROM
  • Access blocks of data
  • Commands include open(), read(), write(), seek()
  • Raw I/O or file-system access
  • Memory-mapped file access possible
  • Character Devices e.g. keyboards, mice, serial
    ports, some USB devices
  • Single characters at a time
  • Commands include get(), put()
  • Libraries layered on top allow line editing
  • Network Devices e.g. Ethernet, Wireless,
    Bluetooth
  • Different enough from block/character to have own
    interface
  • Unix and Windows include socket interface
  • Separates network protocol from network operation
  • Includes select() functionality
  • Usage pipes, FIFOs, streams, queues, mailboxes

8
How Does User Deal with Timing?
  • Blocking Interface Wait
  • When request data (e.g. read() system call), put
    process to sleep until data is ready
  • When write data (e.g. write() system call), put
    process to sleep until device is ready for data
  • Non-blocking Interface Dont Wait
  • Returns quickly from read or write request with
    count of bytes successfully transferred
  • Read may return nothing, write may write nothing
  • Asynchronous Interface Tell Me Later
  • When request data, take pointer to users buffer,
    return immediately later kernel fills buffer and
    notifies user
  • When send data, take pointer to users buffer,
    return immediately later kernel takes data and
    notifies user

9
Life Cycle of An I/O Request
User Program
Kernel I/O Subsystem
Device Driver Top Half
Device Driver Bottom Half
Device Hardware
10
A Kernel I/O Structure
11
Device Drivers
  • Device Driver Device-specific code in the kernel
    that interacts directly with the device hardware
  • Supports a standard, internal interface
  • Same kernel I/O system can interact easily with
    different device drivers
  • Special device-specific configuration supported
    with the ioctl() system call
  • Device Drivers typically divided into two pieces
  • Top half accessed in call path from system calls
  • Implements a set of standard, cross-device calls
    like open(), close(), read(), write(), ioctl(),
    strategy()
  • This is the kernels interface to the device
    driver
  • Top half will start I/O to device, may put thread
    to sleep until finished
  • Bottom half run as interrupt routine
  • Gets input or transfers next block of output
  • May wake sleeping threads if I/O now complete

12
I/O Device Notifying the OS
  • The OS needs to know when
  • The I/O device has completed an operation
  • The I/O operation has encountered an error
  • I/O Interrupt
  • Device generates an interrupt whenever it needs
    service
  • Pro handles unpredictable events well
  • Con interrupts relatively high overhead
  • Polling
  • OS periodically checks a device-specific status
    register
  • I/O device puts completion information in status
    register
  • Could use timer to invoke lower half of drivers
    occasionally
  • Pro low overhead
  • Con may waste many cycles on polling if
    infrequent or unpredictable I/O operations
  • Some devices combine both polling and interrupts
  • For instance High-bandwidth network device
  • Interrupt for first incoming packet
  • Poll for following packets until hardware empty

13
How does the processor actually talk to the
device?
  • CPU interacts with a Controller
  • Contains a set of registers that can be read and
    written
  • May contain memory for request queues or
    bit-mapped images
  • Regardless of the complexity of the connections
    and buses, processor accesses registers in two
    ways
  • I/O instructions in/out instructions
  • Example from the Intel architecture out 0x21,AL
  • Memory mapped I/O load/store instructions
  • Registers/memory appear in physical address space
  • I/O accomplished with load and store instructions

14
Transfering Data To/From Controller
  • Programmed I/O
  • Each byte transferred via processor in/out or
    load/store
  • Pro Simple hardware, easy to program
  • Con Consumes processor cycles proportional to
    data size
  • Direct Memory Access
  • Give controller access to memory bus
  • Ask it to transfer data to/from memory directly
  • Sample interaction with DMA controller (from
    book)

15
Main components of Intel Chipset Pentium 4
  • Northbridge
  • Handles memory
  • Graphics
  • Southbridge I/O
  • PCI bus
  • Disk controllers
  • USB controllers
  • Audio
  • Serial I/O
  • Interrupt controller
  • Timers

16
The Memory Hierarchy
  • Each level acts as a cache for the layer below it

CPU
registers, L1 cache
L2 cache
primary memory
disk storage (secondary memory)
random access
tape or optical storage (tertiary memory)
sequential access
17
Disks
18
What does the disk look like?
19
Some parameters
  • 2-30 heads (platters 2)
  • diameter 14 to 2.5
  • 700-20480 tracks per surface
  • 16-1600 sectors per track
  • sector size
  • 64-8k bytes
  • 512 for most PCs
  • note inter-sector gaps
  • capacity 20M-300G
  • main adjectives BIG, slow

20
Disk overheads
  • To read from disk, we must specify
  • cylinder , surface , sector , transfer size,
    memory address
  • Transfer time includes
  • Seek time to get to the track
  • Latency time to get to the sector and
  • Transfer time get bits off the disk

Track
Sector
Rotation Delay
Seek Time
21
Modern disks
22
52 years ago
  • On 13th September 1956, IBM 305 RAMAC computer
    system first to use disk storage
  • 80000 times more data on the 8GB 1-inch drive in
    his right hand than on the 24-inch RAMAC one in
    his left

23
Disks vs. Memory
  • Smallest write sector
  • Atomic write sector
  • Random access 5ms
  • not on a good curve
  • Sequential access 200MB/s
  • Cost .002MB
  • Crash doesnt matter (non-volatile)
  • (usually) bytes
  • byte, word
  • 50 ns
  • faster all the time
  • 200-1000MB/s
  • .10MB
  • contents gone (volatile)

24
Disk Structure
  • Disk drives addressed as 1-dim arrays of logical
    blocks
  • the logical block is the smallest unit of
    transfer
  • This array mapped sequentially onto disk sectors
  • Address 0 is 1st sector of 1st track of the
    outermost cylinder
  • Addresses incremented within track, then within
    tracks of the cylinder, then across cylinders,
    from innermost to outermost
  • Translation is theoretically possible, but
    usually difficult
  • Some sectors might be defective
  • Number of sectors per track is not a constant

25
Non-uniform sectors / track
  • Maintain same data rate with Constant Linear
    Velocity
  • Approaches
  • Reduce bit density per track for outer layers
  • Have more sectors per track on the outer layers
    (virtual geometry)

26
Disk Scheduling
  • The operating system tries to use hardware
    efficiently
  • for disk drives ? having fast access time, disk
    bandwidth
  • Access time has two major components
  • Seek time is time to move the heads to the
    cylinder containing the desired sector
  • Rotational latency is additional time waiting to
    rotate the desired sector to the disk head.
  • Minimize seek time
  • Seek time ? seek distance
  • Disk bandwidth is total number of bytes
    transferred, divided by the total time between
    the first request for service and the completion
    of the last transfer.

27
Disk Scheduling (Cont.)
  • Several scheduling algos exist service disk I/O
    requests.
  • We illustrate them with a request queue (0-199).
  • 98, 183, 37, 122, 14, 124, 65, 67
  • Head pointer 53

28
FCFS
Illustration shows total head movement of 640
cylinders.
29
SSTF
  • Selects request with minimum seek time from
    current head position
  • SSTF scheduling is a form of SJF scheduling
  • may cause starvation of some requests.
  • Illustration shows total head movement of 236
    cylinders.

30
SSTF (Cont.)
31
SCAN
  • The disk arm starts at one end of the disk,
  • moves toward the other end, servicing requests
  • head movement is reversed when it gets to the
    other end of disk
  • servicing continues.
  • Sometimes called the elevator algorithm.
  • Illustration shows total head movement of 236
    cylinders.

32
SCAN (Cont.)
33
C-SCAN
  • Provides a more uniform wait time than SCAN.
  • The head moves from one end of the disk to the
    other.
  • servicing requests as it goes.
  • When it reaches the other end it immediately
    returns to beginning of the disk
  • No requests serviced on the return trip.
  • Treats the cylinders as a circular list
  • that wraps around from the last cylinder to the
    first one.

34
C-SCAN (Cont.)
35
C-LOOK
  • Version of C-SCAN
  • Arm only goes as far as last request in each
    direction,
  • then reverses direction immediately,
  • without first going all the way to the end of the
    disk.

36
C-LOOK (Cont.)
37
Selecting a Good Algorithm
  • SSTF is common and has a natural appeal
  • SCAN and C-SCAN perform better under heavy load
  • Performance depends on number and types of
    requests
  • Requests for disk service can be influenced by
    the file-allocation method.
  • Disk-scheduling algo should be a separate OS
    module
  • allowing it to be replaced with a different
    algorithm if necessary.
  • Either SSTF or LOOK is a reasonable default algo

38
Summary
  • I/O Devices Types
  • Many different speeds (0.1 bytes/sec to
    GBytes/sec)
  • Different Access Patterns
  • Block Devices, Character Devices, Network Devices
  • Different Access Timing
  • Blocking, Non-blocking, Asynchronous
  • I/O Controllers Hardware that controls actual
    device
  • Processor Accesses through I/O instructions,
    load/store to special physical memory
  • Report their results through either interrupts or
    a status register that processor looks at
    occasionally (polling)
  • Device Driver Device-specific code in kernel
  • Disks
  • Latency Seek Rotational Transfer
  • Also, queuing time
  • Rotational latency on average ½ rotation
  • Improve performance (decrease queuing time) via
    scheduling

39
Announcements
  • Homework 4 available later tonight
  • It is a programming assignment, so start early
  • Prelims graded
  • Mean 67.7 (Median 67), Stddev 14.2, High 96 out
    of 100!
  • Good job!
  • Re-grade policy
  • Submit written re-grade request to Nazrul.
  • Entire prelim will be re-graded.
  • We were generous the first time
  • If still unhappy, submit another re-grade
    request.
  • Nazrul will re-grade herself
  • If still unhappy, submit a third re-grade
    request.
  • I will re-grade. Final grade is law.

40
Grade distribution
41
Question 2
  • Algorithm
  • (1) Pick up a knife
  • (2) Pick a fork
  • (3) Cut out a slice of pizza and eat it
  • (4) Return the knife and fork to the pile
  • Correctness Constraints
  • wait for a knife and then a fork, in that order!
  • Key Deadlock cannot occur since algorithm
    defines partial order
  • thus, no circular waiting exists

42
Question 3
  • 32 bit virtual address and 32-bit physical
    address, 8kB pages
  • bits for offset? bits for index?
  • Bytes required for PTE? Bytes required for page
    table?
  • 3 bytes and 21931.5 MB, respectively

13 and 19, respectively
43
Question 3 continued
  • 32 bit virtual address and 24-bit physical
    address, 8kB pages
  • bits for offset? bits for index?
  • Bytes required for PTE? Bytes required for page
    table?
  • 2 bytes and 21921 MB, respectively

13 and 19, respectively
44
Question 4
  • Give a brief definition of the term working
    set?
  • Virtual memory pages touched within a window of
    time (or window of page references).

45
Question 5 CPU Scheduling
  • CPU Utilization w/ 10 I/O bound process and 1
    CPU-bound
  • I/O bound compute for 1ms, sleep for 10ms
  • CPU bound computes indefinitely
  • Context-switch overhead is 0.1ms
  • CPU utilization w/ 1 ms quantum?
  • scheduler incurs a 0.1ms context-switching cost
    for every context-switch, regardless of process
    type
  • Cpu util execTime/(execTimecontextSwitch)
    1/(10.1)0.9090
  • CPU utilization w/ 10 ms quantum?
  • I/OexI CPUexC / (I/O(exIcs)
    CPU(exCcs))
  • 101 110 / (10(10.1)
    1(100.1))
  • 20/(1110.1) 20/21.1 0.9478673

46
Question 5 continued
  • What strategy can a process employ to maximize
    the amount of CPU time allocated to that process?
  • Multilevel(-feedback) queue
  • Use a large fraction of assigned quantum
  • then relinquish the CPU before end of quantum
  • thus, increasing the priority associated with the
    process
  • Round robin
  • Use entire quantum
  • Or say no specific strategy
  • Alternatively, use more threads

47
How is the disk formatted?
  • After manufacturing disk has no information
  • Is stack of platters coated with magnetizable
    metal oxide
  • Before use, each platter receives low-level
    format
  • Format has series of concentric tracks
  • Each track contains some sectors
  • There is a short gap between sectors
  • Preamble allows h/w to recognize start of sector
  • Also contains cylinder and sector numbers
  • Data is usually 512 bytes
  • ECC field used to detect and recover from read
    errors

48
Cylinder Skew
  • Why cylinder skew?
  • How much skew?
  • Example, if
  • 10000 rpm
  • Drive rotates in 6 ms
  • Track has 300 sectors
  • New sector every 20 µs
  • If track seek time 800 µs
  • 40 sectors pass on seek
  • Cylinder skew 40 sectors

49
Formatting and Performance
  • If 10K rpm, 300 sectors of 512 bytes per track
  • 153,600 bytes every 6 ms ? 24.4 MB/sec transfer
    rate
  • If disk controller buffer can store only one
    sector
  • For 2 consecutive reads, 2nd sector flies past
    during memory transfer of 1st track
  • Idea Use single/double interleaving

50
Disk Partitioning
  • Each partition is like a separate disk
  • Sector 0 is MBR
  • Contains boot code partition table
  • Partition table has starting sector and size of
    each partition
  • High-level formatting
  • Done for each partition
  • Specifies boot block, free list, root directory,
    empty file system
  • What happens on boot?
  • BIOS loads MBR, boot program checks to see active
    partition
  • Reads boot sector from that partition that then
    loads OS kernel, etc.

51
Handling Errors
  • A disk track with a bad sector
  • Solutions
  • Substitute a spare for the bad sector (sector
    sparing)
  • Shift all sectors to bypass bad one (sector
    forwarding)

52
RAID Motivation
  • Disks are improving, but not as fast as CPUs
  • 1970s seek time 50-100 ms.
  • 2000s seek time lt5 ms.
  • Factor of 20 improvement in 3 decades
  • We can use multiple disks for improving
    performance
  • By Striping files across multiple disks (placing
    parts of each file on a different disk), parallel
    I/O can improve access time
  • Striping reduces reliability
  • 100 disks have 1/100th mean time between failures
    of one disk
  • So, we need Striping for performance, but we need
    something to help with reliability / availability
  • To improve reliability, we can add redundant data
    to the disks, in addition to Striping

53
RAID
  • A RAID is a Redundant Array of Inexpensive Disks
  • In industry, I is for Independent
  • The alternative is SLED, single large expensive
    disk
  • Disks are small and cheap, so its easy to put
    lots of disks (10s to 100s) in one box for
    increased storage, performance, and availability
  • The RAID box with a RAID controller looks just
    like a SLED to the computer
  • Data plus some redundant information is Striped
    across the disks in some way
  • How that Striping is done is key to performance
    and reliability.

54
Some Raid Issues
  • Granularity
  • fine-grained Stripe each file over all disks.
    This gives high throughput for the file, but
    limits to transfer of 1 file at a time
  • coarse-grained Stripe each file over only a few
    disks. This limits throughput for 1 file but
    allows more parallel file access
  • Redundancy
  • uniformly distribute redundancy info on disks
    avoids load-balancing problems
  • concentrate redundancy info on a small number of
    disks partition the set into data disks and
    redundant disks

55
Raid Level 0
  • Level 0 is nonredundant disk array
  • Files are Striped across disks, no redundant info
  • High read throughput
  • Best write throughput (no redundant info to
    write)
  • Any disk failure results in data loss
  • Reliability worse than SLED

Stripe 0
Stripe 3
Stripe 1
Stripe 2
Stripe 7
Stripe 4
Stripe 6
Stripe 5
Stripe 8
Stripe 11
Stripe 10
Stripe 9
data disks
56
Raid Level 1
  • Mirrored Disks
  • Data is written to two places
  • On failure, just use surviving disk
  • On read, choose fastest to read
  • Write performance is same as single drive, read
    performance is 2x better
  • Expensive

Stripe 0
Stripe 3
Stripe 1
Stripe 2
Stripe 0
Stripe 3
Stripe 1
Stripe 2
Stripe 7
Stripe 7
Stripe 4
Stripe 6
Stripe 5
Stripe 4
Stripe 6
Stripe 5
Stripe 8
Stripe 11
Stripe 8
Stripe 11
Stripe 10
Stripe 9
Stripe 10
Stripe 9
data disks
mirror copies
57
Parity and Hamming Code
  • What do you need to do in order to detect and
    correct a one-bit error ?
  • Suppose you have a binary number, represented as
    a collection of bits ltb3, b2, b1, b0gt, e.g. 0110
  • Detection is easy
  • Parity
  • Count the number of bits that are on, see if its
    odd or even
  • EVEN parity is 0 if the number of 1 bits is even
  • Parity(ltb3, b2, b1, b0 gt) P0 b0 ? b1 ? b2 ?
    b3
  • Parity(ltb3, b2, b1, b0, p0gt) 0 if all bits are
    intact
  • Parity(0110) 0, Parity(01100) 0
  • Parity(11100) 1 gt ERROR!
  • Parity can detect a single error, but cant tell
    you which of the bits got flipped

58
Parity and Hamming Code
  • Detection and correction require more work
  • Hamming codes can detect double bit errors and
    detect correct single bit errors
  • 7/4 Hamming Code
  • h0 b0 ? b1 ? b3
  • h1 b0 ? b2 ? b3
  • h2 b1 ? b2 ? b3
  • H0(lt1101gt) 0
  • H1(lt1101gt) 1
  • H2(lt1101gt) 0
  • Hamming(lt1101gt) ltb3, b2, b1, h2, b0, h1, h0gt
    lt1100110gt
  • If a bit is flipped, e.g. lt1110110gt
  • Hamming(lt1111gt) lth2, h1, h0gt lt111gt compared
    to lt010gt, lt101gt are in error. Error occurred in
    bit 5.

59
Raid Level 2
  • Bit-level Striping with Hamming (ECC) codes for
    error correction
  • All 7 disk arms are synchronized and move in
    unison
  • Complicated controller
  • Single access at a time
  • Tolerates only one error, but with no performance
    degradation

Bit 0
Bit 3
Bit 1
Bit 2
Bit 4
Bit 5
Bit 6
data disks
ECC disks
60
Raid Level 3
  • Use a parity disk
  • Each bit on the parity disk is a parity function
    of the corresponding bits on all the other disks
  • A read accesses all the data disks
  • A write accesses all data disks plus the parity
    disk
  • On disk failure, read remaining disks plus parity
    disk to compute the missing data

Single parity disk can be used to detect and
correct errors
Bit 0
Bit 3
Bit 1
Bit 2
Parity
Parity disk
data disks
61
Raid Level 4
  • Combines Level 0 and 3 block-level parity with
    Stripes
  • A read accesses all the data disks
  • A write accesses all data disks plus the parity
    disk
  • Heavy load on the parity disk

Stripe 0
Stripe 3
Stripe 1
Stripe 2
P0-3
Stripe 7
Stripe 4
Stripe 6
Stripe 5
P4-7
Stripe 8
Stripe 11
P8-11
Stripe 10
Stripe 9
Parity disk
data disks
62
Raid Level 5
  • Block Interleaved Distributed Parity
  • Like parity scheme, but distribute the parity
    info over all disks (as well as data over all
    disks)
  • Better read performance, large write performance
  • Reads can outperform SLEDs and RAID-0

Stripe 0
Stripe 3
Stripe 1
Stripe 2
P0-3
P4-7
Stripe 6
Stripe 4
Stripe 5
Stripe 7
Stripe 8
Stripe 10
Stripe 11
P8-11
Stripe 9
data and parity disks
63
Raid Level 6
  • Level 5 with an extra parity bit
  • Can tolerate two failures
  • What are the odds of having two concurrent
    failures ?
  • May outperform Level-5 on reads, slower on writes

64
RAID 01 and 10
65
Stable Storage
  • Handling disk write errors
  • Write lays down bad data
  • Crash during a write corrupts original data
  • What we want to achieve? Stable Storage
  • When a write is issued, the disk either correctly
    writes data, or it does nothing, leaving existing
    data intact
  • Model
  • An incorrect disk write can be detected by
    looking at the ECC
  • It is very rare that same sector goes bad on
    multiple disks
  • CPU is fail-stop

66
Approach
  • Use 2 identical disks
  • corresponding blocks on both drives are the same
  • 3 operations
  • Stable write retry on 1st until successful, then
    try 2nd disk
  • Stable read read from 1st. If ECC error, then
    try 2nd
  • Crash recovery scan corresponding blocks on both
    disks
  • If one block is bad, replace with good one
  • If both are good, replace block in 2nd with the
    one in 1st

67
CD-ROMs
  • Spiral makes 22,188 revolutions around disk
    (approx 600/mm).
  • Will be 5.6 km long. Rotation rate 530 rpm to
    200 rpm

68
CD-ROMs
  • Logical data layout on a CD-ROM
Write a Comment
User Comments (0)
About PowerShow.com