Chapter 2' Data Storage - PowerPoint PPT Presentation

1 / 84
About This Presentation
Title:

Chapter 2' Data Storage

Description:

The cache is an integrated circuit or part of the ... If data being expelled from the cache has been modified, then the ... optical, magneto-optical ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 85
Provided by: Sir106
Category:

less

Transcript and Presenter's Notes

Title: Chapter 2' Data Storage


1
Chapter 2. Data Storage
2
Outline
  • Memory hierarchy
  • Hardware Disks
  • Access Times
  • Example - Megatron 747
  • Optimizations
  • Disk failure
  • RAIDs

3
Users
DBMSs
Operating Systems
Hardware - Data Storage
4
The Memory Hierarchy
DBMS
Programs, Main-memory DBMSs
Tertiary Storage
Main memory
Cache
5
Cache
  • The cache is an integrated circuit or part of the
    processors chip
  • Holding data or machine instructions
  • Copy from main-memory
  • If data being expelled from the cache has been
    modified, then the new value must be copied into
    the main memory.
  • Typical performance
  • Capacities up to a megabyte
  • Access time 10 nanoseconds (10-8 seconds)
  • Moving data bet. Cache and main memory 100
    nanoseconds (10-9 seconds)

6
Main Memory
  • Everything that happens in the computer is
    resident in main memory
  • Capacity around 100 Mbyte to 10 Gbyte
  • Random access
  • Typical access time is 10-100 nanoseconds

7
Virtual Memory
  • Is a part of disk
  • In a 32-bit address machine
  • Virtual memory grows up to 232 bytes (4 Gbyte)
  • Data is moved between disk and main memory in
    entire blocks, which are also called pages in
    main memory
  • Main-memory database systems

8
Secondary Storage (1)
  • Slower, more capacious than main memory
  • Random access
  • magnetic, optical, magneto-optical disks
  • Disk read/write are done by moving a chuck of
    bytes called blocks (or pages)

9
Secondary Storage (2)
  • Accessing a block 10-30 milliseconds
  • Recently, one disk unit can store data ranging
    from 10 to 32 Gbytes
  • A machine can have several disk units

10
Tertiary Storage (1)
  • Have been developed to hold data volumes measured
    in terabytes
  • Compared with secondary storage, it offers
  • Higher read/write times
  • Larger capacities and smaller cost per byte
  • Not random access in general

11
Tertiary Storage (2)
  • Kinds of tertiary storage devices
  • Ad-hoc tape storage
  • Optical-disk juke boxes CD-ROMs
  • Tape silo an automated version ad-hoc tape
    storage
  • Capacities
  • CD 2/3 Gbytes, 2.3 Gbytes
  • Tapes 50 Gbytes
  • Access time about 1000 times slower than
    secondary memory

12
Volatile and Nonvolatile
  • Volatile vs. nonvolatile storage
  • Flush memory
  • A form of main memory
  • Nonvolatile
  • Becomes economical
  • RAM disk
  • A battery-backed main memory

13
Access Time vs. Capacity
14
Moores Law
  • Gordon Moore observed that the followings double
    every 18 months
  • The speed of processors, i.e., the number of
    instructions executed per second and the ratio of
    the speed to cost of a processor
  • The cost of main memory per bit and the number of
    bits that can be put on one chip
  • The cost of disk per bit and the number of bytes
    that a disk can hold
  • Not applicable to
  • Main memory access time, disk access time

15
Disks
16
Disks A Top View
  • Cylinder, Track, Sector, Gap
  • Gaps often represents about 10 of the total
    tracks
  • A entire section cannot be used if portion of it
    gets destroyed
  • Typically a block consists of one or more
    sectors.

top view
17
The Disk Controller
  • Controls one or more disk drives
  • controlling the mechanical actuator
  • selecting a surface or a sector on that surface
  • Transferring bits via a data bus

18
Disk Storage Characteristics (as of 1999)
  • Rotation speed of the disk assembly
  • 5400 RPM (one rotation every 11 milliseconds)
  • Number of platters per unit
  • Typical disk drive 5 platters (10 surfaces)
  • Floppy/zip disk 1 platter (2 surfaces)
  • Number of tracks per surface
  • Have as many as 10,000 tracks
  • 3.5 inch diskette 40 tracks
  • Number of bytes per track
  • Common disk 105 or more bytes
  • 3.5 inch diskette 150K

19
Megatron 747 Disk (1)
  • Characteristics
  • Have 4 platters (8 surfaces)
  • 8192 (213) tracks per surface
  • On average 256 (28) sectors per track
  • 512 (29) bytes per sector
  • Diameters of tracks
  • outermost track is 3.5 inches
  • innermost track is 1.5 inches
  • Track consists of two parts
  • gap 10
  • data 90

20
Megatron 747 Disk (2)
  • The capacity of the disk
  • 8 surfaces 8192 tracks 256 sectors 512
    bytes 8G bytes
  • A single track on average
  • 256 sectors 512 bytes 128K bytes 1 Mbits
  • A cylinder is of 1 Mbytes on average
  • If a block is 4096 bytes (212)
  • A block uses 8 sectors ( 4096 bytes / 512 bytes)
  • A track consists of 32 blocks ( 256 sectors / 8)

21
Megatron 747 Disk (3)
  • Each track in Megatron 747 has the different
    numbers of sectors
  • outer 320 sectors
  • middle 250 sectors
  • inner 192 sectors
  • The outermost track
  • 1,801,800 bit / 9.9 ? 182,000 bpi
  • The innermost track
  • 47,880 bit / 4.2 ? 114,000 bpi
  • If each track had the same number (i.e. 256) of
    sectors, then the density of bits around the
    tracks would be greater
  • Length of the outermost track
  • 0.9 3.5 ? ? 9.9 inch
  • 1 megabit / 9.9 ? 100,000 bits per inch
  • Length of the innermost track
  • 0.9 1.5 ? ? 4.2 inch
  • 1 megabit /4.2 ? 250,000 bits per inch

22
The Latency of The Disk
block x in memory
I want block X
disk access time
  • Disk access time
  • seek time
  • rotational delay
  • transfer time
  • others

23
Seek Time
  • The time to position the head assembly at the
    proper cylinder
  • 0(zero) already to be at the proper cylinder
  • Otherwise move to be at the proper cylinder

24
Rotational latency Time
  • The time for disk to rotate the first of the
    sectors containing the block
  • One rotation takes 10 ms, so rotational latency
    on average 5 ms.

25
Transfer Time/Other delays
  • Transfer Time
  • the time to read/writes the data on the
    appropriate disk surface
  • 10 Mbytes per second
  • Other delays (here, those are neglected)
  • taken by the processor and disk controller
  • due to contention for the disk controller
  • other delays due to contention

26
Modifying Blocks
  • Not possible to modify a block on disk directly
  • Sequence of procedures
  • Read block (time rt)
  • Modify in memory (time mt)
  • Write block (time wt)
  • Verify (time vt) if appropriate
  • Total time
  • rt mt wt vt

27
Example 2.3 (1)
  • Let us examine the time to read a 4096-byte block
    from the Megatron 747 disk
  • Characteristic
  • 4 platters (8 surfaces), 1 surface 8192 tracks
  • 1 track 256 sectors, 1 sector 512 bytes
  • Disk rotates at 3840 RPM, one rotation 1/64 of
    a second
  • To move the head assembly
  • 1ms (to start and stop) 1ms for every 500
    cylinders
  • Heads move one track in 1.002 ms
  • To move heads from innermost to outermost track
  • 1 (8192 / 500) 17.4 ms

28
Example 2.3 (2)
  • Minimum time (the best case)
  • No seek time, no rotational latency, only
    transfer time
  • Note 1 track 256 sectors, 1 sector 512 bytes
  • 4096 bytes / 512 bytes 8 sectors (including 7
    gap)
  • gaps/sectors occupy 10/90 of track
  • A track has 256 gaps and 256 sectors
  • 36 7/256 324 8/256 11.109 degrees
  • (11.109/360)/64 4.8e-4 seconds 0.5 ms

29
Example 2.3 (3)
  • Maximum time (the worst case)
  • full seek time and rotational latency, plus
    transfer time
  • full seek time 17.4 ms
  • full rotational time 1/64 of a second 15.6 ms
  • transfer time 0.5 ms
  • 17.4 15.6 0.5 33.5 ms

30
Example 2.3 (4)
  • Average Time
  • Transfer time 0.5 ms
  • Average rotational time half of the full
    rotation 7.8 ms
  • Average seek time
  • average distance traveled 1/3 of the disk
    2730 cylinders
  • 1 2730/500 6.5ms
  • 0.5 7.8 6.5 14.8 ms

31
RAM model vs. I/O model computation
  • I/O model computation
  • Dominance of I/O cost
  • Remember, 105 - 106 in-memory operations take
    the same time as one disk I/O
  • Should minimize the number of block accesses
  • Data Structure vs. File Processing

32
Using Secondary Storage Effectively
  • In general database
  • Whole databases are much too large to fit in main
    memory
  • Key parts of databases are buffered in main
    memory
  • Disk I/Os occur frequently
  • Main memory sorts (such as Quick sort) are
    inadequate

33
Merge Sort
34
Two-Phase, Multiway Merge-Sort (1)
  • Phase 1
  • Sort main-memory-sized pieces of the data
  • Fill all available main memory with blocks
  • Sort the records in main memory
  • Write the sorted records

35
Two-Phase, Multiway Merge-Sort (2)
  • Phase 2
  • Merge all the sorted sublists into a single
    sorted list
  • Find the smallest key among the first remaining
    elements of all the lists
  • Move the smallest element to the first available
    position of the output block
  • If output block is full, write it to disk and
    reinitialize the same buffer
  • Repeat until all input blocks become exhausted.

36
Main-memory Organization
37
Merge Sort Example (1)
  • Assumption
  • 10,000,000 tuples, 1 tuple 100 bytes
  • So, 1 Gbyte data
  • 50 Mbytes memory available
  • 4096 byte blocks, so each block contains 40
    records
  • Total of blocks 250,000
  • of blocks in main memory 12,800 ( 50220 /
    212)
  • Number of sublists
  • 19 sublists (12,800 blocks) 1 sublists (6,800
    blocks)
  • Each block read or write 15 ms

38
Merge Sort Example (2)
  • Computation
  • First phase
  • Read each of the 250,000 blocks once
  • Write 250,000 new blocks
  • Total time
  • (250,000 15 ms) 2 7500 seconds 125
    minutes
  • Second phase
  • Similar with the first phase
  • Total time 125 minutes

39
Improving the Access Time of Secondary Storage
  • Place blocks on the same cylinder
  • Divide the data among several small disks
  • Mirroring disks
  • Use a disk-scheduling algorithm
  • Prefetch blocks to main memory in anticipation of
    their later use

40
Organizing Data by Cylinders
  • Use several adjacent cylinders
  • Read all the blocks on a single track or on a
    cylinder consecutively
  • Neglect all but the first seek time and the first
    rotational latency

41
Example 2.9 (1)
  • Recall examples 2.3 and 2.7
  • Original data may be stored on consecutive
    cylinders
  • Total of cylinders 1000 ( 1Gbytes / 1M bytes)
  • Main memory can hold 50 cylinders (i.e. 50M)
  • To read 50 cylinder data into main memory
  • 6.5 ms for average seek time
  • 49 ms for 49 one-cylinder seeks (1 ms each)
  • 6.4 seconds for transfer of 12,800 blocks
  • (12,800 0.5 ms) / 1000 6.4 seconds
  • So, 6.5 49 6,400 6455.5 ms

42
Example 2.9 (2)
  • First phase
  • Read
  • ((6.5 ms 49 ms 6.4 seconds) 20 times)
    2.15 minutes
  • Write The same as reading
  • Total time 4.3 minutes
  • Second phase
  • Still takes about 125 minutes (WHY ?)

43
Using Multiple Disks in place of One
  • Use several disks with their independent heads
  • Transfer data at a higher rate
  • Roughly speaking, total time could be divided by
    the number of disks

44
Example 2.10 (1)
  • Replace one 747 by four 737s which have one
    platter and two surfaces
  • Assumption
  • Divide the given records among the four disks
  • Occupy 1000 adjacent cylinders on each disk
  • Fill ¼ of main memory each disk
  • Recall previous examples
  • Average seek time and rotational latency 0
  • Number of full memory blocks 12,800
  • ¼ memory size 3,200 blocks

45
Example 2.10 (2)
  • Computation
  • First phase
  • Transfer time 3200 0.5 ms 1.6 seconds
  • Read (6.5 ms 49 ms 1.6 seconds) 20 33
    sec.
  • Write similar with reading
  • Total time about 1 minute

46
Example 2.10 (3)
  • Second phase
  • Apply delicate techniques (?) to reduce disk I/O
    time
  • Start comparisons among the 20 lists as soon as
    the first element of the block appears in main
    memory
  • Use four output buffers
  • Total time about 1 hours (?)

47
Mirroring Disks
  • Two or more disks hold identical copies of data
  • Survive a head crash by either disk
  • If we make n copies of a disk, we can read any n
    blocks in parallel.
  • Using mirror disks does not speed up writing, but
    neither does it slow writing down (to some
    extent)

48
Scheduling Requests by the Elevator Algorithm
  • Disk controller choose which of several requests
    to execute first, to increase throughput
  • Elevator Algorithm
  • Proceed in the same direction until the next
    cylinder with blocks to access is encountered
  • When no requests ahead in direction of travel,
    reverse direction

49
Example 2.11
Finishing times for block accesses using the
elevator algorithm
Finishing times for block accesses using the
first-come-first-served algorithm
Arrival times for six block-access requests
50
Prefetching Data on Track- or Cylinder-sized
Chunks
  • Can we predict the order in which blocks will be
    requested from disk ?
  • For example,
  • Devote two block buffers to each list when merged
    (when there is plenty of memory)
  • When a buffer is exhausted, switch to the other
    buffer for the same list

51
Single Buffering
  • Single buffering
  • Read B1 ? Buffer
  • Process Data in Buffer
  • Read B2 ? Buffer
  • Process Data in Buffer ...
  • Computation
  • P time to process/block
  • R time to read in 1 block
  • n of blocks
  • Single buffer time n(PR)

52
Single Buffering vs. Double Buffering
  • Memory
  • Disk

53
Double Buffering
  • Computation
  • P processing time/block
  • R IO time/block
  • n of blocks
  • Double buffering time R nP
  • Single buffering time n(RP)

54
Prefetching
  • Combine prefetching with the cylinder-based
    strategy
  • Store the sorted sublists on whole, consecutive
    cylinders
  • Read whole tracks or cylinders whenever we need
    some records from a given list

55
Example 2.14 (1)
  • Consider the second phase of the sort
  • Have in main memory two track-sized buffers
  • A track 128KB
  • Total space requirement 128KB 20 lists 2 5
    Mbyte
  • Read all the blocks on 1000 cylinders (8000
    tracks)
  • Computation
  • average seek time 6.5 ms
  • the time for disk to rotate once 15.6 ms
  • total time (for reading) (6.5 15.6) 8000
    2.95 minutes

56
Example 2.14 (2)
  • Have in main memory two cylinder-sized buffers
    per sorted sublist
  • 1 cylinder 8 tracks 128K 8 1M
  • Use 40 buffers of a megabyte each
  • 50 megabytes available main memory
  • Need only do a seek once per cylinder
  • Read all the block on 1000 cylinders (8000
    tracks)
  • Total time (for reading)
  • (6.5 8 15.6) 1000 cylinders) 2.19 minutes

57
Block Size Selection
  • Big block ? amortize I/O cost
  • Big block ? read in more useless stuff and takes
    longer to read
  • As memory prices drop, blocks get bigger

58
Disk Failures
  • Intermittent failure
  • An attempt to read or write a sector is
    unsuccessful, but with repeated tries we are able
    to read or write successfully.
  • Media decay
  • A bit or bits are permanently corrupted, and the
    sector becomes unreadable.
  • Write failure
  • We can neither write successfully nor can we
    retrieve the previously written sector.
  • Disk Crash
  • When a disk becomes unreadable permanently

59
Checksums (1)
  • Each section has additional bits, called the
    checksum, to check reading or writing operations
  • (w, s)
  • w the data that is read
  • s a status bit
  • A simple form of checksum parity

60
Checksums (2)
  • Example 1 (even parity)
  • The sequence of bits in a sector 01101000
  • The parity bit is 1
  • Data becomes 011010001
  • Example 2 (even parity)
  • The sequence of bits in a sector 11101110
  • The parity bit is 0
  • Data becomes 111011100

61
Checksums (3)
  • Possible that we cannot detect an error if more
    than one bit of the sector may be corrupted
  • If we use n independent bits as a checksum, then
    the chance of missing an error is only 1/2n (WHY
    ?)

62
Stable Storage (1)
  • How to correct errors ?
  • Stable storage is a technique for organizing a
    disk so that media decays or failed writes do not
    result in permanent loss.
  • The general idea is that sectors are paired, and
    each pair represents one sector-contents X
  • As the left (XL) and right (XR) copies

63
Stable Storage (2)
  • Writing policy
  • Write the value of X into XL
  • if status is good, write the value
  • if status is bad, repeat writing
  • If fails after a number of times, a media failure
    in the sector
  • Repeat above scheme for XR
  • Reading policy (to obtain the value of X)
  • Read XL
  • if status bad is returned, repeat reading
  • if status good is returned, take that value as X
  • If cant read XL , repeat above with XR

64
Recovery from Disk Crashes
  • Disk crash is fatal in mission-critical
    applications
  • RAID (redundant arrays of independent disks)
  • Here, we talk levels 5, 6, and 7
  • These RAID schemes also handle failures discussed
    previously

65
The Failure Model of Disks
  • Mean time to failure represents the length of
    time by which 50 of a population of disks will
    have failed catastrophically.
  • For modern disks, it is about 10 years

Fraction surviving
Time
66
RAID Level 1
  • To protect against data loss
  • Use mirroring disks
  • The only way data can be lost is if there is a
    second disk crash while the first crash is being
    repaired.

67
How often will a data loss occur?
  • Assume
  • The process of replacing the failed disk
  • take 3 hours, 1/8 day, 1/2920 year
  • A failure rate of 5 per year
  • Probability that the mirror disk will fail during
    copying
  • (1/20) (1/2920) 1/58,400
  • Mean time to a failure involving data loss
  • One of the two disks will fail once in 5 years on
    the average
  • 5 58,400 292,000 years

68
RAID Level 4 (1)
  • Use one redundant disks no matter how many data
    disks there are
  • In the redundant disk, the ith block consists of
    parity checks for the ith blocks of all the data
    disks
  • Use modulo-2 sum an even parity

69
The Algebra of Modulo-2 Sums
  • The commutative law
  • x ? y y ? x
  • The associative law
  • x ? (y ? z) (x ? y) ? z
  • The all-0 vector of the appropriate length is the
    identity for ?
  • x ? O O ? x x
  • ? is its own inverse
  • x ? x O
  • If x ? y z, y x ? z

70
RAID Level 4 Reading (2)
  • Read disks normally.
  • We could read the redundant disk !
  • Example
  • read disk 2, 3, and 4, and get the contents of
    disk 1 using modulo-2 sum.

disk2 10101010 disk3 00111000 disk4
01100010 disk1 11110000
71
RAID Level 4 Writing (3)
  • When a block is written, we need to change the
    redundant disk
  • Naïve approach
  • N-1 reads of blocks not being rewritten
  • One write of new block
  • Rewrite new redundant disk
  • In total, N1 disk I/Os
  • There is a better way to do that !

72
Writing Example (4)
  • When disk 2 changes from 10101010 to 11001100

73
RAID Level 4 Failure Recovery (5)
  • Recomputing any missing data is simple, and does
    not depend on which disk (data or redundant) is
    failed.

74
RAID Level 5
  • We could treat each disk as the redundant disk
    for some of the blocks
  • That is, do not have to treat one disk as the
    redundant disk and the others as data disks
  • When there are n1 disks (disk 0 disk n)
  • If (i mod n1) j, then we can treat the ith
    cylinder of disk j as redundant

75
Example 2.21 (1)
  • How redundant blocks compute for 4 disks (n3)?
  • Disk 0
  • redundant for block 4, 8, 12,
  • Disk 1
  • redundant for block 1, 5, 9,
  • Disk 2
  • redundant for block 2, 6, 10,
  • Disk 3
  • redundant for block 3, 7, 11,

76
Example 2.21 (2)
  • The reading and writing load for each disk is the
    same
  • If all blocks are equally likely to be written
  • each disk has a 1/4 chance
  • If not
  • each disk has a 1/3 chance
  • Each of four disks is involved in ½ of the writes
  • 1/4 3/4 1/3 1/2

77
RAID Level 6 (1)
  • To handle with any number of disk crashes data
    or redundant
  • Here, focused on a simple example, where two
    simultaneous crashes are correctable and the
    strategy is based on a simple error-correcting
    code, Hamming code
  • Consider a system with seven disks
  • data disks disk 1-4
  • redundant disks disk 5-7

78
RAID Level 6 (2)
  • The relationship between data and redundant disks
  • Note
  • every possible column of three 0s and 1s,
    except for the all-0 column
  • the columns for the redundant disk have a singe 1
  • the columns for the data disks each have at least
    two 1s

79
RAID Level 6 (3)
  • The disks with 1 in a row are treated as if they
    were the entire set of disks in a RAID level 4
    scheme.
  • The bits of disk 5
  • are the modulo-2 sum of bits of disk 1,2, and 3
  • The bits of disk 6
  • are the modulo-2 sum of bits of disk 1,2, and 4
  • The bits of disk 7
  • are the modulo-2 sum of bits of disk 1,3, and 4

80
RAID Level 6 Read/Write
  • Reading Just read data from any data disk
    normally
  • Writing
  • Need to recalculate several redundant disks

81
A Writing Example (1)
  • Writing
  • Disk 2 is changed to be 0000111
  • Corresponding redundant disks
  • disk 5 and 6
  • Using modulo-2 sum
  • between old and new disk 2
  • between modulo-2 sum of disk 2s and disk 5
  • between modulo-2 sum of disk 2s and disk 6

82
A Writing Example (2)
83
RAID Level 6 Failure Recovery
  • Assume that disk a and b fails simultaneously
  • Find a row r in which the columns of a and b are
    different
  • For example, a has 0 in row r, b has 1 in row r
  • Compute the correct b by taking the modulo-2 sum
    of corresponding bits from all the disks other
    than b that have 1 in row r.
  • Then, compute the correct a

84
A Recovery Example
  • Pick the second row
  • Disk 2
  • modulo-2 sum of disks 1, 4, and 6
  • 00001111
  • Disk 5
  • modulo-2 sum of disks 1, 2, and 3
  • 11000111
Write a Comment
User Comments (0)
About PowerShow.com