Title: Chapter 2' Data Storage
1Chapter 2. Data Storage
2Outline
- Memory hierarchy
- Hardware Disks
- Access Times
- Example - Megatron 747
- Optimizations
- Disk failure
- RAIDs
3Users
DBMSs
Operating Systems
Hardware - Data Storage
4The Memory Hierarchy
DBMS
Programs, Main-memory DBMSs
Tertiary Storage
Main memory
Cache
5Cache
- The cache is an integrated circuit or part of the
processors chip - Holding data or machine instructions
- Copy from main-memory
- If data being expelled from the cache has been
modified, then the new value must be copied into
the main memory. - Typical performance
- Capacities up to a megabyte
- Access time 10 nanoseconds (10-8 seconds)
- Moving data bet. Cache and main memory 100
nanoseconds (10-9 seconds)
6Main Memory
- Everything that happens in the computer is
resident in main memory - Capacity around 100 Mbyte to 10 Gbyte
- Random access
- Typical access time is 10-100 nanoseconds
7Virtual Memory
- Is a part of disk
- In a 32-bit address machine
- Virtual memory grows up to 232 bytes (4 Gbyte)
- Data is moved between disk and main memory in
entire blocks, which are also called pages in
main memory - Main-memory database systems
8Secondary Storage (1)
- Slower, more capacious than main memory
- Random access
- magnetic, optical, magneto-optical disks
- Disk read/write are done by moving a chuck of
bytes called blocks (or pages)
9Secondary Storage (2)
- Accessing a block 10-30 milliseconds
- Recently, one disk unit can store data ranging
from 10 to 32 Gbytes - A machine can have several disk units
10Tertiary Storage (1)
- Have been developed to hold data volumes measured
in terabytes - Compared with secondary storage, it offers
- Higher read/write times
- Larger capacities and smaller cost per byte
- Not random access in general
11Tertiary Storage (2)
- Kinds of tertiary storage devices
- Ad-hoc tape storage
- Optical-disk juke boxes CD-ROMs
- Tape silo an automated version ad-hoc tape
storage - Capacities
- CD 2/3 Gbytes, 2.3 Gbytes
- Tapes 50 Gbytes
- Access time about 1000 times slower than
secondary memory
12Volatile and Nonvolatile
- Volatile vs. nonvolatile storage
- Flush memory
- A form of main memory
- Nonvolatile
- Becomes economical
- RAM disk
- A battery-backed main memory
13Access Time vs. Capacity
14Moores Law
- Gordon Moore observed that the followings double
every 18 months - The speed of processors, i.e., the number of
instructions executed per second and the ratio of
the speed to cost of a processor - The cost of main memory per bit and the number of
bits that can be put on one chip - The cost of disk per bit and the number of bytes
that a disk can hold - Not applicable to
- Main memory access time, disk access time
15Disks
16Disks A Top View
- Cylinder, Track, Sector, Gap
- Gaps often represents about 10 of the total
tracks - A entire section cannot be used if portion of it
gets destroyed - Typically a block consists of one or more
sectors.
top view
17The Disk Controller
- Controls one or more disk drives
- controlling the mechanical actuator
- selecting a surface or a sector on that surface
- Transferring bits via a data bus
18Disk Storage Characteristics (as of 1999)
- Rotation speed of the disk assembly
- 5400 RPM (one rotation every 11 milliseconds)
- Number of platters per unit
- Typical disk drive 5 platters (10 surfaces)
- Floppy/zip disk 1 platter (2 surfaces)
- Number of tracks per surface
- Have as many as 10,000 tracks
- 3.5 inch diskette 40 tracks
- Number of bytes per track
- Common disk 105 or more bytes
- 3.5 inch diskette 150K
19Megatron 747 Disk (1)
- Characteristics
- Have 4 platters (8 surfaces)
- 8192 (213) tracks per surface
- On average 256 (28) sectors per track
- 512 (29) bytes per sector
- Diameters of tracks
- outermost track is 3.5 inches
- innermost track is 1.5 inches
- Track consists of two parts
- gap 10
- data 90
20Megatron 747 Disk (2)
- The capacity of the disk
- 8 surfaces 8192 tracks 256 sectors 512
bytes 8G bytes - A single track on average
- 256 sectors 512 bytes 128K bytes 1 Mbits
- A cylinder is of 1 Mbytes on average
- If a block is 4096 bytes (212)
- A block uses 8 sectors ( 4096 bytes / 512 bytes)
- A track consists of 32 blocks ( 256 sectors / 8)
21Megatron 747 Disk (3)
- Each track in Megatron 747 has the different
numbers of sectors - outer 320 sectors
- middle 250 sectors
- inner 192 sectors
- The outermost track
- 1,801,800 bit / 9.9 ? 182,000 bpi
- The innermost track
- 47,880 bit / 4.2 ? 114,000 bpi
- If each track had the same number (i.e. 256) of
sectors, then the density of bits around the
tracks would be greater - Length of the outermost track
- 0.9 3.5 ? ? 9.9 inch
- 1 megabit / 9.9 ? 100,000 bits per inch
- Length of the innermost track
- 0.9 1.5 ? ? 4.2 inch
- 1 megabit /4.2 ? 250,000 bits per inch
22The Latency of The Disk
block x in memory
I want block X
disk access time
- Disk access time
- seek time
- rotational delay
- transfer time
- others
23Seek Time
- The time to position the head assembly at the
proper cylinder - 0(zero) already to be at the proper cylinder
- Otherwise move to be at the proper cylinder
24Rotational latency Time
- The time for disk to rotate the first of the
sectors containing the block - One rotation takes 10 ms, so rotational latency
on average 5 ms.
25Transfer Time/Other delays
- Transfer Time
- the time to read/writes the data on the
appropriate disk surface - 10 Mbytes per second
- Other delays (here, those are neglected)
- taken by the processor and disk controller
- due to contention for the disk controller
- other delays due to contention
26Modifying Blocks
- Not possible to modify a block on disk directly
- Sequence of procedures
- Read block (time rt)
- Modify in memory (time mt)
- Write block (time wt)
- Verify (time vt) if appropriate
- Total time
- rt mt wt vt
27Example 2.3 (1)
- Let us examine the time to read a 4096-byte block
from the Megatron 747 disk - Characteristic
- 4 platters (8 surfaces), 1 surface 8192 tracks
- 1 track 256 sectors, 1 sector 512 bytes
- Disk rotates at 3840 RPM, one rotation 1/64 of
a second - To move the head assembly
- 1ms (to start and stop) 1ms for every 500
cylinders - Heads move one track in 1.002 ms
- To move heads from innermost to outermost track
- 1 (8192 / 500) 17.4 ms
28Example 2.3 (2)
- Minimum time (the best case)
- No seek time, no rotational latency, only
transfer time - Note 1 track 256 sectors, 1 sector 512 bytes
- 4096 bytes / 512 bytes 8 sectors (including 7
gap) - gaps/sectors occupy 10/90 of track
- A track has 256 gaps and 256 sectors
- 36 7/256 324 8/256 11.109 degrees
- (11.109/360)/64 4.8e-4 seconds 0.5 ms
29Example 2.3 (3)
- Maximum time (the worst case)
- full seek time and rotational latency, plus
transfer time - full seek time 17.4 ms
- full rotational time 1/64 of a second 15.6 ms
- transfer time 0.5 ms
- 17.4 15.6 0.5 33.5 ms
30Example 2.3 (4)
- Average Time
- Transfer time 0.5 ms
- Average rotational time half of the full
rotation 7.8 ms - Average seek time
- average distance traveled 1/3 of the disk
2730 cylinders - 1 2730/500 6.5ms
- 0.5 7.8 6.5 14.8 ms
31RAM model vs. I/O model computation
- I/O model computation
- Dominance of I/O cost
- Remember, 105 - 106 in-memory operations take
the same time as one disk I/O - Should minimize the number of block accesses
- Data Structure vs. File Processing
32Using Secondary Storage Effectively
- In general database
- Whole databases are much too large to fit in main
memory - Key parts of databases are buffered in main
memory - Disk I/Os occur frequently
- Main memory sorts (such as Quick sort) are
inadequate
33Merge Sort
34Two-Phase, Multiway Merge-Sort (1)
- Phase 1
- Sort main-memory-sized pieces of the data
- Fill all available main memory with blocks
- Sort the records in main memory
- Write the sorted records
35Two-Phase, Multiway Merge-Sort (2)
- Phase 2
- Merge all the sorted sublists into a single
sorted list - Find the smallest key among the first remaining
elements of all the lists - Move the smallest element to the first available
position of the output block - If output block is full, write it to disk and
reinitialize the same buffer - Repeat until all input blocks become exhausted.
36Main-memory Organization
37Merge Sort Example (1)
- Assumption
- 10,000,000 tuples, 1 tuple 100 bytes
- So, 1 Gbyte data
- 50 Mbytes memory available
- 4096 byte blocks, so each block contains 40
records - Total of blocks 250,000
- of blocks in main memory 12,800 ( 50220 /
212) - Number of sublists
- 19 sublists (12,800 blocks) 1 sublists (6,800
blocks) - Each block read or write 15 ms
38Merge Sort Example (2)
- Computation
- First phase
- Read each of the 250,000 blocks once
- Write 250,000 new blocks
- Total time
- (250,000 15 ms) 2 7500 seconds 125
minutes - Second phase
- Similar with the first phase
- Total time 125 minutes
39Improving the Access Time of Secondary Storage
- Place blocks on the same cylinder
- Divide the data among several small disks
- Mirroring disks
- Use a disk-scheduling algorithm
- Prefetch blocks to main memory in anticipation of
their later use
40Organizing Data by Cylinders
- Use several adjacent cylinders
- Read all the blocks on a single track or on a
cylinder consecutively - Neglect all but the first seek time and the first
rotational latency
41Example 2.9 (1)
- Recall examples 2.3 and 2.7
- Original data may be stored on consecutive
cylinders - Total of cylinders 1000 ( 1Gbytes / 1M bytes)
- Main memory can hold 50 cylinders (i.e. 50M)
- To read 50 cylinder data into main memory
- 6.5 ms for average seek time
- 49 ms for 49 one-cylinder seeks (1 ms each)
- 6.4 seconds for transfer of 12,800 blocks
- (12,800 0.5 ms) / 1000 6.4 seconds
- So, 6.5 49 6,400 6455.5 ms
42Example 2.9 (2)
- First phase
- Read
- ((6.5 ms 49 ms 6.4 seconds) 20 times)
2.15 minutes - Write The same as reading
- Total time 4.3 minutes
- Second phase
- Still takes about 125 minutes (WHY ?)
43 Using Multiple Disks in place of One
- Use several disks with their independent heads
- Transfer data at a higher rate
- Roughly speaking, total time could be divided by
the number of disks
44Example 2.10 (1)
- Replace one 747 by four 737s which have one
platter and two surfaces - Assumption
- Divide the given records among the four disks
- Occupy 1000 adjacent cylinders on each disk
- Fill ¼ of main memory each disk
- Recall previous examples
- Average seek time and rotational latency 0
- Number of full memory blocks 12,800
- ¼ memory size 3,200 blocks
45Example 2.10 (2)
- Computation
- First phase
- Transfer time 3200 0.5 ms 1.6 seconds
- Read (6.5 ms 49 ms 1.6 seconds) 20 33
sec. - Write similar with reading
- Total time about 1 minute
46Example 2.10 (3)
- Second phase
- Apply delicate techniques (?) to reduce disk I/O
time - Start comparisons among the 20 lists as soon as
the first element of the block appears in main
memory - Use four output buffers
-
- Total time about 1 hours (?)
47Mirroring Disks
- Two or more disks hold identical copies of data
- Survive a head crash by either disk
- If we make n copies of a disk, we can read any n
blocks in parallel. - Using mirror disks does not speed up writing, but
neither does it slow writing down (to some
extent)
48Scheduling Requests by the Elevator Algorithm
- Disk controller choose which of several requests
to execute first, to increase throughput - Elevator Algorithm
- Proceed in the same direction until the next
cylinder with blocks to access is encountered - When no requests ahead in direction of travel,
reverse direction
49Example 2.11
Finishing times for block accesses using the
elevator algorithm
Finishing times for block accesses using the
first-come-first-served algorithm
Arrival times for six block-access requests
50Prefetching Data on Track- or Cylinder-sized
Chunks
- Can we predict the order in which blocks will be
requested from disk ? - For example,
- Devote two block buffers to each list when merged
(when there is plenty of memory) - When a buffer is exhausted, switch to the other
buffer for the same list
51Single Buffering
- Single buffering
- Read B1 ? Buffer
- Process Data in Buffer
- Read B2 ? Buffer
- Process Data in Buffer ...
- Computation
- P time to process/block
- R time to read in 1 block
- n of blocks
- Single buffer time n(PR)
52Single Buffering vs. Double Buffering
53Double Buffering
- Computation
- P processing time/block
- R IO time/block
- n of blocks
- Double buffering time R nP
- Single buffering time n(RP)
54Prefetching
- Combine prefetching with the cylinder-based
strategy - Store the sorted sublists on whole, consecutive
cylinders - Read whole tracks or cylinders whenever we need
some records from a given list
55Example 2.14 (1)
- Consider the second phase of the sort
- Have in main memory two track-sized buffers
- A track 128KB
- Total space requirement 128KB 20 lists 2 5
Mbyte - Read all the blocks on 1000 cylinders (8000
tracks) - Computation
- average seek time 6.5 ms
- the time for disk to rotate once 15.6 ms
- total time (for reading) (6.5 15.6) 8000
2.95 minutes
56Example 2.14 (2)
- Have in main memory two cylinder-sized buffers
per sorted sublist - 1 cylinder 8 tracks 128K 8 1M
- Use 40 buffers of a megabyte each
- 50 megabytes available main memory
- Need only do a seek once per cylinder
- Read all the block on 1000 cylinders (8000
tracks) - Total time (for reading)
- (6.5 8 15.6) 1000 cylinders) 2.19 minutes
57Block Size Selection
- Big block ? amortize I/O cost
- Big block ? read in more useless stuff and takes
longer to read - As memory prices drop, blocks get bigger
58Disk Failures
- Intermittent failure
- An attempt to read or write a sector is
unsuccessful, but with repeated tries we are able
to read or write successfully. - Media decay
- A bit or bits are permanently corrupted, and the
sector becomes unreadable. - Write failure
- We can neither write successfully nor can we
retrieve the previously written sector. - Disk Crash
- When a disk becomes unreadable permanently
59Checksums (1)
- Each section has additional bits, called the
checksum, to check reading or writing operations - (w, s)
- w the data that is read
- s a status bit
- A simple form of checksum parity
60Checksums (2)
- Example 1 (even parity)
- The sequence of bits in a sector 01101000
- The parity bit is 1
- Data becomes 011010001
- Example 2 (even parity)
- The sequence of bits in a sector 11101110
- The parity bit is 0
- Data becomes 111011100
61Checksums (3)
- Possible that we cannot detect an error if more
than one bit of the sector may be corrupted - If we use n independent bits as a checksum, then
the chance of missing an error is only 1/2n (WHY
?)
62Stable Storage (1)
- How to correct errors ?
- Stable storage is a technique for organizing a
disk so that media decays or failed writes do not
result in permanent loss. - The general idea is that sectors are paired, and
each pair represents one sector-contents X - As the left (XL) and right (XR) copies
63Stable Storage (2)
- Writing policy
- Write the value of X into XL
- if status is good, write the value
- if status is bad, repeat writing
- If fails after a number of times, a media failure
in the sector - Repeat above scheme for XR
- Reading policy (to obtain the value of X)
- Read XL
- if status bad is returned, repeat reading
- if status good is returned, take that value as X
- If cant read XL , repeat above with XR
64Recovery from Disk Crashes
- Disk crash is fatal in mission-critical
applications - RAID (redundant arrays of independent disks)
- Here, we talk levels 5, 6, and 7
- These RAID schemes also handle failures discussed
previously
65The Failure Model of Disks
- Mean time to failure represents the length of
time by which 50 of a population of disks will
have failed catastrophically. - For modern disks, it is about 10 years
Fraction surviving
Time
66RAID Level 1
- To protect against data loss
- Use mirroring disks
- The only way data can be lost is if there is a
second disk crash while the first crash is being
repaired.
67How often will a data loss occur?
- Assume
- The process of replacing the failed disk
- take 3 hours, 1/8 day, 1/2920 year
- A failure rate of 5 per year
- Probability that the mirror disk will fail during
copying - (1/20) (1/2920) 1/58,400
- Mean time to a failure involving data loss
- One of the two disks will fail once in 5 years on
the average - 5 58,400 292,000 years
68RAID Level 4 (1)
- Use one redundant disks no matter how many data
disks there are - In the redundant disk, the ith block consists of
parity checks for the ith blocks of all the data
disks - Use modulo-2 sum an even parity
69The Algebra of Modulo-2 Sums
- The commutative law
- x ? y y ? x
- The associative law
- x ? (y ? z) (x ? y) ? z
- The all-0 vector of the appropriate length is the
identity for ? - x ? O O ? x x
- ? is its own inverse
- x ? x O
- If x ? y z, y x ? z
70RAID Level 4 Reading (2)
- Read disks normally.
- We could read the redundant disk !
- Example
- read disk 2, 3, and 4, and get the contents of
disk 1 using modulo-2 sum.
disk2 10101010 disk3 00111000 disk4
01100010 disk1 11110000
71RAID Level 4 Writing (3)
- When a block is written, we need to change the
redundant disk - Naïve approach
- N-1 reads of blocks not being rewritten
- One write of new block
- Rewrite new redundant disk
- In total, N1 disk I/Os
- There is a better way to do that !
72Writing Example (4)
- When disk 2 changes from 10101010 to 11001100
73RAID Level 4 Failure Recovery (5)
- Recomputing any missing data is simple, and does
not depend on which disk (data or redundant) is
failed.
74RAID Level 5
- We could treat each disk as the redundant disk
for some of the blocks - That is, do not have to treat one disk as the
redundant disk and the others as data disks - When there are n1 disks (disk 0 disk n)
- If (i mod n1) j, then we can treat the ith
cylinder of disk j as redundant
75Example 2.21 (1)
- How redundant blocks compute for 4 disks (n3)?
- Disk 0
- redundant for block 4, 8, 12,
- Disk 1
- redundant for block 1, 5, 9,
- Disk 2
- redundant for block 2, 6, 10,
- Disk 3
- redundant for block 3, 7, 11,
76Example 2.21 (2)
- The reading and writing load for each disk is the
same - If all blocks are equally likely to be written
- each disk has a 1/4 chance
- If not
- each disk has a 1/3 chance
- Each of four disks is involved in ½ of the writes
- 1/4 3/4 1/3 1/2
77RAID Level 6 (1)
- To handle with any number of disk crashes data
or redundant - Here, focused on a simple example, where two
simultaneous crashes are correctable and the
strategy is based on a simple error-correcting
code, Hamming code - Consider a system with seven disks
- data disks disk 1-4
- redundant disks disk 5-7
78RAID Level 6 (2)
- The relationship between data and redundant disks
- Note
- every possible column of three 0s and 1s,
except for the all-0 column - the columns for the redundant disk have a singe 1
- the columns for the data disks each have at least
two 1s
79RAID Level 6 (3)
- The disks with 1 in a row are treated as if they
were the entire set of disks in a RAID level 4
scheme. - The bits of disk 5
- are the modulo-2 sum of bits of disk 1,2, and 3
- The bits of disk 6
- are the modulo-2 sum of bits of disk 1,2, and 4
- The bits of disk 7
- are the modulo-2 sum of bits of disk 1,3, and 4
80RAID Level 6 Read/Write
- Reading Just read data from any data disk
normally - Writing
- Need to recalculate several redundant disks
81A Writing Example (1)
- Writing
- Disk 2 is changed to be 0000111
- Corresponding redundant disks
- disk 5 and 6
- Using modulo-2 sum
- between old and new disk 2
- between modulo-2 sum of disk 2s and disk 5
- between modulo-2 sum of disk 2s and disk 6
82A Writing Example (2)
83RAID Level 6 Failure Recovery
- Assume that disk a and b fails simultaneously
- Find a row r in which the columns of a and b are
different - For example, a has 0 in row r, b has 1 in row r
- Compute the correct b by taking the modulo-2 sum
of corresponding bits from all the disks other
than b that have 1 in row r. - Then, compute the correct a
84A Recovery Example
- Pick the second row
- Disk 2
- modulo-2 sum of disks 1, 4, and 6
- 00001111
- Disk 5
- modulo-2 sum of disks 1, 2, and 3
- 11000111