Chapter 2' Data Storage

About This Presentation

Title:

Chapter 2' Data Storage

Description:

The cache is an integrated circuit or part of the ... If data being expelled from the cache has been modified, then the ... optical, magneto-optical ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 85

Provided by: Sir106

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 2' Data Storage

1
Chapter 2. Data Storage
2
Outline

Memory hierarchy
Hardware Disks
Access Times
Example - Megatron 747
Optimizations
Disk failure
RAIDs

3
Users
DBMSs
Operating Systems
Hardware - Data Storage
4
The Memory Hierarchy
DBMS
Programs, Main-memory DBMSs
Tertiary Storage
Main memory
Cache
5
Cache

The cache is an integrated circuit or part of the
processors chip
Holding data or machine instructions
Copy from main-memory
If data being expelled from the cache has been
modified, then the new value must be copied into
the main memory.
Typical performance
Capacities up to a megabyte
Access time 10 nanoseconds (10-8 seconds)
Moving data bet. Cache and main memory 100
nanoseconds (10-9 seconds)

6
Main Memory

Everything that happens in the computer is
resident in main memory
Capacity around 100 Mbyte to 10 Gbyte
Random access
Typical access time is 10-100 nanoseconds

7
Virtual Memory

Is a part of disk
In a 32-bit address machine
Virtual memory grows up to 232 bytes (4 Gbyte)
Data is moved between disk and main memory in
entire blocks, which are also called pages in
main memory
Main-memory database systems

8
Secondary Storage (1)

Slower, more capacious than main memory
Random access
magnetic, optical, magneto-optical disks
Disk read/write are done by moving a chuck of
bytes called blocks (or pages)

9
Secondary Storage (2)

Accessing a block 10-30 milliseconds
Recently, one disk unit can store data ranging
from 10 to 32 Gbytes
A machine can have several disk units

10
Tertiary Storage (1)

Have been developed to hold data volumes measured
in terabytes
Compared with secondary storage, it offers
Higher read/write times
Larger capacities and smaller cost per byte
Not random access in general

11
Tertiary Storage (2)

Kinds of tertiary storage devices
Ad-hoc tape storage
Optical-disk juke boxes CD-ROMs
Tape silo an automated version ad-hoc tape
storage
Capacities
CD 2/3 Gbytes, 2.3 Gbytes
Tapes 50 Gbytes
Access time about 1000 times slower than
secondary memory

12
Volatile and Nonvolatile

Volatile vs. nonvolatile storage
Flush memory
A form of main memory
Nonvolatile
Becomes economical
RAM disk
A battery-backed main memory

13
Access Time vs. Capacity
14
Moores Law

Gordon Moore observed that the followings double
every 18 months
The speed of processors, i.e., the number of
instructions executed per second and the ratio of
the speed to cost of a processor
The cost of main memory per bit and the number of
bits that can be put on one chip
The cost of disk per bit and the number of bytes
that a disk can hold
Not applicable to
Main memory access time, disk access time

15
Disks
16
Disks A Top View

Cylinder, Track, Sector, Gap
Gaps often represents about 10 of the total
tracks
A entire section cannot be used if portion of it
gets destroyed
Typically a block consists of one or more
sectors.

top view
17
The Disk Controller

Controls one or more disk drives
controlling the mechanical actuator
selecting a surface or a sector on that surface
Transferring bits via a data bus

18
Disk Storage Characteristics (as of 1999)

Rotation speed of the disk assembly
5400 RPM (one rotation every 11 milliseconds)
Number of platters per unit
Typical disk drive 5 platters (10 surfaces)
Floppy/zip disk 1 platter (2 surfaces)
Number of tracks per surface
Have as many as 10,000 tracks
3.5 inch diskette 40 tracks
Number of bytes per track
Common disk 105 or more bytes
3.5 inch diskette 150K

19
Megatron 747 Disk (1)

Characteristics
Have 4 platters (8 surfaces)
8192 (213) tracks per surface
On average 256 (28) sectors per track
512 (29) bytes per sector

Diameters of tracks
outermost track is 3.5 inches
innermost track is 1.5 inches
Track consists of two parts
gap 10
data 90

20
Megatron 747 Disk (2)

The capacity of the disk
8 surfaces 8192 tracks 256 sectors 512
bytes 8G bytes
A single track on average
256 sectors 512 bytes 128K bytes 1 Mbits
A cylinder is of 1 Mbytes on average
If a block is 4096 bytes (212)
A block uses 8 sectors ( 4096 bytes / 512 bytes)
A track consists of 32 blocks ( 256 sectors / 8)

21
Megatron 747 Disk (3)

Each track in Megatron 747 has the different
numbers of sectors
outer 320 sectors
middle 250 sectors
inner 192 sectors
The outermost track
1,801,800 bit / 9.9 ? 182,000 bpi
The innermost track
47,880 bit / 4.2 ? 114,000 bpi

If each track had the same number (i.e. 256) of
sectors, then the density of bits around the
tracks would be greater
Length of the outermost track
0.9 3.5 ? ? 9.9 inch
1 megabit / 9.9 ? 100,000 bits per inch
Length of the innermost track
0.9 1.5 ? ? 4.2 inch
1 megabit /4.2 ? 250,000 bits per inch

22
The Latency of The Disk
block x in memory
I want block X
disk access time

Disk access time
seek time
rotational delay
transfer time
others

23
Seek Time

The time to position the head assembly at the
proper cylinder
0(zero) already to be at the proper cylinder
Otherwise move to be at the proper cylinder

24
Rotational latency Time

The time for disk to rotate the first of the
sectors containing the block
One rotation takes 10 ms, so rotational latency
on average 5 ms.

25
Transfer Time/Other delays

Transfer Time
the time to read/writes the data on the
appropriate disk surface
10 Mbytes per second
Other delays (here, those are neglected)
taken by the processor and disk controller
due to contention for the disk controller
other delays due to contention

26
Modifying Blocks

Not possible to modify a block on disk directly
Sequence of procedures
Read block (time rt)
Modify in memory (time mt)
Write block (time wt)
Verify (time vt) if appropriate
Total time
rt mt wt vt

27
Example 2.3 (1)

Let us examine the time to read a 4096-byte block
from the Megatron 747 disk
Characteristic
4 platters (8 surfaces), 1 surface 8192 tracks
1 track 256 sectors, 1 sector 512 bytes
Disk rotates at 3840 RPM, one rotation 1/64 of
a second
To move the head assembly
1ms (to start and stop) 1ms for every 500
cylinders
Heads move one track in 1.002 ms
To move heads from innermost to outermost track
1 (8192 / 500) 17.4 ms

28
Example 2.3 (2)

Minimum time (the best case)
No seek time, no rotational latency, only
transfer time
Note 1 track 256 sectors, 1 sector 512 bytes
4096 bytes / 512 bytes 8 sectors (including 7
gap)
gaps/sectors occupy 10/90 of track
A track has 256 gaps and 256 sectors
36 7/256 324 8/256 11.109 degrees
(11.109/360)/64 4.8e-4 seconds 0.5 ms

29
Example 2.3 (3)

Maximum time (the worst case)
full seek time and rotational latency, plus
transfer time
full seek time 17.4 ms
full rotational time 1/64 of a second 15.6 ms
transfer time 0.5 ms
17.4 15.6 0.5 33.5 ms

30
Example 2.3 (4)

Average Time
Transfer time 0.5 ms
Average rotational time half of the full
rotation 7.8 ms
Average seek time
average distance traveled 1/3 of the disk
2730 cylinders
1 2730/500 6.5ms
0.5 7.8 6.5 14.8 ms

31
RAM model vs. I/O model computation

I/O model computation
Dominance of I/O cost
Remember, 105 - 106 in-memory operations take
the same time as one disk I/O
Should minimize the number of block accesses
Data Structure vs. File Processing

32
Using Secondary Storage Effectively

In general database
Whole databases are much too large to fit in main
memory
Key parts of databases are buffered in main
memory
Disk I/Os occur frequently
Main memory sorts (such as Quick sort) are
inadequate

33
Merge Sort
34
Two-Phase, Multiway Merge-Sort (1)

Phase 1
Sort main-memory-sized pieces of the data
Fill all available main memory with blocks
Sort the records in main memory
Write the sorted records

35
Two-Phase, Multiway Merge-Sort (2)

Phase 2
Merge all the sorted sublists into a single
sorted list
Find the smallest key among the first remaining
elements of all the lists
Move the smallest element to the first available
position of the output block
If output block is full, write it to disk and
reinitialize the same buffer
Repeat until all input blocks become exhausted.

36
Main-memory Organization
37
Merge Sort Example (1)

Assumption
10,000,000 tuples, 1 tuple 100 bytes
So, 1 Gbyte data
50 Mbytes memory available
4096 byte blocks, so each block contains 40
records
Total of blocks 250,000
of blocks in main memory 12,800 ( 50220 /
212)
Number of sublists
19 sublists (12,800 blocks) 1 sublists (6,800
blocks)
Each block read or write 15 ms

38
Merge Sort Example (2)

Computation
First phase
Read each of the 250,000 blocks once
Write 250,000 new blocks
Total time
(250,000 15 ms) 2 7500 seconds 125
minutes
Second phase
Similar with the first phase
Total time 125 minutes

39
Improving the Access Time of Secondary Storage

Place blocks on the same cylinder
Divide the data among several small disks
Mirroring disks
Use a disk-scheduling algorithm
Prefetch blocks to main memory in anticipation of
their later use

40
Organizing Data by Cylinders

Use several adjacent cylinders
Read all the blocks on a single track or on a
cylinder consecutively
Neglect all but the first seek time and the first
rotational latency

41
Example 2.9 (1)

Recall examples 2.3 and 2.7
Original data may be stored on consecutive
cylinders
Total of cylinders 1000 ( 1Gbytes / 1M bytes)
Main memory can hold 50 cylinders (i.e. 50M)
To read 50 cylinder data into main memory
6.5 ms for average seek time
49 ms for 49 one-cylinder seeks (1 ms each)
6.4 seconds for transfer of 12,800 blocks
(12,800 0.5 ms) / 1000 6.4 seconds
So, 6.5 49 6,400 6455.5 ms

42
Example 2.9 (2)

First phase
Read
((6.5 ms 49 ms 6.4 seconds) 20 times)
2.15 minutes
Write The same as reading
Total time 4.3 minutes
Second phase
Still takes about 125 minutes (WHY ?)

43
Using Multiple Disks in place of One

Use several disks with their independent heads
Transfer data at a higher rate
Roughly speaking, total time could be divided by
the number of disks

44
Example 2.10 (1)

Replace one 747 by four 737s which have one
platter and two surfaces
Assumption
Divide the given records among the four disks
Occupy 1000 adjacent cylinders on each disk
Fill ¼ of main memory each disk
Recall previous examples
Average seek time and rotational latency 0
Number of full memory blocks 12,800
¼ memory size 3,200 blocks

45
Example 2.10 (2)

Computation
First phase
Transfer time 3200 0.5 ms 1.6 seconds
Read (6.5 ms 49 ms 1.6 seconds) 20 33
sec.
Write similar with reading
Total time about 1 minute

46
Example 2.10 (3)

Second phase
Apply delicate techniques (?) to reduce disk I/O
time
Start comparisons among the 20 lists as soon as
the first element of the block appears in main
memory
Use four output buffers
Total time about 1 hours (?)

47
Mirroring Disks

Two or more disks hold identical copies of data
Survive a head crash by either disk
If we make n copies of a disk, we can read any n
blocks in parallel.
Using mirror disks does not speed up writing, but
neither does it slow writing down (to some
extent)

48
Scheduling Requests by the Elevator Algorithm

Disk controller choose which of several requests
to execute first, to increase throughput
Elevator Algorithm
Proceed in the same direction until the next
cylinder with blocks to access is encountered
When no requests ahead in direction of travel,
reverse direction

49
Example 2.11
Finishing times for block accesses using the
elevator algorithm
Finishing times for block accesses using the
first-come-first-served algorithm
Arrival times for six block-access requests
50
Prefetching Data on Track- or Cylinder-sized
Chunks

Can we predict the order in which blocks will be
requested from disk ?
For example,
Devote two block buffers to each list when merged
(when there is plenty of memory)
When a buffer is exhausted, switch to the other
buffer for the same list

51
Single Buffering

Single buffering
Read B1 ? Buffer
Process Data in Buffer
Read B2 ? Buffer
Process Data in Buffer ...

Computation
P time to process/block
R time to read in 1 block
n of blocks
Single buffer time n(PR)

52
Single Buffering vs. Double Buffering

Memory
Disk

53
Double Buffering

Computation
P processing time/block
R IO time/block
n of blocks
Double buffering time R nP
Single buffering time n(RP)

54
Prefetching

Combine prefetching with the cylinder-based
strategy
Store the sorted sublists on whole, consecutive
cylinders
Read whole tracks or cylinders whenever we need
some records from a given list

55
Example 2.14 (1)

Consider the second phase of the sort
Have in main memory two track-sized buffers
A track 128KB
Total space requirement 128KB 20 lists 2 5
Mbyte
Read all the blocks on 1000 cylinders (8000
tracks)
Computation
average seek time 6.5 ms
the time for disk to rotate once 15.6 ms
total time (for reading) (6.5 15.6) 8000
2.95 minutes

56
Example 2.14 (2)

Have in main memory two cylinder-sized buffers
per sorted sublist
1 cylinder 8 tracks 128K 8 1M
Use 40 buffers of a megabyte each
50 megabytes available main memory
Need only do a seek once per cylinder
Read all the block on 1000 cylinders (8000
tracks)
Total time (for reading)
(6.5 8 15.6) 1000 cylinders) 2.19 minutes

57
Block Size Selection

Big block ? amortize I/O cost
Big block ? read in more useless stuff and takes
longer to read
As memory prices drop, blocks get bigger

58
Disk Failures

Intermittent failure
An attempt to read or write a sector is
unsuccessful, but with repeated tries we are able
to read or write successfully.
Media decay
A bit or bits are permanently corrupted, and the
sector becomes unreadable.
Write failure
We can neither write successfully nor can we
retrieve the previously written sector.
Disk Crash
When a disk becomes unreadable permanently

59
Checksums (1)

Each section has additional bits, called the
checksum, to check reading or writing operations
(w, s)
w the data that is read
s a status bit
A simple form of checksum parity

60
Checksums (2)

Example 1 (even parity)
The sequence of bits in a sector 01101000
The parity bit is 1
Data becomes 011010001
Example 2 (even parity)
The sequence of bits in a sector 11101110
The parity bit is 0
Data becomes 111011100

61
Checksums (3)

Possible that we cannot detect an error if more
than one bit of the sector may be corrupted
If we use n independent bits as a checksum, then
the chance of missing an error is only 1/2n (WHY
?)

62
Stable Storage (1)

How to correct errors ?
Stable storage is a technique for organizing a
disk so that media decays or failed writes do not
result in permanent loss.
The general idea is that sectors are paired, and
each pair represents one sector-contents X
As the left (XL) and right (XR) copies

63
Stable Storage (2)

Writing policy
Write the value of X into XL
if status is good, write the value
if status is bad, repeat writing
If fails after a number of times, a media failure
in the sector
Repeat above scheme for XR
Reading policy (to obtain the value of X)
Read XL
if status bad is returned, repeat reading
if status good is returned, take that value as X
If cant read XL , repeat above with XR

64
Recovery from Disk Crashes

Disk crash is fatal in mission-critical
applications
RAID (redundant arrays of independent disks)
Here, we talk levels 5, 6, and 7
These RAID schemes also handle failures discussed
previously

65
The Failure Model of Disks

Mean time to failure represents the length of
time by which 50 of a population of disks will
have failed catastrophically.
For modern disks, it is about 10 years

Fraction surviving
Time
66
RAID Level 1

To protect against data loss
Use mirroring disks
The only way data can be lost is if there is a
second disk crash while the first crash is being
repaired.

67
How often will a data loss occur?

Assume
The process of replacing the failed disk
take 3 hours, 1/8 day, 1/2920 year
A failure rate of 5 per year
Probability that the mirror disk will fail during
copying
(1/20) (1/2920) 1/58,400
Mean time to a failure involving data loss
One of the two disks will fail once in 5 years on
the average
5 58,400 292,000 years

68
RAID Level 4 (1)

Use one redundant disks no matter how many data
disks there are
In the redundant disk, the ith block consists of
parity checks for the ith blocks of all the data
disks
Use modulo-2 sum an even parity

69
The Algebra of Modulo-2 Sums

The commutative law
x ? y y ? x
The associative law
x ? (y ? z) (x ? y) ? z
The all-0 vector of the appropriate length is the
identity for ?
x ? O O ? x x
? is its own inverse
x ? x O
If x ? y z, y x ? z

70
RAID Level 4 Reading (2)

Read disks normally.
We could read the redundant disk !
Example
read disk 2, 3, and 4, and get the contents of
disk 1 using modulo-2 sum.

disk2 10101010 disk3 00111000 disk4
01100010 disk1 11110000
71
RAID Level 4 Writing (3)

When a block is written, we need to change the
redundant disk
Naïve approach
N-1 reads of blocks not being rewritten
One write of new block
Rewrite new redundant disk
In total, N1 disk I/Os
There is a better way to do that !

72
Writing Example (4)

When disk 2 changes from 10101010 to 11001100

73
RAID Level 4 Failure Recovery (5)

Recomputing any missing data is simple, and does
not depend on which disk (data or redundant) is
failed.

74
RAID Level 5

We could treat each disk as the redundant disk
for some of the blocks
That is, do not have to treat one disk as the
redundant disk and the others as data disks
When there are n1 disks (disk 0 disk n)
If (i mod n1) j, then we can treat the ith
cylinder of disk j as redundant

75
Example 2.21 (1)

How redundant blocks compute for 4 disks (n3)?
Disk 0
redundant for block 4, 8, 12,
Disk 1
redundant for block 1, 5, 9,
Disk 2
redundant for block 2, 6, 10,
Disk 3
redundant for block 3, 7, 11,

76
Example 2.21 (2)

The reading and writing load for each disk is the
same
If all blocks are equally likely to be written
each disk has a 1/4 chance
If not
each disk has a 1/3 chance
Each of four disks is involved in ½ of the writes
1/4 3/4 1/3 1/2

77
RAID Level 6 (1)

To handle with any number of disk crashes data
or redundant
Here, focused on a simple example, where two
simultaneous crashes are correctable and the
strategy is based on a simple error-correcting
code, Hamming code
Consider a system with seven disks
data disks disk 1-4
redundant disks disk 5-7

78
RAID Level 6 (2)

The relationship between data and redundant disks
Note
every possible column of three 0s and 1s,
except for the all-0 column
the columns for the redundant disk have a singe 1
the columns for the data disks each have at least
two 1s

79
RAID Level 6 (3)

The disks with 1 in a row are treated as if they
were the entire set of disks in a RAID level 4
scheme.
The bits of disk 5
are the modulo-2 sum of bits of disk 1,2, and 3
The bits of disk 6
are the modulo-2 sum of bits of disk 1,2, and 4
The bits of disk 7
are the modulo-2 sum of bits of disk 1,3, and 4

80
RAID Level 6 Read/Write

Reading Just read data from any data disk
normally
Writing
Need to recalculate several redundant disks

81
A Writing Example (1)

Writing
Disk 2 is changed to be 0000111
Corresponding redundant disks
disk 5 and 6
Using modulo-2 sum
between old and new disk 2
between modulo-2 sum of disk 2s and disk 5
between modulo-2 sum of disk 2s and disk 6

82
A Writing Example (2)
83
RAID Level 6 Failure Recovery

Assume that disk a and b fails simultaneously
Find a row r in which the columns of a and b are
different
For example, a has 0 in row r, b has 1 in row r
Compute the correct b by taking the modulo-2 sum
of corresponding bits from all the disks other
than b that have 1 in row r.
Then, compute the correct a

84
A Recovery Example