Title: SQL: Queries, Programming, Triggers
1Database Systems II Secondary Storage
2The Memory Hierarchy
Tertiary Storage Tape, Network Backup
Swapping, Main-memory DBMSs
Disk
File System
Virtual Memory
3,200 MB/s (DDR-SDRAM _at_200MHz)
300 MB/s (SATA-300)
Disk-Cache (216MB)
Main Memory
6,400 MB/s 12,800 MB/s (DDR2, dual channel,
800MHz)
16 GB/s (64bit_at_2GHz)
CPU
L1/L2-Cache (256KB4MB)
CPU-to-L1-Cache 5 cycles initial latency,
then burst mode
CPU-to-Main-Memory 200 cycles latency
3The Memory Hierarchy
- CacheData and instructions in cache when needed
by CPU. On-board (L1) cache on same chip as CPU,
L2 cache on separate chip.Capacity 1MB, access
time a few nanoseconds. - Main memoryAll active programs and data need to
be in main memory. Capacity 1 GB, access time
10-100 nanoseconds.
4The Memory Hierarchy
- Secondary storageSecondary storage is used for
permanent storage of large amounts of data,
typically a magnetic disk.Capacity up to 1 TB,
access time 10 milliseconds. - Tertiary storageTo store data collections that
do not fit onto secondary storage, e.g. magnetic
tapes or optical disks.Capacity 1 PB, access
time seconds / minutes.
5The Memory Hierarchy
- Trade-offThe larger the capacity of a storage
device, the slower the access (and vice versa). - A volatile storage device forgets its contents
when power is switched off, a non-volatile device
remembers its content. - Secondary storage and tertiary storage is
non-volatile, all others are volatile. - DBS needs non-volatile (secondary) storage
devices to store data permanently.
6The Memory Hierarchy
- RAM (main memory) for subset of database used by
current transactions. - Disk to store current version of entire database
(secondary storage). - Tapes for archiving older versions of the
database (tertiary storage).
7The Memory Hierarchy
- Typically programs are executed in virtual memory
of size equal to the address space of the
processor. - Virtual memory is managed by the operating
system, which keeps the most relevant part in the
main memory and the rest on disk. - A DBS manages the data itself and does not rely
on the virtual memory. - However, main memory DBS do manage their data
through virtual memory.
8Moores Law
- Gordon Moore in 1965 observed that the density of
integrated circuits (i.e., number of transistors
per unit) increased at an exponential rate, thus
roughly doubles every 18 months. - Parameters that follow Moores law
- - number of instructions per second that can be
exceuted for unit cost,- number of main memory
bits that can be bought for unit cost,- number
of bytes on a disk that can be bought for unit
cost.
9Moores Law
Number of transistors on an integrated circuit
10Moores Law
Disk capacity
11Moores Law
- But some other important hardware parameters do
not follow Moores law and grow much slower. - Theses are, in particular, - speed of main
memory access, and- speed of disk access. - For example, disk latencies (seek times) have
almost stagnated for past 5 years. - Thus, moving data from one level of the memory
hierarchy to the next becomes progressively
larger.
12Disks
- Secondary storage device of choice.
- Data is stored and retrieved in units called disk
blocks or pages. - Main advantage over tapes random access vs.
sequential access. - Unlike RAM, time to retrieve a disk page varies
depending upon location on disk. - Therefore, relative placement of pages on disk
has major impact on DBMS performance!
13Disks
- Disk consists of two main, moving parts disk
assembly and head assembly. - Disk assembly stores information, head assembly
reads and writes information.
Spindle
Disk head
Platters
14Disks
- The platters rotate around central spindle.
- Upper and lower platter surfaces are covered with
magnetic material, which is used to store bits. - The arm assembly is moved in or out to position a
head on a desired track. - All tracks under heads at the same time make a
cylinder (imaginary!). - Only one head reads/writes at any one time.
15Disks
Track
Sector
Top viewof a platter surface
Gap
16Disks
- Block size is a multiple of sector size (which is
fixed). - Time to access (read/write) a disk block (disk
latency) consists of three components - - seek time moving arms to position disk head on
track, - - rotational delay (waiting for block to rotate
under head), and - transfer time (actually moving data to/from disk
surface). - Seek time and rotational delay dominate.
17Disks
Seek time
Time
3 or 5x
x
1
N
Cylinders Traveled
18Disks
Average seek time
N
N
? ? SEEKTIME (i ? j) S
N(N-1)
j1 j?i
i1
Typical average seek time 5 ms
19Disks
Average rotational delay
Head Here
Block I Want
Average rotational delay R 1/2
revolution Typical R 5 ms
20Disks
Transfer time
Typical transfer rate 100 MB/sec Typical block
size 16KB Transfer time block size
transfer rate Typical transfer time 0.16 ms
21Disks
- Typical average disk latency is 10 ms, maximum
latency 20 ms. - In 10 ms, a modern microprocessor can execute
millions of instructions. - Thus, the time for a block access by far
dominates the time typically needed for
processing the data in memory. - The number of disk I/Os (block accesses) is a
good approximation for the cost of a database
operation.
22Accelerating Disk Access
- Organize data by cylinders to minimize the seek
time and rotational delay. - Next block concept
- - blocks on same track, followed by
- - blocks on same cylinder, followed by
- blocks on adjacent cylinder.
- Blocks in a file are placed sequentially on disk
(by next). - Disk latency can approach the transfer rate.
23Accelerating Disk Access
- Example
- Assuming 10 ms average seek time, no rotational
delay, 40 MB/s transfer rate. - Read a single 4 KB Block
- Random I/O ? 10 ms
- Sequential I/O ? 10 ms
- Read 4 MB in 4 KB Blocks (amortized)
- Random I/O ? 10 s
- Sequential I/O ? 0.1 s? Speedup factor of 100
24Accelerating Disk Access
- Block size selection
- Bigger blocks ? amortize I/O cost.
- Bigger blocks ? read in more useless stuff and
takes longer to read. - Good trade-off block size from 4KB to 16 KB.
- With decreasing memory costs, blocks are becoming
bigger!
25Accelerating Disk Access
- Using multiple disks
- Replace one disk (with one independent head) by
many disks (with many independent heads). - Striping a relation R divide its blocks over n
disks in a round robin fashion. - Assuming that disk controller, bus and main
memory can handle n times the transfer rate,
striping a relation across n disks can lead to a
speedup factor of up to n.
26Accelerating Disk Access
- Disk scheduling
- For I/O requests from different processes, let
the disk controller choose the processing order. - According to the elevator algorithm, the disk
controller keeps sweeping from the innermost to
the outermost cylinder, stopping at a cylinder
for which there is an I/O request. - Can reverse sweep direction as soon as there is
no I/O request ahead in the current direction. - Optimizes the throughput and average response
time.
27Accelerating Disk Access
- Double buffering
- In some scenarios, we can predict the order in
which blocks will be requested from disk by some
process. - Prefetching (double buffering) is the method of
fetching the necessary blocks into the buffer in
advance. - Requires enough buffer space.
- Speedup factor up to n, where n is the number of
blocks requested by a process.
28Accelerating Disk Access
- Single buffering
- (1) Read B1 ? Buffer
- (2) Process Data in Buffer
- (3) Read B2 ? Buffer
- (4) Process Data in Buffer
- ...
- Execution time n(PR)where
- P time to process one block
- R time to read in one block
- n blocks read.
29Accelerating Disk Access
- Double buffering
- (1) Read B1, . . ., Bn ? Buffer
- (2) Process B1 in Buffer
- (3) Process B2 in Buffer
- ...
- Execution time R nPas opposed to n(PR).
-
- ? remember that R gtgt P
30Disk Failures
- In an intermittent failure, a read or write
operation is unsuccessful, but succeeds with
repeated tries.? parity checks to detect
intermittent failures - Media decay is a permanent corruption of one or
more bits which make the corresponding sector
impossible to read / write.? stable storage to
recover from media decay - A disk crash makes the entire disk permanently
unreadable.? RAID to recover from disk crashes
31Disk Failures
- Checksums
- Add n parity bits every m data bits.
- The number of 1s among a collection of bits and
their parity bit is always even. - The parity bit is the modulo-2 sum of its data
bits. - m8, n1
- Block A 011010001 (odd of 1s)
- Block B 111011100 (even of 1s)
- If Block A instead contains
- Block A 011000001 (has odd of 1s)
- error detected
32Disk Failures
- Checksums
- But what if multiple bits are corrupted?
- E.g., if Block A instead contains
- Block A 010000001 (has even of 1s)
- error cannot be detected
- Probability that a single parity bit cannot
detect a corrupt block is ½. - This is assuming that the probability of disk
failures involving an odd / even number of bits
is identical.
33Disk Failures
- Checksums
- More parity bits decrease the probability of an
undetected failure. With n m independent parity
bits, this probability is only 1/2n . - E.g., we can have eight parity bits, one for the
first bit of every byte, the second one for the
second bit of every byte . . . - The chance for not detecting a disk failure is
then only 1/256.
34Disk Failures
- Stable storage
- Sectors are paired, and information X is written
both on sectors Xl and Xr. - Assume that both copies are written with a
sufficient number of parity bits so that bad
sectors can be detected. - If sector is bad (according to checksum), write
to alternative sector. - Alternate reading Xl and Xr until a good value is
returned. - Probability of Xl and Xr both failing is very low.
35Disk Failures
- Disk arrays
- So far, we cannot recover from disk crashes.
- To address this problem, use Redundant Arrays of
Independent Disks (RAID), arrangements of several
disks that gives abstraction of a single, large
disk. - Goals Increase reliability (and performance).
- Redundant information allows reconstruction of
data if a disk fails. - Data striping improves the disk performance.
36Disk Failures
- Failure Models for Disks
- What is the expected time until disk crash?
- We assume uniform distribution of failures over
time. - Mean time to failure time period by which 50 of
a population of disks have failed (crashed). - Typical mean time to failure is 10 years.
- In this case, 5 of disks crash in the first
year, 5 crash in the second year, . . ., 5
crash in the tenth year, . . ., 5 crash in the
twentieth year.
37Disk Failures
- Failure Models for Disks
- Given the mean time to failure (mtf) in years, we
can derive the probability p of a particular disk
failing in a given year. - p 1 / (2 mtf)
- Ex. mtf 10, p 1/20 5
- Mean time to data loss time period by which 50
of a population of disks have had a crash that
resulted in data loss. - The mean time to disk failure is not necessarily
the same as the mean time to data loss.
38Disk Failures
- Failure Models for Disks
- Failure rate percentage of disks of a population
that have failed until a certain point of time. - Survival rate percentage of disks of a
population that have not failed until a certain
point of time. - While it simplifies the analysis, the assumption
of uniform distribution of failures is
unrealistic. - Disks tend to fail early (manufacturing defects
that have not been detected) or late
(wear-and-tear).
39Disk Failures
- Failure Models for Disks
- Survival rate (realistic)
-
-
- time
40Disk Failures
- Mirroring
- The data disk is copied unto a second disk, the
mirror disk. - When one of the disk crashes, we replace it by a
new disk and copy the other disk to the new one. - Data loss can only occur if the second disk
crashes while the first one is being replaced. - This probability is negligible.
- Mirroring is referred to as RAID level 1.
41Disk Failures
- Parity blocks
- Mirroring doubles the number of disks needed.
- The parity block approach needs only one
redundant disk for n (arbitray) data disks. - In the redundant disk, the ith block stores
parity checks for the ith blocks of all the n
data disks. - Parity block approach is called RAID level 4.
A
B
C
P
42Disk Failures
- Parity blocks
- Reading blocks is the same as without parity
blocks. - When writing a block on a data disk, we also need
to update the corresponding block of the
redundant disk. - This can be done using four (three additional)
disk I/O read old value of data disk block, read
corresponding block of redundant (parity) disk,
write new data block, recompute and write new
redundant block.
43Disk Failures
- Parity blocks
- If one of the disks crashes, we bring in a new
disk. - The content of this disk can be computed, bit by
bit, using the remaining n disks. - No difference between data disks and parity disk.
- Computation based on the definition of parity,
i.e. total number of ones is even.
44Disk Failures
Example
- n 3 data disks
- Disk 1, block 1 11110000
- Disk 2, block 1 10101010
- Disk 3, block 1 00111000
- and one parity disk
- Disk 4, block 1 01100010
- ? Sum over each column is always an even number
of 1s - ? Mod-2 sum can recover any missing single row
45Disk Failures
Example
- Suppose we have
- Disk 1, block 1 11110000
- Disk 2, block 1 ????????
- Disk 3, block 1 00111000
- Disk 4, block 1 01100010 (parity)
- Use mod-2 sums for block 1 over disks 1,3,4 to
recover block 1 of failed disk 2 - ? Disk 2, block 1 10101010
46Disk Failures
- RAID level 5
- In the RAID 4 scheme, the parity disk is the
bottleneck. On average, n-times as many writes on
the parity disk as on the data disks. - However, the failure recovery method does not
distinguish the types of the n 1 disks. - RAID level 5 does not use a fixed parity disk,
but use block i of disk j as redundant if i MOD
n1 j.
D
C
B
A