Storing Data: Disks and Files

About This Presentation

Title:

Storing Data: Disks and Files

Description:

This requires a variable length record organization. ... If all records are of the same length new records can be inserted at the end of ... record length. ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 29

Provided by: johne78

Category:

more less

Transcript and Presenter's Notes

Title: Storing Data: Disks and Files

1
Storing Data Disks and Files

Overview
Types of memory
Magnetic disks
RAID
Striping and redundancy
RAID levels
Disk Space Management
Buffer Management
Buffer management details
Replacement Policies
Files and Indices
Organizing pages into files heap files
Other file organizations
Indices
Page Formats
Fixed length records
Variable length records
Record Formats

2
Accessing Data Overview

When a query is processed data needs to be
retrieved from storage.
Data is stored on devices such as disks.
The disk space manager (DSM) keeps track of
available disk space.
The file manager requests that the DSM finds or
releases disk space.
Disk space is tracked in pages (typically 4 or 8
KB).
When a record is required it must be fetched from
disk to main memory.
The page is found by the file manager.
A request for the page is issued to the buffer
manager.
The buffer manager fetches the page from disk to
the buffer in main memory and informs the file
manager of its location.
The above process has to be performed as
efficiently as possible.

3
Memory

Memory consists of a hierarchy, the fastest and
most expensive at the top and the slowest and
cheapest at the bottom.
Primary memory (volatile)
Main memory.
Cache.
Secondary memory (non-volatile)
Magnetic disk.
Tertiary memory (non-volatile)
Tape (sequential access).
CD / DVD.
Usually used as backup.
If main memory is much faster why not store a DB
there?
Currently cost about 100 times disk (per MB).
Buying enough main memory to store a DB would be
very expensive.
On a 32 bit system only 232 bytes can be directly
referenced.
Data must be maintained between execution which
requires non-volatile memory.

4
Magnetic Disks

Support direct access.
Data is stored on disk blocks.
A contiguous sequence of bytes.
The unit in which data is written to and read
from a disk.
Blocks are arranged in tracks on platters.
Tracks are concentric rings and can be recorded
on one or both sides of a platter.
Platters are therefore single or double-sided.
The set of tracks with the same diameter is
referred to as a cylinder.
A cylinder therefore contains one track per
platter surface.
Each track is divided into arcs called sectors.
The size of a disk block is some multiple of the
sector size.
A disk head array reads / writes blocks.
There is a disk head for each surface but the
heads are moved as a unit.
To read or write a block a disk head must be over
it.
Only one disk head can be read at a time.

5
Hard Drive Structure
Platters
Disk head array

The disk head array moves as a unit.
It can only access one track on the surface of
one platter.
Data is read by moving the disk head array to the
appropriate track and waiting for the data to
spin underneath the head.

6
Accessing Disks

Direct access to any location in main memory
takes the same time (approximately).
Access to a disk block is given by
seek time rotational delay transfer time
Seek time is the time taken to move the disk
heads to the correct track.
Rotational delay is the waiting time for the
desired block to rotate under the disk head.
Transfer time is the time to actually read or
write the data in the block.
Access time is therefore affected by how data is
stored on a disk.
Related data should be stored close to each
other.
In order of closeness records on the same block.
Adjacent blocks.
Same track.
Same cylinder, different platters.
Adjacent cylinders.

7
Redundant Arrays of Independent Disks (RAID)

Disks are bottlenecks for processing.
They have much greater access times than main
memory.
They have relatively high failure rates.
A disk array consists of several disks arranged
to increase performance and improve reliability.
Performance is improved by data striping.
Data is divided into (equal-size) partitions
called striping units.
The striping units are distributed over the disks
using a round robin algorithm.
The disks can be read in parallel so D (the
number of disks) blocks can be read at one time.
Reliability is improved through redundancy.
Disk arrays that use data striping and redundancy
are called RAID. There are several RAID
organizations, called levels.

8
Bit and Block Striping

A file is made up of a sequence of bits. Let
these bits be numbered sequentially starting with
1.
Assume that a block consists of 8 bits.
Assume that there is a 4 disk RAID
Block Striping

Bit Striping

9
RAID Redundancy

Increasing the number of disks increases
performance but decreases reliability.
If the Mean Time to Failure (MTTF) is 50,000
hours (5.7 years) then the MTTF of an array of
100 disks is 21 days. (365 5.7 2,081)
Storing redundant data allows that data to be
used to reconstruct the data on the failed disk.
Where is the redundant data to be stored?
On check disks reserved for redundant data or
Distributed uniformly over all disks.
How is the redundant information computed?
Using a parity scheme. For each bit on the data
disks there is a parity bit on the check disks.
If the sum of a bit on the data disks is even the
corresponding parity bit is set to zero, if it is
odd it is set to one. The data on any one failed
disk can be calculated bit by bit.
In RAID the disk array is partitioned into
reliability groups. A reliability group consists
of a set of data disks and check disks. The
number of check disks depends on the reliability
level chosen.

10
RAID Levels

Level 0 Nonredundant
Uses data striping but does not record redundant
data.
Cheap but reliability a problem as MTTF decreases
with the number of disks.
Highest write efficiency (no redundant data has
to be written).
Level 1 Mirrored
Maintains an identical copy of each disk.
Very expensive.
Each write involves two disks and is not
performed simultaneously in case a systems
failure occurs during writing.
No striping so transfer time is comparable to
that of a single disk (high).
Allows parallel reads of the blocks that
conceptually reside on the same disk.
Level 0 1 (10) Mirroring and Striping
Combines the data striping from level 0 and the
redundancy from level 1.

11
More RAID levels

Level 2 Error-correcting codes
Uses the Hamming code for the redundancy scheme.
The Hamming code allows recovery from single-disk
failure and identifies the failed disk.
The number of check disks grows logarithmically
with the number of data disks.
Striping is at the bit level.
The smallest unit of transfer is therefore D
blocks. This level is therefore good for
workloads with many large requests but bad for
small requests.
A write of a block involves reading D blocks into
memory and modifying and writing D C blocks
(where C is the number of check disks).
Level 3 Bit-interleaved parity
While Hamming codes can detect failed disks this
is a task that can easily be performed by the
disk controller. Thus level 2 stores more
redundant data than is necessary.
Level 3 uses a single check disk with parity
information and bit level striping.
Performance is similar to level 2.

12
Even More RAID Levels

Level 4 Block Interleaved parity
Level 4 uses a striping unit of a block.
This means that read requests of the size of a
disk block can be served entirely by the disk
where the block resides.
Large read requests can still use the aggregate
bandwidth of the D disks.
Writing a block involves only one data disk and
the check disk (the new parity can be calculated
from the existing parity and the difference
between the new and old data blocks).
Only one check disk is ever required.
Level 5 Block-interleaved distributed parity
Improves on level 4 by distributing the parity
blocks uniformly over all the disks rather than
storing them all on one check disk.
This allows several write request to be processed
in parallel (since the bottleneck of one check
disk has been removed).
Read requests have greater parallelism since all
disks are involved (though proportionally this
diminishes as the number of disks increases).
This level has the best performance of all RAID
levels for small and large reads and large writes.

13
And More RAID Levels

Level 6 P Q Redundancy
What happens if a disk fails and another one
fails before the first has been replaced (or
while it is being replaced).
Level 6 uses Reed-Solomon codes.
Reed-Solomon codes allow for recovery from two
simultaneous disk failures.
Level 6 requires two check disks but the parity
data is distributed similarly to level 5.
For small writes the read-modify-write process is
significantly less efficient (comparing the check
data disk ratio) than level 5.
Which RAID level?
Level 0 improves performance at the lowest cost
but does not improve reliability.
Level 0 1 is better than level 1 and has the
best write performance.
Levels 2 and 4 are always inferior to 3 and 5.
Level 3 is good for large transfer requests of
several contiguous blocks but bad for many small
request of a single disk block.
Level 5 is a good general-purpose solution.
Level 6 is appropriate if higher reliability is
required.

14
Managing Data

Managing files (in secondary memory)
Disk space manager is responsible.
File storage
How are the records organized?
How are records stored on a page?
How are fields stored in a record?
Managing data in main memory
The buffer manager is responsible.
When main memory is full frames have to be
replaced.
What replacement policies are there?
Which policy is best?

15
Disk Space Management

The lowest level of the DBMS architecture is the
disk space manager (DSM) which manages disk space
(really!).
It supports the allocation and deallocation of
pages in disk.
A page is an abstract unit of storage.
Each page is mapped to a disk block.
This mapping is so that reading and writing to a
page can be done in one file I/O.
It may be useful to allocate a sequence of pages
a contiguous sequence of blocks to hold data that
is often sequentially accessed.
The DSM hides the details of data storage so that
higher levels can manipulate data as a collection
of pages, rather than by referring to the
underlying blocks.

16
Tracking Free Blocks

The DSM keeps track of which blocks are free and
the mapping of pages to blocks.
While blocks may be initially allocated
sequentially, continual allocation and
deallocation will create holes.
Recording free blocks.
Use a linked list. A pointer to the first free
block is kept in a known location.
Maintain a bitmap with one bit for each block,
indicating if the block is used. This allows for
fast allocation of contiguous areas on the disk.
Using the Operating System (OS)
OSs manage disk space and support the abstraction
of a file as a sequence of bytes.
A DSM could be built using OS files.
In practice this is often not done.
Using a particular OS makes a DBMS less portable.
Using the OS also imposes technical limitations
(such as file size).

17
The Buffer Manager

The buffer manager (BM) is responsible for
bringing pages from disk to main memory.
Main memory is partitioned into pages called the
buffer pool. These pages are referred to as
frames.
If the buffer is full old pages will have to be
replaced with new pages, hence a replacement
policy is required.
For example, a DBMS may contain 1,000,000 pages
of data but there may only be 1,000 pages of main
memory.
Higher levels of the DBMS can request data, the
BM deals with the details.
Note that the BM must be informed if a page is no
longer needed so that it can be replaced.
The BM must also be informed if a page has been
modified so that the change can be made to the
copy on disk.

18
BM Statistics

The BM keeps track of two variables for each
frame in the buffer.
pin-count how many times a frame has been
requested but not released, i.e. the number of
users of the page.
dirty indicates that the page has been
modified.
If a page is requested the BM
Checks to see if the page is in the buffer. If
it is it increments pin-count. Otherwise
Chooses a frame to replace (using the policy).
If the dirty bit is on it writes the page to be
replaced to the disk.
Reads requested page into the replacement frame
and increments pin-count (to 1).
Returns the address of the frame.
Incrementing pin-count for a frame is called
pinning. Decrementing it is referred to as
unpinning.
A frame is only available for replacement if it
is pin-count is zero.
If there is no frame with pin-count of zero the
BM must wait or abort the transaction.

19
Buffer Replacement Policies

Least recently used (LRU).
Use a queue (FIFO) to keep track of frames with
pin-count equal to zero.
When a frames pin-count is decremented to zero
add it to the queue.
Replace the frame at the head of the queue.
Clock replacement.
A variant of LRU (with less overhead).
Uses a variable, current, with value from 1 to N
where N is the number of buffer pages.
Each frame has an associated referenced bit that
is turned on when its pin-count first reaches
zero.
The process works as follows
Consider the current frame for replacement.
If it is not a candidate increment current.
If the current frame has referenced turned on
turn referenced off and increment current. Note
that pin-count must be zero.
If the current frame has referenced off and has a
pin-count of zero replace it.
If current is incremented at N change it to 1.

20
Buffer Replacement Discussion

The LRU and clock replacement schemes are fair
schemes.
They are not always the best strategies for a DB
system.
Consider sequentially scanning data.
Assume the file has slightly more pages than
there are free frames.
Using LRU each scan will result in reading every
page of the file.
This is referred to as sequential flooding.
An alternative is Most Recently Used (MRU) which
avoids the problem noted above but is
disadvantageous in other circumstances.
In practice most systems use some variant of LRU.
In DB2 a page can be specified as being hated in
which case it becomes the next candidate for
replacement.
DB2 also applies MRU for some operations.

21
DBMS vs. OS Buffer Management

There are similarities between OS virtual memory
and DBMS buffer management.
Both have the goal of accessing more data than
will fit in main memory.
Both bring pages from disk to main memory as
needed and replace unneeded pages.
A DBMS cannot be built using the virtual memory
capability of the OS because
A DBMS often predicts the order in which a page
will be accessed (page reference patterns) more
accurately than a typical OS.
Some DBMS operations (like sequential scans or
some implementations of relational algebra
operators) have a pattern of page accesses.
These patterns allow for a better choice of pages
to replace and allow for prefetching of pages.
The BM anticipates the next several page requests
and fetches them before they are requested (given
enough buffer space).
This may be done concurrently with CPU use and
may be able to take advantage of the pages being
stored contiguously.
A DBMS also needs more control over when a page
is written to disk.

22
Files and Indices

We have been ignoring the representation and
storage details of files.
A file (of records) may be on several pages.
These pages need to be organized as a file.
Each record is given a record id (or rid).
The basic file structure is a heap file.
Heap files store records in random
Heap files support creating and destroying files,
inserting records, deleting records (given its
rid), getting records (given rids) and scanning
all the records in a file
They support retrieval of records using rids.
Auxiliary data structures can support retrieval
of records matching a condition these data
structures are called indices.
To support scans the pages in a heap file must be
tracked.
In addition the pages that contain free space
should be recorded.

23
Maintaining Heap Files

Consider two ways to maintain information
required for heap files.
Linked list of pages.
Keep track of which pages have free space.
Use a doubly linked list of pages with free space
and
A doubly linked list of full pages
Both are connected to (the same) header page.
But if records are of variable length most pages
will be on the free list making it costly to
find a page with enough room for an insertion.
Directory of pages
Each directory entry identifies a page (or a
sequence of pages).
The entries are kept in data page order and
record
either a bit indicating if the page is full or
a count showing the amount of free space.
In the latter case discarding a page for an
insert because it has insufficient free space
does not entail visiting the page.

24
Other File Organizations

Sequential files
Records are stored in sequential order based on a
search key value.
It is expensive to keep files physically
sequential.
Pointer chains can be used.
Clustering files
In some cases it may be advantageous to store
more than one relation in a file.
Consider
SELECT
FROM Customer C, Dependent D
WHERE C.sin D.sin
Each record may reside on a separate block making
this kind of query expensive.
Instead tuples from each relation are clustered
(on the joining attribute).
This requires a variable length record
organization.
Note that this kind of organization is poor for
selections on one of the relations.

25
Introduction to Indices

Sometimes (often) it is necessary to find records
with particular values.
A heap file allows a record to be found given its
rid.
It does not help in finding the rids of records
that meet particular criteria.
An index is a data structure that helps find
records that meet selection criteria.
An index works for certain searches only.
An index on customer name does not help to find
records related to a particular policy.
Each index has a search key (not to be confused
with a primary key).
The search key is a field (or fields) on which
the index is built.
An index is designed to speed up equality or
range selections on the search key.
An index file contains entries with the rid and
data about the search key.
There are several types of organizations for
indices (called access methods). They include
B trees
Hash based indices

26
Page Formats Fixed Length Records

How are records organized on a page?
Fixed Length Records
If all records are of the same length new records
can be inserted at the end of the file or where
an old record has been deleted.
There are two organizations based on how records
are deleted.
Packed Store records consecutively and when a
record is deleted move the last record on its
page to its location.
This allows the ith record on the page to be
found easily (an offset calculation).
All empty space appears at the bottom of the
page.
This method causes problems if there are external
references to the record because the rid includes
the slot number.
Unpacked Use a bit array, where each bit
represents the space for a record.
Free space can be found by scanning the bit array
for bits that are off.

27
Page Formats Variable Length Records

Problems with variable length records.
If a new record is bigger than an old records
space it will not fit.
If a new record is smaller than an old record
space is wasted.
To resolve these problems we need to move records
on the page so that all the data is contiguous.
Maintain a directory for each page.
There is an entry for each record which contains
record offset (a pointer to the record).
record length.
When a record is moved its position in the page
does not change (still the ith record) so there
is no impact on the rid.
A pointer to the start of free space is required.
Note that if a record is deleted from the middle
of the page its directory entry must be retained
(its offset can be set to 1 to indicate that it
has been deleted).
A variation is to include the record length in
the first bytes of the record.

28
Record Formats

How are fields organized within a record?
Fixed Length Records
Each field has a fixed length and the number of
fields is also fixed.
Such fields can be stored consecutively.
Given the address of a record the address of a
particular field can be determined.
Information about field length will be stored in
the system catalogue.
Variable Length Records
Note that in the relational model each record in
a relation contains the same number of fields.
Storage methods
Use a delimiting character between fields.
Keep an array of integer offsets at the start of
each record (including an offset to the end of
the record). This method is usually better.
Fixed length records may be stored in these ways.
Variable length record issues.
Modifying a record may change its size.
If a records size increases sufficiently it may
be necessary to move it to a new page. If so a
forwarding address may be required.
A record may grow so large that it does not fit
on a (whole) page. If so it has to be broken
apart and connected by pointers.

Write a Comment

User Comments (0)

About PowerShow.com

Storing Data: Disks and Files - PowerPoint PPT Presentation

Storing Data: Disks and Files

This requires a variable length record organization. ... If all records are of the same length new records can be inserted at the end of ... record length. ... – PowerPoint PPT presentation