Storing Data: Disks and Files - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Storing Data: Disks and Files

Description:

This requires a variable length record organization. ... If all records are of the same length new records can be inserted at the end of ... record length. ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 29
Provided by: johne78
Category:
Tags: data | disks | files | length | storing

less

Transcript and Presenter's Notes

Title: Storing Data: Disks and Files


1
Storing Data Disks and Files
  • Overview
  • Types of memory
  • Magnetic disks
  • RAID
  • Striping and redundancy
  • RAID levels
  • Disk Space Management
  • Buffer Management
  • Buffer management details
  • Replacement Policies
  • Files and Indices
  • Organizing pages into files heap files
  • Other file organizations
  • Indices
  • Page Formats
  • Fixed length records
  • Variable length records
  • Record Formats

2
Accessing Data Overview
  • When a query is processed data needs to be
    retrieved from storage.
  • Data is stored on devices such as disks.
  • The disk space manager (DSM) keeps track of
    available disk space.
  • The file manager requests that the DSM finds or
    releases disk space.
  • Disk space is tracked in pages (typically 4 or 8
    KB).
  • When a record is required it must be fetched from
    disk to main memory.
  • The page is found by the file manager.
  • A request for the page is issued to the buffer
    manager.
  • The buffer manager fetches the page from disk to
    the buffer in main memory and informs the file
    manager of its location.
  • The above process has to be performed as
    efficiently as possible.

3
Memory
  • Memory consists of a hierarchy, the fastest and
    most expensive at the top and the slowest and
    cheapest at the bottom.
  • Primary memory (volatile)
  • Main memory.
  • Cache.
  • Secondary memory (non-volatile)
  • Magnetic disk.
  • Tertiary memory (non-volatile)
  • Tape (sequential access).
  • CD / DVD.
  • Usually used as backup.
  • If main memory is much faster why not store a DB
    there?
  • Currently cost about 100 times disk (per MB).
  • Buying enough main memory to store a DB would be
    very expensive.
  • On a 32 bit system only 232 bytes can be directly
    referenced.
  • Data must be maintained between execution which
    requires non-volatile memory.

4
Magnetic Disks
  • Support direct access.
  • Data is stored on disk blocks.
  • A contiguous sequence of bytes.
  • The unit in which data is written to and read
    from a disk.
  • Blocks are arranged in tracks on platters.
  • Tracks are concentric rings and can be recorded
    on one or both sides of a platter.
  • Platters are therefore single or double-sided.
  • The set of tracks with the same diameter is
    referred to as a cylinder.
  • A cylinder therefore contains one track per
    platter surface.
  • Each track is divided into arcs called sectors.
  • The size of a disk block is some multiple of the
    sector size.
  • A disk head array reads / writes blocks.
  • There is a disk head for each surface but the
    heads are moved as a unit.
  • To read or write a block a disk head must be over
    it.
  • Only one disk head can be read at a time.

5
Hard Drive Structure
Platters
Disk head array
  • The disk head array moves as a unit.
  • It can only access one track on the surface of
    one platter.
  • Data is read by moving the disk head array to the
    appropriate track and waiting for the data to
    spin underneath the head.

6
Accessing Disks
  • Direct access to any location in main memory
    takes the same time (approximately).
  • Access to a disk block is given by
  • seek time rotational delay transfer time
  • Seek time is the time taken to move the disk
    heads to the correct track.
  • Rotational delay is the waiting time for the
    desired block to rotate under the disk head.
  • Transfer time is the time to actually read or
    write the data in the block.
  • Access time is therefore affected by how data is
    stored on a disk.
  • Related data should be stored close to each
    other.
  • In order of closeness records on the same block.
  • Adjacent blocks.
  • Same track.
  • Same cylinder, different platters.
  • Adjacent cylinders.

7
Redundant Arrays of Independent Disks (RAID)
  • Disks are bottlenecks for processing.
  • They have much greater access times than main
    memory.
  • They have relatively high failure rates.
  • A disk array consists of several disks arranged
    to increase performance and improve reliability.
  • Performance is improved by data striping.
  • Data is divided into (equal-size) partitions
    called striping units.
  • The striping units are distributed over the disks
    using a round robin algorithm.
  • The disks can be read in parallel so D (the
    number of disks) blocks can be read at one time.
  • Reliability is improved through redundancy.
  • Disk arrays that use data striping and redundancy
    are called RAID. There are several RAID
    organizations, called levels.

8
Bit and Block Striping
  • A file is made up of a sequence of bits. Let
    these bits be numbered sequentially starting with
    1.
  • Assume that a block consists of 8 bits.
  • Assume that there is a 4 disk RAID
  • Block Striping
  • Bit Striping

9
RAID Redundancy
  • Increasing the number of disks increases
    performance but decreases reliability.
  • If the Mean Time to Failure (MTTF) is 50,000
    hours (5.7 years) then the MTTF of an array of
    100 disks is 21 days. (365 5.7 2,081)
  • Storing redundant data allows that data to be
    used to reconstruct the data on the failed disk.
  • Where is the redundant data to be stored?
  • On check disks reserved for redundant data or
  • Distributed uniformly over all disks.
  • How is the redundant information computed?
  • Using a parity scheme. For each bit on the data
    disks there is a parity bit on the check disks.
    If the sum of a bit on the data disks is even the
    corresponding parity bit is set to zero, if it is
    odd it is set to one. The data on any one failed
    disk can be calculated bit by bit.
  • In RAID the disk array is partitioned into
    reliability groups. A reliability group consists
    of a set of data disks and check disks. The
    number of check disks depends on the reliability
    level chosen.

10
RAID Levels
  • Level 0 Nonredundant
  • Uses data striping but does not record redundant
    data.
  • Cheap but reliability a problem as MTTF decreases
    with the number of disks.
  • Highest write efficiency (no redundant data has
    to be written).
  • Level 1 Mirrored
  • Maintains an identical copy of each disk.
  • Very expensive.
  • Each write involves two disks and is not
    performed simultaneously in case a systems
    failure occurs during writing.
  • No striping so transfer time is comparable to
    that of a single disk (high).
  • Allows parallel reads of the blocks that
    conceptually reside on the same disk.
  • Level 0 1 (10) Mirroring and Striping
  • Combines the data striping from level 0 and the
    redundancy from level 1.

11
More RAID levels
  • Level 2 Error-correcting codes
  • Uses the Hamming code for the redundancy scheme.
  • The Hamming code allows recovery from single-disk
    failure and identifies the failed disk.
  • The number of check disks grows logarithmically
    with the number of data disks.
  • Striping is at the bit level.
  • The smallest unit of transfer is therefore D
    blocks. This level is therefore good for
    workloads with many large requests but bad for
    small requests.
  • A write of a block involves reading D blocks into
    memory and modifying and writing D C blocks
    (where C is the number of check disks).
  • Level 3 Bit-interleaved parity
  • While Hamming codes can detect failed disks this
    is a task that can easily be performed by the
    disk controller. Thus level 2 stores more
    redundant data than is necessary.
  • Level 3 uses a single check disk with parity
    information and bit level striping.
  • Performance is similar to level 2.

12
Even More RAID Levels
  • Level 4 Block Interleaved parity
  • Level 4 uses a striping unit of a block.
  • This means that read requests of the size of a
    disk block can be served entirely by the disk
    where the block resides.
  • Large read requests can still use the aggregate
    bandwidth of the D disks.
  • Writing a block involves only one data disk and
    the check disk (the new parity can be calculated
    from the existing parity and the difference
    between the new and old data blocks).
  • Only one check disk is ever required.
  • Level 5 Block-interleaved distributed parity
  • Improves on level 4 by distributing the parity
    blocks uniformly over all the disks rather than
    storing them all on one check disk.
  • This allows several write request to be processed
    in parallel (since the bottleneck of one check
    disk has been removed).
  • Read requests have greater parallelism since all
    disks are involved (though proportionally this
    diminishes as the number of disks increases).
  • This level has the best performance of all RAID
    levels for small and large reads and large writes.

13
And More RAID Levels
  • Level 6 P Q Redundancy
  • What happens if a disk fails and another one
    fails before the first has been replaced (or
    while it is being replaced).
  • Level 6 uses Reed-Solomon codes.
  • Reed-Solomon codes allow for recovery from two
    simultaneous disk failures.
  • Level 6 requires two check disks but the parity
    data is distributed similarly to level 5.
  • For small writes the read-modify-write process is
    significantly less efficient (comparing the check
    data disk ratio) than level 5.
  • Which RAID level?
  • Level 0 improves performance at the lowest cost
    but does not improve reliability.
  • Level 0 1 is better than level 1 and has the
    best write performance.
  • Levels 2 and 4 are always inferior to 3 and 5.
  • Level 3 is good for large transfer requests of
    several contiguous blocks but bad for many small
    request of a single disk block.
  • Level 5 is a good general-purpose solution.
  • Level 6 is appropriate if higher reliability is
    required.

14
Managing Data
  • Managing files (in secondary memory)
  • Disk space manager is responsible.
  • File storage
  • How are the records organized?
  • How are records stored on a page?
  • How are fields stored in a record?
  • Managing data in main memory
  • The buffer manager is responsible.
  • When main memory is full frames have to be
    replaced.
  • What replacement policies are there?
  • Which policy is best?

15
Disk Space Management
  • The lowest level of the DBMS architecture is the
    disk space manager (DSM) which manages disk space
    (really!).
  • It supports the allocation and deallocation of
    pages in disk.
  • A page is an abstract unit of storage.
  • Each page is mapped to a disk block.
  • This mapping is so that reading and writing to a
    page can be done in one file I/O.
  • It may be useful to allocate a sequence of pages
    a contiguous sequence of blocks to hold data that
    is often sequentially accessed.
  • The DSM hides the details of data storage so that
    higher levels can manipulate data as a collection
    of pages, rather than by referring to the
    underlying blocks.

16
Tracking Free Blocks
  • The DSM keeps track of which blocks are free and
    the mapping of pages to blocks.
  • While blocks may be initially allocated
    sequentially, continual allocation and
    deallocation will create holes.
  • Recording free blocks.
  • Use a linked list. A pointer to the first free
    block is kept in a known location.
  • Maintain a bitmap with one bit for each block,
    indicating if the block is used. This allows for
    fast allocation of contiguous areas on the disk.
  • Using the Operating System (OS)
  • OSs manage disk space and support the abstraction
    of a file as a sequence of bytes.
  • A DSM could be built using OS files.
  • In practice this is often not done.
  • Using a particular OS makes a DBMS less portable.
  • Using the OS also imposes technical limitations
    (such as file size).

17
The Buffer Manager
  • The buffer manager (BM) is responsible for
    bringing pages from disk to main memory.
  • Main memory is partitioned into pages called the
    buffer pool. These pages are referred to as
    frames.
  • If the buffer is full old pages will have to be
    replaced with new pages, hence a replacement
    policy is required.
  • For example, a DBMS may contain 1,000,000 pages
    of data but there may only be 1,000 pages of main
    memory.
  • Higher levels of the DBMS can request data, the
    BM deals with the details.
  • Note that the BM must be informed if a page is no
    longer needed so that it can be replaced.
  • The BM must also be informed if a page has been
    modified so that the change can be made to the
    copy on disk.

18
BM Statistics
  • The BM keeps track of two variables for each
    frame in the buffer.
  • pin-count how many times a frame has been
    requested but not released, i.e. the number of
    users of the page.
  • dirty indicates that the page has been
    modified.
  • If a page is requested the BM
  • Checks to see if the page is in the buffer. If
    it is it increments pin-count. Otherwise
  • Chooses a frame to replace (using the policy).
  • If the dirty bit is on it writes the page to be
    replaced to the disk.
  • Reads requested page into the replacement frame
    and increments pin-count (to 1).
  • Returns the address of the frame.
  • Incrementing pin-count for a frame is called
    pinning. Decrementing it is referred to as
    unpinning.
  • A frame is only available for replacement if it
    is pin-count is zero.
  • If there is no frame with pin-count of zero the
    BM must wait or abort the transaction.

19
Buffer Replacement Policies
  • Least recently used (LRU).
  • Use a queue (FIFO) to keep track of frames with
    pin-count equal to zero.
  • When a frames pin-count is decremented to zero
    add it to the queue.
  • Replace the frame at the head of the queue.
  • Clock replacement.
  • A variant of LRU (with less overhead).
  • Uses a variable, current, with value from 1 to N
    where N is the number of buffer pages.
  • Each frame has an associated referenced bit that
    is turned on when its pin-count first reaches
    zero.
  • The process works as follows
  • Consider the current frame for replacement.
  • If it is not a candidate increment current.
  • If the current frame has referenced turned on
    turn referenced off and increment current. Note
    that pin-count must be zero.
  • If the current frame has referenced off and has a
    pin-count of zero replace it.
  • If current is incremented at N change it to 1.

20
Buffer Replacement Discussion
  • The LRU and clock replacement schemes are fair
    schemes.
  • They are not always the best strategies for a DB
    system.
  • Consider sequentially scanning data.
  • Assume the file has slightly more pages than
    there are free frames.
  • Using LRU each scan will result in reading every
    page of the file.
  • This is referred to as sequential flooding.
  • An alternative is Most Recently Used (MRU) which
    avoids the problem noted above but is
    disadvantageous in other circumstances.
  • In practice most systems use some variant of LRU.
  • In DB2 a page can be specified as being hated in
    which case it becomes the next candidate for
    replacement.
  • DB2 also applies MRU for some operations.

21
DBMS vs. OS Buffer Management
  • There are similarities between OS virtual memory
    and DBMS buffer management.
  • Both have the goal of accessing more data than
    will fit in main memory.
  • Both bring pages from disk to main memory as
    needed and replace unneeded pages.
  • A DBMS cannot be built using the virtual memory
    capability of the OS because
  • A DBMS often predicts the order in which a page
    will be accessed (page reference patterns) more
    accurately than a typical OS.
  • Some DBMS operations (like sequential scans or
    some implementations of relational algebra
    operators) have a pattern of page accesses.
  • These patterns allow for a better choice of pages
    to replace and allow for prefetching of pages.
  • The BM anticipates the next several page requests
    and fetches them before they are requested (given
    enough buffer space).
  • This may be done concurrently with CPU use and
    may be able to take advantage of the pages being
    stored contiguously.
  • A DBMS also needs more control over when a page
    is written to disk.

22
Files and Indices
  • We have been ignoring the representation and
    storage details of files.
  • A file (of records) may be on several pages.
  • These pages need to be organized as a file.
  • Each record is given a record id (or rid).
  • The basic file structure is a heap file.
  • Heap files store records in random
  • Heap files support creating and destroying files,
    inserting records, deleting records (given its
    rid), getting records (given rids) and scanning
    all the records in a file
  • They support retrieval of records using rids.
  • Auxiliary data structures can support retrieval
    of records matching a condition these data
    structures are called indices.
  • To support scans the pages in a heap file must be
    tracked.
  • In addition the pages that contain free space
    should be recorded.

23
Maintaining Heap Files
  • Consider two ways to maintain information
    required for heap files.
  • Linked list of pages.
  • Keep track of which pages have free space.
  • Use a doubly linked list of pages with free space
    and
  • A doubly linked list of full pages
  • Both are connected to (the same) header page.
  • But if records are of variable length most pages
    will be on the free list making it costly to
    find a page with enough room for an insertion.
  • Directory of pages
  • Each directory entry identifies a page (or a
    sequence of pages).
  • The entries are kept in data page order and
    record
  • either a bit indicating if the page is full or
  • a count showing the amount of free space.
  • In the latter case discarding a page for an
    insert because it has insufficient free space
    does not entail visiting the page.

24
Other File Organizations
  • Sequential files
  • Records are stored in sequential order based on a
    search key value.
  • It is expensive to keep files physically
    sequential.
  • Pointer chains can be used.
  • Clustering files
  • In some cases it may be advantageous to store
    more than one relation in a file.
  • Consider
  • SELECT
  • FROM Customer C, Dependent D
  • WHERE C.sin D.sin
  • Each record may reside on a separate block making
    this kind of query expensive.
  • Instead tuples from each relation are clustered
    (on the joining attribute).
  • This requires a variable length record
    organization.
  • Note that this kind of organization is poor for
    selections on one of the relations.

25
Introduction to Indices
  • Sometimes (often) it is necessary to find records
    with particular values.
  • A heap file allows a record to be found given its
    rid.
  • It does not help in finding the rids of records
    that meet particular criteria.
  • An index is a data structure that helps find
    records that meet selection criteria.
  • An index works for certain searches only.
  • An index on customer name does not help to find
    records related to a particular policy.
  • Each index has a search key (not to be confused
    with a primary key).
  • The search key is a field (or fields) on which
    the index is built.
  • An index is designed to speed up equality or
    range selections on the search key.
  • An index file contains entries with the rid and
    data about the search key.
  • There are several types of organizations for
    indices (called access methods). They include
  • B trees
  • Hash based indices

26
Page Formats Fixed Length Records
  • How are records organized on a page?
  • Fixed Length Records
  • If all records are of the same length new records
    can be inserted at the end of the file or where
    an old record has been deleted.
  • There are two organizations based on how records
    are deleted.
  • Packed Store records consecutively and when a
    record is deleted move the last record on its
    page to its location.
  • This allows the ith record on the page to be
    found easily (an offset calculation).
  • All empty space appears at the bottom of the
    page.
  • This method causes problems if there are external
    references to the record because the rid includes
    the slot number.
  • Unpacked Use a bit array, where each bit
    represents the space for a record.
  • Free space can be found by scanning the bit array
    for bits that are off.

27
Page Formats Variable Length Records
  • Problems with variable length records.
  • If a new record is bigger than an old records
    space it will not fit.
  • If a new record is smaller than an old record
    space is wasted.
  • To resolve these problems we need to move records
    on the page so that all the data is contiguous.
  • Maintain a directory for each page.
  • There is an entry for each record which contains
  • record offset (a pointer to the record).
  • record length.
  • When a record is moved its position in the page
    does not change (still the ith record) so there
    is no impact on the rid.
  • A pointer to the start of free space is required.
  • Note that if a record is deleted from the middle
    of the page its directory entry must be retained
    (its offset can be set to 1 to indicate that it
    has been deleted).
  • A variation is to include the record length in
    the first bytes of the record.

28
Record Formats
  • How are fields organized within a record?
  • Fixed Length Records
  • Each field has a fixed length and the number of
    fields is also fixed.
  • Such fields can be stored consecutively.
  • Given the address of a record the address of a
    particular field can be determined.
  • Information about field length will be stored in
    the system catalogue.
  • Variable Length Records
  • Note that in the relational model each record in
    a relation contains the same number of fields.
  • Storage methods
  • Use a delimiting character between fields.
  • Keep an array of integer offsets at the start of
    each record (including an offset to the end of
    the record). This method is usually better.
  • Fixed length records may be stored in these ways.
  • Variable length record issues.
  • Modifying a record may change its size.
  • If a records size increases sufficiently it may
    be necessary to move it to a new page. If so a
    forwarding address may be required.
  • A record may grow so large that it does not fit
    on a (whole) page. If so it has to be broken
    apart and connected by pointers.
Write a Comment
User Comments (0)
About PowerShow.com