Class 7: Data Persistence - PowerPoint PPT Presentation

1 / 85
About This Presentation
Title:

Class 7: Data Persistence

Description:

... techniques originate from the basic nuts and bolts origins of data processing ... between I/O operations by allocating memory space for the storage of data blocks ... – PowerPoint PPT presentation

Number of Views:284
Avg rating:3.0/5.0
Slides: 86
Provided by: glenndo
Category:

less

Transcript and Presenter's Notes

Title: Class 7: Data Persistence


1
Class 7Data Persistence
2
Overview of Class
  • Review of Class Project Progress
  • Introduction to Storage Retrieval
  • Introduction to Memory Management
  • SMP, MMP, NUMA

3
Motivation for Studying Data Relatives
  • To provide an understanding of how data storage
    management or data persistence relates to
    database design
  • To provide the designer with an opportunity to
    seek improvements in database performance.

4
What is Data Persistence?
  • Data in a system can exist either as an instance,
    that is, temporarily, or exist in a state of
    persistence, that is, it is permanently stored.
  • It is this state of permanence that we examine by
    understanding how storage works and is managed.

5
Data Storage Management
  • The Storage Management layer provides the base
    layer upon which the Architectural and
    Programming layers operate.

6
Data Storage Products
  • The majority of commercially available database
    products use twelve basic techniques to implement
    the storage management layer.

7
Data Storage Techniques
  • These techniques originate from the basic nuts
    and bolts origins of data processing and will be
    easily recognized to individuals trained in
    traditional structured programming methodologies.

8
These twelve techniques are
  • Hashing
  • Table Indexes
  • Memory Resident Tables
  • Pointing
  • Joins
  • RAID
  • Blocking
  • Buffering
  • Compression
  • Clustering
  • Sequential Storage
  • Indexing

9
Architecture and Storage Differences
  • Understanding a databases architecture does not
    automatically mean an understanding in the
    databases storage management implementation.
  • Essentially, architecture profile is concerned
    with logical design, while the storage profile is
    concerned with the physical design.

10
Performance Criteria
  • Computer power for the purpose of database
    performance can be measured by three criteria as
    follows
  • Memory
  • Processing Speed
  • Input/Output Speed

11
MEMORY
  • Memory can be used for two basic purposes.
  • To store instructions
  • To store data that is in use
  • The process of tuning is one of balancing these
    two, often conflicting, roles.

12
PROCESSING SPEED
  • Processing speed is simply a measure of how many
    instructions the CPU can process each second
    assuming no other bottlenecks in the process
    chain.

13
I/O SPEED
  • I/O speed is always slower than processing speed
    for the simple fact that the laws of physics
    limit the components in a physical storage
    device.
  • I/O speed is typically slower than processing
    speed from 110 to 1100 times.

14
I/O Bound
  • Systems are typically I/O bound, meaning the
    speed of the I/O system typically sets the pace
    at which a system can operate.

15
Resolving the I/O Bottleneck
  • Databases consistently attempt to employ
    techniques that limit the impact of the I/O
    system on the overall system performance.

16
I/O Minimization Techniques
  • These include
  • I/O Management Techniques
  • Blocking
  • Buffering
  • Compression
  • Clustering
  • Sequential Storage

17
I/O Minimization Techniques
  • Continued
  • Data Lookup Capabilities
  • Indexing
  • Hashing
  • Table-indexes
  • Memory Resident Tables
  • Cross-table Capabilities

18
I/O Minimization Techniques
  • Continued
  • Pointing
  • Dynamic Cross File Strategies
  • Explicit Disk Manipulation and Management
  • RAID

19
I/O MINIMIZATION TECHNIQUES
  • Blocking and buffering are performed by all
    computer systems whenever I/O operations are
    involved.
  • Depending on the platform they may be defined by
    the operating system, the job control language
    (JCL) the runs the program or by the program
    itself.
  • Many databases override these factors for there
    own performance improvements.

20
Blocking
  • This is the process of grouping a lot of little
    records into one larger block. This reduces the
    number of reads the computer required and thereby
    improves system performance.

21
Blocking
  • Blocking groups are also called blocking units,
    pages, control intervals, or record groups.
  • Blocking requires a trade off between improving
    I/O performance and forcing the reading of
    unnecessary data.

22
Buffering
  • This reduces the amount of time waiting between
    I/O operations by allocating memory space for the
    storage of data blocks anticipated for use.
  • Data buffers are measured in the same units as
    the blocks used to fill them.

23
Blocking and Buffering
  • When evaluating a databases blocking and
    buffering criteria a number of issues should be
    examined.

24
Blocking and Buffering Issues
  • Are block sizes adjustable or fixed?
  • The ability to modify block size relative to
    record size improves performance.

25
Blocking and Buffering Issues
  • Does the database control buffering or is it left
    to the operating system?
  • Database adjustable buffering improves
    performance options.

26
Blocking and Buffering Issues
  • Can the memory allocation for buffering be
    adjusted?
  • Optimization of memory relative to buffering
    requirements improves performance.

27
Compression
  • Compression allows more data to fit into a block,
    which in turn improves performance.
  • Data is compressed before storage takes place and
    when the data is read it is immediately
    decompressed.

28
Important considerations for Compression
  • What compression method is used?
  • There are different compression techniques each
    with their own advantages and disadvantages.

29
Other considerations
  • Will the cost of CPU cycles exceed the I/O
    savings?
  • The CPU work required for compression may inhibit
    any potential performance improvement.

30
Other considerations
  • Is compression optional?
  • Some databases dont give the user a choice.

31
Sequencing
  • Sequencing is that ability of the database to
    store data in the same order as its natural sort.
    This works well if there is a natural sort but
    it is also rudimentary and inflexible.

32
Clustering
  • Clustering is the process of predicting natural
    record groups likely to be processed at the same
    time.
  • Clustering can be very effective if the groupings
    match the I/O activity. If this does not happen
    the performance is likely to be degraded.

33
DATA LOOKUP CAPABILITIES
  • These techniques address methods for finding
    specific data.

34
Indexes
  • Indexes provide a way for a system to find one
    record without having to read through all of
    them.
  • An index lists all the values we may be looking
    for in some kind of alphabetic or numeric order
    with an address at which the full data record can
    be found.

35
Various types of indexes include
  • Binary search style indexes
  • B-tree indexes
  • Inverted list indexes
  • Memory resident tables
  • Table-indexes

36
Binary Search
  • The binary search style index is the most
    primitive type of index.
  • In this kind of index a simple lookup list is
    developed and the system sequentially searches
    through it.

37
Binary Search
  • The system knows that common data will be grouped
    together in the index therefore when a grouping
    such as color reaches the end of a specific value
    the system will know to stop searching.
  • Indexing also allows more records to be
    referenced from within a single block because the
    index takes less space.

38
Search Patterns
  • Indexes can be used differently with various
    search patterns, including.
  • Simple Index Search Pattern
  • Binary Search Pattern

39
Simple Index Search Pattern
  • A simple index search pattern reads through the
    index sequentially from beginning to end.
  • Combined with strategic blocking this can
    introduce some efficiencies.
  • The average number of reads per search equals the
    length of the index divide by two.

40
Binary Search Pattern
  • A binary search pattern uses a half-life
    algorithm to narrow down to the required
    information.

41
Binary Search Pattern Algorithm
  • The first check is in the middle of the index and
    if the desired value is higher or lower than the
    checked value then the list is checked by half,
    either higher or lower, again.
  • This process is repeated until the value is found
  • The maximum number of reads required for a search
    is equal to the number of times the index can be
    divided in half and the average will be less than
    that.

42
B-Tree Index
  • The B-tree index is a structure comprised of
    multiple indexes whose common factor will be a
    collection of simple indexes.

43
B-Tree Index
  • A B-tree structure is simply a collection of
    indexes on a further more basic index.
  • This layering can be implemented on more than one
    level if necessary.
  • The B-tree index is the most common indexing
    process used in database products.

44
Inverted List Index
  • In an inverted list index all record addresses
    having the same index value will be stored in the
    same record.

45
Inverted List Index
  • Although a very efficient index it causes
    difficulty with record sharing and file
    maintenance.
  • Some database systems will apply algorithms that
    allow the system to choose the best index system
    for a particular task on the fly.

46
Binary Indexes
  • Binary Indexes use bit fiddling and coding to
    represent values in an index. This can save both
    I/O time and processor time.
  • This is primarily applicable where there are a
    small defined set of possible index values.

47
Hashing
  • Hashing is a non-indexing technique that
    translates index values directly into a storage
    address.
  • Hashing is heavily reliant on the algorithm the
    calculates the addresses.
  • Hashing can be very fast but it causes issues
    with physical storage.

48
Table Indexes
  • Table indexes are indexes that include some of
    the data fields themselves. This can be
    effective when the number of important fields is
    small.

49
Memory Resident Tables
  • Memory resident tables are another brute force
    type of technique to reduce I/O overhead.

50
Cross File Capabilities and Storage Management
  • Two approaches are evaluated
  • Light Overhead or Direct Approach
  • Heavy Overhead or Smart Database Approach

51
Light Overhead
  • The light overhead or direct approach requires
    the user to specify relationships.
  • This is also referred to as a navigational
    database.

52
Heavy Overhead
  • The heavy overhead or smart database approach
    uses catalog tables, directories and other lookup
    facilities to make relationship decisions for the
    user.

53
Pointers
  • Pointers are a technique for relating records
    from one table to another without using an index.
  • The address of the corresponding records are
    imbedded in the original record itself.

54
File Scan Join
  • A file scan join is a join performed on the fly
    by identifying the matching fields and letting
    the database perform full file scans to identify
    record matches.

55
File Scan Join
  • This technique is the fall back method for most
    databases if no other method is available.
  • More commonly, indexing and hash keys are used to
    join files.

56
ORACLE Example for Illustration
  • An ORACLE database uses the following techniques
  • Index Unique, use a unique index to find values
  • Index Range Scan, Scan a range of values within
    the index only
  • Intersection, Return only rows that match on two
    or more tables

57
ORACLE Example
  • ORACLE techniques
  • Merge Join, Get two sets of rows, both sorted,
    and combine them where they match
  • Nested Loop, Identify a set of criteria in one
    table and then match them, one at a time, against
    a second table
  • Sequence, Access the data from a table that is
    presorted in the right order

58
ORACLE Example
  • ORACLE techniques
  • Table Access Full, A full table scan
  • Table Access Cluster, Access a table via its
    cluster organization
  • Table Access Hash, Access using the hash key
  • Table Access Row ID, Find a row based on its ID
    number

59
Supported Join Operations
  • The most common ways databases support join
    operations are
  • Full Scan
  • Merge
  • Nested Loop
  • Row ID Manipulation (enhanced B-tree)
  • Star Row ID Pointers

60
Contrasting Light and Heavy Overhead Systems
  • In the heavy overhead system the database uses
    the same techniques, however, the user doesnt
    have to choose which technique to use in each
    situation.
  • The system will assess and choose the most
    appropriate technique in the given circumstances.

61
Explicit Disk Management Approaches
  • The easiest method a DBA can use to improve
    performance is to improve the physical storage of
    files.
  • This requires three approaches
  • Reorganize database storage area
  • Pull data together
  • Spread data apart

62
Disk Management Technology
  • Storage subsystems have taken a number of forms
    over the years, although a few have become
    dominant in the middle to higher end server and
    system market.
  • One particularly effective technology for
    managing disk throughput, increasing security,
    and improving performance is RAID.

63
RAID
  • RAID, Redundant Arrays of Inexpensive Disks, is a
    method of distributing a database across a number
    of physical disk systems. It consists of the
    disks, the array controller and the software.

64
RAID Levels
  • RAID is described in terms of levels from RAID 0
    to RAID 6.
  • RAID 0 disk striping
  • RAID 1 disk mirroring, no striping
  • RAID 3 disk striping with parity bit backup
  • RAID 5 disk striping, mirroring with parity

65
ENHANCED STORAGE MANAGEMENT
  • Databases can sometimes be enhanced through the
    implementation of third party programs into the
    layers of the system.

66
Storage Management Evolution
  • In the early days of databases it was common for
    systems programmers and database experts to
    rewrite portions of the database system to meet
    their specific needs.
  • It is common now for database systems to have
    pre-built exit routines that allow the system to
    call an external program or algorithm.

67
Exit Routines
  • These intervention points often occurred at these
    key points
  • Calls to external security packages
  • Calls to sort, compression and data
    transportation routines
  • Calls to online transaction monitors and
    processors
  • Calls to network management systems
  • Calls to data storage, retrieval and access
    enhancers

68
Storage Management Applications
  • Object-Oriented databases apply this technology
    to a greater extent than all other models.

69
Storage Management Database Enhancement
  • The data definition language (DDL) of the
    database itself also provides means of database
    enhancement through the following means
  • Stored Procedures
  • Triggers
  • Column-based Calls

70
Storage Management Enhancement
  • In all three cases the routines being called may
    execute other database calls or invoke external
    program routines.

71
Optimizing Hardware Architectures
  • Our concern with hardware revolves around four
    basic elements.
  • Disk Storage
  • Memory
  • Processors (CPUs)
  • Disk Connection Pathways (buses)

72
SMP, MPP and NUMA Hardware Architectures
  • Manufacturers approach these issues by
    addressing
  • physical partitioning across multiple machines
  • clustering across multiple machines
  • 64-bit, SMP, MPP, and NUMA within individual
    machines.

73
Physical Partitioning
  • Physical partitioning is the division of data
    between multiple computers.
  • In some cases it works effectively, in others
    terribly.

74
Problems with Physical Partitioning
  • Difficulty of determining how to distribute
    dataManaging multiple CPU usage on the same data
  • Complexity overload
  • Traffic and transportation overhead

75
Clustering
  • Clustering involves the connection of multiple
    computers to a single storage array.
  • In the clustering model only one CPU can work on
    one query at a time.

76
64 Bit Word Size
  • Increasing the word size (64 bit) increases the
    overall speed of the system.
  • However, if the rest of the architecture doesnt
    change also it simply introduces additional
    bottlenecks.

77
The Multiple Processor Approach
  • Multiple processor approaches introduce methods
    for enhancing processor capability and overall
    processor speed.

78
The Multiple Processor Methods
  • Methods we will look at include
  • Massively Parallel Processor (MPP)
  • Symmetric Multiprocessor
  • Cache-Coherent, Non Uniform Memory Access
    (cc-NUMA)

79
Massively Parallel Processor (MPP)
  • The Massively Parallel Processor (MPP) approach
    is a method for placing up to thousands of
    processors into a single machine governed by a
    set of rules.

80
Problems with MPP
  • The fault of the MPP architecture is the overhead
    required coordinating the CPUs.
  • Usually MPP architectures only work well for
    specialized tasks.
  • MPP leads to complex, custom systems.

81
Symmetric Multiprocessor
  • The Symmetric Multiprocessor (SMP) approach is an
    enhancement of the MPP approach that focuses on
    allowing true multi-tasking.
  • An SMP system dynamically assigns CPU processor
    time.

82
Problems with SMP
  • If processor time is not evenly distributed then
    the advantages of parallelism are lost.
  • There are physical limitations to the size that
    an SMP system can grow due to overhead issues.

83
Cache-Coherent, Non Uniform Memory Access
(cc-NUMA)
  • Cache-coherent, Non-uniform Memory Access
    (cc-NUMA) architectures allow for true,
    distributed sharing of major subsystems using
    shared interconnect bus systems.
  • This occurs most notably with memory but is also
    often used with I/O subsystems.

84
Advantages of NUMA
  • Most NUMA architectures use the symmetric
    Scalable Coherent Interconnect (SCI) model form
    IEEE.
  • The system takes advantage of SMP architectures
    while improving on the bus bandwidth issues.

85
Summary Review of Disk Management
  • Clearly, the I/O bound limitations imposed on
    database implementations can defeat an otherwise
    intelligent design with reduced performance and
    limited reliability.
  • Understanding the optimization tools available to
    the design allows you to overcome these
    limitations with the proper management approach.
Write a Comment
User Comments (0)
About PowerShow.com