DISTRIBUTED DATABASE DESIGN - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

DISTRIBUTED DATABASE DESIGN

Description:

Dimensions of data distribution. Design process and strategies ... physically distributed companies new trends for business amalgamation. electronic commerce ... – PowerPoint PPT presentation

Number of Views:1642
Avg rating:3.0/5.0
Slides: 56
Provided by: fredloc
Category:

less

Transcript and Presenter's Notes

Title: DISTRIBUTED DATABASE DESIGN


1
COMP 332PRINCIPLES OFDATABASE DESIGN
  • DISTRIBUTED DATABASE DESIGN

2
DISTRIBUTED DATABASE DESIGN OUTLINE
  • Distributed Database Management Systems
  • Concepts
  • Benefits
  • Distributed architecture alternatives
  • Dimensions of data distribution
  • Design process and strategies
  • Data Fragmentation
  • Horizontalprimary
  • Horizontalderived
  • Vertical
  • Hybrid (mixed)
  • Data Allocation
  • Problem specification
  • Nonredundant best fit method
  • Redundant all beneficial sites method
  • Practical Considerations

3
CONCEPTS
  • Distributed Computing System
  • A number of autonomous processing elements, not
    necessarily homogeneous, that are interconnected
    by a computer network and that cooperate in
    performing their assigned tasks
  • Distributed Database
  • A collection of multiple, logically interrelated
    databases, distributed over a computer network
  • Distributed Database Management System
  • (see next slide)

4
DISTRIBUTED DATABASE MANAGEMENT SYSTEM
  • A software system that supports the transparent
    creation, access, and manipulation of
    interrelated data located at different sites of a
    computer network

5
DDBMS (contd)
  • The union of two opposing technologies

Database Technology
Network Technology
Distribute
Centralize
  • Differences in database design philosophy
  • Centralized achieve efficiency through local
    optimization by using complex physical data
    structures
  • Distributed achieve efficiency through global
    optimization of processing including cost of
    network communication and local processing

6
WHY DISTRIBUTED?
  • Goal improve the accessibility, sharability and
    performance of a DBMS while preserving the
    appearance of a centralized DBMS
  • Each site
  • has autonomous processing capability and can
    perform local applications
  • participates in the execution of at least one
    global application
  • Corresponds to the organizational structure of
    distributed enterprises
  • physically distributed companies new trends for
    business amalgamation
  • electronic commerce
  • manufacturing control systems

7
WHAT IS BEING DISTRIBUTED?
  • Processing logic
  • Inventory
  • Personnel
  • Sales
  • Function
  • Printing
  • Email
  • Data
  • Computation

Distributed Computing is not Multi-processor
system Backend processor Parallel Computing
8
COMPARE CENTRALIZED DBMS ARCHITECTURE
Centralized DBMS Architecture
9
VERSUS NETWORKED ARCHITECTURE
Networked Architecture with Centralized Database
10
VERSUS DISTRIBUTED DBMS ARCHITECTURE
Distributed DBMS Architecture
11
BENEFITS OF DDBMSs TRANSPARENCY
  • Separation of high-level semantics from low-level
    implementation issues
  • Extension of the data Independence concept in
    centralized databases
  • Basic Concepts
  • fragmentation

Shanghai Data
  • replication

Taipei Data
12
TRANSPARENCY TYPES IN DDBMSS
  • Network
  • Replication
  • Fragmentation

13
TRANSPARENCY TYPES IN DDBMSS (contd)
  • Network
  • Protect the user from the operational details of
    the network
  • Users do not have to specify where the data is
    located
  • location transparency naming transparency
  • Replication
  • Replicas (copies) are created for performance and
    reliability reasons
  • Replication causes difficult update problems
  • Users should be made unaware of the existence of
    these copies
  • Fragmentation
  • Basic fragments are parts of a relation
  • vertical subset of columns horizontal subset
    of rows
  • Fragments are also created for performance and
    reliability reasons
  • Users should be made unaware of the existence of
    fragments

14
BENEFITS OF DDBMSS RELIABILITY
  • Replicated components (data and software)
    eliminate single points of failure
  • Failure of a single site or communication link
    should not bring down entire system
  • Managing the reachable data requires
    distributed transaction support which provides
    correctness in the presence of failures

15
BENEFITS OF DDBMSs PERFORMANCE
  • Greater throughput due to
  • Data Localization
  • Reduces contention for CPU and I/O services
  • Reduces communication overhead
  • Parallelism
  • Inter-query and intra-query parallelism
  • The Multiplex approach

16
BENEFITS OF DDBMSS EXPANSION
  • Expansion by adding processing and storage power
    to the network new nodes
  • Replacing a mainframe versus adding more PCs to
    the network more cost effective

17
DDBMS ARCHITECTURAL ALTERNATIVES
  • Specifies how the components of a DBMS are
    distributed among the network sites and the form
    of the DBMS at each site
  • Autonomy degree to which each DBMS can operate
    independently (distribution of control, not data)
  • A0 tight integration (i.e., one DBMS controls
    all databases)
  • A1 semi-autonomous (i.e., each DBMS controls its
    own database)
  • A2 total isolation/fully autonomous (i.e.,
    multiple, independent DBMSs among which there is
    no communication or data sharing)
  • Distribution describes where the data is
    located physically
  • D0 non-distributed (i.e., at one site)
  • D1 client server (i.e., multiple locations, but
    dependent on each other)
  • D2 peer-to-peer (i.e., multiple locations, but
    independent of each other)
  • Heterogeneity indicates uniformity of the DBMSs
    with respect to data model, query language,
    interfaces, etc.
  • H0 homogenous (i.e., all are the same)
  • H1 heterogeneous (i.e., all are different)

18
DDBMS ARCHITECTURAL ALTERNATIVES (contd)
  • Autonomy (A)
  • A0 Tight integration
  • A1 Semi-autonomous
  • A2 Total isolation
  • Distribution (D)
  • D0 Non-distributed
  • D1 Client Server
  • D2 Peer-to-peer
  • Heterogeneity (H)
  • H0 Homogeneous
  • H1 Heterogeneous

332 18 Alternatives Some alternatives are
meaningless or not practical!
19
DDBMS ARCHITECTURAL ALTERNATIVES (contd)
DISTRIBUTION
Homogenous federated DDBMS (A1,D2,H0)
Homogenous multidatabase DDBMS (A2,D2,H0)
Homogenous DDBMS (A0,D2,H0)
Composite DBMS (A0,D0,H0)
Heterogeneous DDBMS (A0,D2,H1)
AUTONOMY
centralized
federated
autonomous
Multidatabase DBMS (A2,D0,H0)
Heterogeneousmultidatabase DDBMS (A2,D2,H1)
Heterogeneousfederated DDBMS (A1,D2,H1)
Heterogeneousfederated DBMS (A1,D0,H1)
HETEROGENEITY
Heterogeneousmultidatabase DBMS (A2,D0,H1)
20
DDBMS ARCHITECTURAL ALTERNATIVES (contd)
  • (A0, D0, H0) A collection of logically integrated
    DBMSs on the same site, also called Composite
    Systems
  • (A0, D0, H1) A system providing integrated access
    to heterogeneous DBMSs on a single machine
  • (A0, D1, H0) Client Server distribution of a
    single DBMS
  • (A0, D2, H0) Fully distributed
  • (A1, D0, H0) Semi-autonomous systems, also called
    Federated Systems. Each DBMS knows how to
    participate in the federation

21
DDBMS ARCHITECTURAL ALTERNATIVES (contd)
  • (A1, D0, H1) Heterogeneous Federated DBMS.
  • (A1, D1, H1) Heterogeneous Federated Distributed
    DBMS
  • (A2, D0, H0) Multidatabase DBMS. Complete
    homogeneity in component systems is unlikely
  • (A2, D0, H1) Heterogeneous Multidatabase DBMS.
    Similar to (A1, D0, H1), but with full autonomy
  • (A2, D1, H1), Heterogeneous Multidatabase
    Distributed DBMS
  • (A2, D2, H1)

22
MAJOR DBMS ARCHITECTURES CLIENT SERVER
  • (Ax, D1, Hy)
  • Distribute the functionality between client and
    server to better manage the complexity of the
    DBMS
  • Two-level architecture

Typical Scenario 1. Client parses a query,
decomposes it into independent site queries, and
sends it to an appropriate server. 2. Each server
processes a local query and sends the result
relation to the client. 3. The client combines
the results of all the sub-queries.
23
MAJOR DBMS ARCHITECTURES PEER-TO-PEER
  • (A0, D2, H0)
  • Global users submit requests via an external
    schema (ES) defined over a global conceptual
    schema (GCS), which is system maintained, and
    that is the union of all local conceptual schemas
    (LCSs)
  • The global system has no control over local data
    and processing

24
MAJOR DBMS ARCHITECTURES MULTI-DATABASE
  • (A2, Dx, Hy)
  • The global conceptual schema (GCS) exists as a
    union of some local conceptual schemas (LCSs) only

or does not exist at all!
25
DISTRIBUTED DATABASE DESIGN ISSUES
  • network design
  • DBMS software distribution
  • application program distribution
  • data distribution
  • level of sharing
  • access pattern behaviour
  • level of knowledge (of access pattern behaviour)

26
DIMENSIONS OF DATA DISTRIBUTION
ACCESS PATTERN BEHAVIOUR
Dynamic
Static
No sharing
Data
LEVEL OF KNOWLEDGE
Complete information
Data program
Partial information
LEVEL OF SHARING
27
DIMENSIONS OF DATA DISTRIBUTION
  • the distribution of data in distributed systems
    can be viewed along three orthogonal dimensions
  • Level of sharing
  • no sharing each application and its data execute
    at one site
  • data sharing programs replicated at each site
    data not replicated, but moved to a site as
    needed
  • dataprogram sharing both data and programs may
    be moved to a site as needed
  • Access pattern behaviour
  • static access patterns to data do not change
    over time
  • dynamic access patterns to data change over time
    how dynamic?
  • Level of knowledge (of access pattern behaviour)
  • no knowledge not a likely scenario
  • partial knowledge can predict, but users may
    deviate significantly
  • complete knowledge can predict and no major
    changes

28
DISTRIBUTED DATABASE DESIGN OBJECTIVES
  • Separation of data fragmentation from data
    allocation
  • data fragmentation a logical mapping
  • data allocation a physical mapping
  • However, to determine an optimal strategy we
    need toconsider the two problems together
  • Control of data redundancy
  • how much data duplication do we want to allow in
    the design?
  • tradeoff retrieval versus update
  • Independence from local DBMSs
  • we do not want the design to be dependent on the
    specific properties of any one DBMS
  • easier to design, modify design

29
DISTRIBUTED DATABASE DESIGN STRATEGIES
  • Top-down
  • Follows the approach discussed in class so far
  • Additional information regarding
  • distribution of accesses among sites
  • nature of access to database at each site
  • needs to be gathered during Requirements
    Analysis
  • Design local conceptual schemas by distributing
    the entities over the sites of the distributed
    system after conceptual schema design
  • Distribution activity consists of
  • data fragmentation split up schema into pieces
  • data allocation assign schema pieces to sites
  • Bottom-up
  • Necessary when databases already exist and we
    need to integrate them into one database
  • Similar to view integration, but may be
    heterogeneous

30
TOP-DOWN DATABASE DESIGN PROCESS
Phase 1Requirements Analysis
datarequirements
requirementsspecification
Phase 2Conceptual Design
conceptual externalschema design
conceptualschema
Phase 3Choice of DBMS/PL
distributed databaseschema design
conceptualdistributed schema
Phase 4Logical Design
logical externalschema design
logicalschema
Phase 5Physical Design
internal schemadesign
for each site
physicalschema
Phase 6Implementation
DDL statementsSDL statements
31
DISTRIBUTED DATABASE DESIGN ISSUES
  • Fragmentation
  • Reasons
  • Types
  • Degree
  • Correctness
  • Allocation
  • Partitioned
  • Partially Replicated
  • Fully Replicated

Database Information Communication
Information Application Information Computer
System Information
INFO required
32
WHY DATA FRAGMENTATION?
  • Relation is not an appropriate unit of
    distribution
  • Application views are usually subsets of
    relations
  • Application access is local to subsets of
    relations
  • Applications at different sites may require
    different parts of the same relation
  • Store once high remote access costs
  • Duplicate high update costs
  • Concurrency of access is more limited
  • BUT, some applications may suffer from data
    fragmentation
  • Their required data is located in two or more
    fragments
  • It may be harder to check integrity constraints
    (e.g., FDs)
  • Careful design of data fragmentation is required

33
TYPES OF FRAGMENTATION
34
HORIZONTAL DATA FRAGMENTATION
  • partitions a relation along its tuples so that
    each fragment has a subset of the tuples of the
    relation
  • Types of horizontal fragmentation
  • Primary based on a predicate Pi that selects
    tuples from a relation R
  • Derived based on the partitioning of a relation
    due to predicates defined on another relation
  • related according to foreign keys

35
HORIZONTAL DATA FRAGMENTATION PRIMARY
Each fragment, Ri, is a selection on a relation R
using a predicate Pi
R is reconstructed by taking theunion of all Ri
36
(No Transcript)
37
HORIZONTAL DATA FRAGMENTATION PRIMARY
original relation
primary fragments
38
HORIZONTAL DATA FRAGMENTATION DERIVED
existing primary fragments
original relation
39
HORIZONTAL DATA FRAGMENTATION DERIVED
derived fragments
existing primary fragments
40
HORIZONTAL FRAGMENTATION INFORMATION REQUIREMENTS
  • Database Information
  • Relations in the database and relationships
    between them

primary selection operation on owner derived
defined on member according to selection
operation on owner
  • Application Information
  • User query predicates examine most important
    applications (80/20 rule)
  • simple p1 DEPT CSEE p2 SAL gt 30000
  • conjunctive minterm predicate m1 p1? p2
    (e.g., (DEPTCSEE) AND (SALgt30000))
  • minterm selectivity number of tuples returned
    against a given minterm
  • access frequency access frequency of user
    queries and/or minterms

41
VERTICAL DATA FRAGMENTATION
  • produces fragments R1, R2, R3, ,Rn of a relation
    R
  • each fragment contains a subset of Rs attributes
    as well as the primary key of R
  • divides relations vertically by columns
    (attributes)
  • the objective is to obtain fragments so that
    applications only need to access one fragment
  • want to minimize execution time of applications
  • inherently more complicated than horizontal data
    fragmentation due to the total number of
    alternatives available

42
VERTICAL DATA FRAGMENTATION
Each fragment, Ri,is a projection ona relation R
R is reconstructedby applying eitherouter union
or fullouter join to all Ri
...
43
VERTICAL DATA FRAGMENTATION (contd)
44
VERTICAL FRAGMENTATION INFORMATION REQUIREMENTS
  • 1. Access frequenciesuse(Qi, Aj) 1 if Aj is
    referenced by Qi, 0 otherwise

2. Affinity of attributesnumber of accesses to
attributes (Ai, Aj) for each execution of
application Ql at site Sk weighted by the
application access frequency
45
VERTICAL DATA FRAGMENTATION DESIGN
  • Complexity of vertical fragmentation
  • Total number of possible vertical fragments for a
    relation with m non-primary key attributes is
    equal to the m-th Bell number, B(m)
  • For large m, B(m) mm
  • m 10 B(m) 115,000
  • m15 B(m) 109
  • m30 B(m) 1023
  • Need heuristics to obtain reasonable solutions
  • Heuristics for vertical fragmentation
  • grouping assign each attribute to one fragment
    and then join fragments until some criteria is
    satisfied
  • splitting partition a single relation based on
    access behaviour of applications to the
    attributes
  • Need to determine affinity of attributes

46
VERTICAL DATA FRAGMENTATION NOTES
  • splitting fits more naturally with a top-down
    methodology
  • the optimal solution is probably closer to the
    full relation than to a set of single-attribute
    fragments
  • splitting also generates non-overlapping
    fragments whereas grouping produces overlapping
    fragments
  • we prefer designs in which the primary key is
    part of each fragment to enable semantic
    integrity to be more easily enforced

47
HYBRID (MIXED) DATA FRAGMENTATION
Each fragment, Ri, is a combinationof a
selection using a predicate Piand projection on
a relation R
48
HYBRID (MIXED) DATA FRAGMENTATION (contd)
49
CORRECTNESS RULES FOR FRAGMENTATION
  • Completeness
  • If a relation R is decomposed into fragments R1,
    R2, , Rn, each tuple/attribute that can be found
    in R can also be found in one or more of the Ris
  • Reconstruction
  • If a relation R is decomposed into fragments R1,
    R2, , Rn, it should be possible to define a
    relational operator D such that
  • R D Ri "RiÎ FR
  • Disjointness
  • If a relation R is horizontally decomposed into
    fragments R1, R2, , Rn and data item di is in
    Rj, it is not in any other fragment (Rk k?j)

50
DATA REPLICATION
  • Replication improves the availability of the
    database, but increases the cost of updating data
    and requires more complicated concurrency control
  • No replication
  • Each fragment is assigned to only one site
  • Fully replicated
  • Allocate a full copy of the database to each site
  • Improves retrieval performance, but impairs
    update performance
  • Selective replication
  • Replicate some fragments at one or more sites
  • Choice of sites and degree of replication depend
    on performance and availability goals and on the
    types and frequencies of transactions submitted
    at each site

51
DATA ALLOCATION
  • Given a set of fragments F F1, F2, , Fn
  • a set of sites S S1, S2, , Sm
  • a set of transactions T T1, T2, , Tp
  • Find an optimal distribution of F to S
  • 1. Minimize cost
  • Cost of storing each Fi at Sj
  • Cost of querying Fi at Sj
  • Cost of updating Fi at all sites where it is
    stored
  • Cost of data communication
  • 2. Maximize performance
  • Minimize response time at each site
  • Maximize system throughput at each site
  • This problem is NP-hard!

52
DATA ALLOCATION (contd)
  • Notes
  • the placement of one fragment usually has an
    impact on the placement of the other fragments
  • should take into account relationships between
    fragments
  • access to data by applications is modeled very
    simply
  • should take into account relationships
    betweendata allocation and query processing
  • most models do not consider the cost of integrity
    enforcement
  • most models do not consider the cost of
    concurrency control

many simplifying assumptions are madeto make the
problem tractable
53
INFORMATION REQUIREMENTS DATA ALLOCATION
  • 1. Database information
  • global schema and fragmentation schema
  • fragment selectivity for query qi
  • size of a fragment
  • 2. Application information
  • A set of user transactions and their frequencies
  • Read and write accesses which site can update
    and which can query
  • Recovery estimated frequency and volume of
    backup operations
  • Integrity referential integrity, journaling
    overhead
  • 3. Network information
  • Network topology, network channel capacities and
    network control mechanism
  • communication cost between sites
  • 4. Site information
  • The site locations and their processing capacity
    (CPU I/O)
  • Sources of data (where data can be located) and
    sinks of data (where user transactions can be
    initiated and data transferred)
  • The transaction processing options and
    synchronization algorithms
  • The unit cost for data storage, and local site
    processing

Minimize C Ccommuication Cprocessing
Cstorage
54
THE NONREDUNDANT BEST FIT METHOD
Place fragment Ri at site Sj where the number of
local query and update references by all
transactions is maximized
  • determines the single site to which to allocate a
    fragment based on maximum benefit
  • most total query and update references

55
EXAMPLE SYSTEM INFORMATION
  • Global schema and fragmentation schema

Transactions and their frequencies
Security no restrictions Data sources all
sites Data sinks all sites
56
LOCAL REFERENCE COMPUTATION
57
EXAMPLE ALLOCATION DECISION
58
REDUNDANT ALL BENEFICIAL SITES METHOD
  • Allocate a fragment Fi to a site Sj if the
    benefit of doing so exceeds the cost

Cost (time to do all local updates
frequency) (time to do all remote updates
frequency) Benefit time saved to do local query
frequency
59
EXAMPLE COST COMPUTATION
60
EXAMPLE BENEFIT COMPUTATION
61
EXAMPLE ALLOCATION DECISION
62
PRACTICAL CONSIDERATIONS
  • Allocating a fragment to a site will change the
    cost-benefit outcomethis should be taken into
    account when calculating cost-benefit for other
    fragments
  • Our examples use averages for query/update times
    to simplify the calculationsin real
    environments, actual I/O times and network delay
    times may be available for individual
    transactions
  • No matter how detailed, all calculations in the
    end are just estimates
  • We just need to be in the right ballpark, not
    the exact seat!
Write a Comment
User Comments (0)
About PowerShow.com