DISTRIBUTED DATABASE DESIGN

About This Presentation

Title:

DISTRIBUTED DATABASE DESIGN

Description:

Dimensions of data distribution. Design process and strategies ... physically distributed companies new trends for business amalgamation. electronic commerce ... – PowerPoint PPT presentation

Number of Views:1642

Avg rating:3.0/5.0

Slides: 56

Provided by: fredloc

Category:

more less

Transcript and Presenter's Notes

Title: DISTRIBUTED DATABASE DESIGN

1
COMP 332PRINCIPLES OFDATABASE DESIGN

DISTRIBUTED DATABASE DESIGN

2
DISTRIBUTED DATABASE DESIGN OUTLINE

Distributed Database Management Systems
Concepts
Benefits
Distributed architecture alternatives
Dimensions of data distribution
Design process and strategies
Data Fragmentation
Horizontalprimary
Horizontalderived
Vertical
Hybrid (mixed)
Data Allocation
Problem specification
Nonredundant best fit method
Redundant all beneficial sites method
Practical Considerations

3
CONCEPTS

Distributed Computing System
A number of autonomous processing elements, not
necessarily homogeneous, that are interconnected
by a computer network and that cooperate in
performing their assigned tasks
Distributed Database
A collection of multiple, logically interrelated
databases, distributed over a computer network
Distributed Database Management System
(see next slide)

4
DISTRIBUTED DATABASE MANAGEMENT SYSTEM

A software system that supports the transparent
creation, access, and manipulation of
interrelated data located at different sites of a
computer network

5
DDBMS (contd)

The union of two opposing technologies

Database Technology
Network Technology
Distribute
Centralize

Differences in database design philosophy
Centralized achieve efficiency through local
optimization by using complex physical data
structures
Distributed achieve efficiency through global
optimization of processing including cost of
network communication and local processing

6
WHY DISTRIBUTED?

Goal improve the accessibility, sharability and
performance of a DBMS while preserving the
appearance of a centralized DBMS
Each site
has autonomous processing capability and can
perform local applications
participates in the execution of at least one
global application
Corresponds to the organizational structure of
distributed enterprises
physically distributed companies new trends for
business amalgamation
electronic commerce
manufacturing control systems

7
WHAT IS BEING DISTRIBUTED?

Processing logic
Inventory
Personnel
Sales
Function
Printing
Email
Data
Computation

Distributed Computing is not Multi-processor
system Backend processor Parallel Computing
8
COMPARE CENTRALIZED DBMS ARCHITECTURE
Centralized DBMS Architecture
9
VERSUS NETWORKED ARCHITECTURE
Networked Architecture with Centralized Database
10
VERSUS DISTRIBUTED DBMS ARCHITECTURE
Distributed DBMS Architecture
11
BENEFITS OF DDBMSs TRANSPARENCY

Separation of high-level semantics from low-level
implementation issues
Extension of the data Independence concept in
centralized databases

Basic Concepts
fragmentation

Shanghai Data

replication

Taipei Data
12
TRANSPARENCY TYPES IN DDBMSS

Network
Replication
Fragmentation

13
TRANSPARENCY TYPES IN DDBMSS (contd)

Network
Protect the user from the operational details of
the network
Users do not have to specify where the data is
located
location transparency naming transparency

Replication
Replicas (copies) are created for performance and
reliability reasons
Replication causes difficult update problems
Users should be made unaware of the existence of
these copies

Fragmentation
Basic fragments are parts of a relation
vertical subset of columns horizontal subset
of rows
Fragments are also created for performance and
reliability reasons
Users should be made unaware of the existence of
fragments

14
BENEFITS OF DDBMSS RELIABILITY

Replicated components (data and software)
eliminate single points of failure
Failure of a single site or communication link
should not bring down entire system
Managing the reachable data requires
distributed transaction support which provides
correctness in the presence of failures

15
BENEFITS OF DDBMSs PERFORMANCE

Greater throughput due to
Data Localization
Reduces contention for CPU and I/O services
Reduces communication overhead
Parallelism
Inter-query and intra-query parallelism
The Multiplex approach

16
BENEFITS OF DDBMSS EXPANSION

Expansion by adding processing and storage power
to the network new nodes
Replacing a mainframe versus adding more PCs to
the network more cost effective

17
DDBMS ARCHITECTURAL ALTERNATIVES

Specifies how the components of a DBMS are
distributed among the network sites and the form
of the DBMS at each site

Autonomy degree to which each DBMS can operate
independently (distribution of control, not data)
A0 tight integration (i.e., one DBMS controls
all databases)
A1 semi-autonomous (i.e., each DBMS controls its
own database)
A2 total isolation/fully autonomous (i.e.,
multiple, independent DBMSs among which there is
no communication or data sharing)
Distribution describes where the data is
located physically
D0 non-distributed (i.e., at one site)
D1 client server (i.e., multiple locations, but
dependent on each other)
D2 peer-to-peer (i.e., multiple locations, but
independent of each other)
Heterogeneity indicates uniformity of the DBMSs
with respect to data model, query language,
interfaces, etc.
H0 homogenous (i.e., all are the same)
H1 heterogeneous (i.e., all are different)

18
DDBMS ARCHITECTURAL ALTERNATIVES (contd)

Autonomy (A)
A0 Tight integration
A1 Semi-autonomous
A2 Total isolation
Distribution (D)
D0 Non-distributed
D1 Client Server
D2 Peer-to-peer
Heterogeneity (H)
H0 Homogeneous
H1 Heterogeneous

332 18 Alternatives Some alternatives are
meaningless or not practical!
19
DDBMS ARCHITECTURAL ALTERNATIVES (contd)
DISTRIBUTION
Homogenous federated DDBMS (A1,D2,H0)
Homogenous multidatabase DDBMS (A2,D2,H0)
Homogenous DDBMS (A0,D2,H0)
Composite DBMS (A0,D0,H0)
Heterogeneous DDBMS (A0,D2,H1)
AUTONOMY
centralized
federated
autonomous
Multidatabase DBMS (A2,D0,H0)
Heterogeneousmultidatabase DDBMS (A2,D2,H1)
Heterogeneousfederated DDBMS (A1,D2,H1)
Heterogeneousfederated DBMS (A1,D0,H1)
HETEROGENEITY
Heterogeneousmultidatabase DBMS (A2,D0,H1)
20
DDBMS ARCHITECTURAL ALTERNATIVES (contd)

(A0, D0, H0) A collection of logically integrated
DBMSs on the same site, also called Composite
Systems
(A0, D0, H1) A system providing integrated access
to heterogeneous DBMSs on a single machine
(A0, D1, H0) Client Server distribution of a
single DBMS
(A0, D2, H0) Fully distributed
(A1, D0, H0) Semi-autonomous systems, also called
Federated Systems. Each DBMS knows how to
participate in the federation

21
DDBMS ARCHITECTURAL ALTERNATIVES (contd)

(A1, D0, H1) Heterogeneous Federated DBMS.
(A1, D1, H1) Heterogeneous Federated Distributed
DBMS
(A2, D0, H0) Multidatabase DBMS. Complete
homogeneity in component systems is unlikely
(A2, D0, H1) Heterogeneous Multidatabase DBMS.
Similar to (A1, D0, H1), but with full autonomy
(A2, D1, H1), Heterogeneous Multidatabase
Distributed DBMS
(A2, D2, H1)

22
MAJOR DBMS ARCHITECTURES CLIENT SERVER

(Ax, D1, Hy)
Distribute the functionality between client and
server to better manage the complexity of the
DBMS
Two-level architecture

Typical Scenario 1. Client parses a query,
decomposes it into independent site queries, and
sends it to an appropriate server. 2. Each server
processes a local query and sends the result
relation to the client. 3. The client combines
the results of all the sub-queries.
23
MAJOR DBMS ARCHITECTURES PEER-TO-PEER

(A0, D2, H0)
Global users submit requests via an external
schema (ES) defined over a global conceptual
schema (GCS), which is system maintained, and
that is the union of all local conceptual schemas
(LCSs)

The global system has no control over local data
and processing

24
MAJOR DBMS ARCHITECTURES MULTI-DATABASE

(A2, Dx, Hy)
The global conceptual schema (GCS) exists as a
union of some local conceptual schemas (LCSs) only

or does not exist at all!
25
DISTRIBUTED DATABASE DESIGN ISSUES

network design
DBMS software distribution
application program distribution
data distribution
level of sharing
access pattern behaviour
level of knowledge (of access pattern behaviour)

26
DIMENSIONS OF DATA DISTRIBUTION
ACCESS PATTERN BEHAVIOUR
Dynamic
Static
No sharing
Data
LEVEL OF KNOWLEDGE
Complete information
Data program
Partial information
LEVEL OF SHARING
27
DIMENSIONS OF DATA DISTRIBUTION

the distribution of data in distributed systems
can be viewed along three orthogonal dimensions

Level of sharing
no sharing each application and its data execute
at one site
data sharing programs replicated at each site
data not replicated, but moved to a site as
needed
dataprogram sharing both data and programs may
be moved to a site as needed
Access pattern behaviour
static access patterns to data do not change
over time
dynamic access patterns to data change over time
how dynamic?
Level of knowledge (of access pattern behaviour)
no knowledge not a likely scenario
partial knowledge can predict, but users may
deviate significantly
complete knowledge can predict and no major
changes

28
DISTRIBUTED DATABASE DESIGN OBJECTIVES

Separation of data fragmentation from data
allocation
data fragmentation a logical mapping
data allocation a physical mapping
However, to determine an optimal strategy we
need toconsider the two problems together
Control of data redundancy
how much data duplication do we want to allow in
the design?
tradeoff retrieval versus update
Independence from local DBMSs
we do not want the design to be dependent on the
specific properties of any one DBMS
easier to design, modify design

29
DISTRIBUTED DATABASE DESIGN STRATEGIES

Top-down
Follows the approach discussed in class so far
Additional information regarding
distribution of accesses among sites
nature of access to database at each site
needs to be gathered during Requirements
Analysis
Design local conceptual schemas by distributing
the entities over the sites of the distributed
system after conceptual schema design
Distribution activity consists of
data fragmentation split up schema into pieces
data allocation assign schema pieces to sites

Bottom-up
Necessary when databases already exist and we
need to integrate them into one database
Similar to view integration, but may be
heterogeneous

30
TOP-DOWN DATABASE DESIGN PROCESS
Phase 1Requirements Analysis
datarequirements
requirementsspecification
Phase 2Conceptual Design
conceptual externalschema design
conceptualschema
Phase 3Choice of DBMS/PL
distributed databaseschema design
conceptualdistributed schema
Phase 4Logical Design
logical externalschema design
logicalschema
Phase 5Physical Design
internal schemadesign
for each site
physicalschema
Phase 6Implementation
DDL statementsSDL statements
31
DISTRIBUTED DATABASE DESIGN ISSUES

Fragmentation
Reasons
Types
Degree
Correctness
Allocation
Partitioned
Partially Replicated
Fully Replicated

Database Information Communication
Information Application Information Computer
System Information
INFO required
32
WHY DATA FRAGMENTATION?

Relation is not an appropriate unit of
distribution
Application views are usually subsets of
relations
Application access is local to subsets of
relations
Applications at different sites may require
different parts of the same relation
Store once high remote access costs
Duplicate high update costs
Concurrency of access is more limited
BUT, some applications may suffer from data
fragmentation
Their required data is located in two or more
fragments
It may be harder to check integrity constraints
(e.g., FDs)
Careful design of data fragmentation is required

33
TYPES OF FRAGMENTATION
34
HORIZONTAL DATA FRAGMENTATION

partitions a relation along its tuples so that
each fragment has a subset of the tuples of the
relation

Types of horizontal fragmentation
Primary based on a predicate Pi that selects
tuples from a relation R
Derived based on the partitioning of a relation
due to predicates defined on another relation
related according to foreign keys

35
HORIZONTAL DATA FRAGMENTATION PRIMARY
Each fragment, Ri, is a selection on a relation R
using a predicate Pi
R is reconstructed by taking theunion of all Ri
36
(No Transcript)
37
HORIZONTAL DATA FRAGMENTATION PRIMARY
original relation
primary fragments
38
HORIZONTAL DATA FRAGMENTATION DERIVED
existing primary fragments
original relation
39
HORIZONTAL DATA FRAGMENTATION DERIVED
derived fragments
existing primary fragments
40
HORIZONTAL FRAGMENTATION INFORMATION REQUIREMENTS

Database Information
Relations in the database and relationships
between them

primary selection operation on owner derived
defined on member according to selection
operation on owner

Application Information
User query predicates examine most important
applications (80/20 rule)
simple p1 DEPT CSEE p2 SAL gt 30000
conjunctive minterm predicate m1 p1? p2
(e.g., (DEPTCSEE) AND (SALgt30000))
minterm selectivity number of tuples returned
against a given minterm
access frequency access frequency of user
queries and/or minterms

41
VERTICAL DATA FRAGMENTATION

produces fragments R1, R2, R3, ,Rn of a relation
R
each fragment contains a subset of Rs attributes
as well as the primary key of R
divides relations vertically by columns
(attributes)
the objective is to obtain fragments so that
applications only need to access one fragment
want to minimize execution time of applications
inherently more complicated than horizontal data
fragmentation due to the total number of
alternatives available

42
VERTICAL DATA FRAGMENTATION
Each fragment, Ri,is a projection ona relation R
R is reconstructedby applying eitherouter union
or fullouter join to all Ri
...
43
VERTICAL DATA FRAGMENTATION (contd)
44
VERTICAL FRAGMENTATION INFORMATION REQUIREMENTS

1. Access frequenciesuse(Qi, Aj) 1 if Aj is
referenced by Qi, 0 otherwise

2. Affinity of attributesnumber of accesses to
attributes (Ai, Aj) for each execution of
application Ql at site Sk weighted by the
application access frequency
45
VERTICAL DATA FRAGMENTATION DESIGN

Complexity of vertical fragmentation
Total number of possible vertical fragments for a
relation with m non-primary key attributes is
equal to the m-th Bell number, B(m)
For large m, B(m) mm
m 10 B(m) 115,000
m15 B(m) 109
m30 B(m) 1023
Need heuristics to obtain reasonable solutions

Heuristics for vertical fragmentation
grouping assign each attribute to one fragment
and then join fragments until some criteria is
satisfied
splitting partition a single relation based on
access behaviour of applications to the
attributes
Need to determine affinity of attributes

46
VERTICAL DATA FRAGMENTATION NOTES

splitting fits more naturally with a top-down
methodology
the optimal solution is probably closer to the
full relation than to a set of single-attribute
fragments
splitting also generates non-overlapping
fragments whereas grouping produces overlapping
fragments
we prefer designs in which the primary key is
part of each fragment to enable semantic
integrity to be more easily enforced

47
HYBRID (MIXED) DATA FRAGMENTATION
Each fragment, Ri, is a combinationof a
selection using a predicate Piand projection on
a relation R
48
HYBRID (MIXED) DATA FRAGMENTATION (contd)
49
CORRECTNESS RULES FOR FRAGMENTATION

Completeness
If a relation R is decomposed into fragments R1,
R2, , Rn, each tuple/attribute that can be found
in R can also be found in one or more of the Ris

Reconstruction
If a relation R is decomposed into fragments R1,
R2, , Rn, it should be possible to define a
relational operator D such that
R D Ri "RiÎ FR
Disjointness
If a relation R is horizontally decomposed into
fragments R1, R2, , Rn and data item di is in
Rj, it is not in any other fragment (Rk k?j)

50
DATA REPLICATION

Replication improves the availability of the
database, but increases the cost of updating data
and requires more complicated concurrency control

No replication
Each fragment is assigned to only one site
Fully replicated
Allocate a full copy of the database to each site
Improves retrieval performance, but impairs
update performance
Selective replication
Replicate some fragments at one or more sites
Choice of sites and degree of replication depend
on performance and availability goals and on the
types and frequencies of transactions submitted
at each site

51
DATA ALLOCATION

Given a set of fragments F F1, F2, , Fn
a set of sites S S1, S2, , Sm
a set of transactions T T1, T2, , Tp
Find an optimal distribution of F to S

1. Minimize cost
Cost of storing each Fi at Sj
Cost of querying Fi at Sj
Cost of updating Fi at all sites where it is
stored
Cost of data communication
2. Maximize performance
Minimize response time at each site
Maximize system throughput at each site
This problem is NP-hard!

52
DATA ALLOCATION (contd)

Notes
the placement of one fragment usually has an
impact on the placement of the other fragments
should take into account relationships between
fragments
access to data by applications is modeled very
simply
should take into account relationships
betweendata allocation and query processing
most models do not consider the cost of integrity
enforcement
most models do not consider the cost of
concurrency control

many simplifying assumptions are madeto make the
problem tractable
53
INFORMATION REQUIREMENTS DATA ALLOCATION

1. Database information
global schema and fragmentation schema
fragment selectivity for query qi
size of a fragment
2. Application information
A set of user transactions and their frequencies
Read and write accesses which site can update
and which can query
Recovery estimated frequency and volume of
backup operations
Integrity referential integrity, journaling
overhead

3. Network information
Network topology, network channel capacities and
network control mechanism
communication cost between sites
4. Site information
The site locations and their processing capacity
(CPU I/O)
Sources of data (where data can be located) and
sinks of data (where user transactions can be
initiated and data transferred)
The transaction processing options and
synchronization algorithms
The unit cost for data storage, and local site
processing

Minimize C Ccommuication Cprocessing
Cstorage
54
THE NONREDUNDANT BEST FIT METHOD
Place fragment Ri at site Sj where the number of
local query and update references by all
transactions is maximized

determines the single site to which to allocate a
fragment based on maximum benefit
most total query and update references

55
EXAMPLE SYSTEM INFORMATION

Global schema and fragmentation schema

Transactions and their frequencies
Security no restrictions Data sources all
sites Data sinks all sites
56
LOCAL REFERENCE COMPUTATION
57
EXAMPLE ALLOCATION DECISION
58
REDUNDANT ALL BENEFICIAL SITES METHOD

Allocate a fragment Fi to a site Sj if the
benefit of doing so exceeds the cost

Cost (time to do all local updates
frequency) (time to do all remote updates
frequency) Benefit time saved to do local query
frequency
59
EXAMPLE COST COMPUTATION
60
EXAMPLE BENEFIT COMPUTATION
61
EXAMPLE ALLOCATION DECISION
62
PRACTICAL CONSIDERATIONS

Allocating a fragment to a site will change the
cost-benefit outcomethis should be taken into
account when calculating cost-benefit for other
fragments
Our examples use averages for query/update times
to simplify the calculationsin real
environments, actual I/O times and network delay
times may be available for individual
transactions
No matter how detailed, all calculations in the
end are just estimates
We just need to be in the right ballpark, not
the exact seat!