Title: DISTRIBUTED DATABASE DESIGN
1COMP 332PRINCIPLES OFDATABASE DESIGN
- DISTRIBUTED DATABASE DESIGN
2DISTRIBUTED DATABASE DESIGN OUTLINE
- Distributed Database Management Systems
- Concepts
- Benefits
- Distributed architecture alternatives
- Dimensions of data distribution
- Design process and strategies
- Data Fragmentation
- Horizontalprimary
- Horizontalderived
- Vertical
- Hybrid (mixed)
- Data Allocation
- Problem specification
- Nonredundant best fit method
- Redundant all beneficial sites method
- Practical Considerations
3CONCEPTS
- Distributed Computing System
- A number of autonomous processing elements, not
necessarily homogeneous, that are interconnected
by a computer network and that cooperate in
performing their assigned tasks - Distributed Database
- A collection of multiple, logically interrelated
databases, distributed over a computer network - Distributed Database Management System
- (see next slide)
4DISTRIBUTED DATABASE MANAGEMENT SYSTEM
- A software system that supports the transparent
creation, access, and manipulation of
interrelated data located at different sites of a
computer network
5DDBMS (contd)
- The union of two opposing technologies
Database Technology
Network Technology
Distribute
Centralize
- Differences in database design philosophy
- Centralized achieve efficiency through local
optimization by using complex physical data
structures - Distributed achieve efficiency through global
optimization of processing including cost of
network communication and local processing
6WHY DISTRIBUTED?
- Goal improve the accessibility, sharability and
performance of a DBMS while preserving the
appearance of a centralized DBMS - Each site
- has autonomous processing capability and can
perform local applications - participates in the execution of at least one
global application - Corresponds to the organizational structure of
distributed enterprises - physically distributed companies new trends for
business amalgamation - electronic commerce
- manufacturing control systems
7WHAT IS BEING DISTRIBUTED?
- Processing logic
- Inventory
- Personnel
- Sales
- Function
- Printing
- Email
- Data
- Computation
Distributed Computing is not Multi-processor
system Backend processor Parallel Computing
8COMPARE CENTRALIZED DBMS ARCHITECTURE
Centralized DBMS Architecture
9VERSUS NETWORKED ARCHITECTURE
Networked Architecture with Centralized Database
10VERSUS DISTRIBUTED DBMS ARCHITECTURE
Distributed DBMS Architecture
11BENEFITS OF DDBMSs TRANSPARENCY
- Separation of high-level semantics from low-level
implementation issues - Extension of the data Independence concept in
centralized databases
- Basic Concepts
- fragmentation
Shanghai Data
Taipei Data
12TRANSPARENCY TYPES IN DDBMSS
- Network
- Replication
- Fragmentation
13TRANSPARENCY TYPES IN DDBMSS (contd)
- Network
- Protect the user from the operational details of
the network - Users do not have to specify where the data is
located - location transparency naming transparency
- Replication
- Replicas (copies) are created for performance and
reliability reasons - Replication causes difficult update problems
- Users should be made unaware of the existence of
these copies
- Fragmentation
- Basic fragments are parts of a relation
- vertical subset of columns horizontal subset
of rows - Fragments are also created for performance and
reliability reasons - Users should be made unaware of the existence of
fragments
14BENEFITS OF DDBMSS RELIABILITY
- Replicated components (data and software)
eliminate single points of failure - Failure of a single site or communication link
should not bring down entire system - Managing the reachable data requires
distributed transaction support which provides
correctness in the presence of failures
15BENEFITS OF DDBMSs PERFORMANCE
- Greater throughput due to
- Data Localization
- Reduces contention for CPU and I/O services
- Reduces communication overhead
- Parallelism
- Inter-query and intra-query parallelism
- The Multiplex approach
16BENEFITS OF DDBMSS EXPANSION
- Expansion by adding processing and storage power
to the network new nodes - Replacing a mainframe versus adding more PCs to
the network more cost effective
17DDBMS ARCHITECTURAL ALTERNATIVES
- Specifies how the components of a DBMS are
distributed among the network sites and the form
of the DBMS at each site
- Autonomy degree to which each DBMS can operate
independently (distribution of control, not data) - A0 tight integration (i.e., one DBMS controls
all databases) - A1 semi-autonomous (i.e., each DBMS controls its
own database) - A2 total isolation/fully autonomous (i.e.,
multiple, independent DBMSs among which there is
no communication or data sharing) - Distribution describes where the data is
located physically - D0 non-distributed (i.e., at one site)
- D1 client server (i.e., multiple locations, but
dependent on each other) - D2 peer-to-peer (i.e., multiple locations, but
independent of each other) - Heterogeneity indicates uniformity of the DBMSs
with respect to data model, query language,
interfaces, etc. - H0 homogenous (i.e., all are the same)
- H1 heterogeneous (i.e., all are different)
18DDBMS ARCHITECTURAL ALTERNATIVES (contd)
- Autonomy (A)
- A0 Tight integration
- A1 Semi-autonomous
- A2 Total isolation
- Distribution (D)
- D0 Non-distributed
- D1 Client Server
- D2 Peer-to-peer
- Heterogeneity (H)
- H0 Homogeneous
- H1 Heterogeneous
332 18 Alternatives Some alternatives are
meaningless or not practical!
19DDBMS ARCHITECTURAL ALTERNATIVES (contd)
DISTRIBUTION
Homogenous federated DDBMS (A1,D2,H0)
Homogenous multidatabase DDBMS (A2,D2,H0)
Homogenous DDBMS (A0,D2,H0)
Composite DBMS (A0,D0,H0)
Heterogeneous DDBMS (A0,D2,H1)
AUTONOMY
centralized
federated
autonomous
Multidatabase DBMS (A2,D0,H0)
Heterogeneousmultidatabase DDBMS (A2,D2,H1)
Heterogeneousfederated DDBMS (A1,D2,H1)
Heterogeneousfederated DBMS (A1,D0,H1)
HETEROGENEITY
Heterogeneousmultidatabase DBMS (A2,D0,H1)
20DDBMS ARCHITECTURAL ALTERNATIVES (contd)
- (A0, D0, H0) A collection of logically integrated
DBMSs on the same site, also called Composite
Systems - (A0, D0, H1) A system providing integrated access
to heterogeneous DBMSs on a single machine - (A0, D1, H0) Client Server distribution of a
single DBMS - (A0, D2, H0) Fully distributed
- (A1, D0, H0) Semi-autonomous systems, also called
Federated Systems. Each DBMS knows how to
participate in the federation
21DDBMS ARCHITECTURAL ALTERNATIVES (contd)
- (A1, D0, H1) Heterogeneous Federated DBMS.
- (A1, D1, H1) Heterogeneous Federated Distributed
DBMS - (A2, D0, H0) Multidatabase DBMS. Complete
homogeneity in component systems is unlikely - (A2, D0, H1) Heterogeneous Multidatabase DBMS.
Similar to (A1, D0, H1), but with full autonomy - (A2, D1, H1), Heterogeneous Multidatabase
Distributed DBMS - (A2, D2, H1)
22MAJOR DBMS ARCHITECTURES CLIENT SERVER
- (Ax, D1, Hy)
- Distribute the functionality between client and
server to better manage the complexity of the
DBMS - Two-level architecture
Typical Scenario 1. Client parses a query,
decomposes it into independent site queries, and
sends it to an appropriate server. 2. Each server
processes a local query and sends the result
relation to the client. 3. The client combines
the results of all the sub-queries.
23MAJOR DBMS ARCHITECTURES PEER-TO-PEER
- (A0, D2, H0)
- Global users submit requests via an external
schema (ES) defined over a global conceptual
schema (GCS), which is system maintained, and
that is the union of all local conceptual schemas
(LCSs)
- The global system has no control over local data
and processing
24MAJOR DBMS ARCHITECTURES MULTI-DATABASE
- (A2, Dx, Hy)
- The global conceptual schema (GCS) exists as a
union of some local conceptual schemas (LCSs) only
or does not exist at all!
25DISTRIBUTED DATABASE DESIGN ISSUES
- network design
- DBMS software distribution
- application program distribution
- data distribution
- level of sharing
- access pattern behaviour
- level of knowledge (of access pattern behaviour)
26DIMENSIONS OF DATA DISTRIBUTION
ACCESS PATTERN BEHAVIOUR
Dynamic
Static
No sharing
Data
LEVEL OF KNOWLEDGE
Complete information
Data program
Partial information
LEVEL OF SHARING
27DIMENSIONS OF DATA DISTRIBUTION
- the distribution of data in distributed systems
can be viewed along three orthogonal dimensions
- Level of sharing
- no sharing each application and its data execute
at one site - data sharing programs replicated at each site
data not replicated, but moved to a site as
needed - dataprogram sharing both data and programs may
be moved to a site as needed - Access pattern behaviour
- static access patterns to data do not change
over time - dynamic access patterns to data change over time
how dynamic? - Level of knowledge (of access pattern behaviour)
- no knowledge not a likely scenario
- partial knowledge can predict, but users may
deviate significantly - complete knowledge can predict and no major
changes
28DISTRIBUTED DATABASE DESIGN OBJECTIVES
- Separation of data fragmentation from data
allocation - data fragmentation a logical mapping
- data allocation a physical mapping
- However, to determine an optimal strategy we
need toconsider the two problems together - Control of data redundancy
- how much data duplication do we want to allow in
the design? - tradeoff retrieval versus update
- Independence from local DBMSs
- we do not want the design to be dependent on the
specific properties of any one DBMS - easier to design, modify design
29DISTRIBUTED DATABASE DESIGN STRATEGIES
- Top-down
- Follows the approach discussed in class so far
- Additional information regarding
- distribution of accesses among sites
- nature of access to database at each site
- needs to be gathered during Requirements
Analysis - Design local conceptual schemas by distributing
the entities over the sites of the distributed
system after conceptual schema design - Distribution activity consists of
- data fragmentation split up schema into pieces
- data allocation assign schema pieces to sites
- Bottom-up
- Necessary when databases already exist and we
need to integrate them into one database - Similar to view integration, but may be
heterogeneous
30TOP-DOWN DATABASE DESIGN PROCESS
Phase 1Requirements Analysis
datarequirements
requirementsspecification
Phase 2Conceptual Design
conceptual externalschema design
conceptualschema
Phase 3Choice of DBMS/PL
distributed databaseschema design
conceptualdistributed schema
Phase 4Logical Design
logical externalschema design
logicalschema
Phase 5Physical Design
internal schemadesign
for each site
physicalschema
Phase 6Implementation
DDL statementsSDL statements
31DISTRIBUTED DATABASE DESIGN ISSUES
- Fragmentation
- Reasons
- Types
- Degree
- Correctness
- Allocation
- Partitioned
- Partially Replicated
- Fully Replicated
Database Information Communication
Information Application Information Computer
System Information
INFO required
32WHY DATA FRAGMENTATION?
- Relation is not an appropriate unit of
distribution - Application views are usually subsets of
relations - Application access is local to subsets of
relations - Applications at different sites may require
different parts of the same relation - Store once high remote access costs
- Duplicate high update costs
- Concurrency of access is more limited
- BUT, some applications may suffer from data
fragmentation - Their required data is located in two or more
fragments - It may be harder to check integrity constraints
(e.g., FDs) - Careful design of data fragmentation is required
33TYPES OF FRAGMENTATION
34HORIZONTAL DATA FRAGMENTATION
- partitions a relation along its tuples so that
each fragment has a subset of the tuples of the
relation
- Types of horizontal fragmentation
- Primary based on a predicate Pi that selects
tuples from a relation R - Derived based on the partitioning of a relation
due to predicates defined on another relation - related according to foreign keys
35HORIZONTAL DATA FRAGMENTATION PRIMARY
Each fragment, Ri, is a selection on a relation R
using a predicate Pi
R is reconstructed by taking theunion of all Ri
36(No Transcript)
37HORIZONTAL DATA FRAGMENTATION PRIMARY
original relation
primary fragments
38HORIZONTAL DATA FRAGMENTATION DERIVED
existing primary fragments
original relation
39HORIZONTAL DATA FRAGMENTATION DERIVED
derived fragments
existing primary fragments
40HORIZONTAL FRAGMENTATION INFORMATION REQUIREMENTS
- Database Information
- Relations in the database and relationships
between them
primary selection operation on owner derived
defined on member according to selection
operation on owner
- Application Information
- User query predicates examine most important
applications (80/20 rule) - simple p1 DEPT CSEE p2 SAL gt 30000
- conjunctive minterm predicate m1 p1? p2
(e.g., (DEPTCSEE) AND (SALgt30000)) - minterm selectivity number of tuples returned
against a given minterm - access frequency access frequency of user
queries and/or minterms
41VERTICAL DATA FRAGMENTATION
- produces fragments R1, R2, R3, ,Rn of a relation
R - each fragment contains a subset of Rs attributes
as well as the primary key of R - divides relations vertically by columns
(attributes) - the objective is to obtain fragments so that
applications only need to access one fragment - want to minimize execution time of applications
- inherently more complicated than horizontal data
fragmentation due to the total number of
alternatives available
42VERTICAL DATA FRAGMENTATION
Each fragment, Ri,is a projection ona relation R
R is reconstructedby applying eitherouter union
or fullouter join to all Ri
...
43VERTICAL DATA FRAGMENTATION (contd)
44VERTICAL FRAGMENTATION INFORMATION REQUIREMENTS
- 1. Access frequenciesuse(Qi, Aj) 1 if Aj is
referenced by Qi, 0 otherwise
2. Affinity of attributesnumber of accesses to
attributes (Ai, Aj) for each execution of
application Ql at site Sk weighted by the
application access frequency
45VERTICAL DATA FRAGMENTATION DESIGN
- Complexity of vertical fragmentation
- Total number of possible vertical fragments for a
relation with m non-primary key attributes is
equal to the m-th Bell number, B(m) - For large m, B(m) mm
- m 10 B(m) 115,000
- m15 B(m) 109
- m30 B(m) 1023
- Need heuristics to obtain reasonable solutions
- Heuristics for vertical fragmentation
- grouping assign each attribute to one fragment
and then join fragments until some criteria is
satisfied - splitting partition a single relation based on
access behaviour of applications to the
attributes - Need to determine affinity of attributes
46VERTICAL DATA FRAGMENTATION NOTES
- splitting fits more naturally with a top-down
methodology - the optimal solution is probably closer to the
full relation than to a set of single-attribute
fragments - splitting also generates non-overlapping
fragments whereas grouping produces overlapping
fragments - we prefer designs in which the primary key is
part of each fragment to enable semantic
integrity to be more easily enforced
47HYBRID (MIXED) DATA FRAGMENTATION
Each fragment, Ri, is a combinationof a
selection using a predicate Piand projection on
a relation R
48HYBRID (MIXED) DATA FRAGMENTATION (contd)
49CORRECTNESS RULES FOR FRAGMENTATION
- Completeness
- If a relation R is decomposed into fragments R1,
R2, , Rn, each tuple/attribute that can be found
in R can also be found in one or more of the Ris
- Reconstruction
- If a relation R is decomposed into fragments R1,
R2, , Rn, it should be possible to define a
relational operator D such that - R D Ri "RiÎ FR
- Disjointness
- If a relation R is horizontally decomposed into
fragments R1, R2, , Rn and data item di is in
Rj, it is not in any other fragment (Rk k?j)
50DATA REPLICATION
- Replication improves the availability of the
database, but increases the cost of updating data
and requires more complicated concurrency control
- No replication
- Each fragment is assigned to only one site
- Fully replicated
- Allocate a full copy of the database to each site
- Improves retrieval performance, but impairs
update performance - Selective replication
- Replicate some fragments at one or more sites
- Choice of sites and degree of replication depend
on performance and availability goals and on the
types and frequencies of transactions submitted
at each site
51DATA ALLOCATION
- Given a set of fragments F F1, F2, , Fn
- a set of sites S S1, S2, , Sm
- a set of transactions T T1, T2, , Tp
- Find an optimal distribution of F to S
- 1. Minimize cost
- Cost of storing each Fi at Sj
- Cost of querying Fi at Sj
- Cost of updating Fi at all sites where it is
stored - Cost of data communication
- 2. Maximize performance
- Minimize response time at each site
- Maximize system throughput at each site
- This problem is NP-hard!
52DATA ALLOCATION (contd)
- Notes
- the placement of one fragment usually has an
impact on the placement of the other fragments - should take into account relationships between
fragments - access to data by applications is modeled very
simply - should take into account relationships
betweendata allocation and query processing - most models do not consider the cost of integrity
enforcement - most models do not consider the cost of
concurrency control
many simplifying assumptions are madeto make the
problem tractable
53INFORMATION REQUIREMENTS DATA ALLOCATION
- 1. Database information
- global schema and fragmentation schema
- fragment selectivity for query qi
- size of a fragment
- 2. Application information
- A set of user transactions and their frequencies
- Read and write accesses which site can update
and which can query - Recovery estimated frequency and volume of
backup operations - Integrity referential integrity, journaling
overhead
- 3. Network information
- Network topology, network channel capacities and
network control mechanism - communication cost between sites
- 4. Site information
- The site locations and their processing capacity
(CPU I/O) - Sources of data (where data can be located) and
sinks of data (where user transactions can be
initiated and data transferred) - The transaction processing options and
synchronization algorithms - The unit cost for data storage, and local site
processing
Minimize C Ccommuication Cprocessing
Cstorage
54THE NONREDUNDANT BEST FIT METHOD
Place fragment Ri at site Sj where the number of
local query and update references by all
transactions is maximized
- determines the single site to which to allocate a
fragment based on maximum benefit - most total query and update references
55EXAMPLE SYSTEM INFORMATION
- Global schema and fragmentation schema
Transactions and their frequencies
Security no restrictions Data sources all
sites Data sinks all sites
56LOCAL REFERENCE COMPUTATION
57EXAMPLE ALLOCATION DECISION
58REDUNDANT ALL BENEFICIAL SITES METHOD
- Allocate a fragment Fi to a site Sj if the
benefit of doing so exceeds the cost
Cost (time to do all local updates
frequency) (time to do all remote updates
frequency) Benefit time saved to do local query
frequency
59EXAMPLE COST COMPUTATION
60EXAMPLE BENEFIT COMPUTATION
61EXAMPLE ALLOCATION DECISION
62PRACTICAL CONSIDERATIONS
- Allocating a fragment to a site will change the
cost-benefit outcomethis should be taken into
account when calculating cost-benefit for other
fragments - Our examples use averages for query/update times
to simplify the calculationsin real
environments, actual I/O times and network delay
times may be available for individual
transactions - No matter how detailed, all calculations in the
end are just estimates - We just need to be in the right ballpark, not
the exact seat!