Distributed Databases - PowerPoint PPT Presentation

1 / 105
About This Presentation
Title:

Distributed Databases

Description:

... data is located on one site/machine. Semi-Distributed ... Each site surrenders part of its autonomy in terms of right to change ... a site. Network partitioning ... – PowerPoint PPT presentation

Number of Views:1357
Avg rating:3.0/5.0
Slides: 106
Provided by: cec2
Category:

less

Transcript and Presenter's Notes

Title: Distributed Databases


1
Distributed Databases
  • Frank Chen
  • Elise Edson
  • Jian Gao
  • Nathan Waldman
  • Phil Wang
  • Hui Zhang

2
Why Distribute Data?
  • Centralized DBMS vs. DDBMS
  • Several factors have led to the development of
    DDBMS
  • Distributed nature of some database applications
  • Increased reliability and availability
  • Allowing data sharing while maintaining some
    measure of local control
  • Improved performance

3
The Voltron Analogy
4
Additional Functionality
  • Distribution leads to increased complexity in the
    system design and implementation
  • DDBMS must be able to provide additional
    functions to those of a centralized DBMS. Some of
    these are
  • To keep track of the data distribution and
    replication in the DDBMS catalog.
  • To devise execution strategies for queries and
    transactions that access data from more than one
    site.
  • To decide on which copy of a replicated data item
    to access.
  • To maintain the global conceptual schema of the
    distributed database

5
Dates 12 Rules of a DDBMS
  • 1. Local autonomy
  • 2. No reliance on a central site
  • 3. Continuous operation (Failure independence)
  • 4. Location independence
  • 5. Fragmentation independence
  • 6. Replication independence

6
Dates 12 Rules of a DDBMS
  • 7. Distributed query processing
  • 8. Distributed transaction processing
  • 9. Hardware independence
  • 10. Operating system independence
  • 11. Network independence
  • 12. Database independence

7
Todays Topics
  • Architecture Design
  • Transparency
  • Functionality

8
Architecture Design
9
Introduction
  • Distributed database
  • A logically interrelated collection of shared
    data (and a description of this data) physically
    distributed over a computer network.
  • Distributed DBMS
  • The software system that permits the management
    of the distributed database and makes the
    distribution transparent to users.

10
Overview
11
Distributed DBMS VS. Distributed Processing
Site 1
DB
Site 1
Computer Network
Computer Network
Site 2
Site 2
Site 4
Site 4
Site 3
DB
Site 3
DB
DB
Distributed Processing
Distributed DBMS
12
Distributed DBMS VS. Parallel DBMS
  • Parallel DBMS
  • A DBMS running across multiple processors and
    disks that is designed to execute operations in
    parallel, whenever possible, in order to improve
    performance.
  • Parallel DB Architectures
  • Shared Memory
  • Shared Disk
  • Shared Nothing

13
Architectural Dimensions
  • Three dimensions of distributed database systems
  • Autonomy of the individual sites comprising the
    system
  • Distribution of data throughout the system
  • Heterogeneity of hardware, operating system, data
    model and/or DBMS across the various sites

14
Autonomy
  • No Autonomy
  • The various sites/machines are tightly
    integrated and cannot function independently
  • Semi-Autonomous
  • The various sites/machines are able to process
    independently, but their cooperation involves
    some modifications to their otherwise independent
    processing
  • Fully Autonomous
  • The various sites/machines function in complete
    isolation, knowing nothing of and being
    unaffected by other sites processing

15
Distribution
  • No Distribution
  • All of the data is located on one site/machine
  • Semi-Distributed
  • The data is spread across some sites but not all
    there is a differentiation between sites, with
    some being servers and others clients
  • Fully Distributed
  • There is no distinction between sites/machines,
    at least in the sense that data is spread across
    all of them

16
Heterogeneity
  • Homogeneous
  • All sites used the same DBMS product
  • Are aware of each other and agree to cooperate in
    processing user requests.
  • Each site surrenders part of its autonomy in
    terms of right to change schemas or software
  • Appears to user as a single system

17
Heterogeneity
  • Heterogeneous
  • Different sites may use different DBMS product
  • Need not be based on the same underlying data
    model
  • May be composed of relational, network,
    hierarchical, and object-oriented DBMSs.
  • Sites may not be aware of each other and may
    provide only limited facilities for cooperation
    in transaction processing

18
Interesting Architectures
  • Client/server, with multiple servers
  • No autonomy Semi-distributed Homogeneous
  • Peer to peer distributed database
  • Semi-autonomous Fully distributed
    Homogeneous
  • Federated (multi-) database
  • Full autonomy Fully distributed Heterogeneous

19
Architectural Dimensions
20
Reference Architecture for a DDBMS
  • DDBMS can be viewed as an extension of the
    ANSI/SPARC 3-schema architecture
  • The schema consists
  • A set of global external schemas
  • A global conceptual schema
  • A fragmentation schema and allocation schema
  • A set of schemas for each local DBMS

21
Reference Architecture for a DDBMS
GES1
GES2
GES3
GES Global External Schema GCS Global
Conceptual Schema FS Fragmentation Schema AS
Allocation Schema LMS Local Mapping Schema LCS
Local Conceptual Schema LIS Local Internal Schema
GCS
FS
AS
LMS3
LMS2
LMS1
LCS1
LCS2
LCS2
LIS3
LIS2
LIS1
DB
DB
DB
22
Reference Architecture for a Federated MDBS
  • From an architectural viewpoint, multi-database
    distributed systems come in two basic varieties
  • Use a global conceptual schema tightly coupled
  • Dont use a global conceptual schema loosely
    coupled
  • The former case is a generalization, in a sense,
    of the peer-to-peer system

23
Reference Architecture for a Federated MDBS
GES1
GES2
GES3
ANSI/SPARC
ANSI/SPARC
GCS
LES11
LESn1
LES12
LESn2
LCSn
LCS1

LES13
LESn3
LISn
LIS1
Sites
(Tightly Coupled)
24
Component Architecture for a DDBMS
  • Major Components
  • Local DBMS (LDBMS) component
  • Data communications (DC) component
  • Global system catalog (GSC)
  • Distributed DBMS (DDBMS) component
  • Comparison of strategies for data allocation

25
Distributed Relational Database Design
  • Fragmentation
  • A relation may be divided into a number of
    sub-relations, called a fragments, which are then
    distributed.
  • Allocation
  • Each fragment is stored at the site with
    optimal distribution.
  • Replication
  • The DDBMS may maintain a copy of a fragment at
    several different sites.

26
Distributed Relational DB Design (Data Allocation)
  • Strategic Objectives
  • Centralized
  • Fragmented (or Partitioned)
  • Complete Replication
  • Selective Replication

27
Distributed Relational DB Design (Fragmentation)
  • Why Fragment?
  • Usage
  • Efficiency
  • Parallelism
  • Security
  • Primary disadvantages
  • Performance
  • Integrity
  • Rules to follow during fragmentation
  • Completeness
  • Reconstruction
  • Disjointness

28
Distributed Relational DB Design (Fragmentation)
  • Types of fragmentation
  • Horizontal fragmentation
  • -- Consists of a subset of the tuples of a
    relation
  • Vertical fragmentation
  • -- Consists of a subset of the attributes of a
    relation
  • Mixed fragmentation
  • -- Consists of a horizontal fragment that is
    subsequently vertically fragmented, or a vertical
    fragment that is then horizontally fragmented.
  • Derived horizontal fragmentation
  • -- A horizontal fragment that is based on the
    horizontal fragmentation of a parent relation.

29
Distributed Relational DB Design (Methodology
Summary)
  • Design global relations.
  • Examine the topology of the system.
  • Analyse the most important transactions and
    identify where horizontal or vertical
    fragmentation may be appropriate.
  • Decide the relations not to be fragmented (but to
    be replicated).
  • Examine the relations that are on the one-side of
    a relations and decide a suitable fragmentation
    schema.
  • Check for situations where either vertical or
    mixed fragmentation would be appropriate during
    the previous step.

30
Transparency
31
Transparency
  • Q Why do we need transparency?

32
Transparency
  • Q Why do we need transparency?
  • A Without DDBMS transparency, applications must
    be written using special knowledge about the
    stored data

33
Transparency
  • Distribution Transparency
  • Transaction Transparency
  • Performance Transparency
  • Heterogeneity Transparency

34
Distribution Transparency
  • Idea Users perceive the DB as a single, unified
    entity
  • Types of Distribution Transparency
  • Fragmentation
  • Location
  • Replication
  • Local-mapping
  • Naming

35
Distribution TransparencyFragmentation
  • What is a fragment?
  • Horizontal fragments
  • Fragmented by SELECTION
  • Composed by UNION
  • Vertical fragments
  • Fragmented by PROJECTION
  • Composed by JOIN

36
Distribution TransparencyFragmentation
  • Mixed/Hybrid fragments
  • Horizontal ? Vertical fragmentation
  • Vertical ? Horizontal fragmentation
  • Derived (Horizontal) fragments
  • Fragmenti Fragmentj
  • Fragmented by SEMIJOIN

37
Fragment Types
38
Distribution TransparencyFragmentation
  • Idea Tables can be fragmented to different sites
    without user awareness
  • Example
  • SELECT name
  • FROM tblStaff
  • WHERE salary gt 30000

39
Distribution TransparencyFragmentation Example
EMP DEPT SALARY E1 DX
45K E2 DY 40K E3
DZ 50K E4 DY
63K E5 DZ 40K
Employee
User Perception
New York fragment
London fragment
EMP DEPT SALARY E1 DX
45K E3 DZ 50K E5
DZ 40K
EMP DEPT SALARY E2 DY
40K E4 DY 63K
40
Distribution TransparencyLocation
  • Idea Users dont know where the data is but must
    specify fragment names
  • Example
  • SELECT name
  • FROM tblStaff21
  • WHERE staffNo IN
  • (SELECT staffNo FROM tblStaff1
  • WHERE salary gt 30000)

41
Distribution TransparencyReplication
  • Idea Objects can be copied, and copies are
    maintained automatically and without user
    knowledge
  • Location Transparency ? Replication Transparency

42
Distribution TransparencyReplication Example
EMP DEPT SALARY E1 DX
45K E2 DY 40K E3
DZ 50K E4 DY
63K E5 DZ 40K
Employee
User Perception
EMP DEPT SALARY E1 DX
45K E3 DZ 50K E5
DZ 40K
EMP DEPT SALARY E2 DY
40K E4 DY 63K
London fragment
New York fragment
EMP DEPT SALARY E1 DX
45K E3 DZ 50K E5
DZ 40K
Replica of New York fragment
Replica of London fragment
EMP DEPT SALARY E2 DY
40K E4 DY 63K
43
Distribution TransparencyLocal-mapping
  • Idea User needs to specify fragment names and
    location of data items
  • Example
  • SELECT name
  • FROM tblStaff21 _at_ SITE 3
  • WHERE staffNo IN
  • (SELECT staffNo FROM tblStaff1 _at_ SITE 9
  • WHERE salary gt 30000)

44
Distribution TransparencyNaming
  • Idea Each attribute in the DDB must have an
    unique name DDBMS ensures this by either
  • Creating a central name server
  • Prefixing attribute with site identifier
  • Using aliases Best Solution

45
Transaction Transparency
  • Idea Distributed transactions look like
    single-site transactions
  • Example

46
Transaction Transparency
  • Concurrency Transparency
  • Failure Transparency

47
Transaction TransparencyConcurrency
  • Idea DDBMS ensures non-interference among local
    and global transactions
  • Replication complicates concurrency
  • Solution 1 Establishing atomicity
  • Solution 2 Limit update
  • Solution 3 Asynchronous update

48
Transaction TransparencyWhat if we goof up?
  • Several problems can occur when concurrent
    transactions execute in an uncontrolled manner
  • The lost update problem
  • The temporary update problem
  • The incorrect summary problem

49
Transaction TransparencyFailure
  • Idea Recovery procedure must handle
  • Loss of a message
  • Failure of a communication link
  • Failure of a site
  • Network partitioning
  • Must ensure all sub-transactions have completed
    successfully before committing to the transaction.

50
Transaction TransparencyCommit Protocol
  • Idea Ensure atomicity of both local and global
    transactions by employing
  • Transaction Manager (Coordinator)
  • Resource Managers (Participants)
  • Two-Phase Commit (2PC)
  • Three-Phrase Commit (3PC)

51
Transaction TransparencyTwo-Phase Commit (2PC)
52
Transaction Transparency2PC Topologies
  • Centralized 2PC
  • Linear 2PC
  • Distributed 2PC

53
Transaction Transparency2PC Topology -
Centralized
C coordinator Pi participant
RC READY_COMMIT GC GLOBAL
_COMMIT GA GLOBAL_ABORT
54
Transaction Transparency2PC Topology - Linear
C coordinator Pi participant
RC READY_COMMIT GC GLOBAL
_COMMIT GA GLOBAL_ABORT
55
Transaction Transparency2PC Topology -
Distributed
C coordinator Pi participant

56
Transaction TransparencyThree-Phase Commit (3PC)
57
Performance Transparency Query Optimization
  • Idea A DDBMS performs as if it were a
    centralized DBMS
  • Decisions
  • Which fragment to access
  • Which copy of a fragment to use, if the fragment
    is replicated
  • Which location to use
  • Costs
  • Access time (I/O), Communication time, CPU time

58
Heterogeneity Transparency
  • Idea Shouldnt matter what DBMS is running at
    each site


Running Oracle
Running Sybase
1
N
Distributed Database
59
But wait, theres a caveat
  • Its generally not a good idea to implement full
    transparency!

60
Functionality
61
Overview
  • Transaction Management
  • Concurrency Control
  • Deadlock Management
  • Database Failure Recovery
  • The X/Open Distributed Transaction Processing
    Model and Replication
  • Query Optimization
  • Mobile Databases

62
Transaction Management
  • Local Database
  • Transaction Manager
  • Scheduler
  • Recovery Manager
  • Buffer Manager
  • Database Site
  • Transaction Coordinator
  • Data Communication Component

63
Concurrency Control
  • Objectives
  • Serializabilty
  • Locking Protocols
  • Centralized 2PL
  • Primary Copy 2PL
  • Distributed 2PL
  • Timestamp Protocols

64
Concurrency Control
  • Objectives
  • Preserve consistency of data items
  • Enable parallel processing for efficiency
  • Minimize overhead
  • Work efficiently over communication delayed
    network
  • Reduce constraints on structure of atomic actions

65
Concurrency Control
  • Serializability
  • Simple answer
  • one transaction has full, locked control
  • Better method?
  • Want
  • Parallel processing
  • Maximization of concurrent operations
  • How to implement in Distributed DBMS?

66
Concurrency Control
  • Methodology
  • Locking
  • Timestamping
  • How do we implement locking mechanism?

67
Concurrency Control
  • Centralized 2PL (two phase locking)
  • Single site that maintains all locking
    information
  • One lock manager
  • Advantages
  • Straightforward implementation
  • Disadvantages
  • Bottleneck
  • Lower reliability

68
Concurrency Control
  • Primary Copy 2PL
  • Many sites have lock managers
  • For replicated data, one copy is chosen as the
    primary copy, and the rest are slave copies
  • On update, transaction coordinator determines
    location of primary copy
  • Update is propagated to slave copies
  • Advantages
  • Fixes Centralized 2PL disadvantages
  • Disadvantages
  • Deadlock managing more complex
  • Primary copy local is bottleneck

69
Concurrency Control
  • Distributed 2PL
  • Every site has a lock manager
  • If data is replicated, ROWA (read-one-write-all)
    control is executed from any one copy.
  • Advantages
  • Decentralized lock handling
  • Disadvantages
  • Again, deadlock managing more complex
  • High communication costs (all items locked before
    update)

70
Timestamp Protocols
  • Objective
  • Order transactions globally
  • Give older transactions priority in conflict
  • i.e. breaks ties

71
Deadlock Management
  • Centralized Deadlock Detection
  • Hierarchical Deadlock Detection
  • Distributed Deadlock Detection

72
Deadlock Management
  • Centralized Deadlock Detection
  • Single site appointed Deadlock Detection
    Coordinator (DDC)
  • Constructs aggregate wait-for-graphs (WFGs), and
    rolls back deadlocked sites
  • Advantages
  • Need only send delta of WFGs
  • Disadvantages
  • Single point of failure

73
Wait for Graph
74
Deadlock Management
  • Hierarchical Deadlock Detection
  • Sites organized into hierarchy
  • Each node is responsible for checking for
    deadlock in child nodes.
  • Advantages
  • Less network communication
  • Disadvantages
  • Much more complex to implement

75
Deadlock Management
  • Distributed Deadlock Detection
  • Each site contains partial deadlock detection
    information
  • Advantages
  • Extremely robust
  • Disadvantages
  • High network traffic

76
Obermarck Detail
  • Transactions are ordered
  • Ex node is used to represent remote sites in
    the local WFG
  • Upon receiving deadlock detection
  • Create local WFG
  • Resolve local deadlocks
  • For cycles that go from an Ex node through
    local nodes then back out to an Ex node, send
    local node graph out to other sites to fix

77
Database Failure Recovery
  • Failures in a Distributed Environment
  • How Failures Affect Recovery
  • Distributed Recovery Protocols
  • Network Partitioning
  • Identifying Updates
  • Maintaining Integrity
  • Pessimistic Protocols
  • Optimistic Protocols

78
Failures in a Distributed Environment
  • Distributed DBMS-specific failures
  • Packet loss
  • Communications link failure
  • Site failure
  • Network partitioning
  • Sites split into multiple groups that can
    communicate to each other, but not outside the
    group
  • Hard to tell if network failure or busy site

79
Failure and Recovery
  • Goal
  • Maintain globally atomic transactions
  • Leave operational sites unblocked
  • Upon failure
  • Abort affected transactions
  • Flag site as failed
  • Wait for site to recover
  • Initiate recovery procedure to remove partial
    transactions
  • Update its copy of the database to maintain
    consistency with the overall database.

80
Distributed Recovery Protocols
  • Assumptions
  • Every global transaction has transaction manager
    TM site.
  • Two common protocols
  • Two-phase commit (2PC)
  • Three-phase commit (3PC)

81
Network Partitioning
  • Problem with replicated data
  • Classic unsynchronized bank account
  • Constraints on account (e.g. Balance cant go
    below 0) may be violated
  • Generally not possible to design a non-blocking,
    atomic commit protocol for arbitrary partitioned
    networks

82
Network Partitioning
  • Pessimistic protocols
  • Choose consistency of database over availability.
    Block transactions in a partitioned state if
    consistency cannot be maintained.
  • Optimistic protocols
  • Availability at the expense of consistency.
    Updates are allowed to proceed independently in
    the various partitions.

83
The X/Open Distributed Transaction Processing
Model
  • Vendor-neutral, international consortium of
    users, software vendors, and hardware vendors
    whose mission is to cause the creation of a
    viable, global information infrastructure.
  • Goals and Achievements (DTP Standard)
  • Develop APIs for atomic, consistent, isolated,
    and durable transaction processing
  • Created idea of Transaction Manager (TM) and
    Resource Manager (RM) discussed previously

84
The X/Open Distributed Transaction Processing
Model
  • DDBMS Tasks
  • Remote invocation mechanism
  • Provided by Remote Operations Service (ROSE) and
    Remote Procedure Call (RPC)
  • Distributed transaction mechanism
  • Uses Open System Interconnection Transaction
    Processing (OSI-TP) protocol
  • Supports flat, chained, and nested transactions

85
Replication
  • Replication is the process of generating and
    reproducing multiple copies of data at one or
    more sites.
  • Data Ownership
  • Master/Slave Ownership
  • Workflow Ownership
  • Update Anywhere Ownership

86
Master-Slave Ownership
  • Asynchronously replicated data is owned by one
    site and can only be updated by that site.
  • Other sites subscribe to the data owned by the
    publisher site. (publish and subscribe metaphor)

87
Workflow Ownership
  • Similar to Master-Slave Ownership, but allows the
    right to update to move from site to site.
  • Only one site can have the right to update at any
    given moment.

88
Update Anywhere Ownership
  • P2P style functionality
  • Everyone can update
  • Provides local autonomy even if the rest of the
    network is dead.

89
Replication Mechanisms
  • Snapshots
  • Database Triggers
  • Conflict Detection and Resolution

90
Snapshot
  • Allow for asynchronous distribution of changes to
    individual relations, collections of relations,
    views, or fragments of relations according to a
    predefined schedule.

91
Database Triggers
  • Shifts the responsibility to the users code
  • Executes whenever an appropriate event occurs
  • Disadvantages
  • Increases overhead
  • Have no knowledge of transactional nature of the
    modification
  • Frequent updates stress the network and
    application
  • Cannot be scheduled
  • Refreshing related relations is complex
  • Difficult to rollback

92
Conflict Detection and Resolution
  • Detection
  • Before and after images
  • Resolution
  • Earliest and latest timestamps
  • Site priority
  • Additive and average updates
  • Minimum and maximum values
  • User-defined
  • Hold for manual resolution
  • Resolve Duplicate Primary Key or Unique Keys
  • Append site name to duplicate value
  • Append sequence to duplicate value
  • Discard duplicate value

93
Current X/Open DTP and Replication Support
94
Query Optimization
  • Need for speed
  • Distributed Query Transformation
  • Technique Primary Horizontal Fragmentation
  • Technique Distributed Joins

95
Need for speed
  • Why worry?
  • For query optimization, you want to avoid disk
    reads, because they are slow
  • WAN or even LAN speeds are much slower than disks

96
Distributed Query Transformation
  • Same idea as was presented by query optimization
    group, but extended in the following ways
  • Once you apply the heuristic processing, take
    data distribution into account
  • Replace global relations at the leaves of the
    tree with their reconstruction algorithms

97
Technique Primary Horizontal Fragmentation
  • (R1 Union R2) JOIN R3
  • (R1 JOIN R3) Union
  • (R2 JOIN R3)
  • Allows for join of two relations to be
    implemented as a union of partial joins (that can
    be performed in parallel)

98
Technique Distributed Joins
  • Join is very expensive
  • To optimize, replace join by a combinations of
    semi-joins. This reduces the size of the operand
    relation.
  • R1 JOIN(x) R2
  • (R1 SEMIJOIN(x) R2) JOIN R

99
Technique Distributed Joins
  • Evaluation
  • R PROJ(x) (R2) as Site 2
  • Transfer R to Site 1
  • R R1 SEMIJOIN(x) R as Site 1
  • Transfer R to Site 2
  • Result R JOIN(x) R2 at Site2

100
Mobile Databases
  • Problem
  • Laptops and PDAs
  • Want database functionality, but have limited
    bandwidth and storage
  • Solution
  • Mobile Databases
  • Portable and physically separate from a
    centralized database server, but is capable of
    communicating with that server from remote sites

101
Mobile Databases
  • Advantages
  • Mobile!
  • Poll centralized database for up-to-date data and
    statistics
  • Can do a one-shot update of centralized
    database
  • Disadvantages
  • Typically have pre-defined SQL queries
  • May change in future devices

102
Questions?
103
References Web
  • http//www.csee.umbc.edu/kalpakis/Courses/621-fa0
    3/Lectures/DistributedDeadlocks.pdf
  • http//www.itee.adfa.edu.au/courses/ACSC7360/
  • http//www.postgresql.org/
  • http//www.oracle.com/
  • http//www.microsoft.com/
  • http//www.mysql.com/
  • http//www.cis.usouthal.edu/share/allen_324/Presen
    tr/CH10.ppt
  • http//www.cs.rpi.edu/noel/distr_scaleup/distribu
    ted.html
  • http//db.cs.berkeley.edu/jmh/cs262b/ddbms.html.
  • http//www.doc.mmu.ac.uk/STAFF/P.Quick/distribu.pp
    t

104
References Text
  • Database Systems Design, Implementation,
    Management, 5th Edition, Rob Coronel
  • Distributed and parallel database systems M.
    Tamer Ozsu and Patrick Valduriez
  • Database System Concepts, Silberschatz, Korth and
    Sudarshan, McGraw-Hill

105
References Review Article
  • Mariposa A Wide-Area Distributed Database
    System, Michael Stonebraker, Paul M. Aoki, Avi
    Pfeffer, Adam Sah,Jeff Sidell, Carl Staelin and
    Andrew Yu
Write a Comment
User Comments (0)
About PowerShow.com