Distributed Databases DDBs - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Distributed Databases DDBs

Description:

in the prepare state, a transaction can go to the commit or abort state ... Each site writes changes to log file, send ready to commit ... – PowerPoint PPT presentation

Number of Views:154
Avg rating:3.0/5.0
Slides: 46
Provided by: Vrb
Category:

less

Transcript and Presenter's Notes

Title: Distributed Databases DDBs


1
Distributed Databases (DDBs)
  • Chapter 11

2
Distributed Databases
  • Distributed Systems goal
  • to offer local DB autonomy at geographically
    distributed locations
  • Multiple CPU's each has DBMS, but data
    distributed

3
Advantages of DDBs
  • Distributed nature of some DB applications (bank
    branches)
  • Increased reliability and availability if site
    failure - also replicate data at gt 1 site
  • Data sharing but also local control
  • Improved performance - smaller DBs exist at each
    site

4
Disadvantages Increased complexity
  • Additional functions needed
  • global vs. local queries
  • access remote sites, transmit queries and data
  • keep track of data and replication
  • execution strategies if data at gt 1 site
  • which copy to access
  • maintain consistency of copies
  • recover from site crashes

5
Architectures
  • Parallel Systems goal  to construct a faster
    centralized computer
  • Distributed Systems goal  to offer local DB
    autonomy at geographically distributed locations

6
Parallel DBSs
  • Shared-memory multiprocessor - get N times as
    much work with N CPU's access - MIMD, SIMD -
    equal access to same data, massively parallel
  • Parallel shared nothing - data split among CPUs,
    each has own CPU, divide work for transactions,
    communicate over high speed networks
                    LANs - homogeneous machines
                    CPU memory - called a site

7
Distributed DBSs (DDBS)
  • Distributed DB - share nothing
  • lower communication rates, WAN, heterogeneous
    machines
  • Homogeneous DDBS
  • homogeneous same DBMSs
  • Heterogeneous DDBS
  • different DBMSs - need ODBC, standard SQL

8
   Heterogeneous distributed
  • Data distributed and site has own DBMS
  • ORACLE at one site, DB2 at another, etc.
  • need ODBC, standard SQL
  • usually transaction manager responsible for
    cooperation among sites
  • need TP monitor (software) - to tie together
    heterogeneous DBs
  • uses RPCs - supplies needed data conversion and
    to access data at other sites

9
Multidatabase
  • autonomous preexisting DBs form new database
  • collection of cooperating DBSs that are
    heterogeneous
  • Each DB has its own local users, local
    transparency and DBA
  • very high degree of local autonomy
  • need additional interface to allow users to
    access global data

10
Federated DB
  • federated DB is a multidatabase
  • keeps a partial view of total schema
  • Each DB specifies export schema (view)
  • Each DB keeps track of schema can import
  • appears centralized for local autonomous users
  • appears distributed for global users

11
DDBS
  • Issues in DDBS in slides that follow

12
 To process a query
  • Must use data dictionary that includes info on
    data distribution among servers
  • Ensure atomicity
  • Parse user query
  • decomposed into independent site queries
  • each site query sent to appropriate server site
  • site processes local query, sends result to
    result site
  • result site combines results of subqueries

13
Data fragments
  • Can distribute a whole relation at a site
  • or
  • Data fragments
  • logical units of the DB assigned for storage at
    various sites
  • horizontal fragmentation - subset of tuples in
    the relation (select)
  • vertical fragmentation - keeps only certain
    attributes of relation (project) need a PK

14
Fragments contd
  • horizontal fragments
  • complete - set of fragments whose conditions
    include every tuple
  • disjoint - tuples only member of 1 fragment
            salary lt 5000 and dno4
  • Complete vertical fragment        L1 U L2 U ...
    Ln - attributes of R                         Li
    intersect Lj PK(R)

15
Replication
  • Full vs. partial replication
  • Which copy to access
  • Improves performance for global queries but
    updates a problem
  • Ensure consistency of replicated copies of data

16
Example fragmentation/replication
  • Example of fragments for company DB     site 1
    - company headquarters gets entire DB     site
    2, 3 - based on dept. no.

17
Concurrency Control and Recovery in Distributed
Databases
  • Distributed transactions inherit ACID properties
  • problems
  • failure at individual sites and communication
    links
  • distributed deadlock (usually timeout)
  • distributed commit (use 2PC)
  • concurrency control with multiple copies of data

18
Distributed Commit
  • if updating data at different sites for same
    transaction, when to commit?
  • difficult part - commit together so have
    atomicity
  • Observation - 3 states of basic transactions
                active - can enter other 2 states
                committed - cannot leave this state
                aborted - cannot leave this state

19
Commit problems
  • To commit a distributed transaction need a
    coordinator (originator) site (S1 in our example)
  • Given transaction T with with components T1 and
    T2
  • where T1 executes at site S1
  • T2 executes at site S2
  • a)  if S1 commits locally, sends commit to S2
  • BUT if S2 crashes, T2 aborts before completes
    subtask - it is not atomic
  • b)  if both T1 and T2 ready to commit S1 sends
    message to commit
  • BUT if T2 crashes during commit, not atomic if
    cant recover

20
To solve problem
  • How to solve this problem?  Need an additional
    state
  • prepare state
  • when T is done and ready to commit, it goes to
    the prepare state
  • in the prepare state, a transaction can go to
    the commit or abort state
  • (place prepare log on log buffer, forced to log
    file - can reconstruct the transaction)

21
Use Two-Phase Commit (2PC)
  • used by commercial DBS to achieve coordinated
    commit of distributed transactions
  • Need a coordinator - usually a site that is
    participating in the transaction

22
2PC
  • Phase 1
  • When each subquery concludes, signal coordinator
  • Coordinator sends messages to all sites asking
    them to prepare site transactions.
  • Each site writes changes to log file, send ready
    to commit
  • if log fails, send cannot commit, else can
    commit
  • If any site response is "unsuccessful"
    coordinator sends messages to all sites to abort
    - rollback locally

23
2PC
  • Phase 2
  • If all prepare requests successful responses,
    coordinator sends commit
  • Sites record commit in log, update DB
  • subquery concluded    prepare for commit  
    ready to commit

24
Solved by 2PC?
  • a)  if S1 commits locally, sends commit to S2
  • BUT if S2 crashes, T2 aborts and it is not
    atomic
  • With 2PC
  • if T1 completes Transaction
  • T1 is prepared
  • If T2 crashes
  • T2 is unsuccessful
  • T1 can still abort from prepared state

25
Solved by 2PC?
  • If both T1 and T2 ready to commit S1 sends
    message to commit
  • BUT if T2 crashes before commit completed, not
    atomic
  • With 2PC
  • if both T1 and T2 successfully complete
    Transaction
  • T2 and T1 are both prepared
  • If T2 crashes can recover to prepared state and
    commit

26
Any problems with 2PC?
  • Blocking seems to be the problem
  • In most practical applications, blocking caused
    by 2PC are rare
  • Therefore, most systems use 2PC
  • However, when could 2PC block?
  • Coordinator site fails

27
3PC
  • However, when could 2PC block?
  • Coordinator site fails
  • Some participant declared ready to commit
  • This is a problem if site holding a lock.
  • Must wait (blocked) until coordinator recovers
    (or elect new one)
  • Proposed 3PC for this problem
  • 3PC
  • Requires an extra round of messages
  • Adds another state pre_commit

28
3PC
  • Phase 1 the same as 2PC
  • Coordinator sends vote_req to all sites
  • Sites respond commit or abort
  • If one aborts, all abort
  • Phase 2
  • If all reply vote_commit, coordinator sends
    precommit (prepare_to_commit)
  • Sites that responded commit, wait
  • If receive precommit, send ack
  • Else abort

29
3PC
  • Phase 3
  • Coordinator collects all acks
  • Decides commit
  • Broadcasts global_commit
  • Site receiving global_commit, commits

30
3PC
  • Eliminates blocking due to coordinator failure
    when
  • Site sent vote_commit, but not received
    prepare_to_commit
  • Site received prepare_to_commit but not
    global_commit

31
Quorum based 3PC
  • So what does adding the additional pre-commit
    state do?
  • If in pre-commit state, can elect new coordinator
    if it fails
  • New coordinator collects states from sites and
    tries to resolve transaction
  • Otherwise new coordinator tries to establish
    quorum

32
Quorum based 3PC establishing quorum
  • Every site assigned a vote Vi
  • Total votes V
  • Abort and commit quorum are Va and Vc
  • VaVcgt V
  • Before commit, must obtain commit quorum
  • At least one site in pre-commit and (sites in
    wait sites in pre_commit) form quorum
  • Before abort, must obtain abort quorum
  • (Sites in wait pre-abort) for quorum

33
Distributed concurrency control using locking
  • Centralized vs. distributed solutions

34
Primary site technique
  • Primary site technique
  • all distinguished copies at this site
  • site acts as coordinator for all DB items
  • Read lock from primary site, access local copy
  • if write lock, DBMS must update all copies
  • disadv 
  • all locking requests sent to same site -
    bottleneck
  • if failure at primary site?  must abort and
    restart all Ts

35
Primary site with backup
  • Primary site with backup
  • all lock information at both sites, so failure
    is less of a problem
  • disadv 
  • slower - 2 sites to communicate with

36
Primary copy technique
  • Primary copy technique - distribute
    responsibility
  • distinguished copies of different items at
    different sites
  • (can also choose a new coordinator if a failure
    - election to choose self as new site)

37
Voting
  • Voting - no distinguished copy
  • lock request sent to all sites with copy of data
  • if granted lock by majority of copies, hold lock
    and inform sites with copies lock is granted
  • if don't get a majority, timeout and cancel
    request sent to all sites
  • disadv 
  • lots of message traffic

38
Additional Issues in DDBSs
  • Distributed transaction if data distributed
  • Transaction manager (TM) knows where data is
    stored
  • TM decides which site starts local components of
    transaction
  • Distributed query processing

39
Query Processing
  • Query parallelism
  • If Query (read-only) - no 2PC needed
  • if shared nothing and horizontally fragmented
  • decompose query into subqueries that can be
    processed in parallel
  • one site (result site) accepts all results and
    puts together
  • Must also do order by, count

40
Distributed Query Processing
  • If the data is not fragmented
  • process whatever operations you can locally
    (select, project)
  • Query can involve joins between data at multiple
    sites
  • Must transfer data over network
  • Transfer of data highest cost
  • Algorithms to minimize amount of data transferred
    (NP-hard)

41
Distributed Query Processing
  • Assume R1 X R2 -gt Site 3
  • with R1 at S1 and R2 at S2
  • Can do
  • R1 -gt S3 and R2 -gt S3, do join at S3
  • R1-gt S2, execute join at S2, result to S3
  • R2-gt S1, execute join at S1, result to S3

42
Semi-join X
  • p dname, lname Dept Xmgrssn Emp
  • Assume Dept is R1 and Emp is R2
  • 1) project join attribute of Emp at S2 resulting
    in E', send S1. E' is (p ssn Emp)
  • 2) E' X Dept at S1. Project result attributes
    and join attributes resulting in E'', send to S2
  • E'is (p mgr, dname (Dept Xmgrssn E'))
  • 3)  p dname, lname (E'' Xmgrssn Emp) and send
    result to S3

43
Distributed Query Processing
  • Henver Yao
  • Cost
  • time to compute results
  • time to transmit data
  • Local data requires only processing time
  • Assume transmission time is most costly
  • Goal is to minimize transmission time

44
Distributed Query Processing
  • Assume you have R3R1XattattR2
  • R1 is at S1, R2 at S2 and result relation R3 sent
    to result site S
  • Amount of data to transmit
  • Assume R1 is 1M bytes
  • R2 is 250K bytes
  • R3 is 50k bytes
  • Using semi-join
  • pattR1 is 100K
  • pattR2 is 50K
  • pattR1XR2 is 10K
  • pattR2XR1 is 25K

45
Distributed Query Processing
  • Show all possibilities to previous problem and
    determine best distributed query processing
    strategy
Write a Comment
User Comments (0)
About PowerShow.com