Distributed Databases DDBs - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Distributed Databases DDBs

Description:

If shared nothing & horizontally fragmented: Select name, ... If horizontally fragmented. decompose query into subqueries that can be processed in parallel ... – PowerPoint PPT presentation

Number of Views:212
Avg rating:3.0/5.0
Slides: 52
Provided by: Vrb
Category:

less

Transcript and Presenter's Notes

Title: Distributed Databases DDBs


1
Distributed Databases (DDBs)
2
Distributed Databases
  • Distributed Systems goal
  • to offer local DB autonomy at geographically
    distributed locations
  • Multiple CPU's each has DBMS, but data
    distributed

3
Advantages of DDBs
  • Distributed nature of some DB applications (bank
    branches)
  • Increased reliability and availability if site
    failure - also replicate data at gt 1 site
  • Data sharing but also local control
  • Improved performance - smaller DBs exist at each
    site

4
Disadvantages Increased complexity
  • Additional functions needed
  • global vs. local queries
  • access remote sites, transmit queries and data
  • keep track of data and replication
  • execution strategies if data at gt 1 site
  • which copy to access
  • maintain consistency of copies
  • recover from site crashes

5
Architectures
  • Parallel Systems goal  to construct a faster
    centralized computer
  • Distributed Systems goal  to offer local DB
    autonomy at geographically distributed locations

6
Parallel DBSs
  • Shared-memory multiprocessor
  • get N times as much work with N CPU's access
  • MIMD, SIMD - equal access to same data, massively
    parallel
  • Parallel shared nothing
  • data split among CPUs, each has own CPU, divide
    work for transactions, communicate over high
    speed networks                 LANs -
    homogeneous machines                 CPU
    memory - called a site

7
Distributed DBSs (DDBS)
  • Distributed DB - share nothing
  • lower communication rates
  • WAN
  • heterogeneous machines
  • Homogeneous DDBS
  • homogeneous same DBMSs
  • Heterogeneous DDBS
  • different DBMSs - need ODBC, standard SQL

8
   Heterogeneous distributed DBSs HDDBs
  • Data distributed and each site has own DBMS
  • ORACLE at one site, DB2 at another, etc.
  • need ODBC, standard SQL
  • usually transaction manager responsible for
    cooperation among sites
  • must coordinate distributed transaction
  • need data conversion and to access data at other
    sites

9
Multidatabase
  • autonomous preexisting DBs form new database
  • collection of cooperating DBSs that are
    heterogeneous
  • Each DB has its own local users, local
    transparency and DBA
  • very high degree of local autonomy
  • need additional interface to allow users to
    access global data

10
Federated DB
  • federated DB is a multidatabase that bridges
    heterogeneity fully integrated
  • keeps a partial view of total schema
  • Each DB specifies import/export schema (view)
  • appears centralized for local autonomous users
  • appears distributed for global users

11
DDBS
  • Issues in DDBS in slides that follow

12
 To process a query
  • Must use data dictionary that includes info on
    data distribution among servers
  • Parse user query
  • decomposed into independent site queries
  • each site query sent to appropriate server site
  • site processes local query, sends result to
    result site
  • result site combines results of subqueries
  • Ensure atomicity

13
Data fragments
  • Can distribute a whole relation at a site
  • or
  • Data fragments
  • logical units of the DB assigned for storage at
    various sites
  • horizontal fragmentation - subset of tuples in
    the relation (select)
  • vertical fragmentation - keeps only certain
    attributes of relation (project) need a PK

14
Fragments contd
  • horizontal fragments
  • complete - set of fragments whose conditions
    include every tuple
  • disjoint - tuples only member of 1 fragment
            salary lt 5000 and dno4
  • Complete vertical fragment        L1 U L2 U ...
    Ln - attributes of R                         Li
    intersect Lj PK(R)

15
Replication
  • Full vs. partial replication
  • Which copy to access
  • Improves performance for global queries but
    updates a problem
  • Ensure consistency of replicated copies of data

16
Example replication/fragmentation
  • Example of fragments for company DB     site 1
    - company headquarters gets entire DB
        site 2, 3 horizontal fragments based on
    dept. no.

17
Concurrency Control and Recovery in Distributed
Databases
  • Distributed transactions inherit ACID properties
  • problems
  • failure at individual sites and communication
    links
  • distributed deadlock (usually timeout)
  • distributed commit (use 2PC)
  • concurrency control with multiple copies of data

18
Distributed Commit
  • if updating data at different sites for same
    transaction, when to commit?
  • difficult part - commit together so have
    atomicity
  • Observation - 3 states of basic transactions
                active - can enter other 2 states
                committed - cannot leave this state
                aborted - cannot leave this state

19
States
  • commit
  • Active
  • abort

20
Commit problems
  • To commit a distributed transaction need a
    coordinator (originator) site (S1 in our example)
  • Given transaction T with with components T1 and
    T2
  • where T1 executes at site S1
  • T2 executes at site S2
  • a)  if S1 commits locally, sends commit message
    to S2
  • BUT if S2 crashes, T2 aborts before completes
    subtask not atomic
  • b)  if both T1 and T2 ready to commit S1 sends
    message to commit
  • BUT if T2 crashes during commit, not atomic if
    cant recover

21
To solve problem
  • How to solve these problems? 
  • Need an additional state
  • prepare state
  • when T is done and ready to commit, it goes to
    the prepare state
  • in the prepare state, a transaction can go to
    the commit or abort state
  • (place prepare log on log buffer, forced to log
    file - can reconstruct the transaction)

22
States
  • commit
  • Active Prepare to commit
  • abort

23
Use Two-Phase Commit (2PC)
  • Used by commercial DBS to achieve coordinated
    commit of distributed transactions
  • Need a coordinator - usually a site that is
    participating in the transaction
  • Only needed if updates

24
2PC
  • Phase 1
  • When each subquery concludes, signal coordinator
  • If receive done from all, coordinator sends
    prepare to commit to all sites
  • Else coordinate signals abort to all.
  • Each site writes changes and Prepare to log
    file, send ready to commit, site enters
    prepared state
  • if log fails or cannot commit for other reasons,
    send cannot commit, else can commit
  • If dont hear from everyone within a time out
    interval assume not OK, send abort to all-
    rollback locally

25
2PC
  • Phase 2
  • If all prepare requests successful responses,
    coordinator sends commit
  • Sites record Commit in log, update DB
  • If any site response is unsuccessful, coordinator
    sends messages to all sites to abort -
    rollback locally
  • If dont hear from everyone within a time out
    interval assume not OK, send abort to all-
    rollback locally
  • subquery concluded     ready to commit
    (prepared state)
  • commit record commit

26
Solved by 2PC?
  • a)  if S1 commits locally, sends commit to S2
  • BUT if S2 crashes, T2 aborts and it is not
    atomic
  • With 2PC
  • if T1 completes Transaction
  • T1 is prepared
  • If T2 crashes
  • T2 is unsuccessful
  • T1 can still abort from prepared state

27
Solved by 2PC?
  • b) If both T1 and T2 ready to commit, S1 sends
    message to commit
  • BUT if T2 crashes before commit completed, not
    atomic
  • With 2PC
  • if both T1 and T2 successfully complete
    Transaction
  • T2 and T1 are both prepared
  • If T2 crashes can recover prepared state and
    commit

28
Any problems with 2PC?
  • A blocking protocol (block until commit or
    rollback received)
  • In most practical applications, blocking caused
    by 2PC are rare
  • Therefore, most systems use 2PC
  • However, when could 2PC block?

29
2PC blocking
  • However, when could 2PC block?
  • Problem can occur when site in prepared state
  • Site voted to commit
  • Cannot commit until receive commit from
    coordinator
  • Cannot change vote and abort
  • Participant is blocked until receive message from
    coordinator
  • (or coordinator can block while waiting for
    replies from participants what about time out?)
  • Solution?

30
Three-phase commit
  • 3PC
  • Requires an extra round of messages
  • Adds another state pre_commit in between ready
    to commit (prepared) state and commit state
  • When in pre-commit state
  • can still abort
  • can contact other sites

31
3PC
  • Phase 1 the same as 2PC
  • Coordinator sends vote_req to all sites
  • Sites respond commit or abort
  • If one aborts, all abort
  • Phase 2
  • If all reply vote_commit, coordinator sends
    precommit (prepare_to_commit)
  • Sites that responded commit, wait
  • If receive precommit, send ack
  • Else abort

32
3PC
  • Phase 3
  • Coordinator collects all acks
  • Decides commit
  • Broadcasts global_commit
  • Site receiving global_commit, commits
  • subquery concluded     ready to commit
  • precommit
  • commit record commit

33
3PC
  • Eliminates blocking due to coordinator failure
    when
  • Site sent vote_commit, but not received
    prepare_to_commit
  • Site can abort if do not hear from coordinator
  • Site received prepare_to_commit but not
    global_commit
  • Knows everyone voted to commit, knows it will
    eventually commit, can act independently,
    communicate with other sites

34
3PC
  • If in pre-commit state
  • can elect new coordinator if it fails
  • New coordinator collects states from sites and
    tries to resolve transaction
  • Otherwise new coordinator tries to establish
    quorum

35
Quorum based 3PC establishing quorum
  • Every site assigned a vote Vi
  • Total votes V
  • Abort and commit quorum are Va and Vc
  • VaVcgt V
  • Before commit, must obtain commit quorum
  • At least one site in pre-commit and (sites in
    wait sites in pre_commit) form quorum
  • Before abort, must obtain abort quorum
  • (Sites in wait pre-abort) for quorum

36
Distributed concurrency control using locking
  • Centralized vs. distributed solutions
  • Definition distinguished copy
  • Locks associated with this copy

37
Primary site technique
  • Primary site technique
  • Distinguished copy of all data at this site
  • All locks kept at this site, all requests sent
    here
  • Site acts as coordinator for all DB items
  • If write lock, DBMS must update all copies
  • If read lock from primary site, can access local
    copy
  • disadv 
  • all locking requests sent to same site -
    bottleneck
  • if failure at primary site?  must abort and
    restart all Ts

38
Primary site with backup
  • Primary site with backup
  • all lock information at both sites, so failure
    is less of a problem
  • disadv 
  • slower - 2 sites to communicate with

39
Primary copy technique
  • Primary copy technique - distribute
    responsibility
  • distinguished copies of different items at
    different sites
  • can also choose a new coordinator if a failure -
    election to choose self as new site

40
Voting
  • Voting - no distinguished copy
  • lock request sent to all sites with copy of data
  • if granted lock by majority of copies, hold lock
    and inform sites with copies lock is granted
  • if don't get a majority, timeout and cancel
    request sent to all sites
  • disadv 
  • lots of message traffic

41
Additional Issues in DDBSs
  • Distributed transaction if data distributed
  • Transaction manager (TM) knows where data is
    stored
  • TM decides which site starts local components of
    transaction
  • Distributed query processing

42
Query Parallelism
  • Decompose query into parts that can be executed
    in parallel at several sites
  • Intra query parallelism
  • If shared nothing horizontally fragmented
  • Select name, phone from account where age gt 65
  • Decompose into K different queries
  • Result site accepts all and puts together (order
    by, count)

43
Query Parallelism
  • What if a join?
  • Difficult problem if table fragments at different
    sites
  • Must get all values of joint attributes at one
    site
  • Then broadcast to relevant sites value of join
    attribute
  • If site 1 has values 1-10 and site 2 has 11-20,
    only set to those sites
  • Result tuples returned, join performed
  • Example A XB with tuples A1B1 A2B2
    A3B3 tuples A1,A2,A3 sent to S1 S1 sends FK to
    B1,B2,B3 sites B1,B2,B3 site send tuples where
    PKA1,A2,A3 join performed at S1

44
Distributed Query Processing
  • If Query (read-only) - no 2PC needed
  • If horizontally fragmented
  • decompose query into subqueries that can be
    processed in parallel
  • one site (result site) accepts all results and
    puts together
  • Must also do order by, count

45
Distributed Query Processing
  • If the data is not fragmented
  • process whatever operations you can locally
    (select, project)
  • Query can involve joins between data at multiple
    sites
  • Must transfer data over network
  • Transfer of data highest cost
  • Algorithms to minimize amount of data transferred
    (NP-hard)

46
Distributed Query Processing
  • Assume R1 X R2 -gt Site 3
  • with R1 at S1 and R2 at S2
  • Can do
  • R1 -gt S3 and R2 -gt S3, do join at S3
  • R1-gt S2, execute join at S2, result to S3
  • R2-gt S1, execute join at S1, result to S3

47
Semi-join X
  • p dname, lname Dept Xmgrssn Emp
  • Assume Dept is R1 and Emp is R2
  • 1) project join attribute of Emp at S2 resulting
    in E', send to S1. E' is (p ssn Emp)
  • 2) E' X Dept at S1. Project result attributes
    and join attributes resulting in E'', send to S2
  • E' is (p mgr, dname (Dept Xmgrssn E'))
  • 3)  p dname, lname (E'' Xmgrssn Emp) and send
    result to S3

48
Distributed Query Processing
  • Henver Yao
  • Cost
  • time to compute results
  • time to transmit data
  • Local data requires only processing time
  • Assume transmission time is most costly
  • Goal is to minimize transmission time

49
Distributed Query Processing
  • Assume you have R3R1XattattR2
  • R1 is at S1, R2 at S2 and result relation R3 sent
    to result site S
  • Amount of data to transmit
  • Assume R1 is 1M bytes
  • R2 is 250K bytes
  • R3 is 50k bytes
  • Using semi-join
  • pattR1 is 100K
  • pattR2 is 50K
  • pattR1XR2 is 10K
  • pattR2XR1 is 25K

50
Distributed Query Processing
  • Show all possibilities to previous problem and
    determine best distributed query processing
    strategy

51
FDBSs - Data Integration
  • Fully-integrated, logical composite of all
    constituent databases
  • Potential problems
  • Incompatible data types or query syntax
  • Semantically equivalent by differently named
    parts of schemas
  • Use views for data integration
  • reference
Write a Comment
User Comments (0)
About PowerShow.com