Outline - PowerPoint PPT Presentation

About This Presentation

Title:

Outline

Description:

Fail locks and copier transactions. Session vectors. Control transactions. Distributed DBMS ... Copier Transaction. Copier transaction reads current values ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 40

Provided by: mtame7

Learn more at: https://www.cs.purdue.edu

Category:

more less

Transcript and Presenter's Notes

Title: Outline

1
Outline

Introduction
Background
Distributed DBMS Architecture
Distributed Database Design
Distributed Query Processing
Distributed Transaction Management
Transaction Concepts and Models
Distributed Concurrency Control
Distributed Reliability
Building Distributed Database Systems (RAID)
Mobile Database Systems
Privacy, Trust, and Authentication
Peer to Peer Systems

2
Useful References

S. B. Davidson, Optimism and consistency in
partitioned distributed database systems, ACM
Transactions on Database Systems 9(3) 456-481,
1984.
S. B. Davidson, H. Garcia-Molina, and D. Skeen,
Consistency in Partitioned Networks, ACM Computer
Survey, 17(3) 341-370, 1985.
B. Bhargava, Resilient Concurrency Control in
Distributed Database Systems, IEEE Trans. on
Reliability, R-31(5) 437-443, 1984.
Jr. D. Parker, et al., Detection of Mutual
Inconsistency in Distributed Systems, IEEE Trans.
on Software Engineering, SE-9, 1983.

3
Site Failure and Recovery

Maintain consistency of replicated copies during
site failure.
Announce failure and restart of a site.
Identify out-of-date data items.
Update stale data items.

4
Main Ideas and Concepts

Read one Write all available protocol.
Fail locks and copier transactions.
Session vectors.
Control transactions.

5
Logical and Physical Copies of Data
X Logical data item xk A copy of item X on
site k
Strict read-one write all (ROWA) requires reading
at Least at one site and writing at all sites.
6
Session Numbers and Nominal Session Numbers

Each operational session of a site is designated
with an integer, session number.
Failed site has session number 0.
ask is actual session number of site k.
nsik is nominal session number of site k at
site i.
NSk is nominal session number of site k.

A nominal session vector consisting of nominal
session numbers of all sites is stored at each
site. nsi is the nominal session vector at site
i.
7
Read one Write all Available (ROWAA)
Transaction initiated at site i, reads and writes
as follows
At site k, the nsi(k) is checked against as
ask. If they are not equal, the transaction is
rejected. Transaction is not sent to a failed
site for whom nsi(k) 0.
8
Control Transactions for Announcing Recovery
Type 1 Claims that a site is nominally up. Updates the session vector of all operational sites with the recovering sites new session number. New session number is one more than the last session number (like an incarnation).
Example
ask 1 initially ask 0 after site failure ask 2 after site recovers ask 0 after site failure ask 3 after site recovers second time
9
Control Transactions for Announcing Failure
Type 2 Claims that one or more sites are down. Claim is made when a site attempts and fails to access a data item on another site.
Control transaction type 2 sets a value 0 for a
failed site in the nominal session vectors at
all operational sites. This allows operational
sites to avoid sending read and write requests
to failed sites.
10
Fail Locks

A fail lock is set at an operational site on
behalf of a failed site if a data item is
updated.
Fail lock can be set per site or per data item.
Fail lock used to identify out-of-date items (or
missed updates) when a site recovers.
All fail locks are released when all sites are up
and all data copies are consistent.

11
Copier Transaction

Copier transaction reads current values (for
failed lock items) on operational sites and
writes on out of data items on the recover site.

12
Site Recovery Procedure

When a site k starts, it loads its actual session
number ask with 0, meaning that the site is
ready to process control transactions but not
user transactions.
Next, the site initiates a control transaction of
type 1. It reads an available copy of the nominal
session vector and refreshes its own copy. Next
this control transaction writes a newly chosen
session number into nsik for all operational
sites I including itself, but not ask as yet.
Using the fail locks on the operational site, the
recovering site marks the data copies that have
missed updates since the site failed. Note that
steps 2 and 3 can be combined.
If the control transaction in step 2 commits, the
site is nominally up. The site converts its state
from recovering to operational by loading the new
session number into ask. If step 2 fails due to
a crash of another site, the recovering site must
initiate a control transaction of type 2 to
exclude the newly crashed site, and then must try
step 2 and 3 again. Note that the recovery
procedure is delayed by the failure of another
site, but the algorithm is robust as long as
there is at least one operational site
coordinating the transaction in the system.

13
Status in site recovery and Availability of Data
Items for Transaction Processing
14
Transaction Processing when Network Partitioning
Occurs

Three Alternatives after Partition
Allow each group of nodes to process new
transactions
Allow at most one group to process new
transactions
Halt all transaction processing
Alternative A
Database values will diverge database
inconsistent when partition is eliminated
Undo some transactions
detailed log
expensive
Integrate the inconsistent values
database item X has values v1, v2
new value v1 v2 value of i at partition

15
Network Partition Alternatives

Alternative B
How to guarantee only one group processes
transactions
assign a number of points to each site
partition with majority of points proceeds
Both partition and site failure cases are
equivalent in the sense in both situations we
have a group of sites which know that no other
site outside the group may process transactions
What if ? no group with a majority?
should we allow transactions to proceed?
commit point?
delay the commit decision?
force transaction to commit or cancel?

16
Planes of Serializability
17
Merging Semi-Committed Transactions

Merger of Semi-Committed Transactions From
Several Partitions
Combine DCG, DCG2, --- DCGN
(DCG is Dynamic Cyclic Graph)
(minimize rollback if cycle exists)
NP-complete
(minimum feedback vertex set problem)
Consider each DCG as a single transaction
Check acyclicity of this N node graph
(too optimistic!)
Assign a weight to transactions in each
partition
Consider DCG1 with maximum weight
Select transactions from other DCGs that do not
create cycles

18
Breaking Cycle by Aborting Transactions

Two Choices
Abort transactions who create cycles
Consider each transaction that creates cycle one
at a time.
Abort transactions which optimize rollback
(complexity O(n3))
Minimization not necessarily optimal globally

19
Commutative Actions and Semantics

Semantics of Transaction Computation
Commutative
Give 5000 bonus to every employee
Commutativity can be predetermined or recognized
dynamically
Maintain log (REDO/UNDO) of commutative and
noncommutative actions
Partially rollback transactions to their first
noncommutative action

20
Compensating Actions

Compensating Transactions
Commit transactions in all partitions
Break cycle by removing semi-committed
transactions
Otherwise abort transactions that are invisible
to the environment
(no incident edges)
Pay the price of commiting such transactions and
issue compensating transactions
Recomputing Cost
Size of readset/writeset
Computation complexity

21
Network Partitioning

Simple partitioning
Only two partitions
Multiple partitioning
More than two partitions
Formal bounds
There exists no non-blocking protocol that is
resilient to a network partition if messages are
lost when partition occurs.
There exist non-blocking protocols which are
resilient to a single network partition if all
undeliverable messages are returned to sender.
There exists no non-blocking protocol which is
resilient to a multiple partition.

22
Independent Recovery Protocols for Network
Partitioning

No general solution possible
allow one group to terminate while the other is
blocked
improve availability
How to determine which group to proceed?
The group with a majority
How does a group know if it has majority?
centralized
whichever partitions contains the central site
should terminate the transaction
voting-based (quorum)
different for replicated vs non-replicated
databases

23
Quorum Protocols for Non-Replicated Databases

The network partitioning problem is handled by
the commit protocol.
Every site is assigned a vote Vi.
Total number of votes in the system V
Abort quorum Va, commit quorum Vc
Va Vc gt V where 0 Va , Vc V
Before a transaction commits, it must obtain a
commit quorum Vc
Before a transaction aborts, it must obtain an
abort quorum Va

24
State Transitions in Quorum Protocols
Coordinator
Participants
Prepare

Commit command

Vote-commit
Prepare
Prepare

Vote-abort
WAIT
Prepared-to-abortt

Prepare-to-commit

Vote-commit
Vote-abort

Ready-to-abort
Prepare-to-abort
Ready-to-commit
Prepare-to-commit
PRE- ABORT
PRE- COMMIT
PRE- COMMIT
PRE- ABORT
Ready-to-abort
Ready-to-commit
Global-abort
Global commit
ABORT
COMMIT
COMMIT
ABORT
25
Quorum Protocols for Replicated Databases

Network partitioning is handled by the replica
control protocol.
One implementation
Assign a vote to each copy of a replicated data
item (say Vi) such that ?i Vi V
Each operation has to obtain a read quorum (Vr)
to read and a write quorum (Vw) to write a data
item
Then the following rules have to be obeyed in
determining the quorums
Vr Vw gt V a data item is not read and written
by two transactions concurrently
Vw gt V/2 two write operations from two
transactions cannot occur concurrently on
the same data item

26
Use for Network Partitioning

Simple modification of the ROWA rule
When the replica control protocol attempts to
read or write a data item, it first checks if a
majority of the sites are in the same partition
as the site that the protocol is running on (by
checking its votes). If so, execute the ROWA rule
within that partition.
Assumes that failures are clean which means
failures that change the network's topology are
detected by all sites instantaneously
each site has a view of the network consisting of
all the sites it can communicate with

27
Open Problems

Replication protocols
experimental validation
replication of computation and communication
Transaction models
changing requirements
cooperative sharing vs. competitive sharing
interactive transactions
longer duration
complex operations on complex data
relaxed semantics
non-serializable correctness criteria

28
Other Issues

Detection of mutual inconsistency in distributed
systems
Distributed system with replication for
reliability (availability)
efficient access
Maintaining consistency of all copies
hard to do efficiently
Handling discovered inconsistencies
not always possible
semantics-dependent

29
Replication and Consistency

Tradeoffs between
degree of replication of objects access time of
object
availability of object (during partition)
synchronization of updates
(overhead of consistency)
All objects should always be available.
All objects should always be consistent.
Partitioning can destroy mutual consistency in
the worst case.
Basic Design Issue
Single failure must not affect entire system
(robust, reliable).

30
Availability and Consistency

Previous work
Maintain consistency by
Voting (majority consent)
Tokens (unique/resource)
Primary site (LOCUS)
Reliable networks (SDD-1)
Prevent inconsistency at a cost does not address
detection or resolution issues.
Want to provide availability and correct
propagation of updates.

31
Detecting Inconsistency

Detecting Inconsistency
Network may continue to partition or partially
merge for an unbounded time.
Semantics also different with replication
naming, creation, deletion
names in on partition do not relate to entities
in another partition
Need globally unique system name, and user
name(s).
Must be able to use in partitions.

32
Types of Conflicting Consistency

System name consists of a
lt Origin, Version gt pair
Origin globally unique creation name
Version vector of modification history
Two types of conflicts
Name two files have same user-name
Version two incompatible versions of the same
file.
Conflicting files may be identical
Semantics of update determine action
Detection of version conflicts
Timestamp overkill
Version vector necessary sufficient
Update log need global synchronization

33
Version Vector

Version vector approach
each file has a version vector
(Si ui) pairs
Si Site on which the file is stored
ui Number of updates on that site
Example lt A4, B2 C0 D1 gt
Compatible vectors
one is at least as large as the other over all
sites in vector
lt A1 B2 C4 D3 gt ? lt A0 B2 C2
D3 gt
lt A1 B2 C4 D3 gt ? lt A1 B2 C3
D4 gt (Not Compatible)
(lt A1 B2 C4 D4 gt)

34
Additional Comments

Committed updates on site Si will update ui by
one
Deletion/Renaming are updates
Resolution on site Si increments ui to maintain
consistency later.
to Max Si
Storing a file at new site makes vector longer by
one site.
Inconsistency determined as early as possible.
Only works for single file consistency, and not
transactions

35
Example of Conflicting Operation in Different
Partitions
lt A0 B0 C0 gt lt A0 B0 C0 gt lt A2
B0 C1 gt
lt A2 B0 C0 gt lt A3 B0 C0 gt
A updates file twice
Bs version adopted
A updates f once
CONFLICT 3 gt 2, 0 0, 0 lt 1
Version vector VVi (Si vi) vi update to
file f at site Si
36
Example of Partition and Merge

update

37
Create Conflict
A B C D
lt A0, B0, C0, D0 gt
lt A2, B0, C0, D0 gt
A B
C D

lt A0, B0, C0, D0 gt
lt A0, B0, C0, D0 gt
D
B C
A

lt A2, B0, C1, D0 gt
lt A3, B0, C0, D0 gt
B C D
lt A2, B0, C1, D0 gt
A B C D
CONFLICT! After reconcilation at site B lt A3,
B1, C1, D0 gt
38