Title: Reconciling While Tolerating Disagreement in Collaborative Data Sharing
1Reconciling While Tolerating Disagreement in
Collaborative Data Sharing
Nicholas Taylor, Zachary Ives Department of
Computer and Information Science University of
Pennsylvania
- ACM SIGMOD
- International Conference on Management of Data
- June 27, 2006
2Data Exchange is Needed Everywhere
- Cell phone and PDA address books
- contain slightly different data
- Collaborators citation databases
- different abbreviation styles
- different citation programs
- Biologists databases
- collect new information from published databases
- store information from own experiments
- disagree about key data points
3Traditional Data Integration
4Conflicting Data is Inevitable
- Independent sources, conflicting data
- common in collaborative settings
- not well accommodated by traditional data
integration - Schema constraints often reveal conflicts
- e.g. name è rating
5A Model for Data Sharing
- Global instance not possible, but conflicts are
localized - Collaborative Data Sharing System (CDSS)
- Synchronize databases by sharing transactions
- Each participant creates its own global
instance by deciding which transactions to apply - ORCHESTRA is our implementation of a CDSS
6CDSS Overview
CDSS (ORCHESTRA)
updates (D1)
RDBMS
Queries and Answers
- User interacts with standard database
- CDSS coordinates with other participants
- Ensures availability of published updates
- Finds consistent set of trusted updates
(reconciliation) - This paper assumes a single schema
7Trust Policies in a CDSS
8Challenges of Reconciliation
- Updates in atomic transactions
- Causal dependencies (antecedents)
- Intermittent participation
- Maximal progress at each step
- Consistent, predictable behavior
- All transaction acceptances are final
- Always prefer higher priority transactions
- Frequent conflict resolution can be frustrating
- Allow user decisions to be deferred
9Data Sharing Operations
- Operations involve only one participant
- Publishing
- Reconciliation
- Participant applies consistent subset of updates
- May get its own unique instance
d
Publish New Updates
request
Reconciliation Requests
Published Updates
d
Local Instance
Update Log
d
10Reconciliation in ORCHESTRA
- Group transactions with antecedents and accept
highest priority chains
R(X,Y) XèY
Reconciliation 1
Reconciliation 2
û
(A,4) (B,4)
(A,3)
ü
û
(B,3) (C,5)
(B,3) è(B,4)
(A,2)
û
(B,3) (C,5)
ü
(B,4) (C,5)
6
(D,8)
Decision ü Accept û Reject 6 Defer
(D,9)
(C,6)
û
6
11Consistent Reconciliations
- Applied transactions may not
- modify non-present values
- cause constraint violations
- have an unapplied antecedent transaction
- interact with each other
- Want to avoid transient conflicts
- Therefore, flatten chains of antecedent
transactions
(C,6) è(D,6)
(C,6)
(D,6)
Peer 1
(C,5)
Peer 3
12Reconciliation Algorithm
- Input Flattened trusted applicable transaction
chains - Output Set A of accepted transactions
- For each priority p from pmax to 1
- Let C be the set of chains for priority p
- If some t in C conflicts with a non-subsumed u in
A, REJECT t - If some t in C
- uses a deferred value, DEFER it
- conflicts with a non-subsumed, non-rejected u in
C, DEFER t - Otherwise, ACCEPT t by adding it to A
13Flattening and Antecedents
R(X,Y) XèY
ü
(A,2) (D,6) è(D,7)
(D,6)
(D,6)
(A,2) (D,7)
û
(A,1) (B,3) (C,4)
ü
(A,1) (B,3) (C,4)
(B,3) è(B,4) (C,5) è(E,5)
Decision ü Accept û Reject 6 Defer
û
(C,5)
(C,5)
(A,1) (B,4) (E,5)
ü
14System Architecture
- Reconciliation algorithm at each participant
- Centralized and distributed update stores
- Hold updates
- Compute antecedent chains
Publish New Updates
Reconciliation Requests
Reconciliation Algorithm
Published Updates
RDBMS
15Experimental Overview
- Experimental goals
- demonstrate feasibility of CDSS concept
- explore efficiency of system
- Target domain
- bioinformatics databases, 10s to 100s of sites
- low GBs of data, MBs of updates
- periodic updates from multiple sites
- Synthetic workloads
- no real workloads with conflicts exist stress
test - tuples generated using skewed distribution (hot
items) - modification if value present, otherwise insertion
16Result Quality is Robust
- Effect of reconciliation interval on
synchronicity - synchronicity avg. no. of values per key
- ten peers each publish 500 transactions of one
update - Infrequent reconciliation slowly changes
synchronicity
17Fetch Times Dominate Cost
- Effect of reconciliation interval on running time
- ten peers each publish 500 single-update
transactions - Infrequent reconciliation more efficient
- Fetch times (i.e. network latency) dominate
18Summary of Experiments
- CDSS concept is feasible
- Infrequent reconciliation has minimal effects
- Distributed implementation is practical
- Reconciliation is not an expensive operation
- See paper for system stability experiments
- Effect of increasing transaction size
- negligible on synchronicity after size two
- Effect of adding peers
- worsens synchronicity sublinearly
- increases execution time linearly
19Related Work
- Inconsistency repair
- Bry97, ABC99
- Causal ordering in distributed DBs with
replication - Optimistic Concurrency Control KR81,
Version vectors PPR83, - Distributed file systems
- Ivy MMGC02, Coda Braam98,KS95,
Bayou TTP96, - File synchronization
- Unison PV04, Harmony BVP06
- Version control (CVS, Subversion, etc.)
20Future Work Conclusions
- Future Work Completing the ORCHESTRA platform
- Improved performance and reliability in
distributed store - Support for multiple schemas
- Evaluation with real users
- Conclusions
- Conflicts are inevitable and irresolvable
- Collaborative Data Sharing Systems handle
conflicts using update-centric semantics for
consistency - Performance evaluations validate CDSS approach
- A fully distributed implementation is feasible