Title: Replica Control for Peer-to-Peer Storage Systems
1Replica Control for Peer-to-Peer Storage Systems
2P2P
- Peer-to-peer (P2P) has emerged as an important
paradigm model for sharing resources at the edges
of the Internet. - The most widely exploited resource is storage, as
typified in P2P music file sharing - Napster
- Gnutella
- Following the great success of P2P file sharing,
a natural next step is to develop wide-area, P2P
storage systems to aggregate the storage across
the Internet.
3Replica Control Protocol
- Replication
- to maintain multiple copies of some critical data
to increase the availability - Replica Control Protocol
- to guarantee a consistent view of the replicated
data
4Replica Control Methods
- Optimistic
- Proceed optimistically with computation on the
available subgroup and join later with
consistency - Approaches
- Log, Version vector, etc.
- Pessimistic
- Restrict computations with worst-case assumptions
- Approaches
- Primary site, Voting, etc.
5Write-ahead Log
- Files are actually modified in place, but before
any file block is changed, a record is written to
a log telling which node is making the change,
which file block is being changed, and what the
old and new values are - Only after the log has been written successfully
is the change made to the file - Write-ahead log can be used for undo (rollback)
and redo
6Version Vector
- Version vector for file f
- N element vector, where N is the number of nodes
in which f is stores - The ith element represents the number of updates
done by node I - A vector V dominated V if
- Every element in V gt corresponding element in V
- Conflicts if neither dominates
7Version Vector
- Consistency resolution
- If V dominates V, inconsistent can be resolved
by copying V to V - If V and V conflict, inconsistency is detected
- Version vector can detect only update conflicts
cannot detect read-write conflicts
8Primary Site Approach
- Data replicated on at least k1 nodes (for
k-resilient) - One node acts as the primary site (PS)
- Any read request is served by the PS
- Any write request is copied to all other back-up
sites - Any write request to back-up sites are forwarded
to the PS
9PS Failure Handling
- If back-up fails, no interruption in service
- If PS fails, there are two possibilities
- If the network not segmented
- Choose a back-up node as the primary
- If checkpointing has been active, need to restart
only from the previous checkpoint - If segmented
- Only the partition with PS can progress
- Other partitions stops updates on data
- Necessary to distinguish between site failures
and network partitions
10Voting Approach
- V votes are distributed to n replicas with
- VwVr gt V
- VwVw gt V
- Obtain Vr or more votes to read
- Obtain Vw or more votes to write
- Quorum system is more general than voting
11Quorum Systems
- Trees
- Grid-based (array-based)
- Torus
- Hierarchical
- Multi-column
- and so on
12Quorum-Based Schemes (1/2)
- n replicas with version numbers
- Read operation
- Read-lock and access a read quorum
- Obtaining a largest-version-number replica
- Write operation
- Write-lock and access a write quorum
- Updating all replicas with the new version number
the largest 1
13Quorum-Based Schemes (2/2)
The set of replicas must behave as if there is
only a single copy. This is the strictest
consistency criterion.
- One-copy equivalence is guaranteed If we restrict
- Write-write and write-read lock exclusion
- Intersection Property
- A non-empty intersection in any pair of
- A read quorum and a write quorum
- Two write quorums
14Witnesses
- Witness
- - small entity that maintains enough
information to identify the replicas that
contain the most recent version of the data - - the information could also be a timestamp
containing the time of the latest update - - the information could also be a version
number, which is an integer incremented each time
the data are updated -
15Classification of P2P Storage Sys.
- Unstructured
- Replication Strategies for Highly Available
Peer-to-peer Storage - Replication Strategies in Unstructured
Peer-to-peer Networks - Structured
- CFS
- PAST
- LAR
- Ivy
- Oasis
- Om
- Eliot
- Sigma (for mutual exclusion primitive)
Read only
Read/Write (Mutable)
16Ivy
- Stores a set of logs with the aid of distributed
hash tables. - Ivy keeps, for each participant, a log storing
all its updates, and maintains data consistency
optimistically by performing conflict resolutions
among all logs. (Maintain data consistency in a
best-effort manner) - The logs should be kept indefinitely and a
participant must scan all the logs related to a
file to look up the up-to-date file data. Thus,
Ivy is only suitable for small groups of
participants.
17Ivy uses DHT
(Ivy)
Distributed application
data
get (key)
put(key, data)
(DHash)
Distributed hash table
lookup(key)
node IP address
(Chord)
Lookup service
- DHT provides
- Simple API
- Put(key, value) and get(key) ? value
- Availability (Replication)
- Robustness (Integrity checking)
18Solution Log Based
- Update Each participant maintains a log of
changes to the file system - Lookup Each participant scans all logs
19Eliot
- Eliot relies a reliable, fault-tolerant,
immutable P2P storage substrate ? Charles to
store data blocks, and uses an auxiliary metadata
service (MS) for storing mutable metadata. - It supports NFS-like consistency semantics
however, the traffic between MS and the client is
high for such semantics. - It also supports AFS open-close consistency
semantics however, this semantics may cause the
problem of lost updates. - The MS service is provided by a conventional
replicated database, which may be not fit for
dynamic P2P environments.
20Oasis
- Oasis is based on Giffords weighted voting
quorum concept and allows dynamic quorum
membership. - It spreads versioned metadata along with data
replicas over the P2P network. - To complete an operation on a data object, a
client must first find a metadata related to the
object and figure out the total number of votes,
required votes for read/write operations, replica
list, and so on, to form a quorum accordingly. - One drawback of Oasis is that if a node happens
to use a stale metadata, the data consistency may
be violated.
21(No Transcript)
22(No Transcript)
23(No Transcript)
24(No Transcript)
25Om
- Om is based on the concepts of automatic replica
regeneration and replica membership
reconfiguration. - The consistency is maintained by two quorum
systems a read-one-write-all quorum system for
accessing replicas, and a witness-modeled quorum
system for reconfiguration. - Om allows replica regeneration from single
replica. However, a write in Om is always first
forwarded to the primary copy, which serializing
all writes and uses a two-phase procedure to
propagate the write to all secondary replicas. - The drawbacks of Om are (1) the primary replica
may become a bottleneck (2) the overhead
incurred by the two-phase procedure may be too
high (3) the reconfiguration by witness model
has the probability of violating consistency.
26Om and Pastry
PAST example Object key 100
80
120
90
104
98
103
99
101
100
27Om and Pastry
PAST example Object key 100 Replication
80
120
90
104
98
103
99
101
100
28Om and Pastry
PAST example Object key 100 Replication Replic
a crash
80
120
90
104
98
103
99
101
100
29Om and Pastry
PAST example Object key 100 Replication Replic
a crash Regeneration
80
120
90
104
98
103
99
101
100
30Om Normal Case Operation
Read-one / write-all approach Writes serialized
via primary
80
120
90
104
write
98
103
99
101
100
read
primary
31Om System Architecture Overview
32Witness Model
- The witness model utilizes the following limited
view divergence property
- Intuitively, the property says that two replicas
are unlikely to have a completely different view
regarding the reachability of a set of
randomly-placed witnesses.
33Witness Model
- To utilize the limited view divergence property,
all replicas logically organize the witnesses
into an mt matrix - The number of rows, m, determines the probability
of intersection - The number of columns, t, protects against the
failure of individual witnesses, so that each row
has at least one functioning witness with high
probability
34Witness Model
35Sigma System Model
36Sigma System Model
- Replicas are always available, but their internal
states may be randomly reset.? failure-recovery
model - The number of clients is unpredictable. Clients
are not malicious and fail stop. - Clients and replicas communicate via messages,
which could be replicated, lost, but never
forged.
37Sigma
- The Sigma protocol intelligently collect states
from all replicas to achieve mutual exclusion. - The basic idea of the Sigma protocol is as
follows. A node u wishing to be the winner of the
mutual exclusion sends a timestamped request for
each of the totally n (n3k1) replicas and waits
for replies. On receiving a request from u, a
node v should put us request in a local queue by
the timestamp order, takes the node as the winner
whose request is in the front of the queue, and
reply the winner ID to u.
38Sigma
- When the number of replies received by u exceeds
m (m2k1), u acts according to the following
conditions(1) if more than m replies take v as
the winner, then u is the winner. (2) if more
than m replies take w (w?u) as the winner, then w
is the winner and u just keeps waiting.(3) if no
node is regarded as the winner by more than m
replies, then u sends YIELD message to cancel its
request temporarily and then re-inserts its
request again with random backoff. - In this manner, one node can eventually be
elected as the winner even when communication
delay variance is large. - A drawback of the Sigma protocol is that a node
needs to send requests to all replicas and gets
advantaged replies from a large portion (?2/3) of
nodes to be the winner of the mutual exclusion,
which will incur large overhead. Moreover, the
overhead will even be larger under an environment
of high contention.
39References
- IvyA. Muthitacharoen, R. Morris, T. Gil, and B.
Chen, Ivy A Read/write Peer-to-peer File
System, in Proc. of the Symposium on Operating
Systems Design and Implementation (OSDI), 2002. - EliotC. Stein, M. Tucker, and M. Seltzer,
Building a Reliable Mutable File System on
Peer-to-peer Storage, in Proc. of 21st IEEE
Symposium on Reliable Distributed Systems
(WRP2PDS), 2002. - OasisM. Rodrig, and A. Lamarca, Decentralized
Weighted Voting for P2P Data Management, in
Proc. of the 3rd ACM International Workshop on
Data Engineering for Wireless and Mobile Access,
pp. 8592, 2003. - OMH. Yu. and A. Vahdat, Consistent and
Automatic Replica Regeneration, in Proc. of
First Symposium on Networked Systems Design and
Implementation (NSDI '04), 2004. - Sigma S. Lin, Q. Lian, M. Chen, and Z. Zhang,
A practical distributed mutual exclusion
protocol in dynamic peer-to- peer systems, in
Proc. of 3rd International Workshop on
Peer-to-Peer Systems (IPTPS04), 2004.
40