Replica Control for Peer-to-Peer Storage Systems - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Replica Control for Peer-to-Peer Storage Systems

Description:

Peer-to-peer (P2P) has emerged as an important paradigm model for sharing ... Torus. Hierarchical. Multi-column. and so on... 12 /40. Quorum-Based Schemes (1/2) ... – PowerPoint PPT presentation

Number of Views:128

Avg rating:3.0/5.0

Slides: 41

Provided by: GOO47

Category:

more less

Transcript and Presenter's Notes

Title: Replica Control for Peer-to-Peer Storage Systems

1
Replica Control for Peer-to-Peer Storage Systems
2
P2P

Peer-to-peer (P2P) has emerged as an important
paradigm model for sharing resources at the edges
of the Internet.
The most widely exploited resource is storage, as
typified in P2P music file sharing
Napster
Gnutella
Following the great success of P2P file sharing,
a natural next step is to develop wide-area, P2P
storage systems to aggregate the storage across
the Internet.

3
Replica Control Protocol

Replication
to maintain multiple copies of some critical data
to increase the availability
Replica Control Protocol
to guarantee a consistent view of the replicated
data

4
Replica Control Methods

Optimistic
Proceed optimistically with computation on the
available subgroup and join later with
consistency
Approaches
Log, Version vector, etc.
Pessimistic
Restrict computations with worst-case assumptions
Approaches
Primary site, Voting, etc.

5
Write-ahead Log

Files are actually modified in place, but before
any file block is changed, a record is written to
a log telling which node is making the change,
which file block is being changed, and what the
old and new values are
Only after the log has been written successfully
is the change made to the file
Write-ahead log can be used for undo (rollback)
and redo

6
Version Vector

Version vector for file f
N element vector, where N is the number of nodes
in which f is stores
The ith element represents the number of updates
done by node I
A vector V dominated V if
Every element in V gt corresponding element in V
Conflicts if neither dominates

7
Version Vector

Consistency resolution
If V dominates V, inconsistent can be resolved
by copying V to V
If V and V conflict, inconsistency is detected
Version vector can detect only update conflicts
cannot detect read-write conflicts

8
Primary Site Approach

Data replicated on at least k1 nodes (for
k-resilient)
One node acts as the primary site (PS)
Any read request is served by the PS
Any write request is copied to all other back-up
sites
Any write request to back-up sites are forwarded
to the PS

9
PS Failure Handling

If back-up fails, no interruption in service
If PS fails, there are two possibilities
If the network not segmented
Choose a back-up node as the primary
If checkpointing has been active, need to restart
only from the previous checkpoint
If segmented
Only the partition with PS can progress
Other partitions stops updates on data
Necessary to distinguish between site failures
and network partitions

10
Voting Approach

V votes are distributed to n replicas with
VwVr gt V
VwVw gt V
Obtain Vr or more votes to read
Obtain Vw or more votes to write
Quorum system is more general than voting

11
Quorum Systems

Trees
Grid-based (array-based)
Torus
Hierarchical
Multi-column
and so on

12
Quorum-Based Schemes (1/2)

n replicas with version numbers
Read operation
Read-lock and access a read quorum
Obtaining a largest-version-number replica
Write operation
Write-lock and access a write quorum
Updating all replicas with the new version number

the largest 1
13
Quorum-Based Schemes (2/2)
The set of replicas must behave as if there is
only a single copy. This is the strictest
consistency criterion.

One-copy equivalence is guaranteed If we restrict
Write-write and write-read lock exclusion
Intersection Property
A non-empty intersection in any pair of
A read quorum and a write quorum
Two write quorums

14
Witnesses

Witness
- small entity that maintains enough
information to identify the replicas that
contain the most recent version of the data
- the information could also be a timestamp
containing the time of the latest update
- the information could also be a version
number, which is an integer incremented each time
the data are updated

15
Classification of P2P Storage Sys.

Unstructured
Replication Strategies for Highly Available
Peer-to-peer Storage
Replication Strategies in Unstructured
Peer-to-peer Networks
Structured
CFS
PAST
LAR
Ivy
Oasis
Om
Eliot
Sigma (for mutual exclusion primitive)

Read only
Read/Write (Mutable)
16
Ivy

Stores a set of logs with the aid of distributed
hash tables.
Ivy keeps, for each participant, a log storing
all its updates, and maintains data consistency
optimistically by performing conflict resolutions
among all logs. (Maintain data consistency in a
best-effort manner)
The logs should be kept indefinitely and a
participant must scan all the logs related to a
file to look up the up-to-date file data. Thus,
Ivy is only suitable for small groups of
participants.

17
Ivy uses DHT
(Ivy)
Distributed application
data
get (key)
put(key, data)
(DHash)
Distributed hash table
lookup(key)
node IP address
(Chord)
Lookup service

DHT provides
Simple API
Put(key, value) and get(key) ? value
Availability (Replication)
Robustness (Integrity checking)

18
Solution Log Based

Update Each participant maintains a log of
changes to the file system
Lookup Each participant scans all logs

19
Eliot

Eliot relies a reliable, fault-tolerant,
immutable P2P storage substrate ? Charles to
store data blocks, and uses an auxiliary metadata
service (MS) for storing mutable metadata.
It supports NFS-like consistency semantics
however, the traffic between MS and the client is
high for such semantics.
It also supports AFS open-close consistency
semantics however, this semantics may cause the
problem of lost updates.
The MS service is provided by a conventional
replicated database, which may be not fit for
dynamic P2P environments.

20
Oasis

Oasis is based on Giffords weighted voting
quorum concept and allows dynamic quorum
membership.
It spreads versioned metadata along with data
replicas over the P2P network.
To complete an operation on a data object, a
client must first find a metadata related to the
object and figure out the total number of votes,
required votes for read/write operations, replica
list, and so on, to form a quorum accordingly.
One drawback of Oasis is that if a node happens
to use a stale metadata, the data consistency may
be violated.

21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
Om

Om is based on the concepts of automatic replica
regeneration and replica membership
reconfiguration.
The consistency is maintained by two quorum
systems a read-one-write-all quorum system for
accessing replicas, and a witness-modeled quorum
system for reconfiguration.
Om allows replica regeneration from single
replica. However, a write in Om is always first
forwarded to the primary copy, which serializing
all writes and uses a two-phase procedure to
propagate the write to all secondary replicas.
The drawbacks of Om are (1) the primary replica
may become a bottleneck (2) the overhead
incurred by the two-phase procedure may be too
high (3) the reconfiguration by witness model
has the probability of violating consistency.

26
Om and Pastry
PAST example Object key 100
80
120
90
104
98
103
99
101
100
27
Om and Pastry
PAST example Object key 100 Replication
80
120
90
104
98
103
99
101
100
28
Om and Pastry
PAST example Object key 100 Replication Replic
a crash
80
120
90
104
98
103
99
101
100
29
Om and Pastry
PAST example Object key 100 Replication Replic
a crash Regeneration
80
120
90
104
98
103
99
101
100
30
Om Normal Case Operation
Read-one / write-all approach Writes serialized
via primary
80
120
90
104
write
98
103
99
101
100
read
primary
31
Om System Architecture Overview
32
Witness Model

The witness model utilizes the following limited
view divergence property

Intuitively, the property says that two replicas
are unlikely to have a completely different view
regarding the reachability of a set of
randomly-placed witnesses.

33
Witness Model

To utilize the limited view divergence property,
all replicas logically organize the witnesses
into an mt matrix
The number of rows, m, determines the probability
of intersection
The number of columns, t, protects against the
failure of individual witnesses, so that each row
has at least one functioning witness with high
probability

34
Witness Model
35
Sigma System Model
36
Sigma System Model

Replicas are always available, but their internal
states may be randomly reset.? failure-recovery
model
The number of clients is unpredictable. Clients
are not malicious and fail stop.
Clients and replicas communicate via messages,
which could be replicated, lost, but never
forged.

37
Sigma

The Sigma protocol intelligently collect states
from all replicas to achieve mutual exclusion.
The basic idea of the Sigma protocol is as
follows. A node u wishing to be the winner of the
mutual exclusion sends a timestamped request for
each of the totally n (n3k1) replicas and waits
for replies. On receiving a request from u, a
node v should put us request in a local queue by
the timestamp order, takes the node as the winner
whose request is in the front of the queue, and
reply the winner ID to u.

38
Sigma

When the number of replies received by u exceeds
m (m2k1), u acts according to the following
conditions(1) if more than m replies take v as
the winner, then u is the winner. (2) if more
than m replies take w (w?u) as the winner, then w
is the winner and u just keeps waiting.(3) if no
node is regarded as the winner by more than m
replies, then u sends YIELD message to cancel its
request temporarily and then re-inserts its
request again with random backoff.
In this manner, one node can eventually be
elected as the winner even when communication
delay variance is large.
A drawback of the Sigma protocol is that a node
needs to send requests to all replicas and gets
advantaged replies from a large portion (?2/3) of
nodes to be the winner of the mutual exclusion,
which will incur large overhead. Moreover, the
overhead will even be larger under an environment
of high contention.

39
References

IvyA. Muthitacharoen, R. Morris, T. Gil, and B.
Chen, Ivy A Read/write Peer-to-peer File
System, in Proc. of the Symposium on Operating
Systems Design and Implementation (OSDI), 2002.
EliotC. Stein, M. Tucker, and M. Seltzer,
Building a Reliable Mutable File System on
Peer-to-peer Storage, in Proc. of 21st IEEE
Symposium on Reliable Distributed Systems
(WRP2PDS), 2002.
OasisM. Rodrig, and A. Lamarca, Decentralized
Weighted Voting for P2P Data Management, in
Proc. of the 3rd ACM International Workshop on
Data Engineering for Wireless and Mobile Access,
pp. 8592, 2003.
OMH. Yu. and A. Vahdat, Consistent and
Automatic Replica Regeneration, in Proc. of
First Symposium on Networked Systems Design and
Implementation (NSDI '04), 2004.
Sigma S. Lin, Q. Lian, M. Chen, and Z. Zhang,
A practical distributed mutual exclusion
protocol in dynamic peer-to- peer systems, in
Proc. of 3rd International Workshop on
Peer-to-Peer Systems (IPTPS04), 2004.