PeertoPeer P2P Storage Systems CSE 581 Winter 2002

About This Presentation

Title:

PeertoPeer P2P Storage Systems CSE 581 Winter 2002

Description:

Nodes repair R using entries at same level from other nodes; they borrow entries ... ID collision, invalid FC, corrupt file contents. Insufficient storage space ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 41

Provided by: sudarsha5

Category:

more less

Transcript and Presenter's Notes

Title: PeertoPeer P2P Storage Systems CSE 581 Winter 2002

1
Peer-to-Peer (P2P) Storage SystemsCSE 581 Winter
2002

Sudarshan Sun Murthy
smurthy_at_sunlet.net

2
Papers

Rowstron A, Druschel P. Pastry Scalable,
distributed object location and routing for
large-scale peer-to-peer systems
Rowstron A, Druschel P. Storage management and
caching in PAST, a large-scale, persistent
peertopeer storage utility
Zhao, et al. Tapestry An infrastructure for
faultresilient widearea location and routing
Kubiatowicz J, et al. OceanStore An architecture
for globalscale persistent store

3
P2P Storage Systems Basics

Needs
Networking, routing, storage, caching
Roles
Client, server, router, cache
Desired characteristics
Fast, tolerant, scalable, reliable, good locality
Small world keep the clique, and reach
everything fast!

4
Pastry Claims

A generic P2P object location and routing scheme
based on a self-organizing overlay network of
nodes connected to the Internet.
Features
Decentralized
Fault-resilient
Scalable
Reliable
Good route locality

5
Pastry 100K Feet View

Nodes interact with local applications
Each node has a unique 128-bit ID (base 2b)
Cryptographic hash of IP address, usually
Node IDs are distributed in geography, etc.
Nodes route messages based on the key
To a node whose ID shares more prefix digits
To a node with numerically closer ID
Can route in less than (log2b N) hops, usually

6
Pastry Node State

Routing table (R)
(log2b N) rows with (2b -1) entries in each row
(R)
Row n lists nodes that share ID in first n digits
Neighborhood set (M)
Lists M nodes that are closest according to the
proximity metric (application defined)
Leaf set (L)
Lists (L/2) nodes with numerically closest
smaller IDs and (L/2) nodes with numerically
closest larger IDs

7
Pastry Parameters

Numeric base of IDs (b)
R (log2b N) (2b -1) Max. hops (log2b N)
b 4 N 106 ? R 75 max. hops 5
b 4 N 109 ? R 105 max. hops 7
Number of entries in Leaf set (L)
Entries in L are not sensitive to key, entries
in R could be
L 2b or 2b1, usually
Routing could fail if (L/2) nodes fail
simultaneously

8
Pastry Routing Algorithm

Check if key falls in the range of IDs in L
Route to the node with numerically closest ID
Check R for node ID with largest prefix (larger
than that shared with this node)
Route to the node that shares largest prefix
Entry may be empty, or node may be unavailable
Check L for node with same prefix length
Route to the node with numerically closer ID

9
Pastry Adaptation

Arriving nodes
New node X sends Join message to node A
message is routed around to X through node Z
X gets initial L from Z, M from A, ith row of R
from ith node visited X then sends its state to
all nodes visited
Departing/failed nodes
Nodes test connectivity of entries in M
periodically
Nodes repair L M using info. from other nodes
Nodes repair R using entries at same level from
other nodes they borrow entries at next level if
needed

10
Pastry Locality

Entries in R and M are based on a proximity
metric decided by the application
Decision taken with local information only
No guarantee of complete path being shortest
distance
Assumes triangulation inequality holds for
distances
Misses nearby nodes with different prefix
Estimates density of node IDs in the ID space
Heuristically switches between modes to address
problems details are sketchy (very) (Section 2.5)

11
Pastry Evaluation (1)

Number of routing hops (percentage probability)
2 (1.6), 3 (15.6), 4 (64.5), 5 (17.5) (Fig.
5)
Effect of fewer routing entries compared to
network with complete routing tables
At least 30 longer, at most 40 longer (Fig. 6)
75 entries Vs 99,999 entries for a 100K node
network!
Experiments with only one set of parameter
values!!
Ability to locate closest among k nodes
Closest 76, Top 2 92, Top 3 96 (Fig. 8)

12
Pastry Evaluation (2)

Impact of failures and repairs on route quality
Number of routing hops Vs node failure (Fig. 10)
2.73 (no failure), 2.96 (no repair), 2.74 (with
repair)
5K node network, 10 nodes failing
Poor parameters used
Average cost of repairing failed nodes
57 remote procedure calls per failed node
Seems expensive

13
PAST Pastry Application

Storage management system
Archival storage and content distribution utility
No support for search, directory lookup, key
distribution
Nodes and files have uniformly distributed IDs
Replicas of files are stored at nodes whose IDs
match file IDs closely
Files may be encrypted
Clients retrieve files using file ID as key

14
PAST Insert Operation

Inserts a file in k nodes returns a 160-bit ID
File ID is a secure hash of the file name,
owners public key, and some salt the operation
is aborted if ID collision occurs
Copies of file are stored on k nodes whose ID is
closest to the 128 MSBs of the file ID
The required storage (k file size) is debited
against the clients storage quota

15
PAST File Certificate (FC)

A FC is issued when Insert operation starts
Has File ID, hash of file content, replication
factor k, ...
FC is routed with file contents using Pastry
Each node verifies FC and file, stores a copy of
file, attaches a Store Receipt, and forwards the
message
Operation aborts if anything goes wrong
ID collision, invalid FC, corrupt file contents
Insufficient storage space

16
PAST Other Operations

Retrieve a file
Retrieves a copy of the file with given ID from
the first node that stores it
Reclaim storage
Reclaims storage allocated to specified file
Client issues a Reclaim Certificate (RC) to prove
ownership of the file
RC is routed with the message to all nodes
storing nodes verify RC and issue a Reclaim
Receipt
Client uses reclaim receipts to get storage
credit
No guarantees about the state of reclaimed files

17
PAST Storage Management

Goals
Balance free space among nodes as utilization
increases
Ensure that a file is stored at k nodes,
Balance number of files stored on nodes
Storage capacities of nodes cannot differ by more
than two orders of magnitude
Nodes with capacity out of bounds are rejected
Large capacity nodes can form a cluster

18
PAST Diversions
tpri, tdiv control diversion

Replica diversion, if no space at node
File size/free space gt a threshold to store a
file
Node A asks node B (from its L) to store replica
A stores a pointer to B for that file A must
retrieve file from somewhere if B fails (must
have k copies)!
Node C (from As L) also has pointer to B for
that file useful to reach B if A fails, but C
must be in the path
File diversion, if k nodes cant store file
Restart Insert operation with a different salt (3
tries)

19
PAST Caching

Caching is optional, at discretion of nodes
A file routed through a node during lookup/insert
maybe cached
Each node visited during insert stores a copy
lookup returns the first copy found what are we
missing?
Based on Greedy Dual-Size policy developed for
caching in web proxies
Replace file d with least c(d)/s(d) if cache is
full ccost, ssize if c(d)1, replaces the
largest file

20
PAST Security

Smartcards ensure integrity of IDs and
certificates
Store receipts ensure k nodes cooperate
Routing table entries are signed, and they can be
verified by other nodes
A malicious node can cause problems
Choosing next node at random might help somewhat

21
PAST Evaluation (1)

Basic storage management
No diversions 51.1 of insertions fail global
storage utilization is only 60.8 we need
storage management
L 32, tpri 0.1, tdiv 0.05 are optimal
values
Effect of storage management (Fig. 6)
10 replica diversion at 80 utilization 15 at
95
Small file insertion tends to fail after 80
utilization 20 file insertions tend to fail
after 95 utilization
Results are worse with file-system style workload

22
PAST Evaluation (2)

Effect of cache (Fig. 8)
Experiments use no caching, LRU, and GD-S GD-S
performs marginally better than LRU
Hard to know if results are good since we have no
comparison with other systems only proves
caching helps
What we did not see
Retrieval, reclaim performance of hops maybe
for insertion
Overlay routing overhead effort to cache

23
PAST Possible Improvements

Avoid replica diversion
Forward on to the next node if no space
May have to add directory service to improve
retrieval directory service could be useful
anyway
Reduce replica diversion or of forwards
Add storage stats to routing table use to pick
next node
How to increase storage capacity?
Add masters (at least at cluster level)
Will not be as P2P any more!?

24
Tapestry Claims

An overlay location routing infrastructure for
location-independent routing of messages directly
to the closest copy of an object or service using
only point-to-point links, and without
centralized resources
Enhances Plaxton distributed search technique to
improve availability, scalability, and adaptation
More formally defined, better analyzed than
Pastrys techniques benefit of using Plaxton

25
Tapestry 100K Feet View

Nodes and objects have unique 160-bit IDs
Nodes route messages using destination ID
To a node whose ID shares longer suffix
Can route in less than (logb N) hops
Objects are located by routing to a surrogate
root
Servers publish their objects to surrogate roots
how objects get to servers is not a concern

Compare with Pastry
26
Tapestry Node State

Neighbor map
(logb N) levels (rows) with b entries at each
level
Entry i at level j belongs to the closest node
whose ID ends with i j-1 suffix digits of
current node ID
Back pointer list
ID of nodes that refer to this node as neighbor
Object location pointers
Tuples of the form ltObject ID, Node IDgt
Hotspot monitor
Tuples of the form ltObject ID, Node ID, Frequencygt

27
Tapestry Parameters

Numeric base of IDs (b)
Entries b (logb N) max. hops (logb N)
b 16, N 106 ? Entries 80 max. hops 5
b 16, N 109 ? Entries 120 max. hops 8

28
Tapestry Routing Algorithm

A message at nth node shares at least n suffix
digits with ID of the nth node
Go to level (n1) of neighbor map
Find a closest node whose ID shares n suffix
digits of current node ID
Route to the node determined
If such node is not found, then current node must
be the (or a) root node
Message may contain a predicate to find the next
node, in addition to just using the closest node

29
Tapestry Surrogate Roots

Uses multiple surrogate roots
Avoid single point of failure
Add a constant sequence of salts to create IDs
the resulting IDs are published, each ID gets a
potentially different surrogate root
Finding surrogate roots isnt always easy
Neighbors are used to find nodes that share at
least a digit with the object ID
This part of paper isnt very clear work in
progress?

30
Tapestry Adaptation (1)
Adding new nodes can be expensive

Arriving nodes
New node X sends a message to itself through node
A the last node visited is the root node for X
X gets initial ith level of neighbor map from ith
node
X sends hello to its new neighbors
Departing/failed nodes
Send heartbeats on UDP packets using back
pointers
Secondary neighbors are used when a neighbor
fails
Failed neighbors get a second chance period they
are marked valid again if they respond within
this period

31
Tapestry Adaptation (2)

Uses Introspective Optimizations
Attempts to use statistics to adapt
Network tuning
Use pings to update n/w latency to neighbors
Optimize neighbor maps if latency gt threshold
Hotspot caching
Monitor frequency of requests to objects
Advise application of need to cache

32
Tapestry Evaluation

Locality
hops is 2 to 4 times that of network hops
(Fig. 8) better when base of ID is greater (Fig.
18)
Effect of multiple roots
Latency reduces with increasing roots (Fig. 13),
while bandwidth used increases (Fig. 14)
Performance under stress
Better throughput (Fig. 15) and average response
time (Fig. 16) than centralized directory servers
at higher loads

33
OceanStore Tapestry Application

Storage management system with support for
nomadic data, and constructed from a possibly
untrusted infrastructure
Proposes a business revenue model
A goal is to support roughly 100 tera users
Uses Tapestry for networking (a recent change)
Promotes promiscuous caching to improve locality
Replicas are stored independent of server that
stores them (floating replicas)

Contradicts Tapestry paper
34
OceanStore 100K Feet View

Objects are identified using GUIDs
Clients access objects using GUIDs as destination
ID
Objects may be servers, routers, data,
directories,
Many versions of objects might be stored
Update causes a new version of the object the
latest, updatable version is the active form,
others are archival forms archival forms are
encoded with an erasure code
Sessions guarantee consistency- loose to ACID
Supports Access Control Read/Write

35
OceanStore Updates (1)

Client initiates updates as a set of predicates
combined with actions
A replica applies the actions associated with the
first true predicate (commit) update fails if
all predicates fail (abort)
The update attempt is logged regardless
Replicas are not trusted with unencrypted info.
version and size comparisons are done on
plaintext metadata others must be done over
ciphertext

36
OceanStore Updates (2)
I just consult references ?

Assumes a position-dependent block cipher
compare, replace, insert, delete, append blocks
Uses a fancy algorithm for searching within blocks

37
OceanStore Consistency Guarantee

Replica tiers used to serialize authorize
updates
A small number of primary tiers work with each
other in a Byzantine agreement protocol
Larger number of secondary tiers are organized
into multicast tree(s)
Client sends updates to the network
All replicas apply the updates
Updates from primary tiers are multicast back to
the network Version at other replicas is
tentative until then

38
OceanStore Evaluation

Prototype under development at time of paper
publication
Web site shows a prototype is out, but no stats
Issues
Is there such a thing as too untrusting?
Risks of version proliferation
Access control needs work
Directory service squeezed in?

39
Conclusions

Pastry and Tapestry
Somewhat similar in routing Tapestry more
polished
Tapestry stores references, Pastry stores copies
PAST and OceanStore
OceanStore needs caching more than PAST Storage
management in PAST is a good idea, needs more
work
No directory services in PAST, OceanStore has
some
3rd party evaluation of systems needed
Research opportunity? Object people meet Systems
people

PeertoPeer P2P Storage Systems CSE 581 Winter 2002 - PowerPoint PPT Presentation

PeertoPeer P2P Storage Systems CSE 581 Winter 2002

Nodes repair R using entries at same level from other nodes; they borrow entries ... ID collision, invalid FC, corrupt file contents. Insufficient storage space ... – PowerPoint PPT presentation