Title: PeertoPeer P2P Storage Systems CSE 581 Winter 2002
1Peer-to-Peer (P2P) Storage SystemsCSE 581 Winter
2002
- Sudarshan Sun Murthy
- smurthy_at_sunlet.net
2Papers
- Rowstron A, Druschel P. Pastry Scalable,
distributed object location and routing for
large-scale peer-to-peer systems - Rowstron A, Druschel P. Storage management and
caching in PAST, a large-scale, persistent
peertopeer storage utility - Zhao, et al. Tapestry An infrastructure for
faultresilient widearea location and routing - Kubiatowicz J, et al. OceanStore An architecture
for globalscale persistent store
3P2P Storage Systems Basics
- Needs
- Networking, routing, storage, caching
- Roles
- Client, server, router, cache
- Desired characteristics
- Fast, tolerant, scalable, reliable, good locality
- Small world keep the clique, and reach
everything fast!
4Pastry Claims
- A generic P2P object location and routing scheme
based on a self-organizing overlay network of
nodes connected to the Internet. - Features
- Decentralized
- Fault-resilient
- Scalable
- Reliable
- Good route locality
5Pastry 100K Feet View
- Nodes interact with local applications
- Each node has a unique 128-bit ID (base 2b)
- Cryptographic hash of IP address, usually
- Node IDs are distributed in geography, etc.
- Nodes route messages based on the key
- To a node whose ID shares more prefix digits
- To a node with numerically closer ID
- Can route in less than (log2b N) hops, usually
6Pastry Node State
- Routing table (R)
- (log2b N) rows with (2b -1) entries in each row
(R) - Row n lists nodes that share ID in first n digits
- Neighborhood set (M)
- Lists M nodes that are closest according to the
proximity metric (application defined) - Leaf set (L)
- Lists (L/2) nodes with numerically closest
smaller IDs and (L/2) nodes with numerically
closest larger IDs
7Pastry Parameters
- Numeric base of IDs (b)
- R (log2b N) (2b -1) Max. hops (log2b N)
- b 4 N 106 ? R 75 max. hops 5
- b 4 N 109 ? R 105 max. hops 7
- Number of entries in Leaf set (L)
- Entries in L are not sensitive to key, entries
in R could be - L 2b or 2b1, usually
- Routing could fail if (L/2) nodes fail
simultaneously
8Pastry Routing Algorithm
- Check if key falls in the range of IDs in L
- Route to the node with numerically closest ID
- Check R for node ID with largest prefix (larger
than that shared with this node) - Route to the node that shares largest prefix
- Entry may be empty, or node may be unavailable
- Check L for node with same prefix length
- Route to the node with numerically closer ID
9Pastry Adaptation
- Arriving nodes
- New node X sends Join message to node A
message is routed around to X through node Z - X gets initial L from Z, M from A, ith row of R
from ith node visited X then sends its state to
all nodes visited - Departing/failed nodes
- Nodes test connectivity of entries in M
periodically - Nodes repair L M using info. from other nodes
- Nodes repair R using entries at same level from
other nodes they borrow entries at next level if
needed
10Pastry Locality
- Entries in R and M are based on a proximity
metric decided by the application - Decision taken with local information only
- No guarantee of complete path being shortest
distance - Assumes triangulation inequality holds for
distances - Misses nearby nodes with different prefix
- Estimates density of node IDs in the ID space
- Heuristically switches between modes to address
problems details are sketchy (very) (Section 2.5)
11Pastry Evaluation (1)
- Number of routing hops (percentage probability)
- 2 (1.6), 3 (15.6), 4 (64.5), 5 (17.5) (Fig.
5) - Effect of fewer routing entries compared to
network with complete routing tables - At least 30 longer, at most 40 longer (Fig. 6)
- 75 entries Vs 99,999 entries for a 100K node
network! - Experiments with only one set of parameter
values!! - Ability to locate closest among k nodes
- Closest 76, Top 2 92, Top 3 96 (Fig. 8)
12Pastry Evaluation (2)
- Impact of failures and repairs on route quality
- Number of routing hops Vs node failure (Fig. 10)
- 2.73 (no failure), 2.96 (no repair), 2.74 (with
repair) - 5K node network, 10 nodes failing
- Poor parameters used
- Average cost of repairing failed nodes
- 57 remote procedure calls per failed node
- Seems expensive
13PAST Pastry Application
- Storage management system
- Archival storage and content distribution utility
- No support for search, directory lookup, key
distribution - Nodes and files have uniformly distributed IDs
- Replicas of files are stored at nodes whose IDs
match file IDs closely - Files may be encrypted
- Clients retrieve files using file ID as key
14PAST Insert Operation
- Inserts a file in k nodes returns a 160-bit ID
- File ID is a secure hash of the file name,
owners public key, and some salt the operation
is aborted if ID collision occurs - Copies of file are stored on k nodes whose ID is
closest to the 128 MSBs of the file ID - The required storage (k file size) is debited
against the clients storage quota
15PAST File Certificate (FC)
- A FC is issued when Insert operation starts
- Has File ID, hash of file content, replication
factor k, ... - FC is routed with file contents using Pastry
- Each node verifies FC and file, stores a copy of
file, attaches a Store Receipt, and forwards the
message - Operation aborts if anything goes wrong
- ID collision, invalid FC, corrupt file contents
- Insufficient storage space
16PAST Other Operations
- Retrieve a file
- Retrieves a copy of the file with given ID from
the first node that stores it - Reclaim storage
- Reclaims storage allocated to specified file
- Client issues a Reclaim Certificate (RC) to prove
ownership of the file - RC is routed with the message to all nodes
storing nodes verify RC and issue a Reclaim
Receipt - Client uses reclaim receipts to get storage
credit - No guarantees about the state of reclaimed files
17PAST Storage Management
- Goals
- Balance free space among nodes as utilization
increases - Ensure that a file is stored at k nodes,
- Balance number of files stored on nodes
- Storage capacities of nodes cannot differ by more
than two orders of magnitude - Nodes with capacity out of bounds are rejected
- Large capacity nodes can form a cluster
18PAST Diversions
tpri, tdiv control diversion
- Replica diversion, if no space at node
- File size/free space gt a threshold to store a
file - Node A asks node B (from its L) to store replica
- A stores a pointer to B for that file A must
retrieve file from somewhere if B fails (must
have k copies)! - Node C (from As L) also has pointer to B for
that file useful to reach B if A fails, but C
must be in the path - File diversion, if k nodes cant store file
- Restart Insert operation with a different salt (3
tries)
19PAST Caching
- Caching is optional, at discretion of nodes
- A file routed through a node during lookup/insert
maybe cached - Each node visited during insert stores a copy
lookup returns the first copy found what are we
missing? - Based on Greedy Dual-Size policy developed for
caching in web proxies - Replace file d with least c(d)/s(d) if cache is
full ccost, ssize if c(d)1, replaces the
largest file
20PAST Security
- Smartcards ensure integrity of IDs and
certificates - Store receipts ensure k nodes cooperate
- Routing table entries are signed, and they can be
verified by other nodes - A malicious node can cause problems
- Choosing next node at random might help somewhat
21PAST Evaluation (1)
- Basic storage management
- No diversions 51.1 of insertions fail global
storage utilization is only 60.8 we need
storage management - L 32, tpri 0.1, tdiv 0.05 are optimal
values - Effect of storage management (Fig. 6)
- 10 replica diversion at 80 utilization 15 at
95 - Small file insertion tends to fail after 80
utilization 20 file insertions tend to fail
after 95 utilization - Results are worse with file-system style workload
22PAST Evaluation (2)
- Effect of cache (Fig. 8)
- Experiments use no caching, LRU, and GD-S GD-S
performs marginally better than LRU - Hard to know if results are good since we have no
comparison with other systems only proves
caching helps - What we did not see
- Retrieval, reclaim performance of hops maybe
for insertion - Overlay routing overhead effort to cache
23PAST Possible Improvements
- Avoid replica diversion
- Forward on to the next node if no space
- May have to add directory service to improve
retrieval directory service could be useful
anyway - Reduce replica diversion or of forwards
- Add storage stats to routing table use to pick
next node - How to increase storage capacity?
- Add masters (at least at cluster level)
- Will not be as P2P any more!?
24Tapestry Claims
- An overlay location routing infrastructure for
location-independent routing of messages directly
to the closest copy of an object or service using
only point-to-point links, and without
centralized resources - Enhances Plaxton distributed search technique to
improve availability, scalability, and adaptation - More formally defined, better analyzed than
Pastrys techniques benefit of using Plaxton
25Tapestry 100K Feet View
- Nodes and objects have unique 160-bit IDs
- Nodes route messages using destination ID
- To a node whose ID shares longer suffix
- Can route in less than (logb N) hops
- Objects are located by routing to a surrogate
root - Servers publish their objects to surrogate roots
how objects get to servers is not a concern
Compare with Pastry
26Tapestry Node State
- Neighbor map
- (logb N) levels (rows) with b entries at each
level - Entry i at level j belongs to the closest node
whose ID ends with i j-1 suffix digits of
current node ID - Back pointer list
- ID of nodes that refer to this node as neighbor
- Object location pointers
- Tuples of the form ltObject ID, Node IDgt
- Hotspot monitor
- Tuples of the form ltObject ID, Node ID, Frequencygt
27Tapestry Parameters
- Numeric base of IDs (b)
- Entries b (logb N) max. hops (logb N)
- b 16, N 106 ? Entries 80 max. hops 5
- b 16, N 109 ? Entries 120 max. hops 8
28Tapestry Routing Algorithm
- A message at nth node shares at least n suffix
digits with ID of the nth node - Go to level (n1) of neighbor map
- Find a closest node whose ID shares n suffix
digits of current node ID - Route to the node determined
- If such node is not found, then current node must
be the (or a) root node - Message may contain a predicate to find the next
node, in addition to just using the closest node
29Tapestry Surrogate Roots
- Uses multiple surrogate roots
- Avoid single point of failure
- Add a constant sequence of salts to create IDs
the resulting IDs are published, each ID gets a
potentially different surrogate root - Finding surrogate roots isnt always easy
- Neighbors are used to find nodes that share at
least a digit with the object ID - This part of paper isnt very clear work in
progress?
30Tapestry Adaptation (1)
Adding new nodes can be expensive
- Arriving nodes
- New node X sends a message to itself through node
A the last node visited is the root node for X
X gets initial ith level of neighbor map from ith
node - X sends hello to its new neighbors
- Departing/failed nodes
- Send heartbeats on UDP packets using back
pointers - Secondary neighbors are used when a neighbor
fails - Failed neighbors get a second chance period they
are marked valid again if they respond within
this period
31Tapestry Adaptation (2)
- Uses Introspective Optimizations
- Attempts to use statistics to adapt
- Network tuning
- Use pings to update n/w latency to neighbors
- Optimize neighbor maps if latency gt threshold
- Hotspot caching
- Monitor frequency of requests to objects
- Advise application of need to cache
32Tapestry Evaluation
- Locality
- hops is 2 to 4 times that of network hops
(Fig. 8) better when base of ID is greater (Fig.
18) - Effect of multiple roots
- Latency reduces with increasing roots (Fig. 13),
while bandwidth used increases (Fig. 14) - Performance under stress
- Better throughput (Fig. 15) and average response
time (Fig. 16) than centralized directory servers
at higher loads
33OceanStore Tapestry Application
- Storage management system with support for
nomadic data, and constructed from a possibly
untrusted infrastructure - Proposes a business revenue model
- A goal is to support roughly 100 tera users
- Uses Tapestry for networking (a recent change)
- Promotes promiscuous caching to improve locality
- Replicas are stored independent of server that
stores them (floating replicas)
Contradicts Tapestry paper
34OceanStore 100K Feet View
- Objects are identified using GUIDs
- Clients access objects using GUIDs as destination
ID - Objects may be servers, routers, data,
directories, - Many versions of objects might be stored
- Update causes a new version of the object the
latest, updatable version is the active form,
others are archival forms archival forms are
encoded with an erasure code - Sessions guarantee consistency- loose to ACID
- Supports Access Control Read/Write
35OceanStore Updates (1)
- Client initiates updates as a set of predicates
combined with actions - A replica applies the actions associated with the
first true predicate (commit) update fails if
all predicates fail (abort) - The update attempt is logged regardless
- Replicas are not trusted with unencrypted info.
- version and size comparisons are done on
plaintext metadata others must be done over
ciphertext
36OceanStore Updates (2)
I just consult references ?
- Assumes a position-dependent block cipher
- compare, replace, insert, delete, append blocks
- Uses a fancy algorithm for searching within blocks
37OceanStore Consistency Guarantee
- Replica tiers used to serialize authorize
updates - A small number of primary tiers work with each
other in a Byzantine agreement protocol - Larger number of secondary tiers are organized
into multicast tree(s) - Client sends updates to the network
- All replicas apply the updates
- Updates from primary tiers are multicast back to
the network Version at other replicas is
tentative until then
38OceanStore Evaluation
- Prototype under development at time of paper
publication - Web site shows a prototype is out, but no stats
- Issues
- Is there such a thing as too untrusting?
- Risks of version proliferation
- Access control needs work
- Directory service squeezed in?
39Conclusions
- Pastry and Tapestry
- Somewhat similar in routing Tapestry more
polished - Tapestry stores references, Pastry stores copies
- PAST and OceanStore
- OceanStore needs caching more than PAST Storage
management in PAST is a good idea, needs more
work - No directory services in PAST, OceanStore has
some - 3rd party evaluation of systems needed
- Research opportunity? Object people meet Systems
people
40References
- Visit this URL to see this presentation, list of
references, etc. - http//www.cse.ogi.edu/smurthy/p2ps/index.html