PeertoPeer P2P Storage Systems CSE 581 Winter 2002 - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

PeertoPeer P2P Storage Systems CSE 581 Winter 2002

Description:

Nodes repair R using entries at same level from other nodes; they borrow entries ... ID collision, invalid FC, corrupt file contents. Insufficient storage space ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 41
Provided by: sudarsha5
Category:

less

Transcript and Presenter's Notes

Title: PeertoPeer P2P Storage Systems CSE 581 Winter 2002


1
Peer-to-Peer (P2P) Storage SystemsCSE 581 Winter
2002
  • Sudarshan Sun Murthy
  • smurthy_at_sunlet.net

2
Papers
  • Rowstron A, Druschel P. Pastry Scalable,
    distributed object location and routing for
    large-scale peer-to-peer systems
  • Rowstron A, Druschel P. Storage management and
    caching in PAST, a large-scale, persistent
    peertopeer storage utility
  • Zhao, et al. Tapestry An infrastructure for
    faultresilient widearea location and routing
  • Kubiatowicz J, et al. OceanStore An architecture
    for globalscale persistent store

3
P2P Storage Systems Basics
  • Needs
  • Networking, routing, storage, caching
  • Roles
  • Client, server, router, cache
  • Desired characteristics
  • Fast, tolerant, scalable, reliable, good locality
  • Small world keep the clique, and reach
    everything fast!

4
Pastry Claims
  • A generic P2P object location and routing scheme
    based on a self-organizing overlay network of
    nodes connected to the Internet.
  • Features
  • Decentralized
  • Fault-resilient
  • Scalable
  • Reliable
  • Good route locality

5
Pastry 100K Feet View
  • Nodes interact with local applications
  • Each node has a unique 128-bit ID (base 2b)
  • Cryptographic hash of IP address, usually
  • Node IDs are distributed in geography, etc.
  • Nodes route messages based on the key
  • To a node whose ID shares more prefix digits
  • To a node with numerically closer ID
  • Can route in less than (log2b N) hops, usually

6
Pastry Node State
  • Routing table (R)
  • (log2b N) rows with (2b -1) entries in each row
    (R)
  • Row n lists nodes that share ID in first n digits
  • Neighborhood set (M)
  • Lists M nodes that are closest according to the
    proximity metric (application defined)
  • Leaf set (L)
  • Lists (L/2) nodes with numerically closest
    smaller IDs and (L/2) nodes with numerically
    closest larger IDs

7
Pastry Parameters
  • Numeric base of IDs (b)
  • R (log2b N) (2b -1) Max. hops (log2b N)
  • b 4 N 106 ? R 75 max. hops 5
  • b 4 N 109 ? R 105 max. hops 7
  • Number of entries in Leaf set (L)
  • Entries in L are not sensitive to key, entries
    in R could be
  • L 2b or 2b1, usually
  • Routing could fail if (L/2) nodes fail
    simultaneously

8
Pastry Routing Algorithm
  • Check if key falls in the range of IDs in L
  • Route to the node with numerically closest ID
  • Check R for node ID with largest prefix (larger
    than that shared with this node)
  • Route to the node that shares largest prefix
  • Entry may be empty, or node may be unavailable
  • Check L for node with same prefix length
  • Route to the node with numerically closer ID

9
Pastry Adaptation
  • Arriving nodes
  • New node X sends Join message to node A
    message is routed around to X through node Z
  • X gets initial L from Z, M from A, ith row of R
    from ith node visited X then sends its state to
    all nodes visited
  • Departing/failed nodes
  • Nodes test connectivity of entries in M
    periodically
  • Nodes repair L M using info. from other nodes
  • Nodes repair R using entries at same level from
    other nodes they borrow entries at next level if
    needed

10
Pastry Locality
  • Entries in R and M are based on a proximity
    metric decided by the application
  • Decision taken with local information only
  • No guarantee of complete path being shortest
    distance
  • Assumes triangulation inequality holds for
    distances
  • Misses nearby nodes with different prefix
  • Estimates density of node IDs in the ID space
  • Heuristically switches between modes to address
    problems details are sketchy (very) (Section 2.5)

11
Pastry Evaluation (1)
  • Number of routing hops (percentage probability)
  • 2 (1.6), 3 (15.6), 4 (64.5), 5 (17.5) (Fig.
    5)
  • Effect of fewer routing entries compared to
    network with complete routing tables
  • At least 30 longer, at most 40 longer (Fig. 6)
  • 75 entries Vs 99,999 entries for a 100K node
    network!
  • Experiments with only one set of parameter
    values!!
  • Ability to locate closest among k nodes
  • Closest 76, Top 2 92, Top 3 96 (Fig. 8)

12
Pastry Evaluation (2)
  • Impact of failures and repairs on route quality
  • Number of routing hops Vs node failure (Fig. 10)
  • 2.73 (no failure), 2.96 (no repair), 2.74 (with
    repair)
  • 5K node network, 10 nodes failing
  • Poor parameters used
  • Average cost of repairing failed nodes
  • 57 remote procedure calls per failed node
  • Seems expensive

13
PAST Pastry Application
  • Storage management system
  • Archival storage and content distribution utility
  • No support for search, directory lookup, key
    distribution
  • Nodes and files have uniformly distributed IDs
  • Replicas of files are stored at nodes whose IDs
    match file IDs closely
  • Files may be encrypted
  • Clients retrieve files using file ID as key

14
PAST Insert Operation
  • Inserts a file in k nodes returns a 160-bit ID
  • File ID is a secure hash of the file name,
    owners public key, and some salt the operation
    is aborted if ID collision occurs
  • Copies of file are stored on k nodes whose ID is
    closest to the 128 MSBs of the file ID
  • The required storage (k file size) is debited
    against the clients storage quota

15
PAST File Certificate (FC)
  • A FC is issued when Insert operation starts
  • Has File ID, hash of file content, replication
    factor k, ...
  • FC is routed with file contents using Pastry
  • Each node verifies FC and file, stores a copy of
    file, attaches a Store Receipt, and forwards the
    message
  • Operation aborts if anything goes wrong
  • ID collision, invalid FC, corrupt file contents
  • Insufficient storage space

16
PAST Other Operations
  • Retrieve a file
  • Retrieves a copy of the file with given ID from
    the first node that stores it
  • Reclaim storage
  • Reclaims storage allocated to specified file
  • Client issues a Reclaim Certificate (RC) to prove
    ownership of the file
  • RC is routed with the message to all nodes
    storing nodes verify RC and issue a Reclaim
    Receipt
  • Client uses reclaim receipts to get storage
    credit
  • No guarantees about the state of reclaimed files

17
PAST Storage Management
  • Goals
  • Balance free space among nodes as utilization
    increases
  • Ensure that a file is stored at k nodes,
  • Balance number of files stored on nodes
  • Storage capacities of nodes cannot differ by more
    than two orders of magnitude
  • Nodes with capacity out of bounds are rejected
  • Large capacity nodes can form a cluster

18
PAST Diversions
tpri, tdiv control diversion
  • Replica diversion, if no space at node
  • File size/free space gt a threshold to store a
    file
  • Node A asks node B (from its L) to store replica
  • A stores a pointer to B for that file A must
    retrieve file from somewhere if B fails (must
    have k copies)!
  • Node C (from As L) also has pointer to B for
    that file useful to reach B if A fails, but C
    must be in the path
  • File diversion, if k nodes cant store file
  • Restart Insert operation with a different salt (3
    tries)

19
PAST Caching
  • Caching is optional, at discretion of nodes
  • A file routed through a node during lookup/insert
    maybe cached
  • Each node visited during insert stores a copy
    lookup returns the first copy found what are we
    missing?
  • Based on Greedy Dual-Size policy developed for
    caching in web proxies
  • Replace file d with least c(d)/s(d) if cache is
    full ccost, ssize if c(d)1, replaces the
    largest file

20
PAST Security
  • Smartcards ensure integrity of IDs and
    certificates
  • Store receipts ensure k nodes cooperate
  • Routing table entries are signed, and they can be
    verified by other nodes
  • A malicious node can cause problems
  • Choosing next node at random might help somewhat

21
PAST Evaluation (1)
  • Basic storage management
  • No diversions 51.1 of insertions fail global
    storage utilization is only 60.8 we need
    storage management
  • L 32, tpri 0.1, tdiv 0.05 are optimal
    values
  • Effect of storage management (Fig. 6)
  • 10 replica diversion at 80 utilization 15 at
    95
  • Small file insertion tends to fail after 80
    utilization 20 file insertions tend to fail
    after 95 utilization
  • Results are worse with file-system style workload

22
PAST Evaluation (2)
  • Effect of cache (Fig. 8)
  • Experiments use no caching, LRU, and GD-S GD-S
    performs marginally better than LRU
  • Hard to know if results are good since we have no
    comparison with other systems only proves
    caching helps
  • What we did not see
  • Retrieval, reclaim performance of hops maybe
    for insertion
  • Overlay routing overhead effort to cache

23
PAST Possible Improvements
  • Avoid replica diversion
  • Forward on to the next node if no space
  • May have to add directory service to improve
    retrieval directory service could be useful
    anyway
  • Reduce replica diversion or of forwards
  • Add storage stats to routing table use to pick
    next node
  • How to increase storage capacity?
  • Add masters (at least at cluster level)
  • Will not be as P2P any more!?

24
Tapestry Claims
  • An overlay location routing infrastructure for
    location-independent routing of messages directly
    to the closest copy of an object or service using
    only point-to-point links, and without
    centralized resources
  • Enhances Plaxton distributed search technique to
    improve availability, scalability, and adaptation
  • More formally defined, better analyzed than
    Pastrys techniques benefit of using Plaxton

25
Tapestry 100K Feet View
  • Nodes and objects have unique 160-bit IDs
  • Nodes route messages using destination ID
  • To a node whose ID shares longer suffix
  • Can route in less than (logb N) hops
  • Objects are located by routing to a surrogate
    root
  • Servers publish their objects to surrogate roots
    how objects get to servers is not a concern

Compare with Pastry
26
Tapestry Node State
  • Neighbor map
  • (logb N) levels (rows) with b entries at each
    level
  • Entry i at level j belongs to the closest node
    whose ID ends with i j-1 suffix digits of
    current node ID
  • Back pointer list
  • ID of nodes that refer to this node as neighbor
  • Object location pointers
  • Tuples of the form ltObject ID, Node IDgt
  • Hotspot monitor
  • Tuples of the form ltObject ID, Node ID, Frequencygt

27
Tapestry Parameters
  • Numeric base of IDs (b)
  • Entries b (logb N) max. hops (logb N)
  • b 16, N 106 ? Entries 80 max. hops 5
  • b 16, N 109 ? Entries 120 max. hops 8

28
Tapestry Routing Algorithm
  • A message at nth node shares at least n suffix
    digits with ID of the nth node
  • Go to level (n1) of neighbor map
  • Find a closest node whose ID shares n suffix
    digits of current node ID
  • Route to the node determined
  • If such node is not found, then current node must
    be the (or a) root node
  • Message may contain a predicate to find the next
    node, in addition to just using the closest node

29
Tapestry Surrogate Roots
  • Uses multiple surrogate roots
  • Avoid single point of failure
  • Add a constant sequence of salts to create IDs
    the resulting IDs are published, each ID gets a
    potentially different surrogate root
  • Finding surrogate roots isnt always easy
  • Neighbors are used to find nodes that share at
    least a digit with the object ID
  • This part of paper isnt very clear work in
    progress?

30
Tapestry Adaptation (1)
Adding new nodes can be expensive
  • Arriving nodes
  • New node X sends a message to itself through node
    A the last node visited is the root node for X
    X gets initial ith level of neighbor map from ith
    node
  • X sends hello to its new neighbors
  • Departing/failed nodes
  • Send heartbeats on UDP packets using back
    pointers
  • Secondary neighbors are used when a neighbor
    fails
  • Failed neighbors get a second chance period they
    are marked valid again if they respond within
    this period

31
Tapestry Adaptation (2)
  • Uses Introspective Optimizations
  • Attempts to use statistics to adapt
  • Network tuning
  • Use pings to update n/w latency to neighbors
  • Optimize neighbor maps if latency gt threshold
  • Hotspot caching
  • Monitor frequency of requests to objects
  • Advise application of need to cache

32
Tapestry Evaluation
  • Locality
  • hops is 2 to 4 times that of network hops
    (Fig. 8) better when base of ID is greater (Fig.
    18)
  • Effect of multiple roots
  • Latency reduces with increasing roots (Fig. 13),
    while bandwidth used increases (Fig. 14)
  • Performance under stress
  • Better throughput (Fig. 15) and average response
    time (Fig. 16) than centralized directory servers
    at higher loads

33
OceanStore Tapestry Application
  • Storage management system with support for
    nomadic data, and constructed from a possibly
    untrusted infrastructure
  • Proposes a business revenue model
  • A goal is to support roughly 100 tera users
  • Uses Tapestry for networking (a recent change)
  • Promotes promiscuous caching to improve locality
  • Replicas are stored independent of server that
    stores them (floating replicas)

Contradicts Tapestry paper
34
OceanStore 100K Feet View
  • Objects are identified using GUIDs
  • Clients access objects using GUIDs as destination
    ID
  • Objects may be servers, routers, data,
    directories,
  • Many versions of objects might be stored
  • Update causes a new version of the object the
    latest, updatable version is the active form,
    others are archival forms archival forms are
    encoded with an erasure code
  • Sessions guarantee consistency- loose to ACID
  • Supports Access Control Read/Write

35
OceanStore Updates (1)
  • Client initiates updates as a set of predicates
    combined with actions
  • A replica applies the actions associated with the
    first true predicate (commit) update fails if
    all predicates fail (abort)
  • The update attempt is logged regardless
  • Replicas are not trusted with unencrypted info.
  • version and size comparisons are done on
    plaintext metadata others must be done over
    ciphertext

36
OceanStore Updates (2)
I just consult references ?
  • Assumes a position-dependent block cipher
  • compare, replace, insert, delete, append blocks
  • Uses a fancy algorithm for searching within blocks

37
OceanStore Consistency Guarantee
  • Replica tiers used to serialize authorize
    updates
  • A small number of primary tiers work with each
    other in a Byzantine agreement protocol
  • Larger number of secondary tiers are organized
    into multicast tree(s)
  • Client sends updates to the network
  • All replicas apply the updates
  • Updates from primary tiers are multicast back to
    the network Version at other replicas is
    tentative until then

38
OceanStore Evaluation
  • Prototype under development at time of paper
    publication
  • Web site shows a prototype is out, but no stats
  • Issues
  • Is there such a thing as too untrusting?
  • Risks of version proliferation
  • Access control needs work
  • Directory service squeezed in?

39
Conclusions
  • Pastry and Tapestry
  • Somewhat similar in routing Tapestry more
    polished
  • Tapestry stores references, Pastry stores copies
  • PAST and OceanStore
  • OceanStore needs caching more than PAST Storage
    management in PAST is a good idea, needs more
    work
  • No directory services in PAST, OceanStore has
    some
  • 3rd party evaluation of systems needed
  • Research opportunity? Object people meet Systems
    people

40
References
  • Visit this URL to see this presentation, list of
    references, etc.
  • http//www.cse.ogi.edu/smurthy/p2ps/index.html
Write a Comment
User Comments (0)
About PowerShow.com