Title: PeertoPeer Networks
1Peer-to-Peer Networks
- Outline
- Overview
- Pastry
- OpenDHT
- Contributions from Peter Druschel Sean Rhea
2What is Peer-to-Peer About?
- Distribution
- Decentralized control
- Self-organization
- Symmetric communication
- P2P networks do two things
- Map objects onto nodes
- Route requests to the node responsible for a
given object
3Pastry
- Self-organizing overlay network
- Consistent hashing
- Lookup/insert object in lt log16 N routing steps
(expected) - O(log N) per-node state
- Network locality heuristics
4Object Distribution
2128 - 1
O
- Consistent hashing Karger et al. 97
- 128 bit circular id space
- nodeIds (uniform random)
- objIds (uniform random)
- Invariant node with numerically closest nodeId
maintains object
objId
nodeIds
5Object Insertion/Lookup
O
2128 - 1
Msg with key X is routed to live node with nodeId
closest to X Problem complete routing table
not feasible
X
Route(X)
6Routing Distributed Hash Table
d471f1
d467c4
d462ba
d46a1c
d4213f
Route(d46a1c)
- Properties
- log16 N steps
- O(log N) state
d13da3
65a1fc
7Leaf Sets
- Each node maintains IP addresses of the nodes
with the L numerically closest larger and smaller
nodeIds, respectively. - routing efficiency/robustness
- fault detection (keep-alive)
- application-specific local coordination
8Routing Procedure
if (D is within range of our leaf set) forward
to numerically closest member else let l
length of shared prefix let d value of l-th
digit in Ds address if (RouteTabl,d exists)
forward to RouteTabl,d else forward to a
known node that (a) shares at least as long a
prefix (b) is numerically closer than this node
9Routing
- Integrity of overlay
- guaranteed unless L/2 simultaneous failures of
nodes with adjacent nodeIds - Number of routing hops
- No failures lt log16 N expected, 128/b 1 max
- During failure recovery
- O(N) worst case, average case much better
10Node Addition
d471f1
d467c4
d462ba
d46a1c
d4213f
New node d46a1c
Route(d46a1c)
d13da3
65a1fc
11Node Departure (Failure)
- Leaf set members exchange keep-alive messages
- Leaf set repair (eager) request set from
farthest live node in set - Routing table repair (lazy) get table from peers
in the same row, then higher rows
12PAST Cooperative, archival file storage and
distribution
-
- Layered on top of Pastry
- Strong persistence
- High availability
- Scalability
- Reduced cost (no backup)
- Efficient use of pooled resources
13PAST API
- Insert - store replica of a file at k diverse
storage nodes - Lookup - retrieve file from a nearby live storage
node that holds a copy - Reclaim - free storage associated with a file
- Files are immutable
14PAST File storage
fileId
Insert fileId
15PAST File storage
Storage Invariant File replicas are stored
on k nodes with nodeIds closest to fileId (k
is bounded by the leaf set size)
16PAST File Retrieval
C
k replicas
Lookup
file located in log16 N steps (expected) usually
locates replica nearest client C
fileId
17DHT Deployment Today
PAST (MSR/Rice)
i3 (UCB)
Overnet (open)
CFS (MIT)
OStore (UCB)
PIER (UCB)
pSearch (HP)
Coral (NYU)
ChordDHT
Pastry DHT
TapestryDHT
BambooDHT
CANDHT
KademliaDHT
ChordDHT
KademliaDHT
Every application deploys its own DHT (DHT as a
library)
IP
connectivity
18DHT Deployment Tomorrow?
PAST (MSR/Rice)
i3 (UCB)
Overnet (open)
CFS (MIT)
OStore (UCB)
PIER (UCB)
pSearch (HP)
Coral (NYU)
ChordDHT
PastryDHT
TapestryDHT
BambooDHT
CANDHT
KademliaDHT
ChordDHT
KademliaDHT
DHT
indirection
OpenDHT one DHT, shared across applications (DHT
as a service)
IP
connectivity
19Two Ways To Use a DHT
- The Library Model
- DHT code is linked into application binary
- Pros flexibility, high performance
- The Service Model
- DHT accessed as a service over RPC
- Pros easier deployment, less maintenance
20The OpenDHT Service
- 200-300 Bamboo USENIX04 nodes on PlanetLab
- All in one slice, all managed by us
- Clients can be arbitrary Internet hosts
- Access DHT using RPC over TCP
- Interface is simple put/get
- put(key, value) stores value under key
- get(key) returns all the values stored under
key - Running on PlanetLab since April 2004
- Building a community of users
21OpenDHT Applications
22An Example Application The CD Database
Compute Disc Fingerprint
Album Track Titles
23An Example Application The CD Database
Type In Album and Track Titles
Album Track Titles
No Such Fingerprint
24A DHT-Based FreeDB Cache
- FreeDB is a volunteer service
- Has suffered outages as long as 48 hours
- Service costs born largely by volunteer mirrors
- Idea Build a cache of FreeDB with a DHT
- Add to availability of main service
- Goal explore how easy this is to do
25Cache Illustration
New Albums
Disc Fingerprint
Disc Info
Disc Fingerprint
26Is Providing DHT Service Hard?
- Is it any different than just running Bamboo?
- Yes, sharing makes the problem harder
- OpenDHT is shared in two senses
- Across applications ? need a flexible interface
- Across clients ? need resource allocation
27Sharing Between Applications
- Must balance generality and ease-of-use
- Many apps want only simple put/get
- Others want lookup, anycast, multicast, etc.
- OpenDHT allows only put/get
- But use client-side library, ReDiR, to build
others - Supports lookup, anycast, multicast, range search
- Only constant latency increase on average
- (Different approach used by DimChord KR04)
28Sharing Between Clients
- Must authenticate puts/gets/removes
- If two clients put with same key, who wins?
- Who can remove an existing put?
- Must protect systems resources
- Or malicious clients can deny service to others
- The remainder of this talk
29Fair Storage Allocation
- Our solution give each client a fair share
- Will define fairness in a few slides
- Limits strength of malicious clients
- Only as powerful as they are numerous
- Protect storage on each DHT node separately
- Global fairness is hard
- Key choice imbalance is a burden on DHT
- Reward clients that balance their key choices
30Two Main Challenges
- Making sure disk is available for new puts
- As load changes over time, need to adapt
- Without some free disk, our hands are tied
- Allocating free disk fairly across clients
- Adapt techniques from fair queuing
31Making Sure Disk is Available
- Cant store values indefinitely
- Otherwise all storage will eventually fill
- Add time-to-live (TTL) to puts
- put (key, value) ? put (key, value, ttl)
- (Different approach used by Palimpsest RH03)
32Making Sure Disk is Available
- TTLs prevent long-term starvation
- Eventually all puts will expire
- Can still get short term starvation
33Making Sure Disk is Available
- Stronger condition
- Be able to accept rmin bytes/sec new data at all
times
34Making Sure Disk is Available
- Stronger condition
- Be able to accept rmin bytes/sec new data at all
times
35Fair Storage Allocation
Store and send accept message to client
36Defining Most Under-Represented
- Not just sharing disk, but disk over time
- 1-byte put for 100s same as 100-byte put for 1s
- So units are bytes ? seconds, call them
commitments - Equalize total commitments granted?
- No leads to starvation
- A fills disk, B starts putting, A starves up to
max TTL
37Defining Most Under-Represented
- Instead, equalize rate of commitments granted
- Service granted to one client depends only on
others putting at same time
38Defining Most Under-Represented
- Instead, equalize rate of commitments granted
- Service granted to one client depends only on
others putting at same time - Mechanism inspired by Start-time Fair Queuing
- Have virtual time, v(t)
- Each put gets a start time S(pci) and finish time
F(pci) - F(pci) S(pci) size(pci) ? ttl(pci)
- S(pci) max(v(A(pci)) - ?, F(pci-1))
- v(t) maximum start time of all accepted puts
39Fairness with Different Arrival Times
40Fairness With Different Sizes and TTLs
41Performance
- Only 28 of 7 million values lost in 3 months
- Where lost means unavailable for a full hour
- On Feb. 7, 2005, lost 60/190 nodes in 15 minutes
to PL kernel bug, only lost one value
42Performance
- Median get latency 250 ms
- Median RTT between hosts 140 ms
- But 95th percentile get latency is atrocious
- And even median spikes up from time to time
43The Problem Slow Nodes
- Some PlanetLab nodes are just really slow
- But set of slow nodes changes over time
- Cant cherry pick a set of fast nodes
- Seems to be the case on RON as well
- May even be true for managed clusters (MapReduce)
- Modified OpenDHT to be robust to such slowness
- Combination of delay-aware routing and redundancy
- Median now 66 ms, 99th percentile is 320 ms
- using 2X redundancy