Title: PeertoPeer Networks
1Peer-to-Peer Networks
- Distributed Algorithms for P2P
- Distributed Hash Tables
P. Felber Pascal.Felber_at_eurecom.frhttp//www.eure
com.fr/felber/
2Agenda
- What are DHTs? Why are they useful?
- What makes a good DHT design
- Case studies
- Chord
- Pastry (locality)
- TOPLUS (topology-awareness)
- What are the open problems?
3What is P2P?
- A distributed system architecture
- No centralized control
- Typically many nodes, but unreliable and
heterogeneous - Nodes are symmetric in function
- Take advantage of distributed, shared resources
(bandwidth, CPU, storage) on peer-nodes - Fault-tolerant, self-organizing
- Operate in dynamic environment, frequent join and
leave is the norm
Internet
4P2P Challenge Locating Content
- Simple strategy expanding ring search until
content is found - If r of N nodes have copy, the expected search
cost is at least N / r, i.e., O(N) - Need many copies to keep overhead small
Who has this paper?
5Directed Searches
- Idea
- Assign particular nodes to hold particular
content (or know where it is) - When a node wants this content, go to the node
that is supposes to hold it (or know where it is) - Challenges
- Avoid bottlenecks distribute the
responsibilities evenly among the existing
nodes - Adaptation to nodes joining or leaving (or
failing) - Give responsibilities to joining nodes
- Redistribute responsibilities from leaving nodes
6Idea Hash Tables
- A hash table associates data with keys
- Key is hashed to find bucket in hash table
- Each bucket is expected to hold items/buckets
items - In a Distributed Hash Table (DHT), nodes are the
hash buckets - Key is hashed to find responsible peer node
- Data and load are balanced across nodes
7DHTs Problems
- Problem 1 (dynamicity) adding or removing nodes
- With hash mod N, virtually every key will change
its location! - h(k) mod m ? h(k) mod (m1) ? h(k) mod (m-1)
- Solution use consistent hashing
- Define a fixed hash space
- All hash values fall within that space and do not
depend on the number of peers (hash bucket) - Each key goes to peer closest to its ID in hash
space (according to some proximity metric)
8DHTs Problems (contd)
- Problem 2 (size) all nodes must be known to
insert or lookup data - Works with small and static server populations
- Solution each peer knows of only a few
neighbors - Messages are routed through neighbors via
multiple hops (overlay routing)
9What Makes a Good DHT Design
- For each object, the node(s) responsible for that
object should be reachable via a short path
(small diameter) - The different DHTs differ fundamentally only in
the routing approach - The number of neighbors for each node should
remain reasonable (small degree) - DHT routing mechanisms should be decentralized
(no single point of failure or bottleneck) - Should gracefully handle nodes joining and
leaving - Repartition the affected keys over existing nodes
- Reorganize the neighbor sets
- Bootstrap mechanisms to connect new nodes into
the DHT - To achieve good performance, DHT must provide low
stretch - Minimize ratio of DHT routing vs. unicast latency
10DHT Interface
- Minimal interface (data-centric)
- Lookup(key) ? IP address
- Supports a wide range of applications, because
few restrictions - Keys have no semantic meaning
- Value is application dependent
- DHTs do not store the data
- Data storage can be build on top of DHTs
- Lookup(key) ? data
- Insert(key, data)
11DHTs in Context
User Application
load_file
store_file
File System
Retrieve and store files Map files to blocks
CFS
load_block
store_block
Storage Replication Caching
Reliable Block Storage
DHash
lookup
DHT
Lookup Routing
Chord
receive
send
Transport
Communication
TCP/IP
12DHTs Support Many Applications
- File sharing CFS, OceanStore, PAST,
- Web cache Squirrel,
- Censor-resistant stores Eternity, FreeNet,
- Application-layer multicast Narada,
- Event notification Scribe
- Naming systems ChordDNS, INS,
- Query and indexing Kademlia,
- Communication primitives I3,
- Backup store HiveNet
- Web archive Herodotus
13DHT Case Studies
- Case Studies
- Chord
- Pastry
- TOPLUS
- Questions
- How is the hash space divided evenly among nodes?
- How do we locate a node?
- How does we maintain routing tables?
- How does we cope with (rapid) changes in
membership?
14Chord (MIT)
- Circular m-bit ID space for both keys and nodes
- Node ID SHA-1(IP address)
- Key ID SHA-1(key)
- A key is mapped to the first node whose ID is
equal to or follows the key ID - Each node is responsible for O(K/N) keys
- O(K/N) keys move when a node joins or leaves
m6
2m-1 0
15Chord State and Lookup (1)
- Basic Chord each node knows only 2 other nodes
on the ring - Successor
- Predecessor (for ring management)
- Lookup is achieved by forwarding requests around
the ring through successor pointers - Requires O(N) hops
m6
2m-1 0
N1
N56
N8
K54
N51
N14
N48
N42
N21
N38
N32
16Chord State and Lookup (2)
Finger table
- Each node knows m other nodes on the ring
- Successors finger i of n points to node at n2i
(or successor) - Predecessor (for ring management)
- O(log N) state per node
- Lookup is achieved by following closest preceding
fingers, then successor - O(log N) hops
N81
N14
N82
N14
m6
N84
N14
2m-1 0
N88
N21
N1
N816
N32
N832
N42
N56
N8
K54
1
2
N51
4
N14
N48
8
16
32
N42
N21
N38
N32
17Chord Ring Management
- For correctness, Chord needs to maintain the
following invariants - For every key k, succ(k) is responsible for k
- Successor pointers are correctly maintained
- Finger table are not necessary for correctness
- One can always default to successor-based lookup
- Finger table can be updated lazily
18Joining the Ring
- Three step process
- Initialize all fingers of new node
- Update fingers of existing nodes
- Transfer keys from successor to new node
19Joining the Ring Step 1
- Initialize the new node finger table
- Locate any node n in the ring
- Ask n to lookup the peers at j20, j21, j22
- Use results to populate finger table of j
20Joining the Ring Step 2
- Updating fingers of existing nodes
- New node j calls update function on existing
nodes that must point to j - Nodes in the rangesj-2i , pred(j)-2i1
- O(log N) nodes need to be updated
N81
N14
N82
N14
m6
N84
N14
2m-1 0
N88
N21
N1
N816
N32
N28
N832
N42
N56
N8
N51
N14
N48
16
N42
N21
N38
N32
21Joining the Ring Step 3
- Transfer key responsibility
- Connect to successor
- Copy keys from successor to new node
- Update successor pointer and remove keys
- Only keys in the range are transferred
22Stabilization
- Case 1 finger tables are reasonably fresh
- Case 2 successor pointers are correct, not
fingers - Case 3 successor pointers are inaccurate or key
migration is incomplete MUST BE AVOIDED! - Stabilization algorithm periodically verifies and
refreshes node pointers (including fingers) - Basic principle (at node n)
- x n.succ.predif x ? (n, n.succ) n
n.succ notify n.succ - Eventually stabilizes the system when no node
joins or fails
23Dealing With Failures
- Failure of nodes might cause incorrect lookup
- N8 doesnt know correct successor, so lookup of
K19 fails - Solution successor list
- Each node n knows r immediate successors
- After failure, n knows first live successor and
updates successor list - Correct successors guarantee correct lookups
m6
2m-1 0
N1
N56
N8
1
lookup(K19) ?
2
N51
4
N14
N48
8
16
N18
K19
N42
N21
N38
N32
24Dealing With Failures (contd)
- Successor lists guarantee correct lookup with
some probability - Can choose r to make probability of lookup
failure arbitrarily small - Assume half of the nodes fail and failures are
independent - P(n.successor-list all dead) 0.5r
- P(n does not break the Chord ring) 1 - 0.5r
- P(no broken nodes) (1 0.5r)N
- r 2log(N) makes probability 1 1/N
- With high probability (1/N), the ring is not
broken
25Evolution of P2P Systems
- Nodes leave frequently, so surviving nodes must
be notified of arrivals to stay connected after
their original neighbors fail - Take time t with N nodes
- Doubling time time from t until N new nodes join
- Halving time time from t until N nodes leave
- Half-life minimum of halving and doubling time
- Theorem there exist a sequence of joins and
leaves such that any node that has received fewer
than k notifications per half-life will be
disconnected with probability at least (1
1/(e-1))k 0.418k
26Chord and Network Topology
Nodes numerically-close are not
topologically-close (1M nodes 10 hops)
27Pastry (MSR)
- Circular m-bit ID space for both keys and nodes
- Addresses in base 2b with m / b digits
- Node ID SHA-1(IP address)
- Key ID SHA-1(key)
- A key is mapped to the node whose ID is
numerically-closest the key ID
m8
2m-1 0
b2
28Pastry Lookup
- Prefix routing from A to B
- At hth hop, arrive at node that shares prefix
with B of length at least h digits - Example 5324 routes to 0629 via 5324 ? 0748 ?
0605 ? 0620 ? 0629 - If there is no such node, forward message to
neighbor numerically-closer to destination
(successor) 5324 ? 0748 ? 0605 ? 0609 ? 0620 ?
0629 - O(log2b N) hops
29Pastry State and Lookup
- For each prefix, a node knows some other node (if
any) with same prefix and different next digit - For instance, N0201
- N- N1???, N2???, N3???
- N0 N00??, N01??, N03??
- N02 N021?, N022?, N023?
- N020 N0200, N0202, N0203
- When multiple nodes, choose topologically-closest
- Maintain good locality properties (more on that
later)
Routing table
m8
2m-1 0
b2
N0002
N0122
N3200
N0201
N0212
N0221
N3033
N0233
N0322
N3001
N2222
N1113
N2120
K2120
N2001
30A Pastry Routing Table
b2, so node ID is base 4 (16 bits)
b2
m16
Node ID 10233102
Contains the nodes that are numerically closest
to local node MUST BE UP TO DATE
Leaf set
lt SMALLER
LARGER gt
10233033
10233021
10233120
10233122
10233001
10233000
10233230
10233232
Routing Table
02212102
1
22301203
31203203
Entries in the mth column have m as next digit
0
11301233
12230203
13021022
2
10031203
10132102
10323302
3
10222302
10200230
10211302
nth digit of current node
m/b rows
3
10230322
10231000
10232121
10233001
10233232
1
Entries in the nth row share the first n
digits with current node common-prefix
next-digit rest
10233120
0
2
Contains the nodes that are closest to local
node according to proximity metric
Neighborhood set
2b-1 entries per row
Entries with no suitable node ID are left empty
13021022
10200230
11301233
31301233
02212102
22301203
31203203
33213321
31Pastry and Network Topology
Expected node distance increases with row number
in routing table
Smaller and smaller numerical jumps Bigger and
bigger topological jumps
32Joining
X joins
X 0629
0629s routing table
33Locality
- The joining phase preserves the locality property
- First A must be near X
- Entries in row zero of As routing table are
close to A, A is close to X ? X0 can be A0 - The distance from B to nodes from B1 is much
larger than distance from A to B (B is in A0) ?
B1 can be reasonable choice for X1, C2 for X2,
etc. - To avoid cascading errors, X requests the state
from each of the node in its routing table and
updates its own with any closer node - This scheme works pretty well in practice
- Minimize the distance of the next routing step
with no sense of global direction - Stretch around 2-3
34Node Departure
- Node is considered failed when its immediate
neighbors in the node ID space cannot communicate
with it - To replace a failed node in the leaf set, the
node contacts the live node with the largest
index on the side of failed node, and asks for
its leaf set - To repair a failed routing table entry Rdl, node
contacts first the node referred to by another
entry Ril, i?d of the same row, and ask for that
nodes entry for Rdl - If a member in the M table, is not responding,
node asks other members for their M table, check
the distance of each of the newly discovered
nodes, and update its own M table
35CAN (Berkeley)
- Cartesian space (d-dimensional)
- Space wraps up d-torus
- Incrementally split space between nodes that join
- Node (cell) responsible for key k is determined
by hashing k for each dimension
d2
1
36CAN State and Lookup
- A node A only maintains state for its immediate
neighbors (N, S, E, W) - 2d neighbors per node
- Messages are routed to neighbor that minimizes
Cartesian distance - More dimensions means faster the routing but also
more state - (dN1/d)/4 hops on average
- Multiple choices we can route around failures
d2
B
N
W
A
E
S
37CAN Landmark Routing
- CAN nodes do not have a pre-defined ID
- Nodes can be placed according to locality
- Use well known set of m landmark machines (e.g.,
root DNS servers) - Each CAN node measures its RTT to each landmark
- Orders the landmarks in order of increasing RTT
m! possible orderings - CAN construction
- Place nodes with same ordering close together in
the CAN - To do so, partition the space into m! zones m
zones on x, m-1 on y, etc. - A node interprets its ordering as the coordinate
of its zone
38CAN and Network Topology
C
CAB
A
Use m landmarks to split space in m! zones
CBA
ACB
Nodes get random zone in their zone
Topologically-close nodes tend to be in the same
zone
BCA
B
ABC
BAC
39Topology-Awareness
- Problem
- P2P lookup services generally do not take
topology into account - In Chord/CAN/Pastry, neighbors are often not
locally nearby - Goals
- Provide small stretch route packets to their
destination along a path that mimics the
router-level shortest-path distance - Stretch DHT-routing / IP-routing
- Our solution
- TOPLUS (TOPology-centric Look-Up Service)
- An extremist design to topology-aware DHTs
40TOPLUS Architecture
Group nodes in nested groups using IP prefixes
AS, ISP, LAN (IP prefix contiguous address range
of the form w.x.y.z/n)
Use IPv4 address range (32-bits) for node IDs and
key IDs
Assumption nodes with same IP prefix are
topologically close
IP Addresses
41Node State
Each node n is part of a series of telescoping
sets Hi with siblings Si
Node n must know all up nodes in inner group
Node n must know one delegate node in each tier i
set S ? Si
IP Addresses
42Routing with XOR Metric
- To lookup key k, node n forwards the request to
the node in its routing table whose ID j is
closest to k according to XOR metric - Let j j31j30...j0 k k31k30...k0
- Refinement of longest-prefix match
- Note that closest ID is unique d(j,k) d(j,k)
? j j - Example (8 bits)
- k 10010110
- j 10110110 d(j,k) 25 32
- j 10001001 d(j,k) 24 23 22 21 20
31
43Prefix Routing Lookup
Perform longest-prefix match against entries in
routing table using XOR metric
Route message to node in inner group with closest
ID (according to XOR metric)
Compute 32-bits key k (using hash function)
k
IP Addresses
44TOPLUS and Network Topology
Smaller and smaller numerical and topological
jumps
Always move closer to the destination
45Group Maintenance
- To join the system, a node n find its closest
node n - n copies the routing and inner-group tables of n
- n modifies its routing table to satisfy a
diversity property - Requires that the delegate nodes of n and n are
distinct with high probability - Allows us to find a replacement delegate in case
of failure - Upon failure, update inner-group tables
- Lazy update of routing tables
- Membership tracking within groups (local, small)
46On-Demand Caching
To look up k, create kk with r first bits
replaced by w.x.y.z/r (node responsible for k in
cache)
Cache data in group (ISP, campus) with prefix
w.x.y.z/r
Extends naturally to multiple levels (cache
hierarchy)
k
IP Addresses
47Measuring TOPLUS Stretch
- Obtained prefix information from
- BGP tables from Oregon, Michigan Universities
- Routing registries from Castify, RIPE
- Sample of 1000 different IP address
- Point-to-point IP measurements using King
- TOPLUS distance weighted average of all possible
paths between source and destination - Weights probability of a delegate to be in each
group - TOPLUS stretch TOPLUS distance / IP distance
48Results
- Original Tree
- 250,562 distinct IP prefixes
- Up to 11 levels of nesting
- Mean stretch 1.17
- 16-bit regrouping (gt16 ? 16)
- Aggregate small tier-1 groups
- Mean stretch 1.19
- 8-bit regrouping (gt16 ? 8)
- Mean stretch 1.28
- Original1 add one level with 256 8-bit prefixes
- Mean stretch 1.9
- Artificial, 3-tier tree
- Mean stretch 2.32
49TOPLUS Summary
- Problems
- Non-uniform ID space (requires bias in hash to
balance load) - Correlated node failures
- Advantages
- Small stretch
- IP longest-prefix matching allows fast forwarding
- On-demand P2P caching straightforward to
implement - Can be easily deployed in a static environment
(e.g., multi-site corporate network) - Can be used as benchmark to measure speed of
other P2P services
50Other Issues Hierarchical DHTs
- The Internet is organized as a hierarchy
- Should DHT designs be flat?
- Hierarchical DHTs multiple overlays managed by
possibly different DHTs (Chord, CAN, etc.) - First, locate group responsible for key in
top-level DHT - Then, find peer in next-level overlay, etc.
- By designating the most reliable peers as
super-nodes (part of multiple overlays), number
of hops can be significantly decreased - How can we deploy, maintain such architectures?
51Hierarchical DHTs Example
CAN Group
s1
s4
Top-level Chord Overlay
s2
s3
Chord Group
52Other Issues DHT Querying
- DHTs allow us to locate data very quickly...
- Lookup(Beatles/Help) ? IP address
- ...but it only works for perfect matching
- Users tend to submit broad queries
- Lookup(Beatles/) ? IP address
- Queries may be inaccurate
- Lookup(Beattles/Help) ? IP address
- Idea Index data using partial queries as keys
- Other approach Fuzzy matching (UCSB)
53Some Other Issues
- Better handling of failures
- In particular, Byzantine failures A single
corrupted node may compromise the system - Reasoning with the dynamics of the system
- A large system may never achieve a quiescent
ideal state - Dealing with untrusted participants
- Data authentication, integrity of routing tables,
anonymity and censor resistance, reputation - Traffic-awareness, load balancing
54Conclusion
- DHT is a simple, yet powerful abstraction
- Building block of many distributed services (file
systems, application-layer multicast, distributed
caches, etc.) - Many DHT designs, with various pros and cons
- Balance between state (degree), speed of lookup
(diameter), and ease of management - System must support rapid changes in membership
- Dealing with joins/leaves/failures is not trivial
- Dynamics of P2P network is difficult to analyze
- Many open issues worth exploring