Title: Distributed kary System Algorithms for Distributed Hash Tables
1Distributed k-ary SystemAlgorithms for
Distributed Hash Tables
- Ali Ghodsi
- aligh_at_kth.se
- http//www.sics.se/ali/thesis/
PhD Defense, 7th December 2006, KTH/Royal
Institute of Technology
2Distributed k-ary SystemAlgorithms for
Distributed Hash Tables
- Ali Ghodsi
- aligh_at_kth.se
- http//www.sics.se/ali/thesis/
PhD Defense, 7th December 2006, KTH/Royal
Institute of Technology
3 Presentation Overview
- Gentle introduction to DHTs
- Contributions
- The future
4 Whats a Distributed Hash Table (DHT)?
, which is distributed
- An ordinary hash table
- Every node provides a lookup operation
- Provide the value associated with a key
- Nodes keep routing pointers
- If item not found, route to another node
5 So what?
- Characteristic properties
- Scalability
- Number of nodes can be huge
- Number of items can be huge
- Self-manage in presence joins/leaves/failures
- Routing information
- Data items
Time to find data is logarithmic Size of routing
tables is logarithmic Example log2(1000000)20 E
FFICIENT!
Store number of items proportional to number of
nodes Typically With D items and n nodes Store
D/n items per node Move D/n items when nodes
join/leave/fail EFFICIENT!
- Self-management routing info
- Ensure routing information is up-to-date
- Self-management of items
- Ensure that data is always replicated and
available
6 Presentation Overview
-
-
- Whats been the general motivation for DHTs?
-
7 Traditional Motivation (1/2)
- Peer-to-Peer filesharing very popular
- Napster
- Completely centralized
- Central server knows who has what
- Judicial problems
- Gnutella
- Completely decentralized
- Ask everyone you know to find data
- Very inefficient
central index
decentralized index
8 Traditional Motivation (2/2)
- Grand vision of DHTs
- Provide efficient file sharing
- Quote from Chord In particular, Chord can
help avoid single points of failure or control
that systems like Napster possess, and the lack
of scalability that systems like Gnutella display
because of their widespread use of broadcasts.
Stoica et al. 2001 - Hidden assumptions
- Millions of unreliable nodes
- User can switch off computer any time
(leavefailure) - Extreme dynamism (nodes joining/leaving/failing)
- Heterogeneity of computers and latencies
- Unstrusted nodes
9 Our philosophy
- DHT is a useful data structure
- Assumptions might not be true
- Moderate amount of dynamism
- Leave not same thing as failure
- Dedicated servers
- Nodes can be trusted
- Less heterogeneity
- Our goal is to achieve more given stronger
assumptions
10 Presentation Overview
11 How to construct a DHT (Chord)?
- Use a logical name space, called the identifier
space, consisting of identifiers 0,1,2,, N-1 - Identifier space is a logical ring modulo N
- Every node picks a random identifier
- Example
- Space N16 0,,15
- Five nodes a, b, c, d
- a picks 6
- b picks 5
- c picks 0
- d picks 5
- e picks 2
12 Definition of Successor
- The successor of an identifier is the
- first node met going in clockwise direction
- starting at the identifier
- Example
- succ(12)14
- succ(15)2
- succ(6)6
13 Where to store data (Chord) ?
- Use globally known hash function, H
- Each item ltkey,valuegt gets
- identifier H(key)
- Store each item at its successor
- Node n is responsible for item k
- Example
- H(Marina)12
- H(Peter)2
- H(Seif)9
- H(Stefan)14
Store number of items proportional to number of
nodes Typically With D items and n nodes Store
D/n items per node Move D/n items when nodes
join/leave/fail EFFICIENT!
14 Where to point (Chord) ?
- Each node points to its successor
- The successor of a node n is succ(n1)
- Known as a nodes succ pointer
- Each node points to its predecessor
- First node met in anti-clockwise direction
starting at n-1 - Known as a nodes pred pointer
- Example
- 0s successor is succ(1)2
- 2s successor is succ(3)5
- 5s successor is succ(6)6
- 6s successor is succ(7)11
- 11s successor is succ(12)0
15 DHT Lookup
- To lookup a key k
- Calculate H(k)
- Follow succ pointers until item k is found
- Example
- Lookup Seif at node 2
- H(Seif)9
- Traverse nodes
- 2, 5, 6, 11 (BINGO)
- Return Stockholm to initiator
16 Speeding up lookups
- If only pointer to succ(n1) is used
- Worst case lookup time is N, for N nodes
- Improving lookup time
- Point to succ(n1)
- Point to succ(n2)
- Point to succ(n4)
- Point to succ(n8)
-
- Point to succ(n2M)
- Distance always halved to
- the destination
Time to find data is logarithmic Size of routing
tables is logarithmic Example log2(1000000)20 E
FFICIENT!
17 Dealing with failures
- Each node keeps a successor-list
- Pointer to f closest successors
- succ(n1)
- succ(succ(n1)1)
- succ(succ(succ(n1)1)1)
- ...
- If successor fails
- Replace with closest alive successor
- If predecessor fails
- Set pred to nil
18 Handling Dynamism
- Periodic stabilization used to make pointers
eventually correct - Try pointing succ to closest alive successor
- Try pointing pred to closest alive predecessor
19 Presentation Overview
- Gentle introduction to DHTs
- Contributions
- The future
20 Outline
21 Problems with periodic stabilization
- Joins and leaves can result in inconsistent
lookup results - At node 12, lookup(14)14
- At node 10, lookup(14)15
12
14
10
15
22 Problems with periodic stabilization
- Leaves can result in routing failures
13
16
10
23 Problems with periodic stabilization
- Too many leaves destroy the system
- leavesfailures/round lt successor-list
12
14
11
15
10
24 Outline
25 Atomic Ring Maintenance
- Differentiate leaves from failures
- Leave is a synchronized departure
- Failure is a crash-stop
- Initially assume no failures
- Build a ring initially
26 Atomic Ring Maintenance
- Separate parts of the problem
- Concurrency control
- Serialize neighboring joins/leaves
- Lookup consistency
27 Naïve Approach
- Each node i hosts a lock called Li
- For p to join or leave
- First acquire Lp.pred
- Second acquire Lp
- Third acquire Lp.succ
- Thereafter update relevant pointers
- Can lead to deadlocks
28 Our Approach to Concurrency Control
- Each node i hosts a lock called Li
- For p to join or leave
- First acquire Lp
- Thereafter acquire Lp.succ
- Thereafter update relevant pointers
- Each lock has a lock queue
- Nodes waiting to acquire the lock
29 Safety
- Non-interference theorem
- When node p acquires both locks
- Node ps successor cannot leave
- Node ps predecessor cannot leave
- Other joins cannot affect relevant pointers
30 Dining Philosophers
- Problem similar to the
- Dining philosophers
- problem
- Five philosophers around a table
- One fork between each philosopher (5)
- Philosophers eat and think
- To eat
- grab left fork
- then grab right fork
31 Deadlocks
- Can result in a deadlock
- If all nodes acquire their first lock
- Every node waiting indefinitely for second lock
- Solution from Dining philosophers
- Introduce asymmetry
- One node acquires locks in reverse order
- Node with highest identifier reverses
- If nltn.succ, then n has highest identity
32 Pitfalls
- Join adds node/philosopher
- Solution some requests in the lock queue
forwarded to new node
14
14, 12
12
12
12
14
10
15
33 Pitfalls
- Leave removes a node/philosopher
- Problem
- if leaving node gives lock queue to its
successor, nodes can get worse position in queue
starvation - Use forwarding to avoid starvation
- Lock queue empty after local leave request
34 Correctness
- Liveness Theorem
- Algorithm is starvation free
- Also free from deadlocks and livelocks
- Every joining/leaving node will eventually
succeed getting both locks
35 Performance drawbacks
- If many neighboring nodes leaving
- All grab local lock
- Sequential progress
- Solution
- Randomized locking
- Release locks and retry
- Liveness with high probability
12
14
10
15
36 Lookup consistency leaves
- So far dealt with concurrent joins/leaves
- Look at concurrent join/leaves/lookups
- Lookup consistency (informally)
- At any time, only one node responsible for any
key - Joins/leaves should not affect functionality of
lookups
37 Lookup consistency
- Goal is to make joins and leaves appear as if
they happened instantaneously - Every leave has a leave point
- A point in global time, where the whole system
behaves as if the node instantaneously left - Implemented with a LeaveForward flag
- The leaving node forwards messages to successor
if LeaveForward is true
38 Leave Algorithm
leave point
39 Lookup consistency joins
- Every join has a join point
- A point in global time, where the whole system
behaves as if the node instantaneously joined - Implemented with a JoinForward flag
- The successor of a joining node forwards messages
to new node if JoinForward is true
40 Join Algorithm
Node p
Node q (joining)
Node r
Join Point JoinForwardtrue oldpredpred pred
q JoinForwardingfalse
succq
predp succr
ltUpdatePred, predqgt
ltJoinPoint, predpgt
ltUpdateSucc, succqgt
ltStopForwardinggt
ltFinishgt
41 Outline
42 Dealing with Failures
- We prove it is impossible to provide lookup
consistency on the Internet - Assumptions
- Availability (always eventually answer)
- Lookup consistency
- Partition tolerance
- Failure detectors can behave as if the networked
partitioned
43 Dealing with Failures
- We provide fault-tolerant atomic ring
- Locks leased
- Guarantees locks are always released
- Periodic stabilization ensures
- Eventually correct ring
- Eventual lookup consistency
44 Contributions
- Lookup consistency in presence of joins/leaves
- System not affected by joins/leaves
- Inserts do not disappear
- No routing failures when nodes leave
- Number of leaves not bounded
45 Related Work
- Li, Misra, Plaxton (04, 06) have a similar
solution - Advantages
- Assertional reasoning
- Almost machine verifiable proofs
- Disadvantages
- Starvation possible
- Not used for lookup consistency
- Failure-free environment assumed
46 Related Work
- Lynch, Malkhi, Ratajczak (02), position paper
with pseudo code in appendix - Advantages
- First to propose atomic lookup consistency
- Disadvantages
- No proofs
- Message might be sent to a node that left
- Does not work for both joins and leaves together
- Failures not dealt with
47 Outline
-
-
- Additional Pointers on the Ring
-
48 Routing
- Generalization of Chord to provide arbitrary
arity - Provide logk(n) hops per lookup
- k being a configurable parameter
- n being the number of nodes
- Instead of only log2(n)
49 Achieving logk(n) lookup
- Each node logk(N) levels, NkL
- Each level contains k intervals,
- Example, k4, N64 (43), node 0
0
4
8
12
48
16
32
50 Achieving logk(n) lookup
- Each node logk(N) levels, NkL
- Each level contains k intervals,
- Example, k4, N64 (43), node 0
0
4
8
12
48
16
32
51 Achieving logk(n) lookup
- Each node logk(N) levels, NkL
- Each level contains k intervals,
- Example, k4, N64 (43), node 0
0
4
8
12
48
16
32
52 Arity important
- Maximum number of hops can be configured
- Example, a 2-hop system
53 Placing pointers
- Each node has (k-1)logk(N) pointers
- Node ps pointers point at
- Node 0s pointers
- f(1)1
- f(2)2
- f(3)3
- f(4)4
- f(5)8
- f(6)12
- f(7)16
- f(8)32
- f(9)48
0
4
8
12
48
16
32
54 Greedy Routing
- lookup(i) algorithm
- Use pointer closest to i, without overshooting
i - If no such pointer exists, succ is responsible
for i
i
55 Routing with Atomic Ring Maintenance
- Invariant of lookup
- Last hop is always predecessor of responsible
node - Last step in lookup
- If JoinForward is true, forward to pred
- If LeaveForward is true, forward to succ
56 Avoiding Routing Failures
- If nodes leave, routing failures
- can occur
- Accounting algorithm
- Simple Algorithm
- No routing failures of ordinary messages
- Fault-free Algorithm
- No routing failures
- Many cases and interleavings
- Concurrent joins and leaves,
- pointers in both directions
57 General Routing
- Three lookup styles
- Recursive
- Iterative
- Transitive
58 Reliable Routing
- Reliable lookup for each style
- If initiator doesnt crash, responsible node
reached - No redundant delivery of messages
- General strategy
- Repeat operation until success
- Filter duplicates using unique identifiers
- Iterative lookup
- Reliability easy to achieve
- Recursive lookup
- Several algorithms possible
- Transitive lookup
- Efficient reliability hard to achieve
59 Outline
-
-
- One-to-many Communication
-
60 Group Communication on an Overlay
- Use existing routing pointers
- Group communication
- DHT only provides key lookup
- Complex queries by searching the overlay
- Limited horizon broadcast
- Iterative deepening
- More efficient than Gnutella-like systems
- No unintended graph partitioning
- Cheaper topology maintenance castro04
61 Group Communication on an Overlay
- DHT builds a graph
- Why not use general graph algorithms?
- Can use the specific structure of DHTs
- More efficient
- Avoids redundant messages
62 Broadcast Algorithms
- Correctness conditions
- Termination
- Algorithm should eventually terminate
- Coverage
- All nodes should receive the broadcast message
- Non-redundancy
- Each node receives the message at most once
- Initially assume no failures
63 Naïve Broadcast
- Naive Broadcast Algorithm
- send message to succ until
- initiator reached or overshooted
initiator
0
15
1
14
2
13
3
4
12
5
11
6
10
7
9
8
64 Naïve Broadcast
- Naive Broadcast Algorithm
- send message to succ until
- initiator reached or overshooted
- Improvement
- Initiator delegates half
- the space to neighbor
- Idea applied recursively
- log(n) time and n messages
initiator
0
15
1
14
2
13
3
4
12
5
11
6
10
7
9
8
65 Simple Broadcast in the Overlay
- Dissertation assumes general DHT model
- event n.SimpleBcast(m, limit) initially limit
n - for iM downto 1 do
- if u(i) ? (n,limit) then
- sendto u(i) SimpleBcast(m, limit)
- limit u(i)
66 Advanced Broadcast
- Old algorithm on k-ary trees
67 Getting responses
- Getting a reply
- Nodes send directly back to initiator
- Not scalable
- Simple Broadcast with Feedback
- Collect responses back to initiator
- Broadcast induces a tree, feedback in reverse
direction - Similar to simple broadcast algorithm
- Keeps track of parent (par)
- Keeps track of children (Ack)
- Accumulate feedback from children, send to parent
- Atomic ring maintenance
- Acquire local lock to ensure nodes do not leave
68 Outline
-
-
- Advanced One-to-many Communication
-
69 Motivation for Bulk Operation
- Building MyriadStore in 2005
- Distributed backup using the DKS DHT
- Restoring a 4mb file
- Each block (4kb) indexed in DHT
- Requires 1000 items in DHT
- Expensive
- One node making 1000 lookups
- Marshaling/unmarshaling 1000 requests
70 Bulk Operation
- Define a bulk set I
- A set of identifiers
- bulk_operation(m, I)
- Send message m to every node i ? I
- Similar correctness to broadcast
- Coverage all nodes with identifier in I
- Termination
- Non-redundancy
71 Bulk Owner Operation with Feedback
- Define a bulk set I
- A set of identifiers
- bulk_own(m, I)
- Send m to every node responsible for an
identifier i ? I - Example
- Bulk set I4
- Node 4 might not exist
- Some node is responsible for identifier 4
72 Bulk Operation with Feedback
- Define a bulk set I
- A set of identifiers
- bulk_feed(m, I)
- Send message m to every node i ? I
- Accumulate responses back to initiator
- bulk_own_feed(m, I)
- Send message m to every node responsible for i ?
I - Accumulate responses back to initiator
73 Bulk Properties (1/2)
- No redundant messages
- Maximum log(n) messages per node
74 Bulk Properties (2/2)
- Two extreme cases
- Case 1
- Bulk set is all identifiers
- Identical to simple broadcast
- Message complexity is n
- Time complexity is log(n)
- Case 2
- Bulk set is a singleton with one identifier
- Identical to ordinary lookup
- Message complexity is log(n)
- Time complexity is in log(n)
75 Pseudo Reliable Broadcast
- Pseudo-reliable broadcast to deal with crash
failures - Coverage property
- If initiator is correct, every node gets the
message - Similar to broadcast with feedback
- Use failure detectors on children
- If child with responsibility to cover I fails
- Use bulk to retry covering interval I
- Filter redundant messages using unique
identifiers - Eventually perfect failure detector for
termination - Inaccuracy results in redundant messages
76 Applications of bulk operation
- Bulk operation
- Topology maintenance update nodes in bulk set
- Pseudo-reliable broadcast re-covering intervals
- Bulk owner
- Multiple inserts into a DHT
- Bulk owner with feedback
- Multiple lookups in a DHT
- Range queries
77 Outline
78 Successor-list replication
- Successor-list replication
- Replicate a nodes item on its f successors
- DKS, Chord, Pastry, Koorde etcetera.
- Was abandoned in favor of symmetric replication
because
79 Motivation successor-lists
- If a node joins or leaves
- f replicas need to be updated
Color represents data item
Replication degree 3 Every color replicated three
times
80 Motivation successor-lists
- If a node joins or leaves
- f replicas need to be updated
Color represents data item
Node leaves Yellow, green, red, blue need to be
re-distributed
81 Multiple hashing
- Rehashing
- Store each item ltk,vgt at
- succ( H(k) )
- succ( H(H(k)) )
- succ( H(H(H(k))) )
-
- Multiple hash functions
- Store each item ltk,vgt at
- succ( H1(k) )
- succ( H2(k) )
- succ( H3(k) )
-
- Advocated by CAN and Tapestry
82 Motivation multiple hashing
- Example
- Item ltSeif, Stockholmgt
- H(Seif)7
- succ(7)9
- Node 9 crashes
- Node 12 should get item from replica
- Need hash inverse H-1(7)Seif (impossible)
- Items dispersed all over nodes (inefficient)
9
12
7
Seif, Stockholm
5
83 Symmetric Replication
- Basic Idea
- Replicate identifiers, not nodes
- Associate each identifier i with f other
identifiers -
- Identifier space partitioned into m
- equivalence classes
- Cardinality of each class is f, mN/f
- Each node replicates the equivalence class of
- all identifiers it is responsible for
84Symmetric replication
- Replication degree f4, Space0,,15
- Congruence classes modulo 4
- 0, 4, 8, 12
- 1, 5, 9, 13
- 2, 6, 10, 14
- 3, 7, 11, 15
Data 15, 0
Data 14, 13, 12, 11
0
15
1
2
14
Data 1, 2, 3
13
3
12
4
Data 4, 5
11
5
6
10
7
9
Data 6, 7, 8, 9, 10
8
85Ordinary Chord
- Replication degree f4, Space0,,15
- Congruence classes modulo 4
- 0, 4, 8, 12
- 1, 5, 9, 13
- 2, 6, 10, 14
- 3, 7, 11, 15
Data 3, 4
Data 7, 8
Data 2, 1, 0, 15
Data 11, 12
Data 6, 5, 4, 3
Data 15, 0
Data 10, 9, 8, 7
Data 5, 6, 7
Data 14, 13, 12, 11
Data 9, 10, 11
0
15
1
Data 13, 14, 15
2
14
Data 1, 2, 3
13
3
Data 8, 9
Data 12, 13
12
4
Data 0, 1
Data 10, 11, 12, 13, 14
Data 4, 5
11
5
Data 14, 15, 0, 1, 2
6
10
Data 2, 3, 4, 5, 6
7
9
Data 6, 7, 8, 9, 10
8
86Cheap join/leave
- Replication degree f4, Space0,,15
- Congruence classes modulo 4
- 0, 4, 8, 12
- 1, 5, 9, 13
- 2, 6, 10, 14
- 3, 7, 11, 15
Data 3, 4
Data 7, 8
Data 2, 1, 0, 15
Data 11, 12
Data 6, 5, 4, 3
Data 15, 0
Data 10, 9, 8, 7
Data 5, 6, 7
Data 14, 13, 12, 11
Data 9, 10, 11
0
15
1
Data 13, 14, 15
Data 0, 15
2
14
Data 1, 2, 3
Data 3, 4
13
3
Data 11, 12, 7, 8, 3, 4, 0, 15
Data 7, 8
Data 8, 9
Data 11, 12
Data 12, 13
12
4
Data 0, 1
Data 10, 11, 12, 13, 14
Data 4, 5
11
5
Data 14, 15, 0, 1, 2
6
10
Data 2, 3, 4, 5, 6
7
9
Data 6, 7, 8, 9, 10
8
87 Contributions
- Message complexity for join/leave O(1)
- Bit complexity remains unchanged
- Handling failures more complex
- Bulk operation to fetch data
- On average log(n) complexity
- Can do parallel lookups
- Decreasing latencies
- Increasing robustness
- Distributed voting
- Erasure codes
88 Presentation Overview
89 Summary (1/3)
- Atomic ring maintenance
- Lookup consistency for j/l
- No routing failures as nodes j/l
- No bound on number of leaves
- Eventual consistency with failures
- Additional routing pointers
- k-ary lookup
- Reliable lookup
- No routing failures with additional pointers
90 Summary (2/3)
- Efficient Broadcast
- log(n) time and n message complexity
- Used in overlay multicast
- Bulk operations
- Efficient parallel lookups
- Efficient range queries
91 Summary (3/3)
- Symmetric Replication
- Simple, O(1) message complexity for j/l
- O(log f) for failures
- Enables parallel lookups
- Decreasing latencies
- Increasing robustness
- Distributed voting
92 Presentation Overview
- Gentle introduction to DHTs
- Contributions
- The future
93 Future Work (1/2)
- Periodic stabilization
- Prove it is self-stabilizing
94 Future Work (2/2)
- Replication Consistency
- Atomic consistency impossible in asynchronous
systems - Assume partial synchrony
- Weaker consistency models?
- Using virtual synchrony
95 Speculative long-term agenda
- Overlay today provides
- Dynamic membership
- Identities (max/min avail)
- Only know subset of nodes
- Shared memory registers
- Revisit distributed computing
- Assuming an overlay as basic primitive
- Leader election
- Consensus
- Shared memory consistency (started)
- Transactions
- Wave algorithms (started)
- Implement middleware providing these
96 Acknowledgments
- Seif Haridi
- Luc Onana Alima
- Cosmin Arad
- Per Brand
- Sameh El-Ansary
- Roland Yap
97 THANK YOU
98(No Transcript)
99 Handling joins
- When n joins
- Find ns successor with lookup(n)
- Set succ to ns successor
- Stabilization fixes the rest
15
13
11
- Periodically at n
- set vsucc.pred
- if v?nil and v is in (n,succ
- set succv
- send a notify(n) to succ
- When receiving notify(p) at n
- if prednil or p is in (pred,n
- set predp
100 Handling leaves
- When n leaves
- Just dissappear (like failure)
- When pred detected failed
- Set pred to nil
- When succ detected failed
- Set succ to closest alive in successor list
15
13
11
- Periodically at n
- set vsucc.pred
- if v?nil and v is in (n,succ
- set succv
- send a notify(n) to succ
- When receiving notify(p) at n
- if prednil or p is in (pred,n
- set predp