Distributed kary System Algorithms for Distributed Hash Tables presentation

About This Presentation

Title:

Distributed kary System Algorithms for Distributed Hash Tables

Description:

http://www.sics.se/~ali/thesis/ PhD Defense, 7th December 2006, KTH/Royal Institute of Technology ... What's a Distributed Hash Table (DHT)? An ordinary hash table ... –

Number of Views:153

Avg rating:3.0/5.0

Slides: 101

Provided by: aligh

Category:

more less

Transcript and Presenter's Notes

Title: Distributed kary System Algorithms for Distributed Hash Tables

1
Distributed k-ary SystemAlgorithms for
Distributed Hash Tables

Ali Ghodsi
aligh_at_kth.se
http//www.sics.se/ali/thesis/

PhD Defense, 7th December 2006, KTH/Royal
Institute of Technology
2
Distributed k-ary SystemAlgorithms for
Distributed Hash Tables

Ali Ghodsi
aligh_at_kth.se
http//www.sics.se/ali/thesis/

PhD Defense, 7th December 2006, KTH/Royal
Institute of Technology
3
Presentation Overview

Gentle introduction to DHTs
Contributions
The future

4
Whats a Distributed Hash Table (DHT)?
, which is distributed

An ordinary hash table
Every node provides a lookup operation
Provide the value associated with a key
Nodes keep routing pointers
If item not found, route to another node

5
So what?

Characteristic properties
Scalability
Number of nodes can be huge
Number of items can be huge
Self-manage in presence joins/leaves/failures
Routing information
Data items

Time to find data is logarithmic Size of routing
tables is logarithmic Example log2(1000000)20 E
FFICIENT!
Store number of items proportional to number of
nodes Typically With D items and n nodes Store
D/n items per node Move D/n items when nodes
join/leave/fail EFFICIENT!

Self-management routing info
Ensure routing information is up-to-date
Self-management of items
Ensure that data is always replicated and
available

6
Presentation Overview

Whats been the general motivation for DHTs?

7
Traditional Motivation (1/2)

Peer-to-Peer filesharing very popular
Napster
Completely centralized
Central server knows who has what
Judicial problems
Gnutella
Completely decentralized
Ask everyone you know to find data
Very inefficient

central index
decentralized index
8
Traditional Motivation (2/2)

Grand vision of DHTs
Provide efficient file sharing
Quote from Chord In particular, Chord can
help avoid single points of failure or control
that systems like Napster possess, and the lack
of scalability that systems like Gnutella display
because of their widespread use of broadcasts.
Stoica et al. 2001
Hidden assumptions
Millions of unreliable nodes
User can switch off computer any time
(leavefailure)
Extreme dynamism (nodes joining/leaving/failing)
Heterogeneity of computers and latencies
Unstrusted nodes

9
Our philosophy

DHT is a useful data structure
Assumptions might not be true
Moderate amount of dynamism
Leave not same thing as failure
Dedicated servers
Nodes can be trusted
Less heterogeneity
Our goal is to achieve more given stronger
assumptions

10
Presentation Overview

How to construct a DHT?

11
How to construct a DHT (Chord)?

Use a logical name space, called the identifier
space, consisting of identifiers 0,1,2,, N-1
Identifier space is a logical ring modulo N
Every node picks a random identifier
Example
Space N16 0,,15
Five nodes a, b, c, d
a picks 6
b picks 5
c picks 0
d picks 5
e picks 2

12
Definition of Successor

The successor of an identifier is the
first node met going in clockwise direction
starting at the identifier
Example
succ(12)14
succ(15)2
succ(6)6

13
Where to store data (Chord) ?

Use globally known hash function, H
Each item ltkey,valuegt gets
identifier H(key)
Store each item at its successor
Node n is responsible for item k
Example
H(Marina)12
H(Peter)2
H(Seif)9
H(Stefan)14

Store number of items proportional to number of
nodes Typically With D items and n nodes Store
D/n items per node Move D/n items when nodes
join/leave/fail EFFICIENT!
14
Where to point (Chord) ?

Each node points to its successor
The successor of a node n is succ(n1)
Known as a nodes succ pointer
Each node points to its predecessor
First node met in anti-clockwise direction
starting at n-1
Known as a nodes pred pointer
Example
0s successor is succ(1)2
2s successor is succ(3)5
5s successor is succ(6)6
6s successor is succ(7)11
11s successor is succ(12)0

15
DHT Lookup

To lookup a key k
Calculate H(k)
Follow succ pointers until item k is found
Example
Lookup Seif at node 2
H(Seif)9
Traverse nodes
2, 5, 6, 11 (BINGO)
Return Stockholm to initiator

16
Speeding up lookups

If only pointer to succ(n1) is used
Worst case lookup time is N, for N nodes
Improving lookup time
Point to succ(n1)
Point to succ(n2)
Point to succ(n4)
Point to succ(n8)
Point to succ(n2M)
Distance always halved to
the destination

Time to find data is logarithmic Size of routing
tables is logarithmic Example log2(1000000)20 E
FFICIENT!
17
Dealing with failures

Each node keeps a successor-list
Pointer to f closest successors
succ(n1)
succ(succ(n1)1)
succ(succ(succ(n1)1)1)
...
If successor fails
Replace with closest alive successor
If predecessor fails
Set pred to nil

18
Handling Dynamism

Periodic stabilization used to make pointers
eventually correct
Try pointing succ to closest alive successor
Try pointing pred to closest alive predecessor

19
Presentation Overview

Gentle introduction to DHTs
Contributions
The future

20
Outline

Lookup consistency

21
Problems with periodic stabilization

Joins and leaves can result in inconsistent
lookup results
At node 12, lookup(14)14
At node 10, lookup(14)15

12
14
10
15
22
Problems with periodic stabilization

Leaves can result in routing failures

13
16
10
23
Problems with periodic stabilization

Too many leaves destroy the system
leavesfailures/round lt successor-list

12
14
11
15
10
24
Outline

Atomic Ring Maintenance

25
Atomic Ring Maintenance

Differentiate leaves from failures
Leave is a synchronized departure
Failure is a crash-stop
Initially assume no failures
Build a ring initially

26
Atomic Ring Maintenance

Separate parts of the problem
Concurrency control
Serialize neighboring joins/leaves
Lookup consistency

27
Naïve Approach

Each node i hosts a lock called Li
For p to join or leave
First acquire Lp.pred
Second acquire Lp
Third acquire Lp.succ
Thereafter update relevant pointers
Can lead to deadlocks

28
Our Approach to Concurrency Control

Each node i hosts a lock called Li
For p to join or leave
First acquire Lp
Thereafter acquire Lp.succ
Thereafter update relevant pointers
Each lock has a lock queue
Nodes waiting to acquire the lock

29
Safety

Non-interference theorem
When node p acquires both locks
Node ps successor cannot leave
Node ps predecessor cannot leave
Other joins cannot affect relevant pointers

30
Dining Philosophers

Problem similar to the
Dining philosophers
problem
Five philosophers around a table
One fork between each philosopher (5)
Philosophers eat and think
To eat
grab left fork
then grab right fork

31
Deadlocks

Can result in a deadlock
If all nodes acquire their first lock
Every node waiting indefinitely for second lock
Solution from Dining philosophers
Introduce asymmetry
One node acquires locks in reverse order
Node with highest identifier reverses
If nltn.succ, then n has highest identity

32
Pitfalls

Join adds node/philosopher
Solution some requests in the lock queue
forwarded to new node

14
14, 12
12
12
12
14
10
15
33
Pitfalls

Leave removes a node/philosopher
Problem
if leaving node gives lock queue to its
successor, nodes can get worse position in queue
starvation
Use forwarding to avoid starvation
Lock queue empty after local leave request

34
Correctness

Liveness Theorem
Algorithm is starvation free
Also free from deadlocks and livelocks
Every joining/leaving node will eventually
succeed getting both locks

35
Performance drawbacks

If many neighboring nodes leaving
All grab local lock
Sequential progress
Solution
Randomized locking
Release locks and retry
Liveness with high probability

12
14
10
15
36
Lookup consistency leaves

So far dealt with concurrent joins/leaves
Look at concurrent join/leaves/lookups
Lookup consistency (informally)
At any time, only one node responsible for any
key
Joins/leaves should not affect functionality of
lookups

37
Lookup consistency

Goal is to make joins and leaves appear as if
they happened instantaneously
Every leave has a leave point
A point in global time, where the whole system
behaves as if the node instantaneously left
Implemented with a LeaveForward flag
The leaving node forwards messages to successor
if LeaveForward is true

38
Leave Algorithm
leave point
39
Lookup consistency joins

Every join has a join point
A point in global time, where the whole system
behaves as if the node instantaneously joined
Implemented with a JoinForward flag
The successor of a joining node forwards messages
to new node if JoinForward is true

40
Join Algorithm
Node p
Node q (joining)
Node r
Join Point JoinForwardtrue oldpredpred pred
q JoinForwardingfalse
succq
predp succr
ltUpdatePred, predqgt
ltJoinPoint, predpgt
ltUpdateSucc, succqgt
ltStopForwardinggt
ltFinishgt
41
Outline

What about failures?

42
Dealing with Failures

We prove it is impossible to provide lookup
consistency on the Internet
Assumptions
Availability (always eventually answer)
Lookup consistency
Partition tolerance
Failure detectors can behave as if the networked
partitioned

43
Dealing with Failures

We provide fault-tolerant atomic ring
Locks leased
Guarantees locks are always released
Periodic stabilization ensures
Eventually correct ring
Eventual lookup consistency

44
Contributions

Lookup consistency in presence of joins/leaves
System not affected by joins/leaves
Inserts do not disappear
No routing failures when nodes leave
Number of leaves not bounded

45
Related Work

Li, Misra, Plaxton (04, 06) have a similar
solution
Advantages
Assertional reasoning
Almost machine verifiable proofs
Disadvantages
Starvation possible
Not used for lookup consistency
Failure-free environment assumed

46
Related Work

Lynch, Malkhi, Ratajczak (02), position paper
with pseudo code in appendix
Advantages
First to propose atomic lookup consistency
Disadvantages
No proofs
Message might be sent to a node that left
Does not work for both joins and leaves together
Failures not dealt with

47
Outline

Additional Pointers on the Ring

48
Routing

Generalization of Chord to provide arbitrary
arity
Provide logk(n) hops per lookup
k being a configurable parameter
n being the number of nodes
Instead of only log2(n)

49
Achieving logk(n) lookup

Each node logk(N) levels, NkL
Each level contains k intervals,
Example, k4, N64 (43), node 0

0
4
8
12
48
16
32
50
Achieving logk(n) lookup

Each node logk(N) levels, NkL
Each level contains k intervals,
Example, k4, N64 (43), node 0

0
4
8
12
48
16
32
51
Achieving logk(n) lookup

Each node logk(N) levels, NkL
Each level contains k intervals,
Example, k4, N64 (43), node 0

0
4
8
12
48
16
32
52
Arity important

Maximum number of hops can be configured
Example, a 2-hop system

53
Placing pointers

Each node has (k-1)logk(N) pointers
Node ps pointers point at

Node 0s pointers
f(1)1
f(2)2
f(3)3
f(4)4
f(5)8
f(6)12
f(7)16
f(8)32
f(9)48

0
4
8
12
48
16
32
54
Greedy Routing

lookup(i) algorithm
Use pointer closest to i, without overshooting
i
If no such pointer exists, succ is responsible
for i

i
55
Routing with Atomic Ring Maintenance

Invariant of lookup
Last hop is always predecessor of responsible
node
Last step in lookup
If JoinForward is true, forward to pred
If LeaveForward is true, forward to succ

56
Avoiding Routing Failures

If nodes leave, routing failures
can occur
Accounting algorithm
Simple Algorithm
No routing failures of ordinary messages
Fault-free Algorithm
No routing failures
Many cases and interleavings
Concurrent joins and leaves,
pointers in both directions

57
General Routing

Three lookup styles
Recursive
Iterative
Transitive

58
Reliable Routing

Reliable lookup for each style
If initiator doesnt crash, responsible node
reached
No redundant delivery of messages
General strategy
Repeat operation until success
Filter duplicates using unique identifiers
Iterative lookup
Reliability easy to achieve
Recursive lookup
Several algorithms possible
Transitive lookup
Efficient reliability hard to achieve

59
Outline

One-to-many Communication

60
Group Communication on an Overlay

Use existing routing pointers
Group communication
DHT only provides key lookup
Complex queries by searching the overlay
Limited horizon broadcast
Iterative deepening
More efficient than Gnutella-like systems
No unintended graph partitioning
Cheaper topology maintenance castro04

61
Group Communication on an Overlay

DHT builds a graph
Why not use general graph algorithms?
Can use the specific structure of DHTs
More efficient
Avoids redundant messages

62
Broadcast Algorithms

Correctness conditions
Termination
Algorithm should eventually terminate
Coverage
All nodes should receive the broadcast message
Non-redundancy
Each node receives the message at most once
Initially assume no failures

63
Naïve Broadcast

Naive Broadcast Algorithm
send message to succ until
initiator reached or overshooted

initiator
0
15
1
14
2
13
3
4
12
5
11
6
10
7
9
8
64
Naïve Broadcast

Naive Broadcast Algorithm
send message to succ until
initiator reached or overshooted
Improvement
Initiator delegates half
the space to neighbor
Idea applied recursively
log(n) time and n messages

initiator
0
15
1
14
2
13
3
4
12
5
11
6
10
7
9
8
65
Simple Broadcast in the Overlay

Dissertation assumes general DHT model
event n.SimpleBcast(m, limit) initially limit
n
for iM downto 1 do
if u(i) ? (n,limit) then
sendto u(i) SimpleBcast(m, limit)
limit u(i)

66
Advanced Broadcast

Old algorithm on k-ary trees

67
Getting responses

Getting a reply
Nodes send directly back to initiator
Not scalable
Simple Broadcast with Feedback
Collect responses back to initiator
Broadcast induces a tree, feedback in reverse
direction
Similar to simple broadcast algorithm
Keeps track of parent (par)
Keeps track of children (Ack)
Accumulate feedback from children, send to parent
Atomic ring maintenance
Acquire local lock to ensure nodes do not leave

68
Outline

Advanced One-to-many Communication

69
Motivation for Bulk Operation

Building MyriadStore in 2005
Distributed backup using the DKS DHT
Restoring a 4mb file
Each block (4kb) indexed in DHT
Requires 1000 items in DHT
Expensive
One node making 1000 lookups
Marshaling/unmarshaling 1000 requests

70
Bulk Operation

Define a bulk set I
A set of identifiers
bulk_operation(m, I)
Send message m to every node i ? I
Similar correctness to broadcast
Coverage all nodes with identifier in I
Termination
Non-redundancy

71
Bulk Owner Operation with Feedback

Define a bulk set I
A set of identifiers
bulk_own(m, I)
Send m to every node responsible for an
identifier i ? I
Example
Bulk set I4
Node 4 might not exist
Some node is responsible for identifier 4

72
Bulk Operation with Feedback

Define a bulk set I
A set of identifiers
bulk_feed(m, I)
Send message m to every node i ? I
Accumulate responses back to initiator
bulk_own_feed(m, I)
Send message m to every node responsible for i ?
I
Accumulate responses back to initiator

73
Bulk Properties (1/2)

No redundant messages
Maximum log(n) messages per node

74
Bulk Properties (2/2)

Two extreme cases
Case 1
Bulk set is all identifiers
Identical to simple broadcast
Message complexity is n
Time complexity is log(n)
Case 2
Bulk set is a singleton with one identifier
Identical to ordinary lookup
Message complexity is log(n)
Time complexity is in log(n)

75
Pseudo Reliable Broadcast

Pseudo-reliable broadcast to deal with crash
failures
Coverage property
If initiator is correct, every node gets the
message
Similar to broadcast with feedback
Use failure detectors on children
If child with responsibility to cover I fails
Use bulk to retry covering interval I
Filter redundant messages using unique
identifiers
Eventually perfect failure detector for
termination
Inaccuracy results in redundant messages

76
Applications of bulk operation

Bulk operation
Topology maintenance update nodes in bulk set
Pseudo-reliable broadcast re-covering intervals
Bulk owner
Multiple inserts into a DHT
Bulk owner with feedback
Multiple lookups in a DHT
Range queries

77
Outline

Replication

78
Successor-list replication

Successor-list replication
Replicate a nodes item on its f successors
DKS, Chord, Pastry, Koorde etcetera.
Was abandoned in favor of symmetric replication
because

79
Motivation successor-lists

If a node joins or leaves
f replicas need to be updated

Color represents data item
Replication degree 3 Every color replicated three
times
80
Motivation successor-lists

If a node joins or leaves
f replicas need to be updated

Color represents data item
Node leaves Yellow, green, red, blue need to be
re-distributed
81
Multiple hashing

Rehashing
Store each item ltk,vgt at
succ( H(k) )
succ( H(H(k)) )
succ( H(H(H(k))) )
Multiple hash functions
Store each item ltk,vgt at
succ( H1(k) )
succ( H2(k) )
succ( H3(k) )
Advocated by CAN and Tapestry

82
Motivation multiple hashing

Example
Item ltSeif, Stockholmgt
H(Seif)7
succ(7)9
Node 9 crashes
Node 12 should get item from replica
Need hash inverse H-1(7)Seif (impossible)
Items dispersed all over nodes (inefficient)

9
12
7
Seif, Stockholm
5
83
Symmetric Replication

Basic Idea
Replicate identifiers, not nodes
Associate each identifier i with f other
identifiers
Identifier space partitioned into m
equivalence classes
Cardinality of each class is f, mN/f
Each node replicates the equivalence class of
all identifiers it is responsible for

84
Symmetric replication

Replication degree f4, Space0,,15
Congruence classes modulo 4
0, 4, 8, 12
1, 5, 9, 13
2, 6, 10, 14
3, 7, 11, 15

Data 15, 0
Data 14, 13, 12, 11
0
15
1
2
14
Data 1, 2, 3
13
3
12
4
Data 4, 5
11
5
6
10
7
9
Data 6, 7, 8, 9, 10
8
85
Ordinary Chord

Replication degree f4, Space0,,15
Congruence classes modulo 4
0, 4, 8, 12
1, 5, 9, 13
2, 6, 10, 14
3, 7, 11, 15

Data 3, 4
Data 7, 8
Data 2, 1, 0, 15
Data 11, 12
Data 6, 5, 4, 3
Data 15, 0
Data 10, 9, 8, 7
Data 5, 6, 7
Data 14, 13, 12, 11
Data 9, 10, 11
0
15
1
Data 13, 14, 15
2
14
Data 1, 2, 3
13
3
Data 8, 9
Data 12, 13
12
4
Data 0, 1
Data 10, 11, 12, 13, 14
Data 4, 5
11
5
Data 14, 15, 0, 1, 2
6
10
Data 2, 3, 4, 5, 6
7
9
Data 6, 7, 8, 9, 10
8
86
Cheap join/leave

Replication degree f4, Space0,,15
Congruence classes modulo 4
0, 4, 8, 12
1, 5, 9, 13
2, 6, 10, 14
3, 7, 11, 15

Data 3, 4
Data 7, 8
Data 2, 1, 0, 15
Data 11, 12
Data 6, 5, 4, 3
Data 15, 0
Data 10, 9, 8, 7
Data 5, 6, 7
Data 14, 13, 12, 11
Data 9, 10, 11
0
15
1
Data 13, 14, 15
Data 0, 15
2
14
Data 1, 2, 3
Data 3, 4
13
3
Data 11, 12, 7, 8, 3, 4, 0, 15
Data 7, 8
Data 8, 9
Data 11, 12
Data 12, 13
12
4
Data 0, 1
Data 10, 11, 12, 13, 14
Data 4, 5
11
5
Data 14, 15, 0, 1, 2
6
10
Data 2, 3, 4, 5, 6
7
9
Data 6, 7, 8, 9, 10
8
87
Contributions