Title: Introduction to Structured Overlay Networks
1Introduction to Structured Overlay Networks
11/14/2009
1
2Presentation Overview
- Gentle introduction to Structured Overlay
Networks and Distributed Hash Tables - General use of SONs and DHTs
- Chord algorithms and others
3 Whats a Distributed Hash Table (DHT)?
, which is distributed
- An ordinary hash table
- Every node provides a lookup operation
- Provide the value associated with a key
- Nodes keep routing pointers
- If item not found, route to another node
11/14/2009
3
4 So what?
Time to find data is logarithmic Size of routing
tables is logarithmic Example log2(1000000)20 E
FFICIENT!
Store number of items proportional to number of
nodes Typically With D items and n nodes Store
D/n items per node Move D/n items when nodes
join/leave/fail EFFICIENT!
- Self-management routing info
- Ensure routing information is up-to-date
- Self-management of items
- Ensure that data is always replicated and
available
- Characteristic properties
- Scalability
- Number of nodes can be huge
- Number of items can be huge
- Self-manage in presence joins/leaves/failures
- Routing information
- Data items
11/14/2009
4
5 Traditional Motivation (1/2)
- Peer-to-Peer file sharing very popular
- Napster
- Completely centralized
- Central server knows who has what
- Judicial problems
- Gnutella
- Completely decentralized
- Ask everyone you know to find data
- Very inefficient
central index
decentralized index
11/14/2009
5
6 Traditional Motivation (2/2)
- Grand vision of DHTs
- Provide efficient file sharing
- Quote from Chord In particular, Chord can
help avoid single points of failure or control
that systems like Napster possess, and the lack
of scalability that systems like Gnutella display
because of their widespread use of broadcasts.
Stoica et al. 2001 - Hidden assumptions
- Millions of unreliable nodes
- User can switch off computer any time
(leavefailure) - Extreme dynamism (nodes joining/leaving/failing)
- Heterogeneity of computers and latencies
- Untrusted nodes
11/14/2009
6
7Motivation DHT overlay as communication
infra-structure
- Internet communication
- IP/port, TCP and UDP
- Not suited for 21st century computing
- Firewalls
- NATs
- Changing IP addresses
11/14/2009
7
8Name based communication
- DHTs can overcome these
- How?
- Use the DHT
- Map names to locations
- Bypass firewalls and NATs by routing through
neighbors
11/14/2009
8
9Name based communication
- What about group communication?
- IP Multicast is not enabled on the Internet
- Use the overlay to broadcast to all nodes
- Create multiple groups, broadcast within each
11/14/2009
9
10Whats it good for?
- Lets look at 10 applications built using such
systems
11Distributed Authorization
- Defense project at SICS, Swedish Institute of
Computer Science - Store certificates in
- the directory
- No central server
- Survives even if nodes are attacked
12Distributed Backup
- Setup
- Clients installed the backup tool
- Decide on amount of space to share
- Choose files for backup
- Regular backup
- Data is encrypted
- Stored in the directory
13Distributed File System
- Similar to AFS and NFS
- Files stored in directory
- What is new?
- Application logic self-managed
- Add/remove servers on the fly
- Automatically handles failures
- Automatically load-balances
- No manual configuration needed
14P2P Cache
- A distributed cache
- Every node in an org. runs a client
- Want to browse a web page?
- If exists locally -gt download it from a peer
- Otherwise, fetch and cache
- No central proxy needed
15P2P Web Servers
- Distributed Web Server
- Pages stored in the directory
- What is new?
- Application logic self-managed
- Automatically load-balances
- Add/remove servers on the fly
- Automatically handles failures
16P2P SIP
- Session Initiation Protocol
- Used to initiate calls on the Internet
- Is being standardized
- Use the directory to find end-hosts
- Improving Skype
17Host Identity Payload (HIP)
- Uses the directory to provide seamless mobility
- Unlike Mobile IP
- No home agent needed
- Self-managing
18PIER (databases)
- A relational view of the directory
- Use SQL to fetch data
- Standard operations (projection, selection,
equi-join)
19 Summary
- DHT is a useful data structure
- Assumptions mentioned might not be true
- Moderate amount of dynamism
- Leave not same thing as failure
- Dedicated servers
- Nodes can be trusted
- Less heterogeneity
20Chord as Example of DHT
21 How to construct a DHT (Chord)?
- Use a logical name space, called the identifier
space, consisting of identifiers 0,1,2,, N-1 - Identifier space is a logical ring modulo N
- Every node picks a random identifier though Hash
H - Example
- Space N16 0,,15
- Five nodes a, b, c, d, e
- a picks 6
- b picks 5
- c picks 0
- d picks 11
- e picks 2
11/14/2009
21
22 Definition of Successor
- The successor of an identifier is the
- first node met going in clockwise direction
- starting at the identifier
- Example
- succ(12)14
- succ(15)2
- succ(6)6
11/14/2009
22
23 Where to store data (Chord) ?
- Use globally known hash function, H
- Each item ltkey,valuegt gets
- identifier H(key) k
- Store each item at its successor
- Node n is responsible for item k
- Example
- H(Marina)12
- H(Peter)2
- H(Seif)9
- H(Stefan)14
Store number of items proportional to number of
nodes Typically With D items and n nodes Store
D/n items per node Move D/n items when nodes
join/leave/fail EFFICIENT!
11/14/2009
23
24 Where to point (Chord) ?
- Each node points to its successor
- The successor of a node n is succ(n1)
- Known as a nodes succ pointer
- Each node points to its predecessor
- First node met in anti-clockwise direction
starting at n-1 - Known as a nodes pred pointer
- Example
- 0s successor is succ(1)2
- 2s successor is succ(3)5
- 5s successor is succ(6)6
- 6s successor is succ(7)11
- 11s successor is succ(12)0
11/14/2009
24
25 DHT Lookup
- To lookup a key k
- Calculate H(k)
- Follow succ pointers until item k is found
- Example
- Lookup Seif at node 2
- H(Seif)9
- Traverse nodes
- 2, 5, 6, 11 (BINGO)
- Return Stockholm to initiator
11/14/2009
25
26DHT Lookup
- (a, b the segment of the ring moving clockwise
from but not including a until and including b - n.foo(.) denotes an RPC of foo(.) to node n
- n.bar denotes and RPC to fetch the value of the
variable bar in node n - We call the process of finding the successor of
an id a LOOKUP - // ask node n to find the successor of id
- procedure n.findSuccessor(id)
- if predecessor ? nil ? id ? (predecessor, n
then return n - else if id ?(n, successor then
- return successor
- else // forward the query around the circle
- return successor.findSuccessor(id)
11/14/2009
26
27DHT Lookup and Update
- // ask node n to find the successor of id
- procedure n.put(id,value)
- s findSuccessor(id)
- s.store(id,value)
- procedure n.get(id)
- s findSuccessor(id)
- return s.retrieve(id)
- PUT and GET are nothing but lookups!!
11/14/2009
27
28 Speeding up lookups
- If only pointer to succ(n1) is used
- Worst case lookup time is N, for N nodes
- Improving lookup time (finger/routing table)
- Point to succ(n1)
- Point to succ(n2)
- Point to succ(n4)
- Point to succ(n8)
-
- Point to succ(n2M-1)
- Distance always halved to
- the destination
Time to find data is logarithmic Size of routing
tables is logarithmic Example log2(1000000)20 E
FFICIENT!
11/14/2009
28
29Chord Routing (1/7)
Get(15)
0
15
1
15
- Routing table size M, where N 2M
- Every node n knows successor(n 2 i-1) ,for i
1..M - Routing entries log2(N)
- log2(N) hops from any node to any other node
2
14
13
3
12
4
11
5
10
6
9
7
8
11/14/2009
29
30Chord Routing (2/7)
0
15
1
15
- Routing table size M, where N 2M
- Every node n knows successor(n 2 i-1) ,for i
1..M - Routing entries log2(N)
- log2(N) hops from any node to any other node
2
14
13
3
12
4
11
5
10
6
9
Get(15)
7
8
11/14/2009
30
31Chord Routing (3/7)
Get(15)
0
15
1
15
- Routing table size M, where N 2M
- Every node n knows successor(n 2 i-1) ,for i
1..M - Routing entries log2(N)
- log2(N) hops from any node to any other node
2
14
13
3
12
4
11
5
10
6
9
7
8
11/14/2009
31
32Chord Routing (4/7)
Get(15)
0
15
1
15
- From node 1, only 2 hops to node 0 where item 15
is stored - For an id space of 16 is, the maximum is log2(16)
4 hops between any two nodes - In fact, if nodes are uniformly distributed, the
maximum is log2( of nodes), i.e. log2(8) hops
between any two nodes - The average complexity is
- ½ log(nodes)
2
14
13
3
12
4
11
5
10
6
9
7
8
11/14/2009
32
33Chord Routing (5/7) Pseudo code
findSuccessor(.)
- // ask node n to find the successor of id
- procedure n.findSuccessor(id)
- if predecessor ? nil ? id ? (predecessor, n
then return n - if id ?(n, successor then
- return successor
- else
- n closestPrecedingNode(id)
- return n.findSuccessor(id)
- // search locally for the highest predecessor of
id - procedure closestPrecedingNode(id)
- for i m downto 1 do
- if fingeri ?(n, id) then return
fingeri - end
- return n
-
11/14/2009
33
34Chord Discussion
- We are basically done
- But.
- What about joins and failures/leaves?
- Nodes come and go as they wish
- What about data?
- Should I lose my doc because some kid decided to
shut down his machine and he happened to store my
file? What about storing addresses of files
instead of files? - What did we gain compared to Gnutella? Increased
guarantees and determinism? - So actually we just started..
11/14/2009
34
35Agenda
- Handling successor pointers
- Joins, Leaves
- Scalability
- Routing table reducing the cost from O(N) to
O(logN) - Failures (for all the above)
11/14/2009
35
36Handling SuccessorsRing maintenance
- Every thing depends on successor pointers, so, we
better have them right all the time!! - In Chord, in addition to the successor pointer,
every node has a predecessor pointer as well for
ring maintenance
11/14/2009
36
37 Handling Dynamism
- Periodic stabilization is used to make pointers
eventually correct - Try pointing succ to closest alive successor
- Try pointing pred to closest alive predecessor
- When receiving notify(p) at n
- if prednil or p is in (pred,n
- set predp
- Periodically at n
- vsucc.pred
- if v?nil and v is in (n,succ
- set succv
- send a notify(n) to succ
11/14/2009
37
38 Handling joins
- When n joins
- Find ns successor with lookup(n)
- Set succ to ns successor
- Stabilization fixes the rest
15
13
11
- Periodically at n
- set vsucc.pred
- if v?nil and v is in (n,succ
- set succv
- send a notify(n) to succ
- When receiving notify(p) at n
- if prednil or p is in (pred,n
- set predp
11/14/2009
S. Haridi, ID2210, Lecture 02
38
39Handling Successors - Chord Algorithm
nil
11/14/2009
39
40Handling Join/Leaves For FingersFinger
Stabilization (1/5)
- Periodically refresh finger table entries, and
store the index of the next finger to fix - This is also the initialization procedure for the
finger table - Local variable next initially 0
- procedure n.fixFingers()next next1if next gt
m then next 1fingernext findSuccessor(n
? 2next-1)
11/14/2009
40
41Examplefinger stabilization (2/5)
- Current situation succ(N48) is N60
- Succ(N21.Fingerj.start) Succ(53)
N21.Fingerj.node N60
N21.Fingerj.node
N21.Fingerj.start
N21
N32
N26
N60
N48
N53
11/14/2009
41
42Examplefinger stabilization (3/5)
- New node N56 joins and stabilizes successor
pointer - Finger j of node N21 is wrong
- N53 eventually try to fix finger j by looking up
53 which stops at N48, however and nothing
changes
N21.Fingerj.node
N21.Fingerj.start
N21
N32
N26
N60
N48
N53
N56
11/14/2009
42
43Examplefinger stabilization (4/5)
- N48 will eventually stabilize its successor
- This means the ring is correct now.
N21.Fingerj.node
N21.Fingerj.start
N21
N32
N26
N60
N56
N48
N53
11/14/2009
43
44Examplefinger stabilization (5/5)
- When N21 tries to fix Finger j again, this time
the response from N48 will be correct and N21
corrects the finger
N21.Fingerj.node
N21.Fingerj.start
N21
N32
N26
N60
N56
N48
N53
11/14/2009
44
45Agenda
- Handling successor pointers
- Joins, Leaves,
- Scalability
- Routing table reducing the cost from O(N) to
O(log N) - Failures (for all the above)
- Handling data
- Joins, Leaves
11/14/2009
45
46Handling Failures Replication of Successors
- Evidently the failure of one successor pointer
means total collapse - Solution A node has a successors list of size
r containing the immediate r successors - How big should r be? log(N) or a large constant
should be ok - Enhance periodic stabilization to handle failures
11/14/2009
46
47 Dealing with failures
- Each node keeps a successor-list
- Pointer to r closest successors
- succ(n1)
- succ(succ(n1)1)
- succ(succ(succ(n1)1)1)
- ...
- If successor fails
- Replace with closest alive successor
- If predecessor fails
- Set pred to nil
11/14/2009
47
48 Handling leaves
- When n leaves
- Just dissappear (like failure)
- When pred detected failed
- Set pred to nil
- When succ detected failed
- Set succ to closest alive in successor list
15
13
11
- Periodically at n
- set vsucc.pred
- if v?nil and v is in (n,succ
- set succv
- send a notify(n) to succ
- When receiving notify(p) at n
- if prednil or p is in (pred,n
- set predp
11/14/2009
S. Haridi, ID2210, Lecture 02
48
49Handling Failures- Ring (1/5)
- Maintaining the ring
- Each node maintains a successor list of length r
- If a nodes immediate successor fails, it uses
the second entry in its successor list - updateSuccessorList copies a successor list from
s removing last entry, and prepending s - Join a Chord containing node n
- procedure n.join(n) predecessor nil s
n.findSuccessor(n) updateSuccessorList(s.success
orList)
11/14/2009
S. Haridi, ID2210, Lecture 02
49
50Handling Failures- Ring (2/5)
- Check whether predecessor has failed (Failure
detector) - procedure n.checkPredecessor()if predecessor
has failed then predecessor nil
11/14/2009
50
51Handling Failures- Ring (3/5)
- procedure n.stabilize()
- s Find first alive node in successorList
- x s.predecessorif x not nil and x ? (n, s)
then s x endupdateSuccessorList(s.successorLis
t) s.notify(n) - procedure n.notify(n)if predecessor nil or
n? (predecessor, n) then predecessor n -
11/14/2009
51
52Failure Ring (4/5)Example Node failure (N26)
suc(N21,2)
suc(N21,1)
suc(N26,1)
N32
N26
N21
pred(N32)
pred(N32)
- After N21 performed stabilize(), before
N21.notify(N32)
suc(N21,1)
N32
N26
N21
pred(N32)
11/14/2009
52
53Failure Ring (5/5)Example - Node failure
(N26)
- After N21 performed stabilize(), before
N21.notify(N32) - N21.notify(N32) has no effect
suc(N21,1)
N32
N26
N21
pred(N32)
- After N32.checkPredecessor()
suc(N21,1)
N32
N26
N21
- Next N21.stabilize() fixes N32s predecessor
11/14/2009
53
54Failure Lookups (1/5)
- // ask node n to find the successor of id
- procedure n.findSuccessor(id)
- if id ?(n, successor then
- return successor
- else
- n closestPreceedingNode(id)
- return try
n.findSuccessor(id) catch failure of n
then mark n in finger. as
failed n.findSuccessor(id) - // search locally for the highest predecessor of
id - procedure closestPreceedingNode(id)
- for i m downto 1 do
- if fingeri.node is alive and
fingeri ?(n, id) then return fingeri - end
- return n
11/14/2009
54
55Variations of Chord
56DKS Routing
- Generalization of Chord to provide arbitrary
arity - Provide logk(n) hops per lookup
- k being a configurable parameter
- n being the number of nodes
- Instead of only log2(n)
57 Achieving logk(n) lookup
- Each node logk(N)L levels, NkL
- Each level contains k intervals,
- Example, k4, N64 (43), node 0
0
4
8
12
48
16
32
58 Achieving logk(n) lookup
- Each node logk(N) levels, NkL
- Each level contains k intervals,
- Example, k4, N64 (43), node 0
0
4
8
12
48
16
32
59 Achieving logk(n) lookup
- Each node logk(N) levels, NkL
- Each level contains k intervals,
- Example, k4, N64 (43), node 0
0
4
8
12
48
16
32
60 Arity is Important
- Maximum number of hops can be configured
- Example, a 2-hop system
61Chord
- The routing table has exponentially increasing
pointers on the ring (node space) and NOT the
identifier space (skip-list like structure)
62Routing Table of Chord
- Building the routing table
- log2N pointers
- exponentially spaced pointers
Chord
63Chord vs. Chord
Good for load balancing