CS 352 Peer 2 Peer Networking - PowerPoint PPT Presentation

1 / 81

About This Presentation

Title:

CS 352 Peer 2 Peer Networking

Description:

60M users of file-sharing in US. 8.5M logged in at a given time on average ... Soon many other clients: Bearshare, Morpheus, LimeWire, etc. ... – PowerPoint PPT presentation

Number of Views:180

Avg rating:3.0/5.0

Slides: 82

Provided by: tri591

Category:

more less

Transcript and Presenter's Notes

Title: CS 352 Peer 2 Peer Networking

1
CS 352-Peer 2 Peer Networking

Credit slides from J. Pang, B. Richardson, I.
Stoica, M. Cuenca

2
Peer to Peer

Outline
Overview
Systems
Napster
Gnutella
Freenet
BitTorrent
Chord
PlanetP

3
Why Study P2P

Huge fraction of traffic on networks today
50!
Exciting new applications
Next level of resource sharing
Vs. timesharing, client-server, P2P
E.g. Access 10s-100s of TB at low cost.

4
Users and Usage

60M users of file-sharing in US
8.5M logged in at a given time on average
814M units of music sold in US last year
140M digital tracks sold by music companies
As of Nov, 35 of all Internet traffic was for
BitTorrent, a single file-sharing system
Major legal battles underway between recording
industry and file-sharing companies

5
Share of Internet Traffic
6
Number of Users
Others include BitTorrent, eDonkey,
iMesh, Overnet, Gnutella BitTorrent (and
others) gaining share from FastTrack (Kazaa).
7
What is P2P?

Use resources of end-hosts to accomplish a shared
task
Typically share files
Play game
Search for patterns in data (Seti_at_Home)

8
Whats new?

Taking advantage of resources at the edge of the
network
Fundamental shift in computing capability
Increase in absolute bandwidth over WAN
What is LAN/WAN ratio?
Deploying server resources still expensive
Human or hardware cost?
Where does P2P fit in?

9
P2P systems

Napster
Launched P2P
Centralized index
Gnutella
Focus is simple sharing
Using simple flooding
Kazaa
More intelligent query routing
BitTorrent
Focus on Download speed, fairness in sharing

10
More P2P systems

Freenet
Focus privacy and anonymity
Builds internal routing tables
Cord
Focus on building a distributed hash table (DHT)
Finger tables
PlanetP
Focus on search and retrieval
Creates global index on each node via controlled,
randomized flooding

11
Key issues for P2P systems

Join/leave
How do nodes join/leave? Who is allowed?
Search and retrieval
How to find content?
How are metadata indexes built, stored,
distributed?
Content Distribution
Where is content stored? How is it downloaded and
retrieved?

12
4 Key Primitives

Join
How to enter/leave the P2P system?
Publish
How to advertise a file?
Search
how to find a file?
Fetch
how to download a file?

13
Publish and Search

Basic strategies
Centralized (Napster)
Flood the query (Gnutella)
Flood the index (PlanetP)
Route the query(Chord)
Different tradeoffs depending on application
Robustness, scalability, legal issues

14
Napster History

In 1999, S. Fanning launches Napster
Peaked at 1.5 million simultaneous users
Jul 2001, Napster shuts down

15
Napster Overiew

Centralized Database
Join on startup, client contacts central server
Publish reports list of files to central server
Search query the server return someone that
stores the requested file
Fetch get the file directly from peer

16
Napster Publish
insert(X, 123.2.21.23) ...
I have X, Y, and Z!
123.2.21.23
17
Napster Search
123.2.0.18
search(A) -- 123.2.0.18
Where is file A?
18
Napster Discussion

Pros
Simple
Search scope is O(1)
Controllable (pro or con?)
Cons
Server maintains O(N) State
Server does all processing
Single point of failure

19
Gnutella History

In 2000, J. Frankel and T. Pepper from Nullsoft
released Gnutella
Soon many other clients Bearshare, Morpheus,
LimeWire, etc.
In 2001, many protocol enhancements including
ultrapeers

20
Gnutella Overview

Query Flooding
Join on startup, client contacts a few other
nodes these become its neighbors
Publish no need
Search ask neighbors, who as their neighbors,
and so on... when/if found, reply to sender.
Fetch get the file directly from peer

21
Gnutella Search
Where is file A?
22
Gnutella Discussion

Pros
Fully de-centralized
Search cost distributed
Cons
Search scope is O(N)
Search time is O(???)
Nodes leave often, network unstable

23
Aside Search Time?
24
Aside All Peers Equal?
25
Aside Network Resilience
Partial Topology
Random 30 die
Targeted 4 die
from Saroiu et al., MMCN 2002
26
KaZaA History

In 2001, KaZaA created by Dutch company Kazaa BV
Single network called FastTrack used by other
clients as well Morpheus, giFT, etc.
Eventually protocol changed so other clients
could no longer talk to it
Most popular file sharing network today with 10
million users (number varies)

27
KaZaA Overview

Smart Query Flooding
Join on startup, client contacts a supernode
... may at some point become one itself
Publish send list of files to supernode
Search send query to supernode, supernodes flood
query amongst themselves.
Fetch get the file directly from peer(s) can
fetch simultaneously from multiple peers

28
KaZaA Network Design
29
KaZaA File Insert
insert(X, 123.2.21.23) ...
I have X!
123.2.21.23
30
KaZaA File Search
Where is file A?
31
KaZaA Fetching

More than one node may have requested file...
How to tell?
Must be able to distinguish identical files
Not necessarily same filename
Same filename not necessarily same file...
Use Hash of file
KaZaA uses UUHash fast, but not secure
Alternatives MD5, SHA-1
How to fetch?
Get bytes 0..1000 from A, 1001...2000 from B
Alternative Erasure Codes

32
KaZaA Discussion

Pros
Tries to take into account node heterogeneity
Bandwidth
Host Computational Resources
Host Availability (?)
Rumored to take into account network locality
Cons
Mechanisms easy to circumvent
Still no real guarantees on search scope or
search time

33
BitTorrent History

In 2002, B. Cohen debuted BitTorrent
Key Motivation
Popularity exhibits temporal locality (Flash
Crowds)
E.g., Slashdot effect, CNN on 9/11, new
movie/game release
Focused on Efficient Fetching, not Searching
Distribute the same file to all peers
Single publisher, multiple downloaders
Has some real publishers
Blizzard Entertainment using it to distribute the
beta of their new game

34
BitTorrent Overview

Swarming
Join contact centralized tracker server, get a
list of peers.
Publish Run a tracker server.
Search Out-of-band. E.g., use Google to find a
tracker for the file you want.
Fetch Download chunks of the file from your
peers. Upload chunks you have to them.

35
BitTorrent Publish/Join
Tracker
36
BitTorrent Fetch
37
BitTorrent Sharing Strategy

Employ Tit-for-tat sharing strategy
Ill share with you if you share with me
Be optimistic occasionally let freeloaders
download
Otherwise no one would ever start!
Also allows you to discover better peers to
download from when they reciprocate

38
BitTorrent Summary

Pros
Works reasonably well in practice
Gives peers incentive to share resources avoids
freeloaders
Cons
Central tracker server needed to bootstrap swarm
(is this really necessary?)

39
Freenet History

In 1999, I. Clarke started the Freenet project
Basic Idea
Employ Internet-like routing on the overlay
network to publish and locate files
Addition goals
Provide anonymity and security
Make censorship difficult

40
Freenet Overview

Routed Queries
Join on startup, client contacts a few other
nodes it knows about gets a unique node id
Publish route file contents toward the file id.
File is stored at node with id closest to file id
Search route query for file id toward the
closest node id
Fetch when query reaches a node containing file
id, it returns the file to the sender

41
Freenet Routing Tables

id file identifier (e.g., hash of file)
next_hop another node that stores the file id
file file identified by id being stored on the
local node
Forwarding of query for file id
If file id stored locally, then stop
Forward data back to upstream requestor
If not, search for the closest id in the table,
and forward the message to the corresponding
next_hop
If data is not found, failure is reported back
Requestor then tries next closest match in
routing table

id next_hop file

42
Freenet Routing
query(10)
n2
n1
4 n1 f4 12 n2 f12 5 n3
9 n3 f9
n4
n5
14 n5 f14 13 n2 f13 3 n6
4 n1 f4 10 n5 f10 8 n6
n3
3 n1 f3 14 n4 f14 5 n3
43
Freenet Routing Properties

Close file ids tend to be stored on the same
node
Why? Publications of similar file ids route
toward the same place
Network tend to be a small world
Small number of nodes have large number of
neighbors (i.e., six-degrees of separation)
Consequence
Most queries only traverse a small number of hops
to find the file

44
Freenet Anonymity Security

Anonymity
Randomly modify source of packet as it traverses
the network
Can use mix-nets or onion-routing
Security Censorship resistance
No constraints on how to choose ids for files
easy to have to files collide, creating denial
of service (censorship)
Solution have a id type that requires a private
key signature that is verified when updating the
file
Cache file on the reverse path of
queries/publications attempt to replace file
with bogus data will just cause the file to be
replicated more!

45
Freenet Discussion

Pros
Intelligent routing makes queries relatively
short
Search scope small (only nodes along search path
involved) no flooding
Anonymity properties may give you plausible
deniability
Cons
Still no provable guarantees!
Anonymity features make it hard to measure, debug

46
DHT History

In 2000-2001, academic researchers said we want
to play too!
Motivation
Frustrated by popularity of all these
half-baked P2P apps )
We can do better! (so we said)
Guaranteed lookup success for files in system
Provable bounds on search time
Provable scalability to millions of node
Hot Topic in networking ever since

47
DHT Overview

Abstraction a distributed hash-table (DHT)
data structure
put(id, item)
item get(id)
Implementation nodes in system form a
distributed data structure
Can be Ring, Tree, Hypercube, Skip List,
Butterfly Network, ...

48
DHT Overview (2)

Structured Overlay Routing
Join On startup, contact a bootstrap node and
integrate yourself into the distributed data
structure get a node id
Publish Route publication for file id toward a
close node id along the data structure
Search Route a query for file id toward a close
node id. Data structure guarantees that query
will meet the publication.
Fetch Two options
Publication contains actual file fetch from
where query stops
Publication says I have file X query tells
you 128.2.1.3 has X, use IP routing to get X from
128.2.1.3

49
DHT Example - Chord

Associate to each node and file a unique id in an
uni-dimensional space (a Ring)
E.g., pick from the range 0...2m
Usually the hash of the file or IP address
Properties
Routing table size is O(log N) , where N is the
total number of nodes
Guarantees that a file is found in O(log N) hops

from MIT in 2001
50
Chord

Associate to each node and item a unique id in an
uni-dimensional space
Goals
Scales to hundreds of thousands of nodes
Handles rapid arrival and failure of nodes
Properties
Routing table size O(log(N)) , where N is the
total number of nodes
Guarantees that a file is found in O(log(N)) steps

51
Data Structure

Assume identifier space is 0..2m
Each node maintains
Finger table
Entry i in the finger table of n is the first
node that succeeds or equals n 2i
Predecessor node
An item identified by id is stored on the
succesor node of id

52
Hashing Keys to Nodes
53
Basic Lookup
54
Lookup Algorithm

Lookup(my-id, key-id)
n my successor
if my-id
Lookup(id) on node n // goto next hop
else
return my successor // found the correct node
Correctness depends only on successors
O(N) lookup time, but we can do better

55
Shortcutting to Log(N) time
56
Shortcutting
57
Basic Chord algorithm

Lookup(my-id, key-id)
look in local finger table for
highest node n such that my-id
if n exists
Lookup(id) on node n // goto next hop
else
return my successor // found the correct node

58
Chord Example

Assume an identifier space 0..8
Node n1(1) joins?all entries in its finger table
are initialized to itself

Succ. Table
0
i id2i succ 0 2 1 1 3 1 2 5
1
1
7
2
6
3
5
4
59
Chord Example

Node n2(3) joins

Succ. Table
0
i id2i succ 0 2 2 1 3 1 2 5
1
1
7
2
6
Succ. Table
i id2i succ 0 3 1 1 4 1 2 6
1
3
5
4
60
Chord Example
Succ. Table
i id2i succ 0 1 1 1 2 2 2 4
6

Nodes n3(0), n4(6) join

Succ. Table
0
i id2i succ 0 2 2 1 3 6 2 5
6
1
7
Succ. Table
i id2i succ 0 7 0 1 0 0 2 2
2
2
6
Succ. Table
i id2i succ 0 3 6 1 4 6 2 6
6
3
5
4
61
Chord Examples
Succ. Table
Items
7
i id2i succ 0 1 1 1 2 2 2 4
6

Nodes n1(1), n2(3), n3(0), n4(6)
Items f1(7), f2(2)

0
Succ. Table
Items
1
1
7
i id2i succ 0 2 2 1 3 6 2 5
6
2
6
Succ. Table
i id2i succ 0 7 0 1 0 0 2 2
2
Succ. Table
i id2i succ 0 3 6 1 4 6 2 6
6
3
5
4
62
Query

Upon receiving a query for item id, a node
Check whether stores the item locally
If not, forwards the query to the largest node in
its successor table that does not exceed id

Succ. Table
Items
7
i id2i succ 0 1 1 1 2 2 2 4
6
0
Succ. Table
Items
1
1
7
i id2i succ 0 2 2 1 3 6 2 5
6
query(7)
2
6
Succ. Table
i id2i succ 0 7 0 1 0 0 2 2
2
Succ. Table
i id2i succ 0 3 6 1 4 6 2 6
6
3
5
4
63
Node Joining

Node n joins the system
n picks a random identifier, id
n performs n lookup(id)
n-successor n

64
State Maintenance Stabilization Protocol

Periodically node n
Asks its successor, n, about its predecessor n
If n is between n and n
n-successor n
notify n that n its predecessor
When node n receives notification message from
n
If n is between n-predecessor and n, then
n-predecessor n
Improve robustness
Each node maintain a successor list (usually of
size 2log N)

65
PlanetP

Flooding the index
Join on startup, client contacts a node it knows
about starts gossiping its node id
Publishflood local index via gossip (random
exchange)
Search search local index, contact relevant
peers with query
Fetch downloads controlled by content ranking
algorithm

66
PlanetP Introduction

1st generation of P2P applications based on
ad-hoc solutions
File sharing (Kazaa, Gnutella, etc), Spare cycles
usage (SETI_at_Home)
More recently, many projects are focusing on
building infrastructure for large scale key-based
object location (DHTS)
Chord, Tapestry and others
Used to build global file systems (Farsite,
Oceanstore)
What about content-based location?

67
Goals Challenges

Provide content addressing and ranking in P2P
Similar to Google/ search engines
Ranking critical to navigate terabytes of data
Challenges
Resources are divided among large set of
heterogeneous peers
No central management and administration
Uncontrolled peer behavior
Gathering accurate global information is too
expensive

68
The PlanetP Infrastructure

Compact global index of shared information
Supports resource discovery and location
Extremely compact to minimize global storage
requirement
Kept loosely synchronized and globally replicated
Epidemic based communication layer
Provides efficient and reliable communication
despite unpredictable peer behaviors
Supports peer discovery (membership), group
communication, and update propagation
Distributed information ranking algorithm
Locate highly relevant information in large
shared document collections
Based on TFxIDF, a state-of-the-art ranking
technique
Adapted to work with only partial information

69
Using PlanetP

Services provided by PlanetP
Content addressing and ranking
Resource discovery for adaptive applications
Group membership management
Close collaboration
Publish/Subscribe information propagation
Decoupled communication and timely propagation
Group communication
Simplify development of distributed apps.

70
Global Information Index

Each node maintains an index of its content
Summarize the set of terms in its index using a
Bloom filter
The global index is the set of all summaries
Term to peer mappings
List of online peers
Summaries are propagated and kept synchronized
using gossiping

Gossiping
71
Epidemic Comm. in P2P

Nodes push and pull randomly from each others
Unstructured communication ? resilient to
failures
Predictable convergence time
Novel combination of previously known techniques
Rumoring, anti-entropy, and partial anti-entropy
Introduce partial anti-entropy to reduce variance
in propagation time for dynamic communities
Batch updates into communication rounds for
efficiency
Dynamic slow-down in absence of updates to save
bandwidth

72
Content Search in PlanetP
STOP
73
Results Ranking

The Vector Space model
Documents and queries are represented as
k-dimensional vectors
Each dimension represents the relevance or weight
of the word for the document
The angle between a query and a document
indicates its similarity
Does not requires links between documents
Weight assignment (TFxIDF)
Use Term Frequency (TF) to weight terms for
documents
Use Inverse Document Frequency (IDF) to weight
terms for query
Intuition
TF indicates how relevant a document is to a
particular concept
IDF gives more weight to terms that are good
discriminators between documents

74
Using TFxIDF in P2P

Unfortunately IDF is not suited for P2P
Requires term to document mappings
Requires a frequency count for every term in the
shared collection
Instead, use a two-phase approximation algorithm
Replace IDF with IPF ( Inverse Peer Frequency)
IPF(t) f(No. Peers/Peers with documents
containing term t)
Individuals can compute a consistent global
ranking of peers and documents without knowing
the global frequency count of terms
Node ranking function

75
Pruning Searches

Centralized search engines have index for entire
collection
Can rank entire set of documents for each query
In a P2P community, we do not want to contact
peers that have only marginally relevant
documents
Use adaptive heuristic to limit forwarding of
query in 2nd-phase to only a subset of most
highly ranked peers

76
Evaluation

Answer the following questions
What is the efficacy of our distributed ranking
algorithm?
What is the storage cost for the globally
replicated index?
How well does gossiping work in P2P communities?
Evaluation methodology
Use a running prototype to validate and collect
micro benchmarks (tested with up to 200 nodes)
Use simulation to predict performance on big
communities
We model peer behavior based on previous work and
our own measurements from a local P2P community
of 4000 users
Will show sampling of results from paper

77
Ranking Evaluation I

We use the AP89 collection from TREC
84678 documents, 129603 words, 97 queries, 266MB
Each collection comes with a set of queries and
relevance judgments
We measure recall (R) and precision (P)

78
Ranking Evaluation II

Results intersection is 70 at low recall and
gets to 100 as recall increases
To get 10 documents, PlanetP contacted 20 peers
out of 160 candidates

79
Global Index Space Efficiency

TREC collection (pure text)
Simulate a community of 5000 nodes
Distribute documents uniformly
944,651 documents taking up 3GB
36MB of RAM are needed to store the global index
This is 1 of the total collection size
MP3 collection (audio tags)
Using previous result but based on Gnutella
measurements
3,000,000 MP3 files taking up 14TB
36MB of RAM are needed to store the global index
This is 0.0002 of the total collection size

80
Data Propagation
Arrival and departure experiment (LAN)
Propagation speed experiment (DSL)
81
PlanetP Summary

Explored the design of infrastructural support
for a rich set of P2P applications
Membership, content addressing and ranking
Scale well to thousands of peers
Extremely tolerant to unpredictable dynamic peer
behaviors
Gossiping with partial anti-entropy is reliable
Information always propagate everywhere
Propagation time has small variance
Distributed approximation of TFxIDF
Within 11 of centralized implementation
Never collect all needed information in one place
Global index on average is only 1 of data
collection
Synchronization of global index only requires 50
B/sec