CS 352 Peer 2 Peer Networking - PowerPoint PPT Presentation

1 / 81
About This Presentation
Title:

CS 352 Peer 2 Peer Networking

Description:

60M users of file-sharing in US. 8.5M logged in at a given time on average ... Soon many other clients: Bearshare, Morpheus, LimeWire, etc. ... – PowerPoint PPT presentation

Number of Views:180
Avg rating:3.0/5.0
Slides: 82
Provided by: tri591
Category:

less

Transcript and Presenter's Notes

Title: CS 352 Peer 2 Peer Networking


1
CS 352-Peer 2 Peer Networking
  • Credit slides from J. Pang, B. Richardson, I.
    Stoica, M. Cuenca

2
Peer to Peer
  • Outline
  • Overview
  • Systems
  • Napster
  • Gnutella
  • Freenet
  • BitTorrent
  • Chord
  • PlanetP

3
Why Study P2P
  • Huge fraction of traffic on networks today
  • 50!
  • Exciting new applications
  • Next level of resource sharing
  • Vs. timesharing, client-server, P2P
  • E.g. Access 10s-100s of TB at low cost.

4
Users and Usage
  • 60M users of file-sharing in US
  • 8.5M logged in at a given time on average
  • 814M units of music sold in US last year
  • 140M digital tracks sold by music companies
  • As of Nov, 35 of all Internet traffic was for
    BitTorrent, a single file-sharing system
  • Major legal battles underway between recording
    industry and file-sharing companies

5
Share of Internet Traffic
6
Number of Users
Others include BitTorrent, eDonkey,
iMesh, Overnet, Gnutella BitTorrent (and
others) gaining share from FastTrack (Kazaa).
7
What is P2P?
  • Use resources of end-hosts to accomplish a shared
    task
  • Typically share files
  • Play game
  • Search for patterns in data (Seti_at_Home)

8
Whats new?
  • Taking advantage of resources at the edge of the
    network
  • Fundamental shift in computing capability
  • Increase in absolute bandwidth over WAN
  • What is LAN/WAN ratio?
  • Deploying server resources still expensive
  • Human or hardware cost?
  • Where does P2P fit in?

9
P2P systems
  • Napster
  • Launched P2P
  • Centralized index
  • Gnutella
  • Focus is simple sharing
  • Using simple flooding
  • Kazaa
  • More intelligent query routing
  • BitTorrent
  • Focus on Download speed, fairness in sharing

10
More P2P systems
  • Freenet
  • Focus privacy and anonymity
  • Builds internal routing tables
  • Cord
  • Focus on building a distributed hash table (DHT)
  • Finger tables
  • PlanetP
  • Focus on search and retrieval
  • Creates global index on each node via controlled,
    randomized flooding

11
Key issues for P2P systems
  • Join/leave
  • How do nodes join/leave? Who is allowed?
  • Search and retrieval
  • How to find content?
  • How are metadata indexes built, stored,
    distributed?
  • Content Distribution
  • Where is content stored? How is it downloaded and
    retrieved?

12
4 Key Primitives
  • Join
  • How to enter/leave the P2P system?
  • Publish
  • How to advertise a file?
  • Search
  • how to find a file?
  • Fetch
  • how to download a file?

13
Publish and Search
  • Basic strategies
  • Centralized (Napster)
  • Flood the query (Gnutella)
  • Flood the index (PlanetP)
  • Route the query(Chord)
  • Different tradeoffs depending on application
  • Robustness, scalability, legal issues

14
Napster History
  • In 1999, S. Fanning launches Napster
  • Peaked at 1.5 million simultaneous users
  • Jul 2001, Napster shuts down

15
Napster Overiew
  • Centralized Database
  • Join on startup, client contacts central server
  • Publish reports list of files to central server
  • Search query the server return someone that
    stores the requested file
  • Fetch get the file directly from peer

16
Napster Publish
insert(X, 123.2.21.23) ...
I have X, Y, and Z!
123.2.21.23
17
Napster Search
123.2.0.18
search(A) -- 123.2.0.18
Where is file A?
18
Napster Discussion
  • Pros
  • Simple
  • Search scope is O(1)
  • Controllable (pro or con?)
  • Cons
  • Server maintains O(N) State
  • Server does all processing
  • Single point of failure

19
Gnutella History
  • In 2000, J. Frankel and T. Pepper from Nullsoft
    released Gnutella
  • Soon many other clients Bearshare, Morpheus,
    LimeWire, etc.
  • In 2001, many protocol enhancements including
    ultrapeers

20
Gnutella Overview
  • Query Flooding
  • Join on startup, client contacts a few other
    nodes these become its neighbors
  • Publish no need
  • Search ask neighbors, who as their neighbors,
    and so on... when/if found, reply to sender.
  • Fetch get the file directly from peer

21
Gnutella Search
Where is file A?
22
Gnutella Discussion
  • Pros
  • Fully de-centralized
  • Search cost distributed
  • Cons
  • Search scope is O(N)
  • Search time is O(???)
  • Nodes leave often, network unstable

23
Aside Search Time?
24
Aside All Peers Equal?
25
Aside Network Resilience
Partial Topology
Random 30 die
Targeted 4 die
from Saroiu et al., MMCN 2002
26
KaZaA History
  • In 2001, KaZaA created by Dutch company Kazaa BV
  • Single network called FastTrack used by other
    clients as well Morpheus, giFT, etc.
  • Eventually protocol changed so other clients
    could no longer talk to it
  • Most popular file sharing network today with 10
    million users (number varies)

27
KaZaA Overview
  • Smart Query Flooding
  • Join on startup, client contacts a supernode
    ... may at some point become one itself
  • Publish send list of files to supernode
  • Search send query to supernode, supernodes flood
    query amongst themselves.
  • Fetch get the file directly from peer(s) can
    fetch simultaneously from multiple peers

28
KaZaA Network Design
29
KaZaA File Insert
insert(X, 123.2.21.23) ...
I have X!
123.2.21.23
30
KaZaA File Search
Where is file A?
31
KaZaA Fetching
  • More than one node may have requested file...
  • How to tell?
  • Must be able to distinguish identical files
  • Not necessarily same filename
  • Same filename not necessarily same file...
  • Use Hash of file
  • KaZaA uses UUHash fast, but not secure
  • Alternatives MD5, SHA-1
  • How to fetch?
  • Get bytes 0..1000 from A, 1001...2000 from B
  • Alternative Erasure Codes

32
KaZaA Discussion
  • Pros
  • Tries to take into account node heterogeneity
  • Bandwidth
  • Host Computational Resources
  • Host Availability (?)
  • Rumored to take into account network locality
  • Cons
  • Mechanisms easy to circumvent
  • Still no real guarantees on search scope or
    search time

33
BitTorrent History
  • In 2002, B. Cohen debuted BitTorrent
  • Key Motivation
  • Popularity exhibits temporal locality (Flash
    Crowds)
  • E.g., Slashdot effect, CNN on 9/11, new
    movie/game release
  • Focused on Efficient Fetching, not Searching
  • Distribute the same file to all peers
  • Single publisher, multiple downloaders
  • Has some real publishers
  • Blizzard Entertainment using it to distribute the
    beta of their new game

34
BitTorrent Overview
  • Swarming
  • Join contact centralized tracker server, get a
    list of peers.
  • Publish Run a tracker server.
  • Search Out-of-band. E.g., use Google to find a
    tracker for the file you want.
  • Fetch Download chunks of the file from your
    peers. Upload chunks you have to them.

35
BitTorrent Publish/Join
Tracker
36
BitTorrent Fetch
37
BitTorrent Sharing Strategy
  • Employ Tit-for-tat sharing strategy
  • Ill share with you if you share with me
  • Be optimistic occasionally let freeloaders
    download
  • Otherwise no one would ever start!
  • Also allows you to discover better peers to
    download from when they reciprocate

38
BitTorrent Summary
  • Pros
  • Works reasonably well in practice
  • Gives peers incentive to share resources avoids
    freeloaders
  • Cons
  • Central tracker server needed to bootstrap swarm
    (is this really necessary?)

39
Freenet History
  • In 1999, I. Clarke started the Freenet project
  • Basic Idea
  • Employ Internet-like routing on the overlay
    network to publish and locate files
  • Addition goals
  • Provide anonymity and security
  • Make censorship difficult

40
Freenet Overview
  • Routed Queries
  • Join on startup, client contacts a few other
    nodes it knows about gets a unique node id
  • Publish route file contents toward the file id.
    File is stored at node with id closest to file id
  • Search route query for file id toward the
    closest node id
  • Fetch when query reaches a node containing file
    id, it returns the file to the sender

41
Freenet Routing Tables
  • id file identifier (e.g., hash of file)
  • next_hop another node that stores the file id
  • file file identified by id being stored on the
    local node
  • Forwarding of query for file id
  • If file id stored locally, then stop
  • Forward data back to upstream requestor
  • If not, search for the closest id in the table,
    and forward the message to the corresponding
    next_hop
  • If data is not found, failure is reported back
  • Requestor then tries next closest match in
    routing table

id next_hop file


42
Freenet Routing
query(10)
n2
n1
4 n1 f4 12 n2 f12 5 n3
9 n3 f9
n4
n5
14 n5 f14 13 n2 f13 3 n6
4 n1 f4 10 n5 f10 8 n6
n3
3 n1 f3 14 n4 f14 5 n3
43
Freenet Routing Properties
  • Close file ids tend to be stored on the same
    node
  • Why? Publications of similar file ids route
    toward the same place
  • Network tend to be a small world
  • Small number of nodes have large number of
    neighbors (i.e., six-degrees of separation)
  • Consequence
  • Most queries only traverse a small number of hops
    to find the file

44
Freenet Anonymity Security
  • Anonymity
  • Randomly modify source of packet as it traverses
    the network
  • Can use mix-nets or onion-routing
  • Security Censorship resistance
  • No constraints on how to choose ids for files
    easy to have to files collide, creating denial
    of service (censorship)
  • Solution have a id type that requires a private
    key signature that is verified when updating the
    file
  • Cache file on the reverse path of
    queries/publications attempt to replace file
    with bogus data will just cause the file to be
    replicated more!

45
Freenet Discussion
  • Pros
  • Intelligent routing makes queries relatively
    short
  • Search scope small (only nodes along search path
    involved) no flooding
  • Anonymity properties may give you plausible
    deniability
  • Cons
  • Still no provable guarantees!
  • Anonymity features make it hard to measure, debug

46
DHT History
  • In 2000-2001, academic researchers said we want
    to play too!
  • Motivation
  • Frustrated by popularity of all these
    half-baked P2P apps )
  • We can do better! (so we said)
  • Guaranteed lookup success for files in system
  • Provable bounds on search time
  • Provable scalability to millions of node
  • Hot Topic in networking ever since

47
DHT Overview
  • Abstraction a distributed hash-table (DHT)
    data structure
  • put(id, item)
  • item get(id)
  • Implementation nodes in system form a
    distributed data structure
  • Can be Ring, Tree, Hypercube, Skip List,
    Butterfly Network, ...

48
DHT Overview (2)
  • Structured Overlay Routing
  • Join On startup, contact a bootstrap node and
    integrate yourself into the distributed data
    structure get a node id
  • Publish Route publication for file id toward a
    close node id along the data structure
  • Search Route a query for file id toward a close
    node id. Data structure guarantees that query
    will meet the publication.
  • Fetch Two options
  • Publication contains actual file fetch from
    where query stops
  • Publication says I have file X query tells
    you 128.2.1.3 has X, use IP routing to get X from
    128.2.1.3

49
DHT Example - Chord
  • Associate to each node and file a unique id in an
    uni-dimensional space (a Ring)
  • E.g., pick from the range 0...2m
  • Usually the hash of the file or IP address
  • Properties
  • Routing table size is O(log N) , where N is the
    total number of nodes
  • Guarantees that a file is found in O(log N) hops

from MIT in 2001
50
Chord
  • Associate to each node and item a unique id in an
    uni-dimensional space
  • Goals
  • Scales to hundreds of thousands of nodes
  • Handles rapid arrival and failure of nodes
  • Properties
  • Routing table size O(log(N)) , where N is the
    total number of nodes
  • Guarantees that a file is found in O(log(N)) steps

51
Data Structure
  • Assume identifier space is 0..2m
  • Each node maintains
  • Finger table
  • Entry i in the finger table of n is the first
    node that succeeds or equals n 2i
  • Predecessor node
  • An item identified by id is stored on the
    succesor node of id

52
Hashing Keys to Nodes
53
Basic Lookup
54
Lookup Algorithm
  • Lookup(my-id, key-id)
  • n my successor
  • if my-id
  • Lookup(id) on node n // goto next hop
  • else
  • return my successor // found the correct node
  • Correctness depends only on successors
  • O(N) lookup time, but we can do better

55
Shortcutting to Log(N) time
56
Shortcutting
57
Basic Chord algorithm
  • Lookup(my-id, key-id)
  • look in local finger table for
  • highest node n such that my-id
  • if n exists
  • Lookup(id) on node n // goto next hop
  • else
  • return my successor // found the correct node

58
Chord Example
  • Assume an identifier space 0..8
  • Node n1(1) joins?all entries in its finger table
    are initialized to itself

Succ. Table
0
i id2i succ 0 2 1 1 3 1 2 5
1
1
7
2
6
3
5
4
59
Chord Example
  • Node n2(3) joins

Succ. Table
0
i id2i succ 0 2 2 1 3 1 2 5
1
1
7
2
6
Succ. Table
i id2i succ 0 3 1 1 4 1 2 6
1
3
5
4
60
Chord Example
Succ. Table
i id2i succ 0 1 1 1 2 2 2 4
6
  • Nodes n3(0), n4(6) join

Succ. Table
0
i id2i succ 0 2 2 1 3 6 2 5
6
1
7
Succ. Table
i id2i succ 0 7 0 1 0 0 2 2
2
2
6
Succ. Table
i id2i succ 0 3 6 1 4 6 2 6
6
3
5
4
61
Chord Examples
Succ. Table
Items
7
i id2i succ 0 1 1 1 2 2 2 4
6
  • Nodes n1(1), n2(3), n3(0), n4(6)
  • Items f1(7), f2(2)

0
Succ. Table
Items
1
1
7
i id2i succ 0 2 2 1 3 6 2 5
6
2
6
Succ. Table
i id2i succ 0 7 0 1 0 0 2 2
2
Succ. Table
i id2i succ 0 3 6 1 4 6 2 6
6
3
5
4
62
Query
  • Upon receiving a query for item id, a node
  • Check whether stores the item locally
  • If not, forwards the query to the largest node in
    its successor table that does not exceed id

Succ. Table
Items
7
i id2i succ 0 1 1 1 2 2 2 4
6
0
Succ. Table
Items
1
1
7
i id2i succ 0 2 2 1 3 6 2 5
6
query(7)
2
6
Succ. Table
i id2i succ 0 7 0 1 0 0 2 2
2
Succ. Table
i id2i succ 0 3 6 1 4 6 2 6
6
3
5
4
63
Node Joining
  • Node n joins the system
  • n picks a random identifier, id
  • n performs n lookup(id)
  • n-successor n

64
State Maintenance Stabilization Protocol
  • Periodically node n
  • Asks its successor, n, about its predecessor n
  • If n is between n and n
  • n-successor n
  • notify n that n its predecessor
  • When node n receives notification message from
    n
  • If n is between n-predecessor and n, then
  • n-predecessor n
  • Improve robustness
  • Each node maintain a successor list (usually of
    size 2log N)

65
PlanetP
  • Flooding the index
  • Join on startup, client contacts a node it knows
    about starts gossiping its node id
  • Publishflood local index via gossip (random
    exchange)
  • Search search local index, contact relevant
    peers with query
  • Fetch downloads controlled by content ranking
    algorithm

66
PlanetP Introduction
  • 1st generation of P2P applications based on
    ad-hoc solutions
  • File sharing (Kazaa, Gnutella, etc), Spare cycles
    usage (SETI_at_Home)
  • More recently, many projects are focusing on
    building infrastructure for large scale key-based
    object location (DHTS)
  • Chord, Tapestry and others
  • Used to build global file systems (Farsite,
    Oceanstore)
  • What about content-based location?

67
Goals Challenges
  • Provide content addressing and ranking in P2P
  • Similar to Google/ search engines
  • Ranking critical to navigate terabytes of data
  • Challenges
  • Resources are divided among large set of
    heterogeneous peers
  • No central management and administration
  • Uncontrolled peer behavior
  • Gathering accurate global information is too
    expensive

68
The PlanetP Infrastructure
  • Compact global index of shared information
  • Supports resource discovery and location
  • Extremely compact to minimize global storage
    requirement
  • Kept loosely synchronized and globally replicated
  • Epidemic based communication layer
  • Provides efficient and reliable communication
    despite unpredictable peer behaviors
  • Supports peer discovery (membership), group
    communication, and update propagation
  • Distributed information ranking algorithm
  • Locate highly relevant information in large
    shared document collections
  • Based on TFxIDF, a state-of-the-art ranking
    technique
  • Adapted to work with only partial information

69
Using PlanetP
  • Services provided by PlanetP
  • Content addressing and ranking
  • Resource discovery for adaptive applications
  • Group membership management
  • Close collaboration
  • Publish/Subscribe information propagation
  • Decoupled communication and timely propagation
  • Group communication
  • Simplify development of distributed apps.

70
Global Information Index
  • Each node maintains an index of its content
  • Summarize the set of terms in its index using a
    Bloom filter
  • The global index is the set of all summaries
  • Term to peer mappings
  • List of online peers
  • Summaries are propagated and kept synchronized
    using gossiping

Gossiping
71
Epidemic Comm. in P2P
  • Nodes push and pull randomly from each others
  • Unstructured communication ? resilient to
    failures
  • Predictable convergence time
  • Novel combination of previously known techniques
  • Rumoring, anti-entropy, and partial anti-entropy
  • Introduce partial anti-entropy to reduce variance
    in propagation time for dynamic communities
  • Batch updates into communication rounds for
    efficiency
  • Dynamic slow-down in absence of updates to save
    bandwidth

72
Content Search in PlanetP
STOP
73
Results Ranking
  • The Vector Space model
  • Documents and queries are represented as
    k-dimensional vectors
  • Each dimension represents the relevance or weight
    of the word for the document
  • The angle between a query and a document
    indicates its similarity
  • Does not requires links between documents
  • Weight assignment (TFxIDF)
  • Use Term Frequency (TF) to weight terms for
    documents
  • Use Inverse Document Frequency (IDF) to weight
    terms for query
  • Intuition
  • TF indicates how relevant a document is to a
    particular concept
  • IDF gives more weight to terms that are good
    discriminators between documents

74
Using TFxIDF in P2P
  • Unfortunately IDF is not suited for P2P
  • Requires term to document mappings
  • Requires a frequency count for every term in the
    shared collection
  • Instead, use a two-phase approximation algorithm
  • Replace IDF with IPF ( Inverse Peer Frequency)
  • IPF(t) f(No. Peers/Peers with documents
    containing term t)
  • Individuals can compute a consistent global
    ranking of peers and documents without knowing
    the global frequency count of terms
  • Node ranking function

75
Pruning Searches
  • Centralized search engines have index for entire
    collection
  • Can rank entire set of documents for each query
  • In a P2P community, we do not want to contact
    peers that have only marginally relevant
    documents
  • Use adaptive heuristic to limit forwarding of
    query in 2nd-phase to only a subset of most
    highly ranked peers

76
Evaluation
  • Answer the following questions
  • What is the efficacy of our distributed ranking
    algorithm?
  • What is the storage cost for the globally
    replicated index?
  • How well does gossiping work in P2P communities?
  • Evaluation methodology
  • Use a running prototype to validate and collect
    micro benchmarks (tested with up to 200 nodes)
  • Use simulation to predict performance on big
    communities
  • We model peer behavior based on previous work and
    our own measurements from a local P2P community
    of 4000 users
  • Will show sampling of results from paper

77
Ranking Evaluation I
  • We use the AP89 collection from TREC
  • 84678 documents, 129603 words, 97 queries, 266MB
  • Each collection comes with a set of queries and
    relevance judgments
  • We measure recall (R) and precision (P)

78
Ranking Evaluation II
  • Results intersection is 70 at low recall and
    gets to 100 as recall increases
  • To get 10 documents, PlanetP contacted 20 peers
    out of 160 candidates

79
Global Index Space Efficiency
  • TREC collection (pure text)
  • Simulate a community of 5000 nodes
  • Distribute documents uniformly
  • 944,651 documents taking up 3GB
  • 36MB of RAM are needed to store the global index
  • This is 1 of the total collection size
  • MP3 collection (audio tags)
  • Using previous result but based on Gnutella
    measurements
  • 3,000,000 MP3 files taking up 14TB
  • 36MB of RAM are needed to store the global index
  • This is 0.0002 of the total collection size

80
Data Propagation
Arrival and departure experiment (LAN)
Propagation speed experiment (DSL)
81
PlanetP Summary
  • Explored the design of infrastructural support
    for a rich set of P2P applications
  • Membership, content addressing and ranking
  • Scale well to thousands of peers
  • Extremely tolerant to unpredictable dynamic peer
    behaviors
  • Gossiping with partial anti-entropy is reliable
  • Information always propagate everywhere
  • Propagation time has small variance
  • Distributed approximation of TFxIDF
  • Within 11 of centralized implementation
  • Never collect all needed information in one place
  • Global index on average is only 1 of data
    collection
  • Synchronization of global index only requires 50
    B/sec
Write a Comment
User Comments (0)
About PowerShow.com