Mapping the Gnutella Network - PowerPoint PPT Presentation

About This Presentation
Title:

Mapping the Gnutella Network

Description:

Reduced crawling time to a couple of hours for a large initial ... Successive crawls of the network found: 40% of nodes leave the network in less than 4 hours ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 59
Provided by: jack122
Category:

less

Transcript and Presenter's Notes

Title: Mapping the Gnutella Network


1
Mapping the Gnutella Network
  • Presented By
  • Tony Young
  • M.Math Candidate
  • October 7th, 2004

2
Outline
  • Introduction
  • Gnutella in Depth
  • The Crawler
  • Analysis of Network
  • Summary and Improvements
  • Paper Review

3
Outline
  • Introduction
  • Gnutella in Depth
  • The Crawler
  • Analysis of Network
  • Summary and Improvements
  • Paper Review

4
Introduction
  • Peer to peer systems have recently exploded onto
    the internet scene
  • Two main contributing factors
  • Low cost and high availability of resources
    (computing and storage)
  • Increased network connectivity (proliferation of
    always on connections)

5
Introduction
  • Peer systems build a virtual topology (overlay)
    with its own routing mechanisms
  • The topology of the overlay and routing protocols
    directly affects
  • Performance Number of physical hops to send a
    message through virtual overlay
  • Reliability Will a message actually reach the
    other end
  • Scalability Can other nodes be added while
    keeping performance good
  • Anonymity Can we protect the identity of nodes
    in the network

6
Introduction
  • Gnutella is studied in depth and analysis is
    performed to determine how the overlay affects
    the four characteristics previously mentioned
  • Started by capturing the network topology and
    behaviour
  • Performed a macroscopic analysis of the network
    to evaluate costs and benefits
  • Investigated possible improvements

7
Introduction
  • Two questions drive analysis
  • What is the connectivity structure of Gnutella?
  • How well does the Gnutella overlay map to the
    actual network topology?

8
Introduction
  • Connectivity Structure
  • Networks as diverse as natural networks usually
    have a few well connected nodes and many poorly
    connected nodes
  • I.e. Power Law Networks
  • We will see Gnutella is not a pure power law
    network, but still has good fault tolerance and
    is less resistant to DoS attacks

9
Introduction
  • Overlay Topology
  • Important for ISPs overlays that dont map
    closely to the physical topology adds additional
    stress on the infrastructure and costs ISPs more
    money
  • Scalability is directly linked to efficient use
    of network resources

10
Outline
  • Introduction
  • Gnutella in Depth
  • The Crawler
  • Analysis of Network
  • Summary and Improvements

11
Gnutella in Depth
  • Gnutella is an open protocol
  • It is decentralized and unstructured
  • Allows group membership and searching of
    available files for download
  • Gnutella should operate in a dynamic environment
    where hosts can join/leave at any time
  • Gnutella should experience good performance and
    scalability
  • External attacks should not cause data loss or
    performance degradation
  • Users seeking or providing unpopular material
    should stay anonymous

12
Gnutella in Depth
  • Gnutella nodes are called servents
    (SERVer-cliENTS)
  • Provide a client-side interface to allow
    searching of file base
  • Provide server-side storage, routing and response
    to network messages and requests

13
Gnutella in Depth
  • To connect, a node contacts an always on host
    (I.e. gnutella.com) and sends a PING
  • Node replies with a PONG and forwards the PING on
    to other nodes in the network who reply with PONG
    messages and forward the PING on
  • PING stops after TTL hops

14
Gnutella in Depth
  • To find files, users submit QUERY messages to
    other nodes
  • Messages are broadcast to all neighbours who
    forward them on to other neighbours, etc. for TTL
    hops
  • QUERY RESPONSE messages are returned to the
    querying node

15
Gnutella in Depth
  • To download a file, nodes send GET and PUSH
    messages to individual hosts holding a file
  • I.e. transfer requests and transfers are routed
    directly between communicating hosts, and not
    back-propagated

16
Gnutella in Depth
  • Messaging protocol has three important features
  • TTL and hops passed fields are attached to each
    message
  • Randomly generated message ID is attached to each
    message
  • Each node keeps track of recently routed messages
    to prevent re-broadcasting and to implement
    back-propagation

17
Gnutella in Depth
  • PING message contains the host address and name,
    number of files and size of data store
  • PONG message contains the same information from
    the host that received the PING

18
Gnutella in Depth
  • PING messages propagate until TTL has expired
  • Hop count incremented at each servent receiving
    the PING
  • Message propagates until hop count TTL
  • PONG messages are back-propagated (I.e. sent on
    the reverse path that the original message
    followed) to the host initiating the PING

19
Gnutella in Depth
  • QUERY messages are sent the same way as a PING
    message
  • Nodes check the search string requested against
    the names of their locally stored files
  • QUERY RESPONSE messages are back-propagated to
    the querying node and include information
    necessary to download the file

20
Outline
  • Introduction
  • Gnutella in Depth
  • The Crawler
  • Analysis of Network
  • Summary and Improvements
  • Paper Review

21
The Crawler
  • In order to conduct the network tests, a crawler
    was developed to gather information about the
    virtual topology
  • Crawler starts with a list of active nodes and
    sends a PING message to each of them
  • PONG messages are received and the IP, port,
    number of stored files and size of archive are
    stored in a table
  • PING propagates to other nodes and PONG back
    propagates to crawler

22
The Crawler
  • A sequential version of the crawler was initially
    developed
  • I.e. send a PING with an empirically determined
    optimal TTL to a set of nodes resend to the
    nodes where the PING stops, etc.
  • Proved to be very slow 50 hours to collect data
    from a 4 000 node network
  • Slowness means two things
  • Not scalable Will get slower as we add more
    nodes
  • Does not give an accurate network snapshot
    network changes drastically over 50 hours!

23
The Crawler
  • A distributed crawler was developed next
  • Client-Server architecture
  • Server maintains node list and creates a network
    graph
  • Clients receive a list of nodes to contact and
    discover neighbours for
  • Decided to use only 50 clients at once
  • Reduces invasiveness of search and consumption of
    network resources
  • Reduced crawling time to a couple of hours for a
    large initial list and a network of 30 000 nodes

24
The Crawler
  • Network membership is defined as follows
  • A node is a member of the network if the crawler
    is able to connect to it
  • A node might be excluded from network membership
    if it was reported as active by a server or other
    node, but the crawler could not contact it
  • This might happen if nodes go offline before the
    crawler can contact them

25
Outline
  • Introduction
  • Gnutella in Depth
  • The Crawler
  • Analysis of Network
  • Growth Trends
  • Traffic Estimates
  • Connectivity and Reliability
  • Overlay vs. Topology
  • Summary and Improvements
  • Paper Review

26
Analysis of Network
  • Data was collected over a 6 month period
  • Data shows
  • Overhead traffic is reducing
  • Traffic volume is a significant barrier to growth

27
Growth Trends
  • Size of network is growing rapidly
  • Largest connected component in November 2000 had
    2 063 neighbours
  • Largest connected component in May 2001 had 48
    195 neighbours!
  • Number of neighbours for the largest connected
    component has grown 25 times!

28
Growth Trends
29
Growth Trends
  • Despite the explosive growth, most nodes are not
    connected long
  • Successive crawls of the network found
  • 40 of nodes leave the network in less than 4
    hours
  • 25 of nodes are alive for more than 24 hours

30
Traffic Estimates
  • A modified version of the crawler recorded
    traffic generated across one randomly chosen link
  • 36 of total traffic (in bytes) is user generated
    QUERY messages
  • 55 is group membership (PING/PONG) messages
  • 9 is non-standard or malformed messages
  • N.B. File transfer traffic is excluded

31
Traffic Estimates
  • After June 2001 (when new Gnutella implementation
    was released)
  • 92 of total traffic (in bytes) was QUERY
    messages
  • 8 is group membership (PING/PONG) messages
  • N.B. File transfer traffic is excluded

32
Traffic Estimates
  • 95 of all nodes are reachable within 7 hops.
  • Thus, each message typically uses a TTL 7
  • Most links are expected to support similar
    amounts of traffic for these reasons
  • As verified empirically, the total Gnutella
    generated traffic is proportional to the number
    of connections in the network
  • However, the average number of connections per
    node stays relatively constant as the network
    grows

33
Traffic Estimates
34
Traffic Estimates
35
Traffic Estimates
  • The total traffic estimate for the Gnutella
    network is 1 Gbps
  • I.e. 170 000 connections for a 50 000 node
    network times 6 kbps per connection
  • This is approximately 330 TB/month!
  • Excluding file transfers!

36
Traffic Estimates
  • This total is 1.7 of the total internet traffic
    in US backbones in December 2000
  • This volume of traffic is believed to be an
    obstacle to further growth
  • The underlying network topology must be used more
    efficiently to allow scaling and wider deployment

37
Connectivity and Reliability
  • Note Nodes decide locally
  • How many connections to support
  • When to add or drop a connection
  • Recent research shows that many natural systems
    organize themselves into power law networks
  • I.e. networks where a few nodes are well
    connected and most nodes have very few connections

38
Connectivity and Reliability
  • Power law networks
  • Number of nodes with L links (connections) is
    proportional to L-k where k is system-dependent
  • Resilient to losing many poorly connected nodes
  • Falls apart quickly if only a few well connected
    nodes are lost
  • Extremely robust to random failures, but
    vulnerable to targeted attacks

39
Connectivity and Reliability
  • Power law networks appear as a linear system on a
    log-log plot
  • Data for December 2000 shows that early Gnutella
    networks were power law
  • Data for March 2001 shows that later Gnutella
    networks are a mixture
  • There are a constant number of nodes with fewer
    than 10 links
  • Above 10 links, nodes follow a power law structure

40
Connectivity and Reliability
41
Connectivity and Reliability
42
Connectivity and Reliability
  • Why did the distribution change?
  • Two possible reasons
  • About 20 of Gnutella users have modem
    connections - DSL and up can support more
    connections
  • Gnutella users run as many connections as their
    network can support - perception is that more
    connections better query results

43
Connectivity and Reliability
  • Does the change in distribution affect
    reliability? Yes!
  • Preserves resilience to random failures
  • Makes network less dependent on well connected
    nodes and hence less prone to DoS attacks

44
Overlay vs. Topology
  • Peer systems change the way bandwidth is used on
    the internet
  • Servers are at the edge of the network now, and
    peers are constantly downloading
  • Most ISPs use flat-rate billing
  • Peer systems may break this model!

45
Overlay vs. Topology
  • Due to the amount of traffic peer systems
    generate, efficient use of resources is important
  • The greater the mismatch between the overlay and
    the physical network topology, the more messages
    need to be transmitted to route information from
    A to B
  • This means more stress on the network resources

46
Overlay vs. Topology
  • Communication from A to all other nodes requires
    one message over the D - E link

47
Overlay vs. Topology
  • Communication from A to all other nodes requires
    six messages over the D - E link

48
Overlay vs. Topology
  • How well does Gnutella map to the topology?
  • Assume that domain names are roughly evident of
    the hierarchy of the internet
  • Check how well generated traffic maps to the
    cluster of domain names found by the crawler

49
Overlay vs. Topology
  • After analysis of 10 overlays, it was found that
    Gnutella nodes often connect to peers outside of
    their respective domains
  • Thus, it appears that Gnutella does not make
    efficient use of the underlying topology

50
Outline
  • Introduction
  • Gnutella in Depth
  • The Crawler
  • Analysis of Network
  • Summary and Improvements
  • Paper Review

51
Summary and Improvements
  • Gnutella has a multimodal connectivity
    distribution that is partially constant and
    partially power law
  • Network is resilient to random failures
  • Network is harder to attack by malicious parties,
    but not immune to DoS attacks
  • Gnutella makes little effort to ward off
    attackers
  • E.g. topology, connectivity and traffic
    information is easy to obtain and can be used to
    plan attacks

52
Summary and Improvements
  • Gnutellas traffic volume is a significant
    fraction of all internet traffic
  • Makes the future growth of the network reliant on
    efficient use of the topology
  • Gnutellas overlay does not match the network
    topology very well
  • This increases quite substantially the number of
    messages and the amount of network traffic
    generated

53
Summary and Improvements
  • Necessary improvements
  • Make efforts to hide overlay and connectivity
    information (encryption?)
  • Match overlay more closely with topology
  • Limits to growth must be solved first and fast at
    the rate that Gnutella is growing

54
Summary and Improvements
  • Suggested Improvements
  • Exploit locality of files and query distribution
    (I.e. caching and localized queries)
  • Replace query flooding strategy with something
    more efficient (I.e. superpeer routing and group
    communication)

55
Outline
  • Introduction
  • Gnutella in Depth
  • The Crawler
  • Analysis of Network
  • Summary and Improvements
  • Paper Review

56
Paper Review
  • Organization
  • Some discussions of the Gnutella architecture and
    protocols were scattered throughout the paper
  • Should have combined everything into a more
    logical order inside the protocol section
  • Writing Style
  • Generally very good. Some missing words and poor
    grammar

57
Paper Review
  • Novel Ideas
  • Presented a qualitative and quantitative analysis
    of the Gnutella network, and some important
    points for P2P as a whole
  • Content
  • Some backing information was missing
  • Some claims were made without supporting
    evidence, or just referring the reader to another
    paper

58
Questions?
Write a Comment
User Comments (0)
About PowerShow.com