Title: Measurements of Peer-to-Peer Systems
1Measurements ofPeer-to-Peer Systems
- Pradnya Karbhari
- Nov 25th, 2003
- CS 8803 Network Measurements Seminar
2Introduction to Peer-to-Peer (P2P) systems
- End-systems (or peers), are capable of behaving
as clients and servers of data, hence system is
scalable and reliable - Peers participation is voluntary, membership is
dynamic, hence topology keeps changing - Most popularly used for file sharing, hence
peer-to-peer systems have become synonymous with
peer-to-peer file sharing networks
3Classification of P2P systems
- P2P computation (e.g. seti_at_home)
- P2P communication (instant messaging)
- P2P file-sharing networks
- Centralized (e.g. Napster)
- Decentralized
- Structured (e.g. Chord, CAN, Pastry, Tapestry)
- Unstructured (e.g. Gnutella, Kazaa, Freenet,
eDonkey, eMule, Direct Connect, )
4Popularity of unstructured decentralized P2P
networks
- Gnutella host count, maintained by Limewire
(http//www.limewire.com) - good scope for measurement studies because
- deployed and widely used
- use a lot of bandwidth during data transfer,
hence a concern for network operators - quite a few measurement studies have been done on
these systems, some of which we will discuss in
this seminar
5Outline
- Characterization of users of P2P systems
- Saroiu, et.al., A Measurement Study of
Peer-to-Peer File Sharing Systems, MMCN, 2002. - Effect of P2P traffic on the underlying network
- Sen, et.al., Analyzing peer-to-peer traffic
across large networks, IMW02 - Peer-to-Peer Topologies
- Ripeanu, et.al., Mapping the Gnutella Network
Properties of Large-Scale Peer-to-Peer Systems
and Implications for System Design, IEEE
Internet Computing, 2002. - Searching on the P2P network
- Sripanidkulchai, The popularity of Gnutella
queries and its implications on scalability,
2001 - Deciphering proprietary P2P systems (like Kazaa)
- Leibowitz, et.al., Deconstructing the Kazaa
Network, WIAPP, 2003.
6Gnutella protocol overview
- Connecting to the Gnutella network
- bootstrap using GWebCache system and locally
cached hostlist - Ping/Pong messages are exchanged with potential
neighbors - Searching on the network
- Query messages are flooded on the network
- QueryHit messages are received (back-propagated
along Query path) from peers having the requested
content - Downloading the content
- peers download files directly from peers having
the requested content
7Characterization of Users of P2P systems
- S. Saroiu, P. Gummadi and S. Gribble, A
Measurement Study of Peer-to-Peer File Sharing
Systems, MMCN02. - first paper to characterize p2p file sharing
systems - Goal To analyze the following user
characteristics - latency
- lifetime of peers
- bottleneck bandwidth
- number of files shared and downloaded
- degree of cooperation
- methodology active crawling
- systems studied Napster and Gnutella
- data collection May 2001
8Measurement Methodology
- active crawling of the Napster and Gnutella
systems - Napster issued queries for popular content, and
then queried central server for peer information - Gnutella used ping/pong messages in protocol to
get metadata about peers, and then their
neighbors and so on - parallel measurement for
- peer lifetime- periodic probing of peers obtained
from crawlers - offline if no response to TCP SYN
- inactive if response to TCP SYN is a TCP RST
- active if accepts the incoming TCP connection on
that port - latency- RTT measurements from one host
- bottleneck link bandwidth- active probing using
Sprobe, a tool they developed based on
packet-pair dispersion technique
9Host Lifetime analysis
- 20 peers in Napster, Gnutella have IP-level
uptime of 93 or more - Napster peers have higher application uptimes
than Gnutella peers - the best 20 of Napster peers have uptime of 83
or more and the best 20 of Gnutella peers have
uptime of 45 or more - median session duration is 60 minutes for Napster
and Gnutella
10Latency analysis (Gnutella)
- 20 peers have a latency of at most 70ms and 20
have a latency of at least 280ms - correlation between downstream bottleneck
bandwidth and latency two clusters for modems
(20-60Kbps, 100-1000ms) and broadband (1Mbps,
60-300ms)
11Bottleneck Bandwidth Analysis (Gnutella)
- 92 Gnutella peers have downstream bottleneck
bandwidth of at least 100Kbps - 22 peers have upstream bottleneck bandwidth of
100Kbps or less - peers are unsuitable to serve content
12Downloads, Uploads and Shared Files
- relative number of downloads and uploads varies
significantly across bandwidth classes - clear client/server behavior of different classes
13Shared files v/s Shared Data(Napster and
Gnutella)
- Strong correlation between number of files shared
and amount of shared MB of data - slope of both lines is 3.7MB, the size of a
typical MP3 audio file
14Degree of Cooperation (Napster)
- 30 of the peers report bandwidth as 64Kbps or
less, but actually have significantly higher
bandwidths - 10 of the peers reporting higher bandwidths
(3Mbps or higher) actually have significantly
lower bandwidth
15Effect of P2P traffic on underlying network
- S. Sen and J. Wang, Analyzing peer-to-peer
traffic across large networks, IMW 2002. - Goal To characterize p2p traffic at three
aggregation levels- IP, prefix and AS - host distribution and host connectivity
- traffic volume and mean bandwidth usage
- traffic patterns over time
- connection duration and on-time methodology
passive measurements at routers (port based) - systems studied FastTrack(Kazaa), Gnutella,
Direct Connect - analysis of flow-level data collected from
multiple border routers across a large tier-1
ISPs backbone
16Measurement Methodology
- flow records from multiple border routers
matching ports - 6346/6347 Kazaa
- 1214 FastTrack
- 411/412 Direct Connect
- processed data to eliminate
- private IP addresses
- invalid AS numbers
- final data set contained 800 million flow records
17Datasets used for analysis
- FastTrack is most popular in terms of number of
hosts participating and average traffic volume
per day - rapid growth of P2P traffic is mainly caused by
increasing number of hosts in the system - Direct Connect systems have higher traffic volume
per IP address
18Host distribution analysis
- of IP addresses in FastTrack ranges from 0.5 to
2 million - ratio of of IP addresses in FastTrackGnutellaD
irectConnect is 150301 - Density of a prefix is the number of unique
active IP addresses belonging to it - Density of an AS is the number of unique prefixes
belonging to it - FastTrack hosts are distributed more densely than
Gnutella and Direct Connect hosts (64164)
19Host connectivity analysis (FastTrack)
- 48 of individual IPs communicate with at most
one IP and 89 with at most 10 IPs - 75 of prefixes and ASes communicate with at
least 2 prefixes or ASes - very few hosts have very high connectivity and
most hosts have very low connectivity
20Traffic volume analysis
- CDF of traffic volume per IP/prefix/AS for
FastTrack (one day) - distribution of P2P upstream traffic volume
across three months
21Mean bandwidth usage(FastTrack and Direct
Connect)
- FastTrack 33 IP addresses have mean downstream
b/w 56Kbps or less 50 have mean upstream b/w
56Kbps or less - Direct Connect 20 IP addresses have mean
downstream b/w 56Kbps or less 33 have mean
upstream b/w 56Kbps or less
22Traffic patterns over time (FastTrack)
- traffic volume transferred every hour among
FastTrack hosts - number of unique IP addresses, prefixes, ASes
active every hour - number of active unique IP addresses in each bin
of various sizes - system is very dynamic- hosts join and leave
frequently
23Connection duration and On-time (FastTrack)
- 50 of the IPs are online for less than one
minute/day - 60 IPs, 40 prefixes, 30 ASes stay for less
than 10 mins/day - 65 of the IPs join only once
- AS, prefix level- not very transient
24Peer-to-Peer Topologies
- M. Ripeanu, I. Foster and A. Iamnitchi, Mapping
the Gnutella Network Properties of Large-Scale
Peer-to-Peer Systems and Implications for System
Design, IEEE Internet Computing Journal, 2002. - Goal To discover and analyze the Gnutella
overlay topology and evaluate generated traffic - methodology active crawling
- datasets Nov 2000, March 2001 and May 2001
25Gnutella Network Growth
- number of nodes in the largest connected
component in the Gnutella network - significantly larger network found during
Memorial Day and Thanksgiving - 50 times increase within 6 months
26Distribution of node-to-node shortest paths
- more than 95 node pairs are at most 7 hops away
- longest node-to-node path is 12 hops
27Averag node connectivity
- average number of connections per node remains
constant 3.4
28Node connectivity distribution
- Nov 2000 Gnutella nodes organize themselves in a
power law - March 2001 connectivity does not look like a
power law for all nodes power law distribution
is preserved for nodes with more than 10 links
for less than 10 links, the distribution is
almost constant
29Searching on the P2P network
- K. Sripanidkulchai, The popularity of Gnutella
queries and its implications on scalability,
2001, http//www-2.cs.cmu.edu/kunwadee/research/p
2p/gnutella.html - methodology passive measurements at one or two
peers, made part of the Gnutella network, to log
queries and query messages routed through it - data sets Dec 2000, Jan 2001
30Top 20 most popular query types
- 17 queries contained non-ASCII strings- filtered
them out - most queries for artists, adult content and file
extensions (audio) - some queries for books, software etc.
31Query popularity distribution
- two distinct distributions of document
popularity, with a break at query rank 100 - most popular documents are equally popular
- less popular documents follow a Zipf-like
distribution, with alpha beween 0.63 and 1.24
32Deciphering proprietary P2P systems
- Leibowitz, M. Ripeanu and A. Wierzbicki,
Deconstructing the Kazaa Network, WIAPP, 2003. - methodology passive content-based data
collection at a caching server installed at the
border of a large ISP - L4 switch inspects first few packets of each TCP
connection to detect Kazaa download traffic - redirects Kazaa download traffic through caching
server - focus on download traffic only, not control
traffic (since it is encrypted)
33Characteristics of Collected Traces
- 38 of all download sessions do not use standard
Kazaa port (1214)
34File download distribution by bytes
- CDF of byte popularity distribution for 10, 1
most popular files - 0.8 of all files account for 80 of the
generated traffic - 0.1 of the most bandwidth hungry files (top 1
of all files) generate 50 traffic
35File size distribution
- note the log-scale on X-axis
- 3 distinct modes
- 100KB for pictures
- 2-5MB for music files
- 700MB for movies
36Quantity and Rate of Distinct Files
- new files seen at different time scales- every
day, hour, minute - 150,000 distinct files during a 17-day period
- daily graph new files seen continued to
decrease, but no steady state value (rate of
injection of files in the network) achieved - hourly graph time of day effect
- per-minute graph 50 new files seen every minute
on an average
37Rate of change of popularity of files
- percentage of files that make it to the N most
popular files list- (a) in consecutive intervals
and (b) after T intervals, compared with first
list - measurement interval is 24 hours
- 15 of the highly popular files remain popular
throughout the experiment, and the rest are
popular at short time intervals
38Open Questions
- Mapping a global snapshot of the entire Gnutella
topology - Bootstrapping of peers in unstructured
peer-to-peer systems (work in progress) - More efficient searching on P2P networks- efforts
in this direction include random walks,
bloom-filter based techniques etc. - End-point privacy/anonymity is absent in most of
these peer-to-peer networks
39References
- Papers covered in the seminar
- S. Saroiu, P. Gummadi and S. Gribble, A
Measurement Study of Peer-to-Peer File Sharing
Systems, MMCN 2002. - S. Sen and J. Wang, Analyzing peer-to-peer
traffic across large networks, IMW 2002. - M. Ripeanu, I. Foster, A. Iamnitchi, Mapping the
Gnutella Network Properties of Large-Scale
Peer-to-Peer Systems and Implications for System
Design, IEEE Internet Computing, 2002. - Sripanidkulchai, The popularity of Gnutella
queries and its implications on scalability,
2001. - N. Leibowitz, M. Ripeanu, A. Wierzbicki,
Deconstructing the Kazaa Network, WIAPP 2003. - Papers not covered in the seminar
- J. Chu, K.Labonte and B. Levine, Availability
and Locality Measurements of Peer-to-Peer File
Systems, SPIE, July 2002. - F. Bustamante and Y. Qiao, Friendships that
last Peer lifespan and its role in P2P
protocols, WCW 2003. - R. Bhagwan, S. Savage and G. Voelker,
Understanding Availability, IPTPS 2003. - Saroiu, et.al., An Analysis of Internet Content
Delivery Systems, OSDI 2002. - Markatos et.al., Tracing a large-scale
Peer-to-Peer System An hour in the life of
Gnutella, CCGrid 2002.