Title: Characterizing Files in the Modern Gnutella Network: A Measurement Study
1Characterizing Files in the Modern Gnutella
Network A Measurement Study
- Shanyu Zhao, Daniel Stutzbach, Reza Rejaie
- University of Oregon
SPIE Multimedia Computing and Networking 2006
(MMCN06), 18-19th January 2006 San Jose,
California, USA
2Outlines
- Measurement study of modern Gnutella system
- Conduct static, topological and dynamic analysis
- Help to improve design and evaluations of P2P
file-sharing applications
3Previous studies
- Focus on a small population
- Be more than three years old
- Not examine dynamics of file characteristics over
time and correlation between the overlay topology
and file distribution
4Why Gnutella
- Top three (eDonkey2K, FastTrack, Gnutella)
- Gnutella has Browse-Host extension to extract the
list of shared files from peers - One of most studied P2P systems compare and
contrast with previous studies
5Original Gnutella
- A new node joins the system (Node A)
- Node A connects to some node (Node B) by
pre-existing list, a particular website, IRC and
etc - Node B sends its working nodes to Node A
- Node A connects provided nodes till certain
threshold - During search, Node A sends requests to connected
nodes which in turn forward requests
6Original Gnutella
- Nodes reply the request directly or indirectly
depending on the firewall existence - Node A downloads file pieces from one ore more
positive nodes - Unlike Napster, Gnutella is decentralized
flood-based searches
7Modern Gnutella
- Contrast to unstructured overlay topology, most
modern Gnutella clients adopt a two-tier overlay
structure - Ultrapeers and leaf peers (majority)
- Legacy peers (not implement ultrapeer feature)
8Measurement methodology
- Problems of general crawlers
- Slow, distorted, inflate population
- Previous studies
- Partial snapshot, periodic probe of a fixed group
- Significance is doubted
- Goal of this work
- Capture entire population (?)
- Short period
9Measurement methodology
- Topology crawl
- List of neighboring nodes
- Content crawl
- List of available files of each node
- Need more
10Cruiser
- Parallel P2P crawler
- Orders of magnitude faster than previous crawlers
(?) - Master-slave architecture
- Slave crawls hundreds of peers and master
coordinates multiple slaves - Increase degree of concurrency
11Cruiser
- Using 6 off-the-shelf 1GHz GNU/Linux boxes, crawl
takes 15min 5.5hr 15min 6 hours - Each content crawl takes 10GB log file containing
file name and content hash
12Dataset
- Three measurement periods within each period,
take snapshots everyday - 6/8/2005-6/18/2005, 8/23/2005-9/9/2005 and
10/11/2005-10/21/2005 - Examine both short and long timescales
13Dataset
14Sources of unreachable nodes
- Firewall
- Severe network congestion
- Peer departed
- Not support Browse Host protocol
- Ultrapeers depart
- Leaf peers depart and firewall
- Contact 20 peers (half a million)
15Problems
- Low-bandwidth TCP connection
- Some crawls do not complete after the timeout
threshold, as they are sent at extremely low rate - File identity
- File name is not a reliable file identifier so
this work use content hash - Post-processing
- More than 100 million distinct files
- Divide into 7 segments randomly, trim files of
less than 10 copies in a segment, combine trimmed
back to one
16Static analysis
- Ratio of free riders
- Degree of resources sharing among cooperative
peers - File popularity distribution
- File type analysis
17Ratio of free riders
- Free riders drop, ratio of ultrapeers is lower,
long-lived peers slightly higher, files not
strongly correlate
18Degree of resources sharing among cooperative
peers
- Distribution of peers sharing x files
power-law distribution
19Degree of resources sharing among cooperative
peers
- Distribution of contributed disk space
power-law distribution
20Degree of resources sharing among cooperative
peers
- Correlation not as strong as previous studies
- Discernable line with slope 3.7MB/file which is
typical size of MP3 audio file
21File popularity distribution
22File type analysis
23File type analysis
Previous studies Current studies
Music 67.2 files 79.2 bytes 67 files 40 bytes
Video 2.1 files 19.1 bytes 6 files 52.5 bytes
24Topological analysis
- Per-file perspective figure a b
- Per-peer perspective figure c
25Topological analysis
- Churn (dynamics of peer participation) is
dominant factor - Depart
- Join
- Leaf peers become ultrapeers
- Rapid change in overlay topology prevents
formation of topological clustering
26Dynamics analysis
- Variations in shared files by individual peers
- Variations in popularity of individual files
- Trends in popularity variations
27Variations in shared files by individual peers
28Variations in popularity of individual files
- Focus on top 100 and top 1000 files
29Trends in popularity variations
- Track top 10 files across several days (fig a
b) - Over several months (fig c)
30Conclusion
- Use parallel crawl to obtain snapshots of peer
connectivity and available files - Conduct three types of analysis
- Understand the distribution, correlation and
dynamics of available files
31Summary of findings
- Free riding significantly drops
- shared files and contributed storage space by
individual peers follow power-law distribution ?
most peers contribute little disk space (lt100MB)
while small peers contribute very large space
(50-100GB) - Popularity of individual files follow Zipf
distribution ? small files are extremely
popular but majority of files are very unpopular
32Summary of findings
- Most popular file type is MP3 file (2/3 of all
files, 1/3 of all bytes) - Popularity and occupied space by video files has
tripled over past few years - video files lt 1/10 of audio files but occupy
25 more bytes - 93 of bytes or 73 of files are multimedia files
33Summary of findings
- Files are randomly distributed no strong
correlation between the available files at peers
that are one, two or three hops apart in overlay
topology - Shared files by individual slowly change over
timescale of days more popular files experience
larger variations in popularity