Characterizing Files in the Modern Gnutella Network: A Measurement Study

About This Presentation
Title:

Characterizing Files in the Modern Gnutella Network: A Measurement Study

Description:

Characterizing Files in the Modern Gnutella Network: A Measurement Study Shanyu Zhao, Daniel Stutzbach, Reza Rejaie University of Oregon SPIE Multimedia Computing and ... –

Number of Views:96
Avg rating:3.0/5.0
Slides: 34
Provided by: john1599
Category:

less

Transcript and Presenter's Notes

Title: Characterizing Files in the Modern Gnutella Network: A Measurement Study


1
Characterizing Files in the Modern Gnutella
Network A Measurement Study
  • Shanyu Zhao, Daniel Stutzbach, Reza Rejaie
  • University of Oregon

SPIE Multimedia Computing and Networking 2006
(MMCN06), 18-19th January 2006 San Jose,
California, USA
2
Outlines
  • Measurement study of modern Gnutella system
  • Conduct static, topological and dynamic analysis
  • Help to improve design and evaluations of P2P
    file-sharing applications

3
Previous studies
  • Focus on a small population
  • Be more than three years old
  • Not examine dynamics of file characteristics over
    time and correlation between the overlay topology
    and file distribution

4
Why Gnutella
  • Top three (eDonkey2K, FastTrack, Gnutella)
  • Gnutella has Browse-Host extension to extract the
    list of shared files from peers
  • One of most studied P2P systems compare and
    contrast with previous studies

5
Original Gnutella
  • A new node joins the system (Node A)
  • Node A connects to some node (Node B) by
    pre-existing list, a particular website, IRC and
    etc
  • Node B sends its working nodes to Node A
  • Node A connects provided nodes till certain
    threshold
  • During search, Node A sends requests to connected
    nodes which in turn forward requests

6
Original Gnutella
  • Nodes reply the request directly or indirectly
    depending on the firewall existence
  • Node A downloads file pieces from one ore more
    positive nodes
  • Unlike Napster, Gnutella is decentralized
    flood-based searches

7
Modern Gnutella
  • Contrast to unstructured overlay topology, most
    modern Gnutella clients adopt a two-tier overlay
    structure
  • Ultrapeers and leaf peers (majority)
  • Legacy peers (not implement ultrapeer feature)

8
Measurement methodology
  • Problems of general crawlers
  • Slow, distorted, inflate population
  • Previous studies
  • Partial snapshot, periodic probe of a fixed group
  • Significance is doubted
  • Goal of this work
  • Capture entire population (?)
  • Short period

9
Measurement methodology
  • Topology crawl
  • List of neighboring nodes
  • Content crawl
  • List of available files of each node
  • Need more

10
Cruiser
  • Parallel P2P crawler
  • Orders of magnitude faster than previous crawlers
    (?)
  • Master-slave architecture
  • Slave crawls hundreds of peers and master
    coordinates multiple slaves
  • Increase degree of concurrency

11
Cruiser
  • Using 6 off-the-shelf 1GHz GNU/Linux boxes, crawl
    takes 15min 5.5hr 15min 6 hours
  • Each content crawl takes 10GB log file containing
    file name and content hash

12
Dataset
  • Three measurement periods within each period,
    take snapshots everyday
  • 6/8/2005-6/18/2005, 8/23/2005-9/9/2005 and
    10/11/2005-10/21/2005
  • Examine both short and long timescales

13
Dataset
14
Sources of unreachable nodes
  • Firewall
  • Severe network congestion
  • Peer departed
  • Not support Browse Host protocol
  • Ultrapeers depart
  • Leaf peers depart and firewall
  • Contact 20 peers (half a million)

15
Problems
  • Low-bandwidth TCP connection
  • Some crawls do not complete after the timeout
    threshold, as they are sent at extremely low rate
  • File identity
  • File name is not a reliable file identifier so
    this work use content hash
  • Post-processing
  • More than 100 million distinct files
  • Divide into 7 segments randomly, trim files of
    less than 10 copies in a segment, combine trimmed
    back to one

16
Static analysis
  • Ratio of free riders
  • Degree of resources sharing among cooperative
    peers
  • File popularity distribution
  • File type analysis

17
Ratio of free riders
  • Free riders drop, ratio of ultrapeers is lower,
    long-lived peers slightly higher, files not
    strongly correlate

18
Degree of resources sharing among cooperative
peers
  • Distribution of peers sharing x files
    power-law distribution

19
Degree of resources sharing among cooperative
peers
  • Distribution of contributed disk space
    power-law distribution

20
Degree of resources sharing among cooperative
peers
  • Correlation not as strong as previous studies
  • Discernable line with slope 3.7MB/file which is
    typical size of MP3 audio file

21
File popularity distribution
22
File type analysis
23
File type analysis
Previous studies Current studies
Music 67.2 files 79.2 bytes 67 files 40 bytes
Video 2.1 files 19.1 bytes 6 files 52.5 bytes
24
Topological analysis
  • Per-file perspective figure a b
  • Per-peer perspective figure c

25
Topological analysis
  • Churn (dynamics of peer participation) is
    dominant factor
  • Depart
  • Join
  • Leaf peers become ultrapeers
  • Rapid change in overlay topology prevents
    formation of topological clustering

26
Dynamics analysis
  • Variations in shared files by individual peers
  • Variations in popularity of individual files
  • Trends in popularity variations

27
Variations in shared files by individual peers
28
Variations in popularity of individual files
  • Focus on top 100 and top 1000 files

29
Trends in popularity variations
  • Track top 10 files across several days (fig a
    b)
  • Over several months (fig c)

30
Conclusion
  • Use parallel crawl to obtain snapshots of peer
    connectivity and available files
  • Conduct three types of analysis
  • Understand the distribution, correlation and
    dynamics of available files

31
Summary of findings
  • Free riding significantly drops
  • shared files and contributed storage space by
    individual peers follow power-law distribution ?
    most peers contribute little disk space (lt100MB)
    while small peers contribute very large space
    (50-100GB)
  • Popularity of individual files follow Zipf
    distribution ? small files are extremely
    popular but majority of files are very unpopular

32
Summary of findings
  • Most popular file type is MP3 file (2/3 of all
    files, 1/3 of all bytes)
  • Popularity and occupied space by video files has
    tripled over past few years
  • video files lt 1/10 of audio files but occupy
    25 more bytes
  • 93 of bytes or 73 of files are multimedia files

33
Summary of findings
  • Files are randomly distributed no strong
    correlation between the available files at peers
    that are one, two or three hops apart in overlay
    topology
  • Shared files by individual slowly change over
    timescale of days more popular files experience
    larger variations in popularity
Write a Comment
User Comments (0)
About PowerShow.com