Title: Crawling Gnutella Network
1Crawling Gnutella Network
By Samer Al-Kiswany
2Roadmap
- Introduction
- Gnutella network structure
- Gnutella protocol overview
- Gnutella crawling protocol
- Crawling topology information
- Crawling node content
- Demo
3Introduction
Gnutella network is a decentralized peer to peer
system for file sharing.
- Original created by Justin Frankel of Nullsoft
- Large scale
- today up to 4M nodes, 1000TB data, 100M files
- Fast growth in its early stages
- more than 50 times during first half of 2001
- (50 times again 2001 to 2006)
- Self-organizing network
- Open architecture, simple and flexible protocol
4Roadmap
- Introduction
- Gnutella network structure
- Gnutella protocol overview
- Gnutella crawling protocol
- Crawling topology information
- Crawling node content
- Demo
5Gnutella Network Structure
Gnutella Protocol 0.6
Two tier architectures of ultrapeers and leaves
Ultrapeers
Leaves
6Roadmap
- Introduction
- Gnutella network structure
- Gnutella protocol overview
- Gnutella crawling protocol
- Crawling topology information
- Crawling node content
- Demo
7Basic Primitives for File Sharing
- Join How do I begin participating?
- Publish How do I advertise my file(s)?
- Search How do I find a file?
- Fetch How do I retrieve a file?
8Gnutella Protocol Overview
- Join on startup, client contacts an ultrapeer
node(s) - Publish no need
- Search
- Ask the ultrapeer node
- The ultrapeer will propagate the questions to
other ultrapeers and will return the answer back - Fetch get the file directly from peer (HTTP)
9Roadmap
- Introduction
- Gnutella network structure
- Gnutella protocol overview
- Gnutella crawling protocol
- Crawling topology information
- Crawling node content
- Demo
10Crawling a Gnutella node
- By Crawling we are interested in two main pieces
of information - With whom the node is connected ? - Topology
information - Gnutella protocols terms Crawling/Communicating
Network Topology Information - What files the node is sharing with others?
- Gnutella protocol terms Browsing Host
11Crawling Topology Information
Gnutella protocol 0.6 supports network topology
information crawling !!!
- Topology Information
- Ultrapeers
- Leaves
12Crawling Topology Information
GNUTELLA CONNECT/0.6 User-Agent LimeWire
(crawl) X-Ultrapeer False Query-Routing 0.1
Crawler 0.1
GNUTELLA/0.6 200 OK User-Agent BearShare
Leaves 127.0.0.16346,127.0.0.26346 Peers
127.0.0.46346,127.0.0.56346
GNUTELLA/0.6 200 OK
13Browsing Node Content
Gnutella Network
14Browsing Node Content
GET / HTTP/1.1 Host Crawler_IPPORT User-Agent
UBCECE Accept application/x-gnutella-packets Conn
ection close
HTTP/1.1 200 OK Server LimeWire/x.y Content-Type
application/x-gnutella-packets Connectionclose List of files
Query Hit Message
15Query Hit Parsing
Query Hit Message
1 Gnutella message header important field
message length.
The Gnutella message may contain more than one
query Hit responses
2 Query Hit Header important field
Number of files
A-F list of shared files includes file name
and size
3 Other Gnutella protocol fields
1
- - -
1
16Limitations - Does this always work ?
- Topology Crawling
- The topology information crawling is not
supported by some Gnutella protocol v0.4
implementations
- Host Browsing
- Some Gnutella node implementations will return
the list of files in HTML (BearShare for
instance). (will not respond with Query Hit
message)
17Roadmap
- Introduction
- Gnutella network structure
- Gnutella protocol overview
- Gnutella crawling protocol
- Crawling topology information
- Crawling node content
- Demo
18Single Gnutella-Node Crawler
A proof of concept implementation of single
Gnutella-node crawler.
- The main class that implements the crawling
protocol is the Crawler class - crawlpeers(ip_address, port)
- parsePeers(byte )
- listFiles(ip_address, port)
- parseFilesList(byte )
- processQueryHit(byte )
Available through the following
link http//www.ece.ubc.ca/samera/TA/411/index.ht
ml
19Demo !!!
Crawling reala.ece.ubc.ca 5627
20Project Phase II
- Implement a single-node Gnutella network crawler
- Report
- The active leaf nodes
- Information regarding the agent (i.e., the
implementation LimeWire , BearShare etc) - The domain name corresponding to the node IP
address. - List all the files shared (excluding for
BearShare servants).
Avoid cycles !!
21References
- Single Gnutella-Node Crawler http//www.ece.ubc.c
a/samera/TA/411/index.html - Gnutella Crawling protocol http//www.ece.ubc.ca
/samera/TA/411/index.html
- Other references
- http//gnutella-specs.rakjar.de/index.php/Main_Pag
e - www.limewire.com
22Thank you
www.ece.ubc.ca/samera