Title:
1Information Retrieval in Peer-to-Peer Systems
Dept. of Computer Science Engineering. _at_
University of California - Riverside
- Demetrios Zeinalipour-Yazti
M.Sc. Thesis Defense Monday, May 5, 2003Surge
349 1200-100 PM
Thesis Committee Dr. Dimitrios Gunopulos,
Chairperson Dr. Vana Kalogeraki Dr. Chinya V.
Ravishankar
http//www.cs.ucr.edu/csyiazti/msc.html
2Presentation Outline
- Introduction Motivation.
- Search Techniques for P2P systems
- The Intelligent Search Mechanism
- PeerWare Simulation Infrastructure
- Experimental Evaluation.
- Conclusions Future Work.
3Introduction to Peer-to-Peer
- Peer-to-Peer Computing definition
- Sharing of computer resources and information
through direct exchange
- Clients (downloaders) are also servers
- Clients may join or leave the network at any time
gt highly fault-tolerant but with a cost! - Searches are done within the virtual network
while actual downloads are done offline (with
HTTP).
4Introduction to Peer-to-Peer
- Peer-to-Peer (P2P) systems are increasingly
becoming popular. - P2P file-sharing systems, such as Gnutella,
Napster and Freenet realized a distributed
infrastructure for sharing files. - Traditionally, files were shared using the
Client-Server model (e.g. http). Not scalable
since they are centralized services. - P2P uncover new advantages in simplicity of use,
robustness, self organization and scalability.
5Information Retrieval in P2P
- Problem
- How to efficiently retrieve Information in P2P
systems where each node shares a collection of
documents?
- Documents consists of keywords.
- Resembles Information Retrieval but resources are
distributed now. - Primary Data Structures such as Global Inverted
Indexes cant be maintained efficiently.
6Solutions for P2P Information Retrieval
- 1) Centralized Approaches
- Centralized Indexes
- e.g. Napster, SETI_at_HOME
- 2) Purely Distributed Approaches
- Each node has only local knowledge.
- I.R is done using Brute force mechanisms
- e.g. Gnutella, Fasttrack (Kazaa)
- 3) Hybrid Approaches
- One or more peers have partial indexes of the
contents of others. - e.g. Limewire's Ultrapeers
Centralized Index
1) Upload Index
2) Query/QueryHit
3) Download (offline)
1
2
3
1) Connect
2) Query/QueryHit
3) Download (offline)
1,2
3
1) Connect
2) Intelligent Query/QueryHit
3) Download (offline)
1,2
3
7Motivation
- On 1st June we crawled the Gnutella P2P Network
for 5 hours with 17 workstations. - We analyzed 15,153,524 query messages.
- Observation High locality of specific queries.
- We try to exploit this property for more
efficient searches?
8Presentation Outline
- Introduction Motivation.
- Search Techniques for P2P systems
- The Intelligent Search Mechanism
- PeerWare Simulation Infrastructure
- Experimental Evaluation.
- Conclusions Future Work.
9Search Techniques for P2P systems
- Breadth-First Search (Gnutella)
- Idea Each Query Message is propagated along all
outgoing links of a peer using TTL
(time-to-live). - TTL is decremented on each forward until it
becomes 0 - Technique for I.R in P2P systems such as
Gnutella. - Highlights
- The physical network comes to its knees
- Long Delays for search results.
P2P Network N
A
QUERY
1
QUERYHIT
2
Peer q
Peer d
10Search Techniques for P2P systems
- Modified Random BFS
- V. Kalogeraki, D. Gunopulos, D.
Zeinalipour-Yazti . CIKM2002 - Idea Each Query Message is forwarded to only a
fraction of outgoing links (e.g. ½ of them). - TTL is again decremented on each forward until it
becomes 0. - Highlights
- Fewer Messages but possibly less results
- This algorithm is probabilistic.
- Some segments may become
- unreachable
unreachable
B
A
QUERY
1
P2P Network N
QUERYHIT
2
C
Peer d
11Search Techniques for P2P systems
- Searching Using Random Walkers
- Q. Lv et al P. Cao, E. Cohen, K. Li, and S.
Shenker. ICS2002 - Idea Each Query Message is forwarded to 1
neighbor - With k walkers after T steps we reach the same
nodes as 1 walker after kT steps. (They use 16-64
walkers) - Highlights
- Network Traffic reduced (from BFS) by 2 orders of
magnitude - Increases the user-perceived delay (from 2-6 hops
to 4-15 hops) - This algorithm is probabilistic and the
likelihood to locate the objects depends on the
network topology.
Peer d
12Search Techniques for P2P systems
- 4. Using Randomized Gossiping to Replicate Global
State F.M Cuenca-Acuna, Thu D. Nguyen HPDC-12 - Idea PlanetP uses Bloom Filters to propagate
summary indexes of the contents of a Peer. - Bloom Filters are used for Membership Queries
- Highlights
- Not Scalable (Technique works well
- for lt10000 nodes)
- No Data Replication Required
- False Positives are a function of m,n,k
- and can be kept small
D d
,d
,...,d
000
1
2
n
001
1
h
(d
)
010
1
1
011
m
h
(d
)
2
1
100
1
h
(d
)
3
1
101
d1?
110
h
(d
)
1
4
1
111
1
An 8-bit bloom filter w/ 4 hash functions
13Search Techniques for P2P systems
- 5. Searching using Local Indices Arturo Crespo
and Hector Garcia-Molina, ICDCS 2002. - Idea Create indices which contain statistics
that reveal the direction towards the
documents. - Types of Proposed Indices
- Compound Routing Index (CRI) metricnumber of
documents - Hop-Count Routing Index (HRI) maintain a CRI for
k hops, - Exponentially Aggregated Index (ERI) Apply some
cost formula on HRI to shrink HRIs size. - Highlights
- Not Scalable, Expensive Routing Updates but
better than replicating data indexes. - Assumes static environment but No Data
Replication Required
14Search Techniques for P2P systems
- 6. Directed BFS and the gtRES Heuristic 1/2
Beverly Yang and Hector Garcia-Molina, ICDCS
2002. - Proposed Techniques
- Directed BFS based on aggregate statistics (e.g.
num of results a peer returned, shortest queue,
forwarded the most data) - Iterative Deepening, until Z results are returned
- Local Indexes, each node maintains the actual
index over the data of peers r hops away. - Their experiments deploy the Direct BFS
techniques by attaching nodes to the Gnutella
Network. - The gtRES Heuristic is shown to be working well.
15Search Techniques for P2P systems
- Directed BFS and the gtRES Heuristic 2/2
- The gtRES Heuristic is optimized to find Z
documents efficiently for some user defined Z. - gtRES works well because
- It captures stable/large network segments.
- Potentially less overloaded peers
- gtRES is a quantitative approach
- Drawback gtRES doesnt route queries to most
relevant content
16Search Techniques for P2P systems
- 7. Depth-First-Search and Freenet
- I. Clarke O. Sandberg, B. Wiley, and T.W. Hong,
LNCS 2009 - Idea Objects are Hashed and route the hash of a
query based on the key closeness in a DFS
manner. - Highlights
- Uses caching of key/object for future requests.
- Data Replication along the QueryHit path provides
Availability - Anonymity of Searcher and Publisher.
- Drawbacks i) Searches ONLY based on Object
Identifier. - ii) The user-perceived delay is high
S
B
replicated
B
A
fileA
QUERY
h(A)
1
Search A
C
result
S
2
Peer
q
R
original fileA
17Search Techniques for P2P systems
- 8. Consistent Hashing and Chord
- Ion Stoica et al. SIGCOMM 2001
- Idea Objects/Nodes are hashed with m-bit
identifier and organized in a virtual ring.
Object lookup is achieved in O(logN). - Highlights
- Consistent Hashing achieves (i) Good Load
Balancing of keys (ii) Little object/key movement
in case of node join/leave . - Drawbacks i) Searches ONLY based on Object
Identifier - ii) Data Movement may be a big overhead.
18Presentation Outline
- Introduction Motivation.
- Search Techniques for P2P systems
- The Intelligent Search Mechanism
- PeerWare Simulation Infrastructure
- Experimental Evaluation.
- Conclusions Future Work.
19Intelligent Search Mechanism ISM
- Introduction
- Idea Each Query Message is forwarded
intelligently based on what queries a peer
answered in the past. - Components of ISM (for each node u)
- Profile Mechanism, for each neighbor N(u).
- Peer Ranking Mechanism, for ranking peers locally
and send a search query only to the ones that
most likely will answer. - Similarity Function, for finding similar search
queries. - Search Mechanism, for propagating queries based
on local indexes
A
QUERY
1
profiles
QUERYHIT
2
?
Peer d
20Intelligent Search Mechanism ISM
- Components of ISM
- a) Profile mechanism.
- Maintains a list of past queries routed through
that host. - Every time a QueryHit is received the table is
updated - The profile manager uses a Least Recently Used
policy to keep most recent queries in repository. - Profiles are kept for neighbors only so the cost
for maintaining this cost is O(Td), T is a
limiting factor per profile, d is the degree of a
node
Size Td
21Intelligent Search Mechanism ISM
- Components of ISM
- b) The RelevanceRank Peer Ranking Metric.
- Before forwarding a Query Message a peer performs
an on-the-fly ranking of its peers to determine
the best paths. - We use the Aggregate Weighted Similarity of peer
Pi to a query q, computed by a peer Pl as
2
22Intelligent Search Mechanism ISM
- Components of ISM
- c) Similarity Function The cosine similarity.
- Assume that L is a set of all words (in Profile
Manager)\ - e.g. Lelections, bush, clinton, super, bowl,
san, diego, ,italy, earthquake, disaster - We define an L-dimensional space where each
query is a vector. - If qitaly disaster gt q (vector of q)
0,0,0,,1,0,1 - Recall that we have a vector for each qi stored
in the Profile Manager ( i.e. qi )
23Intelligent Search Mechanism ISM
- Components of ISM
- d) Search Mechanism
- Utilizes the Peer Ranking Mechanism to forward
Queries to nodes that will potentially contain
the info we are looking for
Peer d
profiles
?
QUERY
1
?
24Intelligent Search Mechanism ISM
- Breaking cycles with Random Perturbation
- Suppose that nodes answers to conjunction of
q-terms - Suppose that query q has no answer from A,B,C
or D. - and that one of them answered to similar q in
the past - ? Query q fails to explore the segment through E
- Random Perturbation adds one additional random
message
25Presentation Outline
- Introduction Motivation.
- Search Techniques for P2P systems
- The Intelligent Search Mechanism
- PeerWare Simulation Infrastructure
- Experimental Evaluation.
- Conclusions Future Work.
26PeerWare Simulation Infrastructure
- Introduction
- PeerWare is our distributed middleware
infrastructure that allows us to benchmark
various Query Routing Algorithms. - It is deployed on a network of 50 workstations
- It uses Public/Private Keys and SSH to connect to
the networked hosts. - It is implemented in JAVA and consists of
approximately 10000 lines of code.
27PeerWare Simulation Infrastructure
- Why real middleware and not simulations?
- Many properties such as network failures, dropped
queries may reveal interesting and unknown
patterns. - In a real middleware we are able to measure the
actual time to satisfy queries. - Finally there are no assumptions (network delays
etc) which are typical in simulation
environments - The Anthill Project (Univ. of Bologna) uses a
similar approach to investigate properties of the
Freenet algorithm.
28PeerWare Simulation Infrastructure
- PeerWare Components
- dataGen The Dataset Generator
- graphGen The Network Graph Generator
- dataPeer The Data Node
- searchPeer The Search Node
- Other Administrative Components
- netLaucher Shell script that launches Network
- netStats Shell script that provides statistics
- graphPlot Shell script that plots Graphs based
on generated results.
29PeerWare Simulation Infrastructure
- 1) dataGen Component
- dataGen is the Dataset Generator which generates
documents about specific documents - (each peer can have some specialized knowledge)
- It uses the REUTERS News Agency dataset (22,531
documents). - It groups documents by various properties
- Date, Topics, Places, People, Orgs, Companies
- In our experiments we use the Places attribute
and generate 104 countries.
30PeerWare Simulation Infrastructure
- 2) graphGen Component
- graphGen is topology generator
- Currently it generates Random Topologies given
parameters such as degree, IPs, ports - It generates with graphViz visualizations of the
generated topologies.
31PeerWare Simulation Infrastructure
- 3) dataPeer Component
- dataPeer is a P2P client that maintains an XML
repository of documents. - It uses the PDOM-XQL engine to query its
documents. - It pre-establishes connections to other peers
with persistent TCP connections
32PeerWare Simulation Infrastructure
- 4) searchPeer Component
- searchPeer is a P2P client that connects to a
PeerWare Network and performs unstructured
queries. - Keywords are sampled from within the dataset
- It logs statistics such as query response time,
nodes answered to a node etc.
33Presentation Outline
- Introduction Motivation.
- Search Techniques for P2P systems
- The Intelligent Search Mechanism
- PeerWare Simulation Infrastructure
- Experimental Evaluation.
- Conclusions Future Work.
34Experimental Evaluation
- Introduction
- We create a distributed Newspaper application
- We use a Random Network of 104 peers
- Each peer has documents for 1 country
- The average degree of a node is 7 log2100
(connected graph) - We perform two series of experiments
- 10x10 sequential queries with a delay of 4 sec.
- 400 random queries with a delay of 4 sec.
- We compare Doc. Ratio (Recall Rate) vs. Num. of
messages - BFS (Gnutella Message Flooding) (forward to
degree nodes). - Random BFS (randomly forward to degree/2 nodes).
- Intelligent Search Mechanism (forward to
M(degree/2)-1 highest RelevanceRank nodes 1
random). - gtRES Heuristic (forward to degree/2 nodes that
answered gtRES)
35Experimental Evaluation
- Reducing Query Messages (10x10 Experiment)
- Recall Rate vs. Num. of messages with TTL4
- BFS uses 1050 messages w/ recall rate 100
- RBFS uses 220 (20) msgs w/ recall rate 50
- gtRES uses 400 (38) msgs w/ recall rate 70
- ISM uses 400 (38) msgs w/ recall rate 90
- ISM improves over time since Peer Profiles get
more knowledge. - ISM and gtRES start out slow since the use RBFS
- until they populate their routing structures
36Experimental Evaluation
- Digging Deeper by Increasing the TTL (10x10)
- Recall Rate vs. Num. of messages with TTL5
- BFS uses again 1050 messages w/ recall rate 100
- RBFS uses 450 (43) msgs w/ recall rate 82
- gtRES uses 570(54) msgs w/ recall rate 90
- ISM uses 570 (54) msgs w/ recall rate 99
37Experimental Evaluation
- Reducing Query Response Time (QRT) (10x10
Experiment) - BFSs QRT is in the order of 6 seconds
- RBFS, ISM and gtRES use
- 30-60 of BFS for TTL4
- 60-80 of BFS for TTL5
- BFS unnecessary messages increase the user
perceived delay - The Query Response Time as a percentage of BFS
38Experimental Evaluation
- The Discarded Message Problem (DMP)
- A query q is identified by a GUID.
- To avoid cycles a node never forwards a query it
already forwarded. - DMP occurs if a node has forwarded q with TTL1
and then receives again q with TTL2, where
TTL2gtTTL1 - In our experiments approximately 30 of queries
were affected by the DMP problem.
39Experimental Evaluation
- Improving Recall Rate over Time (400 Experiment)
- 10x10 Queries Experiment suited well ISM
- In this experiment we perform 400 random queries
- BFS overwhelming message create two major
outbreaks - ISM improves over time achieving
- 96 Recall Rate using again 38 of Messages
40Presentation Outline
- Introduction Motivation.
- Search Techniques for P2P systems
- The Intelligent Search Mechanism
- PeerWare Simulation Infrastructure
- Experimental Evaluation.
- Conclusions Future Work.
41Conclusions
- Efficient Information Retrieval in P2P networks
is not feasible with the current Search
Algorithms. - We propose an Intelligent Search Mechanism that
uses local knowledge to improve Information
Retrieval in P2P. - We implement PeerWare and evaluate the
performance of various Search Techniques - The ISM achieves in some cases 100 recall rate
while using only 57 of the BFS messaging.
42Future Work
- Probe different Network Topologies such as ASMap
with PowerLaws. - Deploy larger PeerWares with more queries.
- Probe different Peer-Profile maintenance
policies. - Use Stemming/Stop Words to answer more
accurately. - Compare the performance of our method with new
proposed techniques (random gossiping, random
walkers, etc). - 60 of Gnutella belongs to 20 ISPs. How to
exploit that to provide more efficient query
routing schemes?
43Information Retrieval in Peer-to-Peer Systems
Dept. of Computer Science Engineering. _at_
University of California - Riverside
- Demetrios Zeinalipour-Yazti
Thank You!
M.Sc. Thesis Defense Monday, May 5, 2003Surge
349 1200-100 PM
Thesis Committee Dr. Dimitrios Gunopulos,
Chairperson Dr. Vana Kalogeraki Dr. Chinya V.
Ravishankar
http//www.cs.ucr.edu/csyiazti/msc.html