PPT – PowerPoint presentation | free to download

About This Presentation

Title:

Description:

Clients may join or leave the network at any time = highly fault ... e.g. Limewire's Ultrapeers. Centralized Index. 1) Upload Index. 2) Query/QueryHit ... – PowerPoint PPT presentation

Number of Views:206

Avg rating:3.0/5.0

Slides: 44

Provided by: demetriosz

Learn more at: http://alumni.cs.ucr.edu

Category:

Tags: limewire

more less

Transcript and Presenter's Notes

Title:

1
Information Retrieval in Peer-to-Peer Systems
Dept. of Computer Science Engineering. _at_
University of California - Riverside

Demetrios Zeinalipour-Yazti

M.Sc. Thesis Defense Monday, May 5, 2003Surge
349 1200-100 PM
Thesis Committee Dr. Dimitrios Gunopulos,
Chairperson Dr. Vana Kalogeraki Dr. Chinya V.
Ravishankar
http//www.cs.ucr.edu/csyiazti/msc.html
2
Presentation Outline

Introduction Motivation.
Search Techniques for P2P systems
The Intelligent Search Mechanism
PeerWare Simulation Infrastructure
Experimental Evaluation.
Conclusions Future Work.

3
Introduction to Peer-to-Peer

Peer-to-Peer Computing definition
Sharing of computer resources and information
through direct exchange

Clients (downloaders) are also servers
Clients may join or leave the network at any time
gt highly fault-tolerant but with a cost!
Searches are done within the virtual network
while actual downloads are done offline (with
HTTP).

4
Introduction to Peer-to-Peer

Peer-to-Peer (P2P) systems are increasingly
becoming popular.
P2P file-sharing systems, such as Gnutella,
Napster and Freenet realized a distributed
infrastructure for sharing files.
Traditionally, files were shared using the
Client-Server model (e.g. http). Not scalable
since they are centralized services.
P2P uncover new advantages in simplicity of use,
robustness, self organization and scalability.

5
Information Retrieval in P2P

Problem
How to efficiently retrieve Information in P2P
systems where each node shares a collection of
documents?

Documents consists of keywords.
Resembles Information Retrieval but resources are
distributed now.
Primary Data Structures such as Global Inverted
Indexes cant be maintained efficiently.

6
Solutions for P2P Information Retrieval

1) Centralized Approaches
Centralized Indexes
e.g. Napster, SETI_at_HOME
2) Purely Distributed Approaches
Each node has only local knowledge.
I.R is done using Brute force mechanisms
e.g. Gnutella, Fasttrack (Kazaa)
3) Hybrid Approaches
One or more peers have partial indexes of the
contents of others.
e.g. Limewire's Ultrapeers

Centralized Index
1) Upload Index
2) Query/QueryHit
3) Download (offline)
1
2
3
1) Connect
2) Query/QueryHit
3) Download (offline)
1,2
3
1) Connect
2) Intelligent Query/QueryHit
3) Download (offline)
1,2
3
7
Motivation

On 1st June we crawled the Gnutella P2P Network
for 5 hours with 17 workstations.
We analyzed 15,153,524 query messages.
Observation High locality of specific queries.
We try to exploit this property for more
efficient searches?

8
Presentation Outline

Introduction Motivation.
Search Techniques for P2P systems
The Intelligent Search Mechanism
PeerWare Simulation Infrastructure
Experimental Evaluation.
Conclusions Future Work.

9
Search Techniques for P2P systems

Breadth-First Search (Gnutella)
Idea Each Query Message is propagated along all
outgoing links of a peer using TTL
(time-to-live).
TTL is decremented on each forward until it
becomes 0
Technique for I.R in P2P systems such as
Gnutella.
Highlights
The physical network comes to its knees
Long Delays for search results.

P2P Network N
A
QUERY
1
QUERYHIT
2
Peer q
Peer d
10
Search Techniques for P2P systems

Modified Random BFS
V. Kalogeraki, D. Gunopulos, D.
Zeinalipour-Yazti . CIKM2002
Idea Each Query Message is forwarded to only a
fraction of outgoing links (e.g. ½ of them).
TTL is again decremented on each forward until it
becomes 0.
Highlights
Fewer Messages but possibly less results
This algorithm is probabilistic.
Some segments may become
unreachable

unreachable
B
A
QUERY
1
P2P Network N
QUERYHIT
2
C
Peer d
11
Search Techniques for P2P systems

Searching Using Random Walkers
Q. Lv et al P. Cao, E. Cohen, K. Li, and S.
Shenker. ICS2002
Idea Each Query Message is forwarded to 1
neighbor
With k walkers after T steps we reach the same
nodes as 1 walker after kT steps. (They use 16-64
walkers)
Highlights
Network Traffic reduced (from BFS) by 2 orders of
magnitude
Increases the user-perceived delay (from 2-6 hops
to 4-15 hops)
This algorithm is probabilistic and the
likelihood to locate the objects depends on the
network topology.

Peer d
12
Search Techniques for P2P systems

4. Using Randomized Gossiping to Replicate Global
State F.M Cuenca-Acuna, Thu D. Nguyen HPDC-12
Idea PlanetP uses Bloom Filters to propagate
summary indexes of the contents of a Peer.
Bloom Filters are used for Membership Queries
Highlights
Not Scalable (Technique works well
for lt10000 nodes)
No Data Replication Required
False Positives are a function of m,n,k
and can be kept small

D d
,d
,...,d

000
1
2
n
001
1
h
(d
)
010
1
1
011
m
h
(d
)
2
1
100
1
h
(d
)
3
1
101
d1?
110
h
(d
)
1
4
1
111
1
An 8-bit bloom filter w/ 4 hash functions
13
Search Techniques for P2P systems

5. Searching using Local Indices Arturo Crespo
and Hector Garcia-Molina, ICDCS 2002.
Idea Create indices which contain statistics
that reveal the direction towards the
documents.
Types of Proposed Indices
Compound Routing Index (CRI) metricnumber of
documents
Hop-Count Routing Index (HRI) maintain a CRI for
k hops,
Exponentially Aggregated Index (ERI) Apply some
cost formula on HRI to shrink HRIs size.
Highlights
Not Scalable, Expensive Routing Updates but
better than replicating data indexes.
Assumes static environment but No Data
Replication Required

14
Search Techniques for P2P systems

6. Directed BFS and the gtRES Heuristic 1/2
Beverly Yang and Hector Garcia-Molina, ICDCS
2002.
Proposed Techniques
Directed BFS based on aggregate statistics (e.g.
num of results a peer returned, shortest queue,
forwarded the most data)
Iterative Deepening, until Z results are returned
Local Indexes, each node maintains the actual
index over the data of peers r hops away.
Their experiments deploy the Direct BFS
techniques by attaching nodes to the Gnutella
Network.
The gtRES Heuristic is shown to be working well.

15
Search Techniques for P2P systems

Directed BFS and the gtRES Heuristic 2/2
The gtRES Heuristic is optimized to find Z
documents efficiently for some user defined Z.
gtRES works well because
It captures stable/large network segments.
Potentially less overloaded peers
gtRES is a quantitative approach
Drawback gtRES doesnt route queries to most
relevant content

16
Search Techniques for P2P systems

7. Depth-First-Search and Freenet
I. Clarke O. Sandberg, B. Wiley, and T.W. Hong,
LNCS 2009
Idea Objects are Hashed and route the hash of a
query based on the key closeness in a DFS
manner.
Highlights
Uses caching of key/object for future requests.
Data Replication along the QueryHit path provides
Availability
Anonymity of Searcher and Publisher.
Drawbacks i) Searches ONLY based on Object
Identifier.
ii) The user-perceived delay is high

S
B
replicated
B
A
fileA
QUERY
h(A)
1
Search A
C
result
S
2
Peer
q
R
original fileA
17
Search Techniques for P2P systems

8. Consistent Hashing and Chord
Ion Stoica et al. SIGCOMM 2001
Idea Objects/Nodes are hashed with m-bit
identifier and organized in a virtual ring.
Object lookup is achieved in O(logN).
Highlights
Consistent Hashing achieves (i) Good Load
Balancing of keys (ii) Little object/key movement
in case of node join/leave .
Drawbacks i) Searches ONLY based on Object
Identifier
ii) Data Movement may be a big overhead.

18
Presentation Outline

Introduction Motivation.
Search Techniques for P2P systems
The Intelligent Search Mechanism
PeerWare Simulation Infrastructure
Experimental Evaluation.
Conclusions Future Work.

19
Intelligent Search Mechanism ISM

Introduction
Idea Each Query Message is forwarded
intelligently based on what queries a peer
answered in the past.
Components of ISM (for each node u)
Profile Mechanism, for each neighbor N(u).
Peer Ranking Mechanism, for ranking peers locally
and send a search query only to the ones that
most likely will answer.
Similarity Function, for finding similar search
queries.
Search Mechanism, for propagating queries based
on local indexes

A
QUERY
1
profiles
QUERYHIT
2
?
Peer d
20
Intelligent Search Mechanism ISM

Components of ISM
a) Profile mechanism.
Maintains a list of past queries routed through
that host.
Every time a QueryHit is received the table is
updated
The profile manager uses a Least Recently Used
policy to keep most recent queries in repository.
Profiles are kept for neighbors only so the cost
for maintaining this cost is O(Td), T is a
limiting factor per profile, d is the degree of a
node

Size Td

21
Intelligent Search Mechanism ISM

Components of ISM
b) The RelevanceRank Peer Ranking Metric.
Before forwarding a Query Message a peer performs
an on-the-fly ranking of its peers to determine
the best paths.
We use the Aggregate Weighted Similarity of peer
Pi to a query q, computed by a peer Pl as

2
22
Intelligent Search Mechanism ISM

Components of ISM
c) Similarity Function The cosine similarity.
Assume that L is a set of all words (in Profile
Manager)\
e.g. Lelections, bush, clinton, super, bowl,
san, diego, ,italy, earthquake, disaster
We define an L-dimensional space where each
query is a vector.
If qitaly disaster gt q (vector of q)
0,0,0,,1,0,1
Recall that we have a vector for each qi stored
in the Profile Manager ( i.e. qi )

23
Intelligent Search Mechanism ISM

Components of ISM
d) Search Mechanism
Utilizes the Peer Ranking Mechanism to forward
Queries to nodes that will potentially contain
the info we are looking for

Peer d
profiles
?
QUERY
1
?
24
Intelligent Search Mechanism ISM

Breaking cycles with Random Perturbation
Suppose that nodes answers to conjunction of
q-terms
Suppose that query q has no answer from A,B,C
or D.
and that one of them answered to similar q in
the past
? Query q fails to explore the segment through E
Random Perturbation adds one additional random
message

25
Presentation Outline

Introduction Motivation.
Search Techniques for P2P systems
The Intelligent Search Mechanism
PeerWare Simulation Infrastructure
Experimental Evaluation.
Conclusions Future Work.

26
PeerWare Simulation Infrastructure

Introduction
PeerWare is our distributed middleware
infrastructure that allows us to benchmark
various Query Routing Algorithms.
It is deployed on a network of 50 workstations
It uses Public/Private Keys and SSH to connect to
the networked hosts.
It is implemented in JAVA and consists of
approximately 10000 lines of code.

27
PeerWare Simulation Infrastructure

Why real middleware and not simulations?
Many properties such as network failures, dropped
queries may reveal interesting and unknown
patterns.
In a real middleware we are able to measure the
actual time to satisfy queries.
Finally there are no assumptions (network delays
etc) which are typical in simulation
environments
The Anthill Project (Univ. of Bologna) uses a
similar approach to investigate properties of the
Freenet algorithm.

28
PeerWare Simulation Infrastructure

PeerWare Components
dataGen The Dataset Generator
graphGen The Network Graph Generator
dataPeer The Data Node
searchPeer The Search Node
Other Administrative Components
netLaucher Shell script that launches Network
netStats Shell script that provides statistics
graphPlot Shell script that plots Graphs based
on generated results.

29
PeerWare Simulation Infrastructure

1) dataGen Component
dataGen is the Dataset Generator which generates
documents about specific documents
(each peer can have some specialized knowledge)
It uses the REUTERS News Agency dataset (22,531
documents).
It groups documents by various properties
Date, Topics, Places, People, Orgs, Companies
In our experiments we use the Places attribute
and generate 104 countries.

30
PeerWare Simulation Infrastructure

2) graphGen Component
graphGen is topology generator
Currently it generates Random Topologies given
parameters such as degree, IPs, ports
It generates with graphViz visualizations of the
generated topologies.

31
PeerWare Simulation Infrastructure

3) dataPeer Component
dataPeer is a P2P client that maintains an XML
repository of documents.
It uses the PDOM-XQL engine to query its
documents.
It pre-establishes connections to other peers
with persistent TCP connections

32
PeerWare Simulation Infrastructure

4) searchPeer Component
searchPeer is a P2P client that connects to a
PeerWare Network and performs unstructured
queries.
Keywords are sampled from within the dataset
It logs statistics such as query response time,
nodes answered to a node etc.

33
Presentation Outline

Introduction Motivation.
Search Techniques for P2P systems
The Intelligent Search Mechanism
PeerWare Simulation Infrastructure
Experimental Evaluation.
Conclusions Future Work.

34
Experimental Evaluation

Introduction
We create a distributed Newspaper application
We use a Random Network of 104 peers
Each peer has documents for 1 country
The average degree of a node is 7 log2100
(connected graph)
We perform two series of experiments
10x10 sequential queries with a delay of 4 sec.
400 random queries with a delay of 4 sec.
We compare Doc. Ratio (Recall Rate) vs. Num. of
messages
BFS (Gnutella Message Flooding) (forward to
degree nodes).
Random BFS (randomly forward to degree/2 nodes).
Intelligent Search Mechanism (forward to
M(degree/2)-1 highest RelevanceRank nodes 1
random).
gtRES Heuristic (forward to degree/2 nodes that
answered gtRES)

35
Experimental Evaluation

Reducing Query Messages (10x10 Experiment)
Recall Rate vs. Num. of messages with TTL4
BFS uses 1050 messages w/ recall rate 100
RBFS uses 220 (20) msgs w/ recall rate 50
gtRES uses 400 (38) msgs w/ recall rate 70
ISM uses 400 (38) msgs w/ recall rate 90
ISM improves over time since Peer Profiles get
more knowledge.
ISM and gtRES start out slow since the use RBFS
until they populate their routing structures

36
Experimental Evaluation

Digging Deeper by Increasing the TTL (10x10)
Recall Rate vs. Num. of messages with TTL5
BFS uses again 1050 messages w/ recall rate 100
RBFS uses 450 (43) msgs w/ recall rate 82
gtRES uses 570(54) msgs w/ recall rate 90
ISM uses 570 (54) msgs w/ recall rate 99

37
Experimental Evaluation

Reducing Query Response Time (QRT) (10x10
Experiment)
BFSs QRT is in the order of 6 seconds
RBFS, ISM and gtRES use
30-60 of BFS for TTL4
60-80 of BFS for TTL5
BFS unnecessary messages increase the user
perceived delay
The Query Response Time as a percentage of BFS

38
Experimental Evaluation

The Discarded Message Problem (DMP)
A query q is identified by a GUID.
To avoid cycles a node never forwards a query it
already forwarded.
DMP occurs if a node has forwarded q with TTL1
and then receives again q with TTL2, where
TTL2gtTTL1
In our experiments approximately 30 of queries
were affected by the DMP problem.

39
Experimental Evaluation

Improving Recall Rate over Time (400 Experiment)
10x10 Queries Experiment suited well ISM
In this experiment we perform 400 random queries
BFS overwhelming message create two major
outbreaks
ISM improves over time achieving
96 Recall Rate using again 38 of Messages

40
Presentation Outline

Introduction Motivation.
Search Techniques for P2P systems
The Intelligent Search Mechanism
PeerWare Simulation Infrastructure
Experimental Evaluation.
Conclusions Future Work.

41
Conclusions

Efficient Information Retrieval in P2P networks
is not feasible with the current Search
Algorithms.
We propose an Intelligent Search Mechanism that
uses local knowledge to improve Information
Retrieval in P2P.
We implement PeerWare and evaluate the
performance of various Search Techniques
The ISM achieves in some cases 100 recall rate
while using only 57 of the BFS messaging.

42
Future Work

Probe different Network Topologies such as ASMap
with PowerLaws.
Deploy larger PeerWares with more queries.
Probe different Peer-Profile maintenance
policies.
Use Stemming/Stop Words to answer more
accurately.
Compare the performance of our method with new
proposed techniques (random gossiping, random
walkers, etc).
60 of Gnutella belongs to 20 ISPs. How to
exploit that to provide more efficient query
routing schemes?

43
Information Retrieval in Peer-to-Peer Systems
Dept. of Computer Science Engineering. _at_
University of California - Riverside

Demetrios Zeinalipour-Yazti

Thank You!
M.Sc. Thesis Defense Monday, May 5, 2003Surge
349 1200-100 PM
Thesis Committee Dr. Dimitrios Gunopulos,
Chairperson Dr. Vana Kalogeraki Dr. Chinya V.
Ravishankar
http//www.cs.ucr.edu/csyiazti/msc.html

Write a Comment

User Comments (0)