Improve search in unstructured P2P overlay

About This Presentation

Title:

Improve search in unstructured P2P overlay

Description:

Decentralized: search is performed by probing peers ... Search strategies ... Allow keyword search. Example of searching a mp3 file in Gnutella network. ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 74

Provided by: OPA72

Category:

more less

Transcript and Presenter's Notes

Title: Improve search in unstructured P2P overlay

1

Improve search in unstructured P2P overlay

2
Peer-to-peer Networks

Peers are connected by an overlay network.
Users cooperate to share files (e.g., music,
videos, etc.)

3
(Search in) Basic P2P Architectures

Centralized central directory server. (Napster)
Decentralized search is performed by probing
peers
Structured (DHTs) (Can, Chord,) location is
coupled with topology - search is routed by the
query.
Only exact-match queries, tightly controlled
overlay.
Unstructured (Gnutella) search is blind -
probed peers are unrelated to query.

4
Topics

Search strategies
Beverly Yang and Hector Garcia-Molina, Improving
Search in Peer-to-Peer Networks, ICDCS 2002
Arturo Crespo, Hector Garcia-Molina, Routing
Indices For Peer-to-Peer Systems, ICDCS 2002
Short cuts
Kunwadee Sripanidkulchai, Bruce Maggs and Hui
Zhang, Efficient Content Location Using
Interest-based Locality in Peer-to-Peer Systems,
infocom 2003.
Replication
Edith Cohen and Scott Shenker, Replication
Strategies in Unstructured Peer-to-Peer
Networks, SIGCOMM 2002.

5
Improving Search in Peer-to-Peer Networks

ICDCS 2002
Beverly Yang
Hector Garcia-Molina

6
Motivation

The propose of a data-sharing P2P system is to
accept queries from users, and locate and return
data (or pointers to the data).
Metrics
Cost
Average aggregate bandwidth
Average aggregate processing cost
Quality of results
Number of results
Satisfaction a query is satisfied if Z (a value
specified by user) or more results are returned.
Time to satisfaction

7
Current Techniques

Gnutella
BFS with depth limit D.
Waste bandwidth and processing resources
Freenet
DFS with depth limit D.
Poor response time.

8
Broadcast policies

Iterative deepening
Directed BFS
Local Indices

9
Iterative Deepening

In system where satisfaction is the metric of
choice, iterative deepening is a good technique
Under policy P a, b, c waiting time W
A source node S first initiates a BFS of depth a
The query is processed and then becomes frozen at
all nodes that are a hops from the source
S waiting for a time period W

10
Iterative Deepening

If query is not satisfied, S will start the next
iteration, initiating a BFS of depth b.
S send a Resend with a TTL of a
A node that receives a Resend message will simply
forward the message or if the node is at depth a,
it will drop the resend message and unfreeze the
corresponding query by forwarding the query
message with a TTL of b-a to all its neighbors
A node need only freeze a query for slightly more
than W time units before deleting it

11
Directed BFS

If minimizing response time is important to an
application, iterative deepening may not be
appropriate
A source send query messages to just a subset of
its neighbors
A node maintains simple statistics on its
neighbors
Number of results received from each neighbor
Latency of connection

12
Directed BFS (cont)

Candidate nodes
Returned the Highest number of results
The neighbor that returns response messages that
have taken the lowest average number of hops
High message count

13
Local Indices

Each node n maintains an index over the data of
all nodes within r hops radius.
All nodes at depths not listed in the policy
simply forward the query.
Example policy P 1, 5

14
Experimental result
15
Routing Indices For Peer-to-Peer Systems
Arturo Crespo, Hector Garcia-Molina
Stanford University
crespo,hector_at_db.Stanford.edu
16
Motivation

A key part of a P2P system is document discovery
The goal is to help users find documents with
content of interest across potential P2P sources
efficiently
The mechanisms for searching can be classified in
three categories
Mechanisms without an index
Mechanisms with specialized index nodes
(centralized search)
Mechanisms with indices at each node (distributed
search)

17
Motivation (cont.)

Gnutella uses a mechanism where nodes do not have
an index
Queries are propagated from node to node until
matching documents are found
Although this approach is simple and robust, it
has the disadvantage of the enormous cost of
flooding the network every time a query is
generated
Centralized-search systems use specialized nodes
that maintain an index of the documents available
in the P2P system like Napster
The user queries an index node to identify nodes
having documents with the content
A centralized system is vulnerable to attack and
it is difficult to keep the indices up-to-date

18
Motivation (cont.)

A distributed-index mechanism
Routing Indices (RIs)
Give a direction towards the document, rather
than its actual location
By using routes the index size is proportional
to the number of neighbors

19
Peer-to-peer Systems

A P2P system is formed by a large number of nodes
that can join or leave the system at any time
Each node has a local document database that can
be accessed through a local index
The local index receives content queries and
returns pointers to the documents with the
requested content

20
Query Processing in a Distributed Search P2P
System

In a distributed-search P2P system, users submit
queries to any node along with a stop condition
A node receiving a query first evaluates the
query against its own database, returns to the
user pointers to any results
If the stop condition has not been reached, the
node selects one or more of its neighbors and
forwards the query to them
Queries can be forwarded to the best neighbors in
parallel or sequentially
A parallel approach yields better response time,
but generates higher traffic and may waste
resources

21
Routing indices

The objective of a Routing Index (RI) is to allow
a node to select the best neighbors to send a
query
A RI is a data structure that, given a query,
returns a list of neighbors, ranked according to
their goodness for the query
Each node has a local index for quickly finding
local documents when a query is received. Nodes
also have a CRI containing
the number of documents along each path
the number of documents on each topic

22
Routing indices (cont.)

Thus, the number of results in a path can be
estimated
as
CRI(si) is the value for the cell at the column
for topic si and at the row for a neighbor
The goodness of B 6
C
0
D
75
Note that these numbers are just estimates and
they are subject to overcounts and/or undercounts
A limitation of using CRIs is that they do not
take into account the difference in cost due to
the number of hops necessary to reach a document

23
Using Routing Indices
24
Using Routing Indices (cont.)

The storage space required by an RI in a node is
modest as we are only storing index information
for each neighbor
t is the counter size in bytes, c is the number
of categories, N the number of nodes, and b the
branching factor
Centralized index would require t (c 1) N
bytes
the total for the entire distributed system is t
(c 1) b N bytes
the RIs require more storage space overall than a
centralized index, the cost of the storage space
is shared among the network nodes

25
Creating Routing Indices
26
Maintaining Routing Indices

Maintaining RIs is identical to the process used
for creating them
For efficiency, we may delay exporting an update
for a short time so we can batch several updates,
thus, trading RI freshness for a reduced update
cost
We can also choose sending minor updates, but
reduce accuracy of the RI

27
Hop-count Routing Indices
28
Hop-count Routing Indices (cont.)

The estimator of a hop-count RI needs a cost
model to compute the goodness of a neighbor
We assumes that document results are uniformly
distributed across the network and that the
network is a regular tree with fanout F
We define the goodness (goodness hc) of Neighbor
i with respect to query Q for hop-count RI as
If we assume F 3, the goodness of X for a query
about DB documents would be 1310/3 16.33 and
for Y would be 031/3 10.33

29
Exponentially aggregated RI

Each entry of the ERI for node N contains a value
computed as
th is the height and F the fanout of the assumed
regular tree, goodness() is the Compound RI
estimator , Nj is the summary of the local
index of neighbor j of N, and T is the topic of
interest of the entry
Problems?!

30
Exponentially aggregated RI (cont.)
31
Cycles in the P2P Network

There are three general approaches for dealing
with cycles
No-op solution No changes are made to the
algorithms
Cycle avoidance solution In this solution we do
not allow nodes to create an update connection
to other nodes if such connection would create a
cycle
Absence of global information
Cycle detection and recovery This solution
detects cycles sometime after they are formed
and, after that, takes recovery actions to
eliminate the effect of the cycles

32
Experimental Results

Modeling search mechanisms in a P2P system
We consider three kinds of network topologies
a tree because it does not have cycles
we start with a tree and we add extra vertices at
random (creating cycles)
a power-law graph, is considered a good model for
P2P systems and allows us to test our algorithms
against a realistic topology
We model the location of document results using
two distributions uniform and an 80/20 biased
distribution
80/20 assigns uniformly 80 of the document
results to 20 of the nodes
In this paper we focus on the network and we use
the number of messages generated by each
algorithm as a measure of cost

33
Experimental Results (cont.)
34
Experimental Results (cont.)

In particular, CRI uses all nodes in the network,
HRI uses nodes within a predefined a horizon, and
ERI uses nodes until the exponentially decayed
value of an index entry reaches a minimum value
In the case of the No-RI approach, an 80/20
document distribution penalizes performance as
the search mechanism needs to visit a number of
nodes until it finds a content-loaded node

35
Experimental Results (cont.)

RIs perform better in a power-law network than in
a tree network (Query)
In a power-law network a few nodes have a
significantly higher connectivity than the rest
Power-law distributions generate network
topologies where the average path length between
two nodes is lower than in tree topologies

36
Experimental Results (cont.)

The tradeoff between query and update costs for
RIs
The cost of CRI is much higher when compared with
HRI and ERI
ERI only propagate the update to a subset of the
network

37
Conclusions

Achieve greater efficiency by placing Routing
Indices in each node. Three possible RIs
compound RIs, hopcount RIs, and exponential RIs
From experiments, ERIs and HRI offer significant
improvements versus not using an RI, while
keeping update costs low

38
Efficient Content Location Using Interest-based
Locality in Peer-to-Peer Systems
39
Background

Each peer is connected randomly, and searching is
done by flooding.
Allow keyword search

Example of searching a mp3 file in Gnutella
network. The query is flooded across the network.
40
Background

DHT (Chord)
Given a key, Chord will map the key to the node.
Each node need to maintain O(log N) information
Each query use O(log N) messages.
Key search means searching by exact name

41
Interest-based Locality

Peers have similar interest will share similar
contents

42
Architecture

Shortcuts are modular.
Shortcuts are performance enhancement hints.

43
Creation of shortcuts

The peer use the underlying topology (e.g.
Gnutella) for the first few searches.
One of the return peers is selected from random
and added to the shortcut lists.
Each shortcut will be ordered by the metric, e.g.
success rate, path latency.
Subsequent queries go through the shortcut lists
first.
If fail, lookup through underlying topology.

44
Performance Evaluation

Performance metric
success rate
load characteristics (query packets per peers
process in the system)
query scope (the fraction of peers in each query)
minimum reply path length
additional state kept in each node

45
Methodology query workload

Create traffic trace from the real application
traffic
Boeing firewall proxies
Microsoft firewall proxies
Passively collect the web traffic between CMU and
the Internet
Passively collect typical P2P traffic (Kazza,
Gnutella)
Use exact matching rather than keyword matching
in the simulation.
song.mp3 and my artist song.mp3 will be
treated as different.

46
Methodology Underlying peers topology

Based on the Gnutella connectivity graph in 2001,
with 95 nodes about 7 hops away.
Searching TTL is set to 7.
For each kind of traffic (Boeing, Microsoft
etc), run 8 times simulations, each with 1 hour.

47
Simulation Results success rate
48
Simulation Results load and path length
-- Query load for Boeing and Microsoft Traffic
-- Average path length of the traces
49
Increase Number of Shortcuts
Enhancement of Interest-based Locality
50
Using Shortcuts Shortcuts
Enhancement of Interest-based Locality

Idea

Add the shortcuts shortcut
Performance gain of 7 on average
51
Interest-based Structures

When viewed as an undirected graph
In the first 10 minutes, there are many connected
components, each component has a few peers in
between.
At the end of simulation, there are few connected
components, each component has several hundred
peers. Each component is well connected.
The clustering coefficient is about 0.6 0.7,
which is higher than that in Web graph.

52
Sensitivity of Shortcuts

Run Interest based shortcuts over DHT (Chord)
instead of Gnutella.

Query load is reduced by a factor 2 4. Query
scope is reduced from 7/N to 1.5/N
53
Conclusion

Interest based shortcuts are modular and
performance enhancement hints over existing P2P
topology.
Shortcuts are proven can enhance the searching
efficiencies.
Shortcuts form clusters within a P2P topology,
and the clusters are well connected.

54
Replication Strategies in Unstructured
Peer-to-Peer Networks

Edith Cohen
ATT Labs-research

Scott Shenker ICIR
55
(replication in) P2P architectures

No proactive replication (Gnutella)
Hosts store and serve only what they requested
A copy can be found only by probing a host with a
copy

56
Question how to use replication to improve
search efficiency in unstructured networks with a
proactive replication mechanism ?
57
Search and replication model

Unstructured networks with replication of keys or
copies. Peers probed (in the search and
replication process) are unrelated to query/item

Search probe hosts, uniformly at random, until
the query is satisfied (or the search max size is
exceeded)

Replication Each host can store up to r copies
of items.

Goal minimize average search size (number of
probes till query is satisfied)
58
Search size

What is the search size of a query ?
Soluble queries number of probes until answer is
found.
We look at the Expected Search Size (ESS) of
each item. The ESS is inversely proportional to
the fraction of peers with a copy of the item.

59
Search Example

2 probes

4 probes
60
Expected Search Size (ESS)

m items with relative query rates
q1 gt q2 gt q3 gt gt qm. Si qi 1

n nodes, capacity r, Rn r
ri number of copies of the ith items
Allocation p1(r1/R), p2, p3,, pm Si pi
1
ith item is allocated pi fraction of
storage.

Search size for ith item is a Geometric r.v. with
mean Ai 1/(r pi ).
ESS is Si qi Ai (Si qi / pi)/r

61
Uniform and Proportional Replication

Two natural strategies
Uniform Allocation pi 1/m
Simple, resources are divided equally
Proportional Allocation pi qi
Fair, resources per item proportional to demand
Reflects current P2P practices

62
Basic Questions

How do Uniform and Proportional allocations
perform/compare ?
Which strategy minimizes the Expected Search Size
(ESS) ?
Is there a simple protocol that achieves optimal
replication in decentralized unstructured
networks ?

63
ESS under Uniform and Proportional Allocations
(soluble queries)

Lemma The ESS under either Uniform or
Proportional allocations is m/r
Independent of query rates (!!!)
Same ESS for Proportional and Uniform (!!!)

Proof

Proportional ASS is (Si qi / pi)/r (Si qi /
qi)/r m/r
Uniform ASS is (Si qi / pi)/r (Si m qi)/r
(m/r) Si qi m/r pi(R/m)/R
64
Space of Possible Allocations

Definition Allocation p1, p2, p3,, pm is
in-between Uniform and Proportional if
for 1lt i ltm, q
i1/q i lt p i1/p i lt 1
Theorem1 All (strictly) in-between strategies
are (strictly) better than Uniform and
Proportional

Theorem2 p is worse than Uniform/Proportional if
for all i, p i1/p i gt 1 (more popular gets
less) OR for all i, q i1/q i gt p i1/p i (less
popular gets less than fair share)
65
So, what is the best strategy for soluble queries
?
66
Square-Root Allocation

pi is proportional to square-root(qi)

Lies In-between Uniform and Proportional
Theorem Square-Root allocation minimizes the ESS
(on soluble queries)
Minimize Si qi / pi such that Si pi 1

67
How much can we gain by using SR ?
Zipf-like query rates
68
Replication Algorithms

Uniform and Proportional are easy -
Uniform When item is created, replicate its key
in a fixed number of hosts.
Proportional for each query, replicate the key
in a fixed number of hosts

Desired properties of algorithm

Fully distributed where peers communicate through
random probes minimal bookkeeping and no more
communication than what is needed for search.
Converge to/obtain SR allocation when query rates
remain steady.

69
Model for Copy Creation/Deletion

Creation after a successful search, C(s) new
copies are created at random hosts.
Deletion is independent of the identity of the
item copy survival chances are non-decreasing
with creation time. (i.e., FIFO at each node)

70
Creation/Deletion Process
Corollary
then

71
SR Replication Algorithms

Path replication number of new copies C(s) is
proportional to the size of the search
Probe memory each peer records number and
combined search size of probes it sees for each
item. C(S) is determined by collecting this info
from number of peers proportional to search size.
Extra communication (proportional to that needed
for search).

72
Path Replication

Number of new copies produced per query, ltCigt, is
proportional to search size 1/pi
Creation rate is proportional to qi ltCigt
Steady state creation rate proportional to
allocation pi, thus

73
Summary

Random Search/replication Model probes to
random hosts
Soluble queries
Proportional and Uniform allocations are two
extremes with same average performance
Square-Root allocation minimizes Average Search
Size
OPT (all queries) lies between SR and Uniform
SR/OPT allocation can be realized by simple
algorithms.