Range and kNN Searching in P2P - PowerPoint PPT Presentation

About This Presentation
Title:

Range and kNN Searching in P2P

Description:

Range and kNN Searching in P2P Manesh Subhash Ni Yuan Sun Chong Outline Range query searching in P2P one dimension range query multi-dimension range query ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 67
Provided by: soc128
Category:
Tags: p2p | knn | range | searching

less

Transcript and Presenter's Notes

Title: Range and kNN Searching in P2P


1
Range and kNN Searching in P2P
  • Manesh Subhash
  • Ni Yuan
  • Sun Chong

2
Outline
  • Range query searching in P2P
  • one dimension range query
  • multi-dimension range query
  • comparison of range query searching in P2P
  • kNN searching in P2P
  • scalable nearest neighbor searching
  • PierSearch
  • Conclusion

3
Motivation
  • Most P2P systems support only simple lookup
    queries
  • The DHT based approaches such as Chord, CAN are
    not suitable for range queries
  • More complicated queries such as range query and
    kNN searching is needed

4
P-Tree APJ04
  • B-tree is widely used for efficiently
    evaluating range queries in centralized database
  • distributed B-tree is not directly applicable
    in a P2P environment
  • fully independent B-tree
  • semi-independent B-tree, i.e. P-tree

5
Fully independent B-tree
4
24
26
4
8
12
20
24
25
26
35
26
8
20
24
35
8
26
35
4
8
12
20
24
25
24
25
26
35
4
8
12
20
6
Semi-independently B-tree
4
24
26
8
24
35
35
8
24
4
8
12
20
8
12
20
35
4
P1 4
P2 8
P8 35
26
8
20
12
25
35
P3 12
P7 26
12
20
24
26
35
4
P4 20
P6 25
P5 24
20
25
4
25
8
20
24
35
8
20
24
25
26
35
4
24
25
26
7
Coverage Separation
4
35
4
20
24
25
35
8
20
25
overlap
4
8
20
24
24
25
25
26
35
4
20
4
8
12
24
25
26
35
anti-coverage
8
Properties of P-tree
  • Each node stores O(logdN) nodes
  • Total storage per node is O(dlogdN)
  • Require no global coordination among all peers
  • The search cost for a range query that returns m
    results is O(m logdN)

9
Search Algorithm p1 21ltvalue lt29
4
24
26
8
24
35
l0
35
8
24
4
8
12
20
8
12
20
l1
35
4
P1 4
P2 8
P8 35
26
8
20
12
25
35
P3 12
P7 26
12
20
24
26
35
4
P4 20
P6 25
P5 24
20
25
4
25
8
20
24
35
8
20
24
25
26
35
4
24
25
26
10
Multi-dimension range query
  • Routing in one-dimensional routing space
  • ZNet Z-ordering Skip graph STZ04
  • Hilbert space filling curve Chord SP03
  • SCRAP GYG04
  • Routing in multi-dimensional routing space
  • MURK GYG04

11
Desiderata
  • Locality the data elements nearby in the data
    space should be stored in the same node or the
    close nodes
  • Load balance the amount of data stored by each
    node should be roughly the same
  • Efficient routing the number of messages
  • exchanged between nodes for routing a query
    should be small

12
Hilbert SFC Chord
  • SFC d-dimensional cube -gt a line
  • the line passes once through each point in the
    volume of the cube

10
01
0101
0110
1001
1010
0100
0111
1000
1011
00
11
0011
0010
1101
1100
0000
0001
1110
1111
13
Hilbert SFC Chord
  • mapping the 1-dimensional index space onto the
    Chord overlay network topological

0
4
14
8
11
data elements with keys 5, 6, 7, 8
14
Query Processing
  • translate the keyword query to relevant clusters
    of the SFC-based index space
  • query the appropriate nodes in the overlay
    network for data-elements

0
0101
0110
1001
1010
1111
11
0100
0111
1000
1011
1100 1101 1110
10
4
14
0011
0010
1101
1100
01
(1, 0)
0000
0001
1110
1111
00
8
11
00
01
10
11
15
Query Optimization (010, )
000000
111 110 101 100 011 010 001 000
000100
111000
000 001 010 011 100 101 110 111
001001
011110
(000100)
(000111, 001000)
(001011) (011000, 011001) (011101, 011110)
16
Query Optimization (cont.) (010,)
01
10
11
00
00 01 10 11
0
00
01
0110
0001
0010
0111
000100
000111
001011
011000
011101
001000
011001
011110
17
Query Optimization (cont.)
000000
0
00
111000
000100
(010, )
01
Pruning nodes from the tree
0
001001
011110
00
01
0110
0001
0010
0111
000100
000111
001011
011000
011101
001000
011001
011110
18
SCARP GYG04
  • Use z-order or Hilbert space filling curve to
    map multi-dimensional data down to a single
    dimension
  • Range partitioned the one dimension data across
    the available S nodes
  • Use Skip graph to rout the queries

19
MURK Multi-dimensional Rectangulation with
KD-tree
  • Basic conception
  • Partitioning high-dimensional data space into
    rectangles, managed by each node.
  • Partitioning is done based on the KD-tree. The
    space is split cyclically according to the
    dimensions and each leaf of the KD-tree
    corresponds to one rectangle.

20
Partitioning
  • Each node joins, split the space along one
    dimension into two parts of equal load, keeping
    load balance.
  • Each node manage data in one rectangle, thus
    keeping data locality.

21
Comparison with CAN
  • The partition based on KD-tree is similar as that
    in CAN. Both hash data into multi-dimensional
    space and try to keep load balancing
  • The major difference is that a new node splits
    the exiting node data space equally in CAN,
    rather than splitting load equality.

22
Routing in MURK
  • Routing is to create a link between all the
    neighboring nodes along the relevant nodes.
  • Based on the greedy routing over the grid
    links, the distance between two node is the
    minimum Manhattan distance.

23
Optimization for the routing
  • Grid links are not efficient for the routing.
  • Maintain skip pointers for each node to speed up
    the routing. Two methods to chose the skip
    pointers
  • Random. Chose randomly a node from node set.
  • Space-filling skip graph. Make the skip pointers
    at exponentially increasing distance.

24
Discussion
  • Non-uniformity for the routing neighbors.
    Resulted from load balancing for the node.
  • The dynamic data distribution would result in the
    unbalance for the node data.

25
Performance
26
performance
27
Conclusion
  • For locality, MURK far outperforms SCRAP. For
    routing cost, SCRAP is efficient enough, skip
    pointers are efficient, such as space filling
    curve skip.
  • SCRAP using space filling with rang partitioning
    is efficient in low dimensions. MURK with space
    filling skip graph performs much better,
    especially in high dimensions.

28
pSearch
  • Motivation
  • Numerous documents are over the internet.
  • How to efficiently search the most closely
    related document without returning too many with
    little interest.
  • Problem Semantically, documents are randomly
    distributed.
  • Exhaustively search brings overhead.
  • No deterministic guarantees.

29
P2P IR techniques
  • Unstructured p2p search
  • Centralized index with the problem bottleneck.
  • Flooding-based techniques result in too much
    overhead.
  • Heuristic-based algorithm may miss some important
    documents.
  • Structured p2p search
  • DHT based can and chord are suitable for keyword
    matching.
  • Traditional IR techniques
  • Advanced IR ranking algorithm could be adopted
    into p2p search.
  • Two IR techniques
  • Vector space model (VSM).
  • Latent semantic indexing (LSI).

30
pSearch
  • An IR system built on p2p networks.
  • Efficient and scalable as DHT
  • Accurate as advanced IR algorithms.
  • Map semantic space to nodes and conduct nearest
    neighbor search.
  • use VSM and LSI to generate semantic space
  • use CAN to organize nodes.

31
VSM LSI
  • VSM
  • Document and queries are expressed as term
    vectors.
  • Weight of a term Term frequency inverse
    document frequency.
  • Rank based on the similarity of the document and
    query cos (X,Y). X and Y are two term vectors.
  • LSI
  • Based on singular value decomposition, transform
    term vector from high-dimension to low-dimension
    (L) semantic vector.
  • Statistically based conception avoids synonymous
    and noise in document.

32
pSearch system





DOC
QUERY
33
Advantage of pSearch
  • Exhaustive search in a bounded area while could
    be ideally accurate.
  • Communication overhead is limited to transferring
    query and reference to top documents independent
    of the corpus size.
  • A good approximate of the global statistics is
    sufficient for pSearch.

34
Challenges
  • Dimensionality mismatch between CAN and LSI.
  • Uneven distribution of indices.
  • Large search region.

35
Dimensionality mismatch
  • Not enough nodes (N) in the CAN to partition all
    the dimensions (L) in the LSI semantic space.
  • N nodes in CAN could partition log(N) low
    dimensions (effective dimensionality), leaving
    others un-partitioned.

36
Rolling index
  • Motivation
  • Small part of the dimensions would contribute a
    lot to the similarity
  • Low-dimensions are of high importance.
  • Partition more dimensions of the semantic space
    by rotating the semantic vectors.
  • A semantic vector V(v0,v1,,vl). Each time
    rotate the vector m dimensions. The rotate space
    i is the vector of ith rotation.
  • Vi(vim,,v0,v1,, vim-1)
  • m2.3ln(n).
  • Use the rotated vector to route the query and
    guide the search.

37
Rolling index
  • Use more storage (p times) to keep the search in
    local space.
  • Selective rotation is expected to be efficient to
    process the important high dimensions

38
Balance index distribution
  • Content-aware node bootstrapping.
  • Randomly select a document to publish .
  • Route the node.
  • Transfers load.
  • More indices would be distributed by more node.
    Even random, still balance with large corpus.

39
Reducing search space
  • Curse of dimensionality
  • Data of high-dimensions sparsely populated
  • In the high-dimension, distance between nearest
    neighbor becomes large.
  • Based on data locality, use stored indices on
    nodes and recently processed query to guide new
    search.

40
Content-directed search
1 f 2 3 4 5 6
7 8 9 a 10 b 11 c 12
13 e 14 d 15 q 16 17 18
19 20 21 22 g 23 24 p
41
Performance
42
Conclusion
  • pSearch is a P2P IR system organizing contents
    around semantics and achieves good accuracy
    w.r.t system size, corpus size and returned
    document.
  • Rolling index resolve the dimension mismatch and
    could limit space overhead and visited node
    number.
  • Content-aware node bootstrapping balance node
    load to achieve index and query locality
  • Contentdirected search reduce the searching
    nodes.

43
kNN searching in P2P Networks
  • Manesh Subhash
  • Ni Yuan
  • Sun Chong

44
Outline
  • Introduction to searching in P2P
  • Nearest neighbor queries
  • Presentation of the ideas in the papers
  • 1. A Scalable Nearest Neighbor Search in P2P
    Systems
  • 2. Enhancing P2P File-Sharing with an
    Internet-Scale Query Processor

45
Introduction to searching in P2P
  • Exact Match queries
  • Single key retrieval
  • Linear Hash
  • CAN, CHORD, PASTRY, TAPESTRY
  • Similarity based queries
  • Metric space based
  • What do we search for?
  • Rare items or popular items or both.

46
Nearest neighbor queries
  • The notion of a metric space
  • How similar are two objects given a set of
    objects
  • Extensible for exact, range and nearest neighbor
    queries.
  • Computationally expensive
  • Distance property satisfies positive-ness,
    reflexivity, symmetry, triangle inequality.

47
Nearest neighbor queries (Cont)
  • Metric space is a pair (D, d)
  • D domain of objects
  • d the distance function.
  • Similarity queries
  • Rangefor F D, a range query retrieves all
    objects which have a distance lt ? to the query
    object q F
  • Nearest neighbor
  • Returns the object closest to q, k-nearest
    object
  • for kNN. K F

48
Scalable NN search
  • Uses the GHT structure.
  • Distributed metric index
  • Supports range and k-NN queries
  • The GHT architecture is composed of nodes, peers
    that can insert, store and retrieve objects using
    similarity queries.
  • Assumptions Message passing, unique network
    identifiers, Local buckets to store data and
    lastly, only one bucket per object.

49
Example of the GHT Network
50
Scalable NN search (3)
  • Address Search Trees (AST)
  • Is a binary search tree
  • Inner nodes hold routing information
  • Two pivots pointers to left and right sub-trees
  • Leaf nodes are pointers to data
  • Local data is stored in the buckets and can be
    accessed using the BID
  • Non local data can be identified using NNID.
  • (All AST leaf nodes are one of the above pointers)

51
Scalable NN search (4)
  • Searching the AST?
  • The BPATH
  • Is a representation of a tree as a string of n
    binary elements 0,1 p (b1,b2,,bn)
  • Use the traversing operator ? and radius ? for a
    query q. ? returns a BPATH.
  • ? examines every inner node using the two pivot
    values and decides which sub-tree to follow.
  • A radius of zero is used for exact matches and
    during inserts.

52
Scalable NN search (5)
  • k-NN searching in GHT
  • Range searching not suitable without intrinsic
    knowledge of data and the metric space used.
  • Begin search at bucket with high probability of
    occurrences of k objects
  • If k objects are found, then use kth object to
    define a similarity search with radius of kth
    distance from q.
  • Sort result and pick first k.
  • If less than k objects found then we cannot
    determine the upper bound on the search for the
    kth neighbor
  • Variation on range radius

53
Scalable NN search (6)
  • Finding the k objects using range searches.
  • Optimistic
  • Minimize distance computation costs, bucket
    access.
  • Use bounding distance as that of the last
    candidate available at the first accessed bucket.
  • Iteratively expand radius if fewer than k found
  • Pessimistic.
  • Probability of next iteration is minimized.
  • Use distance between the pivot values at a level
    of the AST as range radius starting from parent
    of leaf and executes the range query.
  • If fever than k, move up the next level.

54
Scalable NN search (7)
  • Performance evaluation
  • With increasing k
  • Number of parallel distance computations remain
    stable
  • Number of bucket accesses and Number of Messages
    increase rapidly
  • Effect of growing dataset
  • Max hop count increases slowly
  • Nearly constant parallel distance computation
    costs
  • Comparison with range
  • Slightly slower because of overhead to locate
    first bucket

55
Scalable NN search (8)
Performance of the scheme on the TXT dataset.
56
Scalable NN search (9)
  • Conclusion
  • First effort in distributed index structures
    supporting K-NN searching.
  • GHT is a scalable solution
  • Scope for future work includes handling updates
    of the dataset.
  • Other metric space partitioning schemes.

57
Enhanced P2P - PIERSearch (1)
  • Internet scale query processor
  • Queried data has Zipfian distribution
  • Popular data in the head
  • Long tail of rare items
  • PIERSearch is DHT based
  • Its a Hybrid system, uses Gnutella for popular
    items, PIERSearch for rare items
  • Integrated with the PIER system

58
PIERSearch (2)
  • Gnutella query processing
  • Flooding based
  • Simple for popular files
  • Optimized using
  • ultra peers nodes that perform the query
    processing on behalf of the leaf nodes,
  • dynamic querying Larger TTL
  • Team studied characteristics of the Gnutella
    network.

59
PIERSearch (3)
  • Effectiveness of Gnutella
  • Query recall Percentage of available results in
    the network returned
  • Query distinct recall Percentage of distinct
    results, nullifies the effect of having replicas.
  • Experiments show that Gnutella is efficient for
    highly replicated content and those with large
    result set.
  • Found ineffective for rare content.
  • Increasing the TTL does not reduce latency but
    can improve recall

60
PIERSearch (4)
  • Searching using PIERSearch
  • Keyword based.
  • Publisher maintains inverted file indexed using
    the DHT.
  • Generates two tuples for each item
  • Item(fileId,filename, filesiz, ipAddress, port)
  • Inverted(keyword,fileId)
  • Uses the underlying PIER system
  • A DHT based internet-scale relational query
    processor.

61
PIERSearch (5)
  • Hybrid system
  • Identification of rare items
  • Query result size
  • Smaller than fixed threshold considered rare.
  • Term frequency
  • Items with at-least one term below threshold
    considered rare.
  • Term pair frequency
  • Less prone to skew if filenames contain popular
    words.
  • Sampling
  • Samples neighboring nodes and computes lower
    bound estimate on the number of replicas.

62
PIERSearch (6)
  • Performance summary

63
PIERSearch (7)
  • Conclusion
  • We have found that Gnutella is highly effective
    for querying popular content, but ineffective for
    rare items.
  • We have found that building a partial index over
    the least replicated content can improve query
    recall.

64
Referemce
  • APJ04 A. Crainiceanu, P. Linga, J. Gehrke and
    J. Shanmugasundaram. Querying Peer-to-Peer
    Networks Using P-Trees. In WebDB, 2004
  • GYG04 P. Ganesan, B. Yang and H.
    Garcia-Molina. One Torus to Rule them all
    Multi-dimensional Queries in P2P Systems. In
    WebDB, 2004
  • SP03 C. Schmidt and M. Parashar. Flexible
    Information Discovery in Decentralized
    Distributed Systems. In HPDC, 2003
  • STZ04 Y. Shu, K-L. Tan and A. Zhou. Adapting
    the Content Native Space for Load Balanced
    Indexing. In Database, Information Systems and
    Peer-to-Peer Computing, 2004

65
Reference (cont.)
  • LHH04 B. Loo, J. Hellerstern, R. Huebsch, S.
    Shenker and I. Stoica. Enhancing P2P File-sharing
    with an Internet-Scale Query Processor. In VLDB,
    2004.
  • TXD03 C. Tang, Z. Xu and S. Dwarkadas.
    Peer-to-Peer Information Retrieval Using
    Self-Organizing Semantic Overlay Networks. In
    SIGCOMM, 2003
  • ZBG04 P. Zezula, M. Batko and C. Gennaro. A
    Scalable Nearest Neighbor Search in P2P Systems.
    In Database, Information Systems and Peer-to-Peer
    Computing, 2004

66
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com