P2P Information Retrieval Using Semantic Overlay Networks - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

P2P Information Retrieval Using Semantic Overlay Networks

Description:

... proportional to the distance between A and B to their distance p on the circle ... with vector Va we store its index on p places in the CAN Vai where i = 0 ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 33
Provided by: dougb4
Category:

less

Transcript and Presenter's Notes

Title: P2P Information Retrieval Using Semantic Overlay Networks


1
P2P Information RetrievalUsing Semantic Overlay
Networks
  • Authors Tang(Univ Rochester), Zu(HP Labs),
    Dwarkadas(Univ Rochester)

2
Problems with Current P2P
  • Can only be searched by keyword.
  • Documents not semantically organized
  • Ways to Search
  • Centralized Search
  • Decentralized Search floods nodes.

3
How to Solve?
  • pSearch
  • Decentralized, non-flooding algorithm
  • VSM (Vector Space Model)
  • LSI (Latent Semantic Indexing)
  • CAN Network

4
Vector Space Modeling
  • Represents documents and search terms as vectors.
  • When searching documents are ranked by according
    to similarity of document vector and the query
    vector.
  • Given Vectors X and Y
  • cos(X,Y) SUM(Xi Yi) i 0 k
  • k X Y

5
VSM Cont.
  • Represents a Corpus as a t x d Matrix A
  • t is the number of search terms in the corpus
  • d is the number of documents in the corpus
  • arj is the importance of term r in document j

6
Latent Semantic Indexing
  • Uses statistically derived conceptual indices
  • Singular Value Decomposition
  • Transform High-Dimensional Vector from VSM into a
    Lower-Dimensional semantic vector

7
What SVD does
  • Takes the matrix A from the VSM and decomposes it
    into the product of 3 matrices
  • A UED
  • U u1 ur a t x r matrix
  • E diag(e1 er) a r x r diagonal matrix
  • ejs are As singular values
  • V v1 vr a d x r matrix

8
LSI Cont
  • What LSI does with SVD
  • Approximates matrix A of rank r by a matrix Al of
    lower rank
  • Al UlElVl
  • Omits all but largest l singular values
  • Rows of ElVl the semantic vectors for documents
  • Given UlElVl semantic vectors of queries,
    documents, or terms that were not originally in
    the space can be folded into the space

9
LSI Cont
  • By choosing appropriate l noise from smaller
    singular values is eliminated
  • Studies suggest l 50350 is best

10
Content Addressable Networks
  • Values of CAN
  • Fault-tolerant distributed hash table
  • Partitions a d-dimensional Cartesian space into
    zones and assigns each zone to a node
  • Locating a document is reduced to routing to the
    node that contains that document

11
Finding something in a CAN
  • D joins and takes over part of Cs zone
  • D is looking for a document with key 0.4,0.1
  • Routed to E then to A

12
pSearch
  • pLSI modified LSI for pSearch
  • Large number of machines are organized into a
    semantic overlay.
  • This forms the pSearch Engine
  • Engine is subset of nodes with good connectivity

13
pLSI Algorithm
  • Sets the dimensionality of the CAN to be the
    dimensionality of the semantic space
  • Index for a document is stored in the CAN using
    the semantic vector as the index.
  • How it works
  • Engine node receives Document A and creates the
    Semantic vector Va via LSI and is used as the key
  • Receiving a query the Engine nodes transforms the
    query into a vector Vq and routes it to the
    correct node
  • When node is reached query is flooded to nodes in
    radius r
  • All receiving nodes do query via LSI

14
pLSI Cont
  • Relies on inverse document frequency and basis of
    semantic space
  • Allows nodes to compute semantic vectors of
    queries and documents independently
  • Challenges
  • Dimensionality mismatch
  • There are not enough nodes to partition the can
    when it has the dimensionality set by the LSI,
    50350

15
pLSI Cont
  • Uneven Distribution of Indices
  • Vectors are normalized and reside on the surface
    of the unit sphere S in the semantic space
  • cos ? (similarity) is proportional to the
    distance between A and B to their distance p on
    the circle since on a unit sphere cos ? cos p
  • Large Search Space
  • Limiting the search space in a high-dimensionality
    is difficult

16
Solving Dimensionality
  • For a l-dimensional CAN network with n nodes each
    node should maintain 2l neighbors only if l lt
    log2(n)
  • Since 2x zones will be produced by partitioning
    along x dimensions
  • This results in more zones than nodes
  • If l gt log2(n) and zones partitioned evenly then
    only log2(n) neighbors are required

17
Solving Dimensionality Cont
  • This results in only the low dimensionality
    semantic vectors being partitioned
  • Example
  • Supposed we have Va (-0.1,0.55,0.57,-0.6) and
    Vq (0.55,-0.1,0.6,-0.57)
  • They have similarity of .574 with a majority
    contributed by the last two parts of the vector.
    If we have only four nodes the 4-Dimensional CAN
    may only be split on first two parts of the
    vector.

18
Solving Dimensionality Cont
  • Solving only low dimensionality partitioning
  • Rolling Semantic Indexes
  • Given V(v0,,vl) we rotate m dimensions each
    time to produce a new vector.
  • Support Vector
  • Vi(vi-m,,vi-m-1) m2.3ln(n)
  • Given a document A with vector Va we store its
    index on p places in the CAN Vai where i 0
    p-1
  • The pLSI algorithm uses different Vqi query
    vectors to search and return relative documents.

19
Rolling Index Example
  • m 2

20
Balancing Indexes
  • Content Aware Boot Strapping
  • Forces the distribution of nodes to follow the
    distributions of indices
  • At partitioning the node randomly picks a
    document and computes the semantic vector and
    rotates it by i (0 lt i lt p) the space is then
    partitioned a across the lowest un-partitioned
    dimension and then hands over half that zone

21
Balancing Indexes Cont
  • Advantages Provided
  • More balanced index
  • Larger num of nodes used in semantic space with
    greater num of documents
  • Index Locality
  • Assuming the documents published by a node have
    similar semantics, on space i, indices of a
    nodes documents are likely to be published on
    itself or its neighbors.
  • Query Locality
  • Assuming documents published by a node are good
    indications of user queries, more nodes near the
    user are likely to have relevant documents to the
    query.

22
(No Transcript)
23
Reducing Search Space
  • Content Directed Search
  • Search Algorithm

24
Content Directed Search
  • Use indices and last queries on nodes to guide
    search
  • Example

25
1-25 are the node ids, q is the query, and a-f
are document vectors. N13 N8,12,14,18 14
is searched next because its samples are most
relevant N8,12,18,9,15,19 9 searched
next N8,12,18,15,19,4,10 4 searched next
26
Algorithm Description
  • XZ XZ,Y mean X is a parameter of Z with Y as
    an additional attribute
  • DZ is set of Semantic Vector indices stored at
    node Z
  • QZ is the set of semantic vectors of queries
    recently processed at Z
  • UiZ summarizes the indices stored on Z and the
    queries recently processed by Z
  • UiZ H/H H sum(d) d ? DiZ sum(q) q ?
    QiZ
  • Z requests each of its neighbors P to send kc
    document vectors similar to UiZ and kr random
    samples

27
Algorithm Cont
  • When Z is under search for query q then Z uses
    the following to estimate Vqi
  • eiP,Vqi max(cos(d,Vqi)) d ? SiZ,P
  • Directs search on each rotated space
  • Visits nodes with high value first and stops when
    T nodes have yielded no result.

28
Experiment
  • Used Text Retrieval Conference corpus

29
Varying System Size
30
Varying System and Corpus
31
Content vs Query
32
With Replication
Write a Comment
User Comments (0)
About PowerShow.com