P2P Information Retrieval Using Semantic Overlay Networks - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

P2P Information Retrieval Using Semantic Overlay Networks

Description:

... proportional to the distance between A and B to their distance p on the circle ... with vector Va we store its index on p places in the CAN Vai where i = 0 ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 33

Provided by: dougb4

Category:

more less

Transcript and Presenter's Notes

Title: P2P Information Retrieval Using Semantic Overlay Networks

1
P2P Information RetrievalUsing Semantic Overlay
Networks

Authors Tang(Univ Rochester), Zu(HP Labs),
Dwarkadas(Univ Rochester)

2
Problems with Current P2P

Can only be searched by keyword.
Documents not semantically organized
Ways to Search
Centralized Search
Decentralized Search floods nodes.

3
How to Solve?

pSearch
Decentralized, non-flooding algorithm
VSM (Vector Space Model)
LSI (Latent Semantic Indexing)
CAN Network

4
Vector Space Modeling

Represents documents and search terms as vectors.
When searching documents are ranked by according
to similarity of document vector and the query
vector.
Given Vectors X and Y
cos(X,Y) SUM(Xi Yi) i 0 k
k X Y

5
VSM Cont.

Represents a Corpus as a t x d Matrix A
t is the number of search terms in the corpus
d is the number of documents in the corpus
arj is the importance of term r in document j

6
Latent Semantic Indexing

Uses statistically derived conceptual indices
Singular Value Decomposition
Transform High-Dimensional Vector from VSM into a
Lower-Dimensional semantic vector

7
What SVD does

Takes the matrix A from the VSM and decomposes it
into the product of 3 matrices
A UED
U u1 ur a t x r matrix
E diag(e1 er) a r x r diagonal matrix
ejs are As singular values
V v1 vr a d x r matrix

8
LSI Cont

What LSI does with SVD
Approximates matrix A of rank r by a matrix Al of
lower rank
Al UlElVl
Omits all but largest l singular values
Rows of ElVl the semantic vectors for documents
Given UlElVl semantic vectors of queries,
documents, or terms that were not originally in
the space can be folded into the space

9
LSI Cont

By choosing appropriate l noise from smaller
singular values is eliminated
Studies suggest l 50350 is best

10
Content Addressable Networks

Values of CAN
Fault-tolerant distributed hash table
Partitions a d-dimensional Cartesian space into
zones and assigns each zone to a node
Locating a document is reduced to routing to the
node that contains that document

11
Finding something in a CAN

D joins and takes over part of Cs zone
D is looking for a document with key 0.4,0.1
Routed to E then to A

12
pSearch

pLSI modified LSI for pSearch
Large number of machines are organized into a
semantic overlay.
This forms the pSearch Engine
Engine is subset of nodes with good connectivity

13
pLSI Algorithm

Sets the dimensionality of the CAN to be the
dimensionality of the semantic space
Index for a document is stored in the CAN using
the semantic vector as the index.
How it works
Engine node receives Document A and creates the
Semantic vector Va via LSI and is used as the key
Receiving a query the Engine nodes transforms the
query into a vector Vq and routes it to the
correct node
When node is reached query is flooded to nodes in
radius r
All receiving nodes do query via LSI

14
pLSI Cont

Relies on inverse document frequency and basis of
semantic space
Allows nodes to compute semantic vectors of
queries and documents independently
Challenges
Dimensionality mismatch
There are not enough nodes to partition the can
when it has the dimensionality set by the LSI,
50350

15
pLSI Cont

Uneven Distribution of Indices
Vectors are normalized and reside on the surface
of the unit sphere S in the semantic space
cos ? (similarity) is proportional to the
distance between A and B to their distance p on
the circle since on a unit sphere cos ? cos p
Large Search Space
Limiting the search space in a high-dimensionality
is difficult

16
Solving Dimensionality

For a l-dimensional CAN network with n nodes each
node should maintain 2l neighbors only if l lt
log2(n)
Since 2x zones will be produced by partitioning
along x dimensions
This results in more zones than nodes
If l gt log2(n) and zones partitioned evenly then
only log2(n) neighbors are required

17
Solving Dimensionality Cont

This results in only the low dimensionality
semantic vectors being partitioned
Example
Supposed we have Va (-0.1,0.55,0.57,-0.6) and
Vq (0.55,-0.1,0.6,-0.57)
They have similarity of .574 with a majority
contributed by the last two parts of the vector.
If we have only four nodes the 4-Dimensional CAN
may only be split on first two parts of the
vector.

18
Solving Dimensionality Cont

Solving only low dimensionality partitioning
Rolling Semantic Indexes
Given V(v0,,vl) we rotate m dimensions each
time to produce a new vector.
Support Vector
Vi(vi-m,,vi-m-1) m2.3ln(n)
Given a document A with vector Va we store its
index on p places in the CAN Vai where i 0
p-1
The pLSI algorithm uses different Vqi query
vectors to search and return relative documents.

19
Rolling Index Example

20
Balancing Indexes

Content Aware Boot Strapping
Forces the distribution of nodes to follow the
distributions of indices
At partitioning the node randomly picks a
document and computes the semantic vector and
rotates it by i (0 lt i lt p) the space is then
partitioned a across the lowest un-partitioned
dimension and then hands over half that zone

21
Balancing Indexes Cont

Advantages Provided
More balanced index
Larger num of nodes used in semantic space with
greater num of documents
Index Locality
Assuming the documents published by a node have
similar semantics, on space i, indices of a
nodes documents are likely to be published on
itself or its neighbors.
Query Locality
Assuming documents published by a node are good
indications of user queries, more nodes near the
user are likely to have relevant documents to the
query.

22
(No Transcript)
23
Reducing Search Space

Content Directed Search
Search Algorithm

24
Content Directed Search

Use indices and last queries on nodes to guide
search
Example

25
1-25 are the node ids, q is the query, and a-f
are document vectors. N13 N8,12,14,18 14
is searched next because its samples are most
relevant N8,12,18,9,15,19 9 searched
next N8,12,18,15,19,4,10 4 searched next
26
Algorithm Description

XZ XZ,Y mean X is a parameter of Z with Y as
an additional attribute
DZ is set of Semantic Vector indices stored at
node Z
QZ is the set of semantic vectors of queries
recently processed at Z
UiZ summarizes the indices stored on Z and the
queries recently processed by Z
UiZ H/H H sum(d) d ? DiZ sum(q) q ?
QiZ
Z requests each of its neighbors P to send kc
document vectors similar to UiZ and kr random
samples

27
Algorithm Cont

When Z is under search for query q then Z uses
the following to estimate Vqi
eiP,Vqi max(cos(d,Vqi)) d ? SiZ,P
Directs search on each rotated space
Visits nodes with high value first and stops when
T nodes have yielded no result.

28
Experiment