Title: P2P Information Retrieval Using Semantic Overlay Networks
1P2P Information RetrievalUsing Semantic Overlay
Networks
- Authors Tang(Univ Rochester), Zu(HP Labs),
Dwarkadas(Univ Rochester)
2Problems with Current P2P
- Can only be searched by keyword.
- Documents not semantically organized
- Ways to Search
- Centralized Search
- Decentralized Search floods nodes.
3How to Solve?
- pSearch
- Decentralized, non-flooding algorithm
- VSM (Vector Space Model)
- LSI (Latent Semantic Indexing)
- CAN Network
4Vector Space Modeling
- Represents documents and search terms as vectors.
- When searching documents are ranked by according
to similarity of document vector and the query
vector. - Given Vectors X and Y
- cos(X,Y) SUM(Xi Yi) i 0 k
- k X Y
5VSM Cont.
- Represents a Corpus as a t x d Matrix A
- t is the number of search terms in the corpus
- d is the number of documents in the corpus
- arj is the importance of term r in document j
6Latent Semantic Indexing
- Uses statistically derived conceptual indices
- Singular Value Decomposition
- Transform High-Dimensional Vector from VSM into a
Lower-Dimensional semantic vector
7What SVD does
- Takes the matrix A from the VSM and decomposes it
into the product of 3 matrices - A UED
- U u1 ur a t x r matrix
- E diag(e1 er) a r x r diagonal matrix
- ejs are As singular values
- V v1 vr a d x r matrix
8LSI Cont
- What LSI does with SVD
- Approximates matrix A of rank r by a matrix Al of
lower rank - Al UlElVl
- Omits all but largest l singular values
- Rows of ElVl the semantic vectors for documents
- Given UlElVl semantic vectors of queries,
documents, or terms that were not originally in
the space can be folded into the space
9LSI Cont
- By choosing appropriate l noise from smaller
singular values is eliminated - Studies suggest l 50350 is best
10Content Addressable Networks
- Values of CAN
- Fault-tolerant distributed hash table
- Partitions a d-dimensional Cartesian space into
zones and assigns each zone to a node - Locating a document is reduced to routing to the
node that contains that document
11Finding something in a CAN
- D joins and takes over part of Cs zone
- D is looking for a document with key 0.4,0.1
- Routed to E then to A
12pSearch
- pLSI modified LSI for pSearch
- Large number of machines are organized into a
semantic overlay. - This forms the pSearch Engine
- Engine is subset of nodes with good connectivity
13pLSI Algorithm
- Sets the dimensionality of the CAN to be the
dimensionality of the semantic space - Index for a document is stored in the CAN using
the semantic vector as the index. - How it works
- Engine node receives Document A and creates the
Semantic vector Va via LSI and is used as the key - Receiving a query the Engine nodes transforms the
query into a vector Vq and routes it to the
correct node - When node is reached query is flooded to nodes in
radius r - All receiving nodes do query via LSI
14pLSI Cont
- Relies on inverse document frequency and basis of
semantic space - Allows nodes to compute semantic vectors of
queries and documents independently - Challenges
- Dimensionality mismatch
- There are not enough nodes to partition the can
when it has the dimensionality set by the LSI,
50350
15pLSI Cont
- Uneven Distribution of Indices
- Vectors are normalized and reside on the surface
of the unit sphere S in the semantic space - cos ? (similarity) is proportional to the
distance between A and B to their distance p on
the circle since on a unit sphere cos ? cos p - Large Search Space
- Limiting the search space in a high-dimensionality
is difficult
16Solving Dimensionality
- For a l-dimensional CAN network with n nodes each
node should maintain 2l neighbors only if l lt
log2(n) - Since 2x zones will be produced by partitioning
along x dimensions - This results in more zones than nodes
- If l gt log2(n) and zones partitioned evenly then
only log2(n) neighbors are required
17Solving Dimensionality Cont
- This results in only the low dimensionality
semantic vectors being partitioned - Example
- Supposed we have Va (-0.1,0.55,0.57,-0.6) and
Vq (0.55,-0.1,0.6,-0.57) - They have similarity of .574 with a majority
contributed by the last two parts of the vector.
If we have only four nodes the 4-Dimensional CAN
may only be split on first two parts of the
vector.
18Solving Dimensionality Cont
- Solving only low dimensionality partitioning
- Rolling Semantic Indexes
- Given V(v0,,vl) we rotate m dimensions each
time to produce a new vector. - Support Vector
- Vi(vi-m,,vi-m-1) m2.3ln(n)
- Given a document A with vector Va we store its
index on p places in the CAN Vai where i 0
p-1 - The pLSI algorithm uses different Vqi query
vectors to search and return relative documents.
19Rolling Index Example
20Balancing Indexes
- Content Aware Boot Strapping
- Forces the distribution of nodes to follow the
distributions of indices - At partitioning the node randomly picks a
document and computes the semantic vector and
rotates it by i (0 lt i lt p) the space is then
partitioned a across the lowest un-partitioned
dimension and then hands over half that zone
21Balancing Indexes Cont
- Advantages Provided
- More balanced index
- Larger num of nodes used in semantic space with
greater num of documents - Index Locality
- Assuming the documents published by a node have
similar semantics, on space i, indices of a
nodes documents are likely to be published on
itself or its neighbors. - Query Locality
- Assuming documents published by a node are good
indications of user queries, more nodes near the
user are likely to have relevant documents to the
query.
22(No Transcript)
23Reducing Search Space
- Content Directed Search
- Search Algorithm
24Content Directed Search
- Use indices and last queries on nodes to guide
search - Example
251-25 are the node ids, q is the query, and a-f
are document vectors. N13 N8,12,14,18 14
is searched next because its samples are most
relevant N8,12,18,9,15,19 9 searched
next N8,12,18,15,19,4,10 4 searched next
26Algorithm Description
- XZ XZ,Y mean X is a parameter of Z with Y as
an additional attribute - DZ is set of Semantic Vector indices stored at
node Z - QZ is the set of semantic vectors of queries
recently processed at Z - UiZ summarizes the indices stored on Z and the
queries recently processed by Z - UiZ H/H H sum(d) d ? DiZ sum(q) q ?
QiZ - Z requests each of its neighbors P to send kc
document vectors similar to UiZ and kr random
samples
27Algorithm Cont
- When Z is under search for query q then Z uses
the following to estimate Vqi - eiP,Vqi max(cos(d,Vqi)) d ? SiZ,P
- Directs search on each rotated space
- Visits nodes with high value first and stops when
T nodes have yielded no result.
28Experiment
- Used Text Retrieval Conference corpus
29Varying System Size
30Varying System and Corpus
31Content vs Query
32With Replication