Title: Distributed Information Retrieval
1Distributed Information Retrieval
- Server Ranking for Distributed Text Retrieval
Systems on the Internet - B. Yuwono and D. Lee
- Siemens TREC-4 Report Further Experiments with
Database Merging E. Vorhees
Brian Shaw CS 5604
2Issue Merging for Effective Results
- multiple brokers (take search queries), multiple
collection servers - broker must select appropriate collection servers
and merge results
3Server Ranking overview
- Problem cost (including users time) of
broadcasting to all servers and processing power - Solution broker ranks collection servers
(goodness score) broadcasts query to at most s
(sigma) collection servers (preset number or
scoring threshold) merges results
1- Server Ranking for Distributed Text Retrieval
on the Internet
4Server Ranking Server Selection
- Relies solely on Document Frequency data (DF)
all collection servers must report changes to
broker - Cue Validity Variance (CVV) goodness score is
based on estimate that term j distinguishes one
collection server from another not an indication
of quantity or quality of relevance
1- Server Ranking for Distributed Text Retrieval
on the Internet
5Server Ranking Merging
- Assumption 1 the best document in collection i
is equally relevant to the best document in
collection k - A collection server containing a few but highly
relevant documents will contribute to the final
list. - Assumption 2 the distance between two
consecutive document ranks is inversely
proportional to the goodness score - Relative goodness scores are roughly proportional
to the number of documents contributed to the
final list. - Final ranking is a combination of goodness score
and local rankings.
1- Server Ranking for Distributed Text Retrieval
on the Internet
6Experiments (overview)
- Problem broker has no access to meta-data from
isolated collection servers - Solution choose collection server(s) based on
results from previous training queries
2- Further Experiments with Database Merging
7Experiments Server Selection, two approaches
- Query Clustering (QC) cluster training queries
(based on of same documents retrieved) and
calculate cluster centroid vector compare
query vector to centroid vector and assign weight
to collection - Modeling Relevant Document Distributions (MRDD)
find M most similar training queries and assign
weights to collections based on the training
runs relevant document distribution
2- Further Experiments with Database Merging
8Experiments Merging
- N documents retrieved from each server as
determined by weights - Final ranking is a random process roll a C-faced
die that is biased by the number of documents
still to be picked from each of the C collections
2- Further Experiments with Database Merging
9Comparison
1-Server Ranking 2-Experiments
Brokers Knowledge Shared Document Frequency Data Training Query Results
Collection Server Selection CVV Goodness Scoring Comparison to Training Queries
Merging Goodness Score Local Rank Random
10Conclusions
- The server ranking method proposed by Yuwono and
Lee is an effective way to minimize operating
costs (such as time) in an environment where
brokers and collection servers can share document
frequency data. - The isolated merging strategies proposed by
Vorhees is an effective way to choose a
collection server where no meta-information is
shared between the broker and collection server.