Title: P1252432899ZuTgC
1A Logistic Regression Approach to Distributed IR
Ray R. Larson School of Information Management
Systems, University of California, Berkeley --
ray_at_sherlock.berkeley.edu
The Problem
Distributed IR Tasks
Our Approach Using Z39.50
MetaSearch New approach to building
metasearch based on Z39.50 Instead of
using broadcast search we are using two Z39.50
Services -- Identification of database
metadata using Z39.50 Explain --
Extraction of distributed indexes using Z39.50
SCAN -- Creation of Collection
Documents using index contents Evaluation
Questions -- How efficiently can we
build distributed indexes? -- How
effectively can we choose databases using the
index? -- How effective is merging
search results from multiple sources? --
Do Hierarchies of servers (general/meta-topical/in
dividual) work?
- Resource Description
- How to collect metadata about digital libraries
and their collections or databases - Resource Selection
- How to select relevant digital library
collections or databases from a large number of
databases - Distributed Search
- How to perform parallel or sequential searching
over the selected digital library databases - Data Fusion
- How to merge query results from different digital
libraries with their different search engines,
differing record structures, etc.
- Hundreds or Thousands of servers with databases
ranging widely in content, topic, and format - Broadcast search is expensive in terms of
bandwidth and in processing too many irrelevant
results - How to select the best ones to search?
- What to search first
- Which to search next
- Topical /domain constraints on the search
selections - Variable contents of database (metadata only,
full text)
Probabilistic Retrieval Using Logistic Regression
Distributed Retrieval Testing and Results
- Tested using the collection representatives as
harvested from over the network and the TIPSTER
relevance judgements - Testing by comparing our approach to known
algorithms for ranking collections - Results (preliminary) were measured against
reported results for the Ideal and CORI
algorithms and against the optimal Relevance
Based Ranking (MAX) - Recall analog (How many of the Rel docs occurred
in the top n databases averaged)
We attempt to estimate the probability of
relevance for a given collection with respect to
a query using the Logistic Regression method
developed at Berkeley (W. Cooper, F. Gey, D.
Dabney, A. Chen) with new algorithm for weight
calculation at retrieval time. We calculate the
probability of relevance using Logistic
regression from a sample set of documents to
determine values of the coefficients. At
retrieval time the probability of relevance for a
particular query Q and a collection C is
estimated by
CORI Ranking
Comparative Evaluation
Effectiveness Measures
The probabilities are actually calculated as the
log odds, and converted
The ci coefficients were estimated separately for
three query types (during retrieval the length of
the query was used to differentiate these.
Results and Discussion
The figures to the right summarize our results
from the preliminary evaluation. The X axis is
the number of collections in the ranking and the
Y axis, , is a Recall analog that measures
the proportion of the total possible relevant
documents that have been accumulated in the top N
databases, averaged across all of the
queries. The Max line is the optimal results
based where the collections are ranked in order
of the number of relevant documents they contain.
Ideal(0) is an implementation of the GlOSS
Ideal''algorithm and CORI is an implementation
of Callan's Inference net approach. The Prob line
is the logistic regression method (described to
the left). For title queries the described method
performs slightly better than the CORI algorithm
for up to about 100 collections, where CORI
exceeds it. For Long queries our method is
virtually identical to CORI, and CORI performs
better for Very Long queries. Both CORI and the
logistic regression method outperform the
Ideal(0) implementation.
R
Test Database Characteristics
We used collections formed by dividing the
documents on TIPSTER disks 1, 2, and 3 into 236
sets based on source and month (using the same
contents as in evaluations by Powell French and
Callan). The query set used was TREC queries
51-150. Collection relevance information was
based on whether any documents in the collection
were relevant according to the relevance
judgements for TREC queries 51-150. The relevance
information was used both for estimating the
logistic regression coefficients (using a sample
of the data) and for the evaluation (with full
data).
Application and Further Research
The method described here is being applied to two
distributed systems of servers in the UK. The
first (the Distributed Archives Hub will be made
up of individual servers containing archival
descriptions in the EAD (Encoded Archival
Description) DTD. MerseyLibraries.org is a
consortium of University and Public libraries in
the Merseyside area. In both cases the method
described here is being used to build a central
index to provide efficient distributed search
over the various servers. The basic model is
shown below individual database servers will be
harvested to create (potentially) a hierarchy of
servers used to intelligently route queries to
the databases most like to contain relevant
materials. We are also continuing to refine the
both the harvesting method and the collection
ranking algorithm. We believe that additional
collection and collection document statistics
may provide a better ranking of results and thus
more effective routing of queries.
TREC Disk Source Size MB Size doc
1 WSJ (86-89) 270 98,732
1 AP (89) 259 84,678
1 ZIFF 245 75,180
1 FR (89) 262 25,960
2 WSJ (90-92) 247 74.520
2 AP (88) 241 79,919
2 ZIFF 178 56,920
2 FR (88) 211 19,860
3 AP (90) 242 78,321
3 SJMN (91) 290 90,257
3 PAT 245 6,711
Totals 2,690 691,058
TREC Disk Source Num DB Total DB
1 WSJ (86-89) 29 Disk 1
1 AP (89) 12 67
1 ZIFF 14
1 FR (89) 12
2 WSJ (90-92) 22 Disk 2
2 AP (88) 11 54
2 ZIFF 11 (1 dup)
2 FR (88) 10
3 AP (90) 12 Disk 3
3 SJMN (91) 12 116
3 PAT 92
Totals 237 - 1 237 - 1
Data Harvesting and Collection Document Creation
- For all servers (could be a topical subset)
- Get Explain information to find which indexes are
supported and the collection statistics. - For each index
- Use SCAN to extract terms and frequency
information - Add term freq source index database
metadata to the metasearch XML Collection
Document - Index collection documents including for
retrieval by the above algorithm - Planned Exensions
- Post-Process indexes (especially Geo Names, etc)
for special types of data - e.g. create geographical coverage indexes
Acknowledgements
This research was sponsored at U.C. Berkeley and
the University of Liverpool by the National
Science Foundation and the Joint Information
Systems Committee (UK) under the International
Digital Libraries Program award
IIS-99755164 James French and Allison Powell
kindly provided the CORI and Ideal(0) results
used in the evaluation.