Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection

1 / 25

About This Presentation

Title:

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection

Description:

Many words missing from content summaries (many rare words) ... Content summaries extracted by (small-scale) sampling are inherently incomplete (Zipf's law) ... –

Number of Views:93

Avg rating:3.0/5.0

Slides: 26

Provided by: panagi2

Category:

more less

Transcript and Presenter's Notes

Title: Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection

1
Distributed Search over the Hidden
WebHierarchical Database Sampling and Selection

Panagiotis G. Ipeirotis
Luis Gravano

Computer Science Department Columbia University
2
Distributed Search? Why?Surface Web vs.
Hidden Web

Surface Web
Link structure
Crawlable
Documents indexed by search engines

Hidden Web
No link structure
Documents hidden in databases
Documents not indexed by search engines
Need to query each collection individually

3
Hidden Web Examples

PubMed search diabetes
178,975 matches
PubMed is at http//www.ncbi.nlm.nih.gov/PubMed
Google search diabetes sitewww.ncbi.nlm.nih.gov
? 119 matches

4
Distributed Search Challenges

Select good databases for query
Evaluate query at these databases
Merge results from databases

Hidden Web
Metasearcher
PubMed
Library of Congress
ESPN
5
Database Selection Problems

How to extract content summaries?
How to use the extracted
content summaries?

basketball 4 cancer 4,532 cpu 23
Web Database
basketball 4 cancer 4,532 cpu 23
Web Database 1
basketball 4 cancer 60,298 cpu 0
Web Database 2
Metasearcher
cancer
Web Database 3
basketball 6,340 cancer 2 cpu 0
6
Extracting Content Summariesfrom Web Databases

No direct access to remote documents other than
by querying
Resort to query-based document sampling
Send queries to database
Retrieve document sample
Use sample to create approximate content summary

7
Random Query-Based Sampling

Pick a word and send it as a query to database
Retrieve top-k documents returned (e.g., k4)
Repeat until enough (e.g., 300) documents are
retrieved

Callan et al., SIGMOD99, TOIS 2001
Word Frequency in Sample cancer 150 (out of
300) aids 114 (out of 300) heart 98 (out of 300)
basketball 2 (out of 300)
Use word frequencies in sample to create content
summary
8
Random Sampling Problems

No actual word frequencies computed for content
summaries, only a ranking of words
Many words missing from content summaries (many
rare words)
Many queries return very few or no matches

Many words appear in only one or two documents
9
Our Technique Focused Probing

Train document classifiers
Find representative words for each category
Use classifier rules to derive a
topically-focused sample from database
Estimate actual document frequencies for all
discovered words

10
Focused Probing Training

Start with a predefined topic hierarchy and
preclassified documents
Train document classifiers for each node
Extract rules from classifiers
ibm AND computers ? Computers
lung AND cancer ? Health
angina ? Heart
hepatitis AND liver ? Hepatitis

Root
SIGMOD 2001
Health
11
Focused Probing Sampling

Transform each rule into a query
For each query
Send to database
Record number of matches
Retrieve top-k matching documents
At the end of round
Analyze matches for each category
Choose category to focus on

Sampling proceeds in rounds In each round, the
rules associated with each node are turned into
queries for the database
12
Sample Frequencies and Actual Frequencies

liver appears in 200 out of 300 documents in
sample
kidney appears in 100 out of 300 documents in
sample
hepatitis appears in 30 out of 300 documents in
sample

Document frequencies in actual database?

Query liver returned 140,000 matches
Query hepatitis returned 20,000 matches
kidney was not a query probe

Can exploit number of matches from one-word
queries
13
Adjusting Document Frequencies

We know ranking r of words according to document
frequency in sample
We know absolute document frequency f of some
words from one-word queries
Mandelbrots formula connects empirically word
frequency f and ranking r
We use curve-fitting to estimate the absolute
frequency of all words in sample

f
r
14
Actual PubMed Content Summary
PubMed content summary Number of Documents
3,868,552 category Health, Diseases cancer 1,3
98,178 aids 106,512 heart 281,506
hepatitis 23,481 basketball 907 cpu 487

Extracted automatically
27,500 words in extracted content summary
Fewer than 200 queries sent
At most 4 documents retrieved per query

The extracted content summary accurately
represents size, contents, and classification of
the database
15
Focused Probing Contributions

Focuses database sampling on dense topic areas
Estimates absolute document frequencies of words
Classifies databases along the way
Classification useful for database selection

16
Database Selection Problems

How to extract content summaries?
How to use the extracted
content summaries?

Database selection algorithms assume complete
content summaries
Content summaries extracted by (small-scale)
sampling are inherently incomplete (Zipf's law)
Queries with undiscovered words are problematic

Database Classification Helps Similar topics ?
Similar content summaries Extracted content
summaries complement each other
18
Content Summaries for Categories Example

Cancerlit contains metastasis, not found during
sampling
CancerBacup contains diabetes, not found during
sampling
Cancer category content summary contains both

19
Hierarchical DB Selection Outline

Create aggregated content summaries for
categories
Hierarchically direct queries using categories
Category content summaries are more complete
than database content summaries

Various traversal techniques possible
20
Hierarchical DB Selection Example

To select D databases
Use a flat DB selection algorithm to score
categories
Proceed to category with highest score
Repeat until category is a leaf, or category has
fewer than D databases

21
Experiments Content Summary Extraction

Focused Probing compared to Random Sampling
Better vocabulary coverage
Better word ranking
More efficient for same sample size
More effective for same sample size

More results in the paper! 4 types of classifiers
(SVM, Ripper, C4.5, Bayes), frequency estimation,
different data sets
22
Experiments Database Selection
LoC
Query
LoC
LoCc

Data set and workload
50 real Web databases
50 TREC Web Track queries
Metric Precision _at_ 15
For each query pick 3 databases
Retrieve 5 documents from each database
Return 15 documents to user
Mark relevant and irrelevant documents

LoC
LoC
Database Selection
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
Good database selection algorithms choose
databases with relevant documents
23
Experiments Precision of Database Selection
Algorithms

Hierarchical database selection improves
precision drastically
Category content summaries more complete
Topic-based database clustering helps

More results in the paper! (different flat
selection algorithms, more content summary
extraction algorithms)
Best result for centralized search 0.35 Not an
option for Hidden Web!
24
Contributions

Technique for extracting content summaries from
completely autonomous Hidden-Web databases
Technique for estimating frequencies Possible to
distinguish large from small databases
Hierarchical database selection exploits
classification improving drastically precision of
distributed search

Content summary extraction implemented and
available for download at http//sdarts.cs.columb
ia.edu
25
Future Work

Different techniques for merging content
summaries for category content summary creation
Effect of frequency estimation on database
selection
Different hierarchy traversing algorithms for
hierarchical database selection

Write a Comment

User Comments (0)