Title: Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection
1Distributed Search over the Hidden
WebHierarchical Database Sampling and Selection
- Panagiotis G. Ipeirotis
- Luis Gravano
Computer Science Department Columbia University
2Distributed Search? Why?Surface Web vs.
Hidden Web
- Surface Web
- Link structure
- Crawlable
- Documents indexed by search engines
- Hidden Web
- No link structure
- Documents hidden in databases
- Documents not indexed by search engines
- Need to query each collection individually
3Hidden Web Examples
- PubMed search diabetes
- 178,975 matches
- PubMed is at http//www.ncbi.nlm.nih.gov/PubMed
- Google search diabetes sitewww.ncbi.nlm.nih.gov
- ? 119 matches
4Distributed Search Challenges
- Select good databases for query
- Evaluate query at these databases
- Merge results from databases
Hidden Web
Metasearcher
PubMed
Library of Congress
ESPN
5Database Selection Problems
- How to extract content summaries?
- How to use the extracted
- content summaries?
basketball 4 cancer 4,532 cpu 23
Web Database
basketball 4 cancer 4,532 cpu 23
Web Database 1
basketball 4 cancer 60,298 cpu 0
Web Database 2
Metasearcher
cancer
Web Database 3
basketball 6,340 cancer 2 cpu 0
6Extracting Content Summariesfrom Web Databases
- No direct access to remote documents other than
by querying - Resort to query-based document sampling
- Send queries to database
- Retrieve document sample
- Use sample to create approximate content summary
7Random Query-Based Sampling
- Pick a word and send it as a query to database
- Retrieve top-k documents returned (e.g., k4)
- Repeat until enough (e.g., 300) documents are
retrieved
Callan et al., SIGMOD99, TOIS 2001
Word Frequency in Sample cancer 150 (out of
300) aids 114 (out of 300) heart 98 (out of 300)
basketball 2 (out of 300)
Use word frequencies in sample to create content
summary
8Random Sampling Problems
- No actual word frequencies computed for content
summaries, only a ranking of words - Many words missing from content summaries (many
rare words) - Many queries return very few or no matches
Many words appear in only one or two documents
9Our Technique Focused Probing
- Train document classifiers
- Find representative words for each category
- Use classifier rules to derive a
topically-focused sample from database - Estimate actual document frequencies for all
discovered words
10Focused Probing Training
- Start with a predefined topic hierarchy and
preclassified documents - Train document classifiers for each node
- Extract rules from classifiers
- ibm AND computers ? Computers
- lung AND cancer ? Health
-
- angina ? Heart
- hepatitis AND liver ? Hepatitis
Root
SIGMOD 2001
Health
11Focused Probing Sampling
- Transform each rule into a query
- For each query
- Send to database
- Record number of matches
- Retrieve top-k matching documents
- At the end of round
- Analyze matches for each category
- Choose category to focus on
Sampling proceeds in rounds In each round, the
rules associated with each node are turned into
queries for the database
12Sample Frequencies and Actual Frequencies
- liver appears in 200 out of 300 documents in
sample - kidney appears in 100 out of 300 documents in
sample - hepatitis appears in 30 out of 300 documents in
sample
Document frequencies in actual database?
- Query liver returned 140,000 matches
- Query hepatitis returned 20,000 matches
- kidney was not a query probe
Can exploit number of matches from one-word
queries
13Adjusting Document Frequencies
- We know ranking r of words according to document
frequency in sample - We know absolute document frequency f of some
words from one-word queries - Mandelbrots formula connects empirically word
frequency f and ranking r - We use curve-fitting to estimate the absolute
frequency of all words in sample
f
r
14Actual PubMed Content Summary
PubMed content summary Number of Documents
3,868,552 category Health, Diseases cancer 1,3
98,178 aids 106,512 heart 281,506
hepatitis 23,481 basketball 907 cpu 487
- Extracted automatically
- 27,500 words in extracted content summary
- Fewer than 200 queries sent
- At most 4 documents retrieved per query
The extracted content summary accurately
represents size, contents, and classification of
the database
15 Focused Probing Contributions
- Focuses database sampling on dense topic areas
- Estimates absolute document frequencies of words
- Classifies databases along the way
- Classification useful for database selection
16Database Selection Problems
- How to extract content summaries?
- How to use the extracted
- content summaries?
basketball 4 cancer 4,532 cpu 23
Web Database
basketball 4 cancer 4,532 cpu 23
Web Database 1
basketball 4 cancer 60,298 cpu 0
Web Database 2
Metasearcher
cancer
Web Database 3
basketball 6,340 cancer 2 cpu 0
17Database Selection and Extracted Content Summaries
- Database selection algorithms assume complete
content summaries - Content summaries extracted by (small-scale)
sampling are inherently incomplete (Zipf's law) - Queries with undiscovered words are problematic
Database Classification Helps Similar topics ?
Similar content summaries Extracted content
summaries complement each other
18Content Summaries for Categories Example
- Cancerlit contains metastasis, not found during
sampling - CancerBacup contains diabetes, not found during
sampling - Cancer category content summary contains both
19Hierarchical DB Selection Outline
- Create aggregated content summaries for
categories - Hierarchically direct queries using categories
- Category content summaries are more complete
than database content summaries
Various traversal techniques possible
20Hierarchical DB Selection Example
- To select D databases
- Use a flat DB selection algorithm to score
categories - Proceed to category with highest score
- Repeat until category is a leaf, or category has
fewer than D databases
21Experiments Content Summary Extraction
- Focused Probing compared to Random Sampling
- Better vocabulary coverage
- Better word ranking
- More efficient for same sample size
- More effective for same sample size
More results in the paper! 4 types of classifiers
(SVM, Ripper, C4.5, Bayes), frequency estimation,
different data sets
22Experiments Database Selection
LoC
Query
LoC
LoCc
- Data set and workload
- 50 real Web databases
- 50 TREC Web Track queries
- Metric Precision _at_ 15
- For each query pick 3 databases
- Retrieve 5 documents from each database
- Return 15 documents to user
- Mark relevant and irrelevant documents
LoC
LoC
Database Selection
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
Good database selection algorithms choose
databases with relevant documents
23Experiments Precision of Database Selection
Algorithms
- Hierarchical database selection improves
precision drastically - Category content summaries more complete
- Topic-based database clustering helps
More results in the paper! (different flat
selection algorithms, more content summary
extraction algorithms)
Best result for centralized search 0.35 Not an
option for Hidden Web!
24Contributions
- Technique for extracting content summaries from
completely autonomous Hidden-Web databases - Technique for estimating frequencies Possible to
distinguish large from small databases - Hierarchical database selection exploits
classification improving drastically precision of
distributed search
Content summary extraction implemented and
available for download at http//sdarts.cs.columb
ia.edu
25Future Work
- Different techniques for merging content
summaries for category content summary creation - Effect of frequency estimation on database
selection - Different hierarchy traversing algorithms for
hierarchical database selection