Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection

1 / 25
About This Presentation
Title:

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection

Description:

Many words missing from content summaries (many rare words) ... Content summaries extracted by (small-scale) sampling are inherently incomplete (Zipf's law) ... –

Number of Views:93
Avg rating:3.0/5.0
Slides: 26
Provided by: panagi2
Category:

less

Transcript and Presenter's Notes

Title: Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection


1
Distributed Search over the Hidden
WebHierarchical Database Sampling and Selection
  • Panagiotis G. Ipeirotis
  • Luis Gravano

Computer Science Department Columbia University
2
Distributed Search? Why?Surface Web vs.
Hidden Web
  • Surface Web
  • Link structure
  • Crawlable
  • Documents indexed by search engines
  • Hidden Web
  • No link structure
  • Documents hidden in databases
  • Documents not indexed by search engines
  • Need to query each collection individually

3
Hidden Web Examples
  • PubMed search diabetes
  • 178,975 matches
  • PubMed is at http//www.ncbi.nlm.nih.gov/PubMed
  • Google search diabetes sitewww.ncbi.nlm.nih.gov
  • ? 119 matches

4
Distributed Search Challenges
  • Select good databases for query
  • Evaluate query at these databases
  • Merge results from databases

Hidden Web
Metasearcher
PubMed
Library of Congress
ESPN
5
Database Selection Problems
  • How to extract content summaries?
  • How to use the extracted
  • content summaries?

basketball 4 cancer 4,532 cpu 23
Web Database
basketball 4 cancer 4,532 cpu 23
Web Database 1
basketball 4 cancer 60,298 cpu 0
Web Database 2
Metasearcher
cancer
Web Database 3
basketball 6,340 cancer 2 cpu 0
6
Extracting Content Summariesfrom Web Databases
  • No direct access to remote documents other than
    by querying
  • Resort to query-based document sampling
  • Send queries to database
  • Retrieve document sample
  • Use sample to create approximate content summary

7
Random Query-Based Sampling
  • Pick a word and send it as a query to database
  • Retrieve top-k documents returned (e.g., k4)
  • Repeat until enough (e.g., 300) documents are
    retrieved

Callan et al., SIGMOD99, TOIS 2001
Word Frequency in Sample cancer 150 (out of
300) aids 114 (out of 300) heart 98 (out of 300)
basketball 2 (out of 300)
Use word frequencies in sample to create content
summary
8
Random Sampling Problems
  • No actual word frequencies computed for content
    summaries, only a ranking of words
  • Many words missing from content summaries (many
    rare words)
  • Many queries return very few or no matches

Many words appear in only one or two documents
9
Our Technique Focused Probing
  • Train document classifiers
  • Find representative words for each category
  • Use classifier rules to derive a
    topically-focused sample from database
  • Estimate actual document frequencies for all
    discovered words

10
Focused Probing Training
  • Start with a predefined topic hierarchy and
    preclassified documents
  • Train document classifiers for each node
  • Extract rules from classifiers
  • ibm AND computers ? Computers
  • lung AND cancer ? Health
  • angina ? Heart
  • hepatitis AND liver ? Hepatitis

Root
SIGMOD 2001
Health
11
Focused Probing Sampling
  • Transform each rule into a query
  • For each query
  • Send to database
  • Record number of matches
  • Retrieve top-k matching documents
  • At the end of round
  • Analyze matches for each category
  • Choose category to focus on

Sampling proceeds in rounds In each round, the
rules associated with each node are turned into
queries for the database
12
Sample Frequencies and Actual Frequencies
  • liver appears in 200 out of 300 documents in
    sample
  • kidney appears in 100 out of 300 documents in
    sample
  • hepatitis appears in 30 out of 300 documents in
    sample

Document frequencies in actual database?
  • Query liver returned 140,000 matches
  • Query hepatitis returned 20,000 matches
  • kidney was not a query probe

Can exploit number of matches from one-word
queries
13
Adjusting Document Frequencies
  • We know ranking r of words according to document
    frequency in sample
  • We know absolute document frequency f of some
    words from one-word queries
  • Mandelbrots formula connects empirically word
    frequency f and ranking r
  • We use curve-fitting to estimate the absolute
    frequency of all words in sample

f
r
14
Actual PubMed Content Summary
PubMed content summary Number of Documents
3,868,552 category Health, Diseases cancer 1,3
98,178 aids 106,512 heart 281,506
hepatitis 23,481 basketball 907 cpu 487
  • Extracted automatically
  • 27,500 words in extracted content summary
  • Fewer than 200 queries sent
  • At most 4 documents retrieved per query

The extracted content summary accurately
represents size, contents, and classification of
the database
15
Focused Probing Contributions
  • Focuses database sampling on dense topic areas
  • Estimates absolute document frequencies of words
  • Classifies databases along the way
  • Classification useful for database selection

16
Database Selection Problems
  • How to extract content summaries?
  • How to use the extracted
  • content summaries?

basketball 4 cancer 4,532 cpu 23
Web Database
basketball 4 cancer 4,532 cpu 23
Web Database 1
basketball 4 cancer 60,298 cpu 0
Web Database 2
Metasearcher
cancer
Web Database 3
basketball 6,340 cancer 2 cpu 0
17
Database Selection and Extracted Content Summaries
  • Database selection algorithms assume complete
    content summaries
  • Content summaries extracted by (small-scale)
    sampling are inherently incomplete (Zipf's law)
  • Queries with undiscovered words are problematic

Database Classification Helps Similar topics ?
Similar content summaries Extracted content
summaries complement each other
18
Content Summaries for Categories Example
  • Cancerlit contains metastasis, not found during
    sampling
  • CancerBacup contains diabetes, not found during
    sampling
  • Cancer category content summary contains both

19
Hierarchical DB Selection Outline
  • Create aggregated content summaries for
    categories
  • Hierarchically direct queries using categories
  • Category content summaries are more complete
    than database content summaries

Various traversal techniques possible
20
Hierarchical DB Selection Example
  • To select D databases
  • Use a flat DB selection algorithm to score
    categories
  • Proceed to category with highest score
  • Repeat until category is a leaf, or category has
    fewer than D databases

21
Experiments Content Summary Extraction
  • Focused Probing compared to Random Sampling
  • Better vocabulary coverage
  • Better word ranking
  • More efficient for same sample size
  • More effective for same sample size

More results in the paper! 4 types of classifiers
(SVM, Ripper, C4.5, Bayes), frequency estimation,
different data sets
22
Experiments Database Selection
LoC
Query
LoC
LoCc
  • Data set and workload
  • 50 real Web databases
  • 50 TREC Web Track queries
  • Metric Precision _at_ 15
  • For each query pick 3 databases
  • Retrieve 5 documents from each database
  • Return 15 documents to user
  • Mark relevant and irrelevant documents

LoC
LoC
Database Selection
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
LoC
Good database selection algorithms choose
databases with relevant documents
23
Experiments Precision of Database Selection
Algorithms
  • Hierarchical database selection improves
    precision drastically
  • Category content summaries more complete
  • Topic-based database clustering helps

More results in the paper! (different flat
selection algorithms, more content summary
extraction algorithms)
Best result for centralized search 0.35 Not an
option for Hidden Web!
24
Contributions
  • Technique for extracting content summaries from
    completely autonomous Hidden-Web databases
  • Technique for estimating frequencies Possible to
    distinguish large from small databases
  • Hierarchical database selection exploits
    classification improving drastically precision of
    distributed search

Content summary extraction implemented and
available for download at http//sdarts.cs.columb
ia.edu
25
Future Work
  • Different techniques for merging content
    summaries for category content summary creation
  • Effect of frequency estimation on database
    selection
  • Different hierarchy traversing algorithms for
    hierarchical database selection
Write a Comment
User Comments (0)
About PowerShow.com