Title: Building High-Quality Online Database Directories
1 Building High-Quality Online Database Directories
Luciano Barbosa, Eun Yong Kang, Juliana Freire
University of Utah
Form-Focused Crawler on Web Graph
Introduction
url travel
anchor rental car
As the volume of information in the hidden Web
grows, there is an increased need for techniques
and tools that allow users and applications to
uncover and leverage this information. But
before one can access the information, a key
issue that needs to be addressed is how to find
relevant data sources. In this poster we propose
a scalable solution to the problem of
automatically constructing topic specific
directories of online databases. Given a domain,
our goal is to automatically locate the entry
points for online databases in this domain. The
availability of such a directory is a requirement
for many applications the exploit hidden-Web
content, such as hidden-Web crawlers2,
meta-searchers5, and Web information
integration systems3. Our solution has two main
components a focused crawling strategy that
efficiently scours the Web to locate forms that
are the entry points to online databases and a
novel form-filtering process which reliably
identifies relevant forms in the heterogeneous
set retrieved by the crawler.
anchor reservation
Start Page
Target Page
searchable form
Web page
Form Classification based on Form Structure
Searchable does not have textarea
Non-Searchable contains textarea
Form Classification based on Form Contents
Rental Car Domain
Airfare Domain
Discussion
Our experiments show that FFCGFCTSFC is
effective for collecting high-quality, topic
specific forms. The decomposition of the feature
space allows the use of learning techniques that
are more appropriate for each partition---ID3 by
the GFC and SVM by the TSFC. Our experiments
showed that substantially higher recall and
precision are obtained than if a single learning
method is used over the complete feature space.
In addition, the results indicate that the FFC
obtains much larger harvest rates than a crawler
that focuses only on a topic.
Methods
Different contents!
Experimental Evaluation
1. System Components
Focused Crawler (FC) collects Web documents
within specific topic (Naive Bayes) Form-Focused
Crawler (FFC) learns patterns of hyperlink
paths that lead to pages that contains forms
(Naive Bayes) General Form Classifier (GFC)
identifies searchable and non-searchable forms
based on structural information in the form
(ID3) Topic-Specific Form Classifier (TSFC)
identifies forms the belong to a given database
domain based on contents of the form (SVM)
We evaluate our solution over eight distinct
database domains auto, airfare, book, job,
hotel, movie, music and rental car. For locating
the forms, we use two different crawlers
Baseline(FC) a variation of the best-first
crawler4. FFC the form-focused crawler.
1. Performance of GFCTSFC in Job domain
2. Taxonomy of Web Forms In Our Solution Space
2. Crawler Efficiency
3. Process for Building a Topic-Specific Online
Database Directory.
Form Filtering
Form Searching
Form-Focused Crawler
Generic Form Classifier
Topic-Specific Form Classifier
Focused Crawler
Form- Related pages
Topic- Specific pages
Searchable forms
Relevant forms
Acknowledgments
Form structure
Form textual content
Information around
Web page textual content
hyperlink
Feature space
This work is supported by the National Science
Foundation under grantsIIS-0534628, EIA-0323604,
CNS-0514485, IIS-0513692, and CNS-0528201.