Title: Crawling the Hidden Web
1Crawling the Hidden Web
- by
- Michael Weinberg
- mwmw_at_cs.huji.ac.il
Internet DB Seminar,
The Hebrew University of Jerusalem, School
of Computer Science and Engineering,
December 2001
2Agenda
- Hidden Web - what is it all about?
- Generic model for a hidden Web crawler
- HiWE (Hidden Web Exposer)
- LITE Layout-based Information Extraction
Technique - Results from experiments conducted to test these
techniques
3Web Crawlers
- Automatically traverse the Web graph, building a
local repository of the portion of the Web that
they visit - Traditionally, crawlers have only targeted a
portion of the Web called the publicly indexable
Web (PIW) - PIW the set of pages reachable purely by
following hypertext links, ignoring search forms
and pages that require authentication
4The Hidden Web
- Recent studies show that a significant fraction
of Web content in fact lies outside the PIW - Large portions of the Web are hidden behind
search forms in searchable databases - HTML pages are dynamically generated in response
to queries submitted via the search forms - Also referred as the Deep Web
5The Hidden Web Growth
- Hidden Web continues to grow, as organizations
with large amount of high-quality information are
placing their content online, providing
web-accessible search facilities over existing
databases - For example
- Census Bureau
- Patents and Trademarks Office
- News media companies
- InvisibleWeb.com lists over 10000 such databases
6Surface Web
7Deep Web
8Deep Web Content Distribution
9Deep Web Stats
- The Deep Web is 500 times larger than PIW !!!
- Contains 7,500 terabytes of information (March
2000) - More than 200,000 Deep Web sites exist
- Sixty of the largest Deep Web sites collectively
contain about 750 terabytes of information - 95 of the Deep Web is publicly accessible (no
fees) - Google indexes about 16 of the PIW, so we search
about 0.03 of the pages available today
10The Problem
- Hidden Web contains large amounts of high-quality
information - The information is buried on dynamically
generated sites - Search engines that use traditional crawlers
never find this information
11The Solution
- Build a hidden Web crawler
- Can crawl and extract content from hidden
databases - Enable indexing, analysis, and mining of hidden
Web content - The content extracted by such crawlers can be
used to categorize and classify the hidden
databases
12Challenges
- Significant technical challenges in designing a
hidden Web crawler - Should interact with forms that were designed
primarily for human consumption - Must provide input in the form of search queries
- How equip the crawlers with input values for use
in constructing search queries? - To address these challenges, we adopt the
task-specific, human-assisted approach
13Task-Specificity
- Extract content based on the requirements of a
particular application or task - For example, consider a market analyst interested
in press releases, articles, etc pertaining to
the semiconductor industry, and dated sometime in
the last ten years
14Human-Assistance
- Human-assistance is critical to ensure that the
crawler issues queries that are relevant to the
particular task - For instance, in the semiconductor example, the
market analyst may provide the crawler with lists
of companies or products that are of interest - The crawler will be able to gather additional
potential company and product names as it
processes a number of pages
15Two Steps
- There are two steps in achieving our goal
- Resource discovery identify sites and databases
that are likely to be relevant to the task - Content extraction actually visit the
identified sites to submit queries and extract
the hidden pages - In this presentation we do not directly address
the resource discovery problem
16Hidden Web Crawlers
17User form interaction
Form page
(1) Download form
(2) View form
(4) Submit form
(3) Fill-out form
Web query front-end
(5) Download response
Hidden Database
(6) View result
Response page
18Operation Model
- Our model of a hidden Web crawler consists of
four components - Internal Form Representation
- Task-specific database
- Matching function
- Response Analysis
- Form Page the page containing the search form
- Response Page the page received in response to
a form submission
19Generic Operational Model
Hidden Web Crawler
Form page
Internal Form Representation
Form analysis
Download form
Match
Web query front-end
Form submission
Set of value-assignments
Download response
Response Analysis
Response page
20Internal Form Representation
- Form F
- is a set of n form
elements - S submission information associated with the
form - submission URL
- Internal identifiers for each form element
- M meta-information about the form
- web-site hosting the form
- set of pages pointing to this form page
- other text on the page besides the form
21Task-specific Database
- The crawler is equipped with a task-specific
database D - Contains the necessary information to formulate
queries relevant to the particular task - In the market analyst example, D could contain
list of semiconductor company and product names - The actual format and organization of D are
specific for to a particular crawler
implementation - HiWE uses a set of labeled fuzzy sets
22Matching Function
- Matching algorithm properties
-
- Input Internal form representation and current
contents of the database D - Output Set of value assignments
- associates value with element
23Response Analysis
- Module that stores the response page in the
repository - Attempts to distinguish between pages containing
search results and pages containing error
messages - This feedback is used to tune the matching
function
24Traditional Performance Metric
- Traditional crawlers performance metrics
- Crawling speed
- Scalability
- Page importance
- Freshness
- These metrics are relevant to hidden web
crawlers, but do not capture the fundamental
challenges in dealing with the Hidden Web
25New Performance Metrics
- Coverage metric
- Relevant pages extracted / relevant pages
present in the targeted hidden databases - Problem difficult to estimate how much of the
hidden content is relevant to the task
26New Performance Metrics
-
- the total number of forms that the
crawler submits - num of submissions which result in
response page with one or more search
results - Problem the crawler is penalized if the database
didnt contain any relevant search results
27New Performance Metrics
-
- number of semantically correct form
submissions - Penalizes the crawler only if a form submission
is semantically incorrect - Problem difficult to evaluate since a manual
comparison is needed to decide whether the
form is semantically correct
28Design Issues
- What information about each form element should
the crawler collect? - What meta-information is likely to be useful?
- How should the task-specific database be
organized, updated and accessed? - What Match function is likely to maximize
submission efficiency? - How to use the response analysis module to tune
the Match function?
29HiWE Hidden Web Exposer
30Basic Idea
- Extract descriptive information (label) for each
element of a form - Task-specific database is organized in terms of
categories, each of which is also associated with
labels - Matching function attempts to match from form
labels to database categories to compute a set of
candidate values assignments
31HiWE Architecture
URL 1 URL 2 URL N
URL List
LVS Table
WWW
Label1 Value-Set1 Label2 Value-Set2 Labeln
Value-Setn
Parser
Crawl Manager
Form Analyzer
Form submission
LVS Manager
Form Processor
Feedback
Response
Response Analyzer
Custom data sources
32HiWEs Main Modules
- URL List
- contains all the URLs the crawler has discovered
so far - Crawl Manager
- controls the entire crawling process
- Parser
- extracts hypertext links from the crawled pages
and adds them to the URL list - Form Analyzer, Form Processor, Response Analyzer
- Together implement the form processing and
submission operations
33HiWEs Main Modules
- LVS Manager
- Manages additions and accesses to the LVS table
- LVS table
- HiWEs implementation of the task-specific
database
34HiWEs Form Representation
- Form
- The third component of F is an empty set since
current implementation of HiWE does not collect
any meta-information about the form - For each element , HiWE collects a domain
Dom( ) and a label label( )
35HiWEs Form Representation
- Domain of an element
- Set of values which can be associated with the
corresponding form element - May be a finite set (e.g., domain of a selection
list) - May be infinite set (e.g., domain of a text box)
- Label of an element
- The descriptive information associated with the
element, if any - Most forms include some descriptive text to help
users understand the semantics of the element
36Form Representation - Figure
Element E1
Label(E1) "Document Type" Dom(E1 ) Articles,
Press Releases,
Reports
Element E2
Label(E2) "Company Name" Dom(E2) s s is a
text string
Element E3
Label(E3) "Sector" Dom(E3) Entertainment,
Automobile
Information Technology, Construction
37HiWEs Task-specific Database
- Task-specific information is organized in terms
of a finite set of concepts or categories - Each concept has one or more labels and an
associated set of values - For example the label Company Name could be
associated with the set of values IBM,
Microsoft, HP,
38HiWEs Task-specific Database
- The concepts are organized in a table called the
Label Value Set (LVS) - Each entry in the LVS is of the form (L,V)
- L label
- fuzzy set of values
- Fuzzy set V has an associated membership function
that assigns weights, in the range 0,1 to each
member of the set - is a measure of the crawlers
confidence that the assignment of to E is
semantically meaningful
39HiWEs Matching Function
- For elements with a finite domain
- The set of possible values is fixed and can be
exhaustively enumerated - In this example, the crawler can first retrieve
all relevant articles, then all relevant press
releases and finally all relevant reports
Element E1
Label(E1) "Document Type" Dom(E1 ) Articles,
Press Releases,
Reports
40HiWEs Matching Function
- For elements with an infinite domain
- HiWE textually matches the labels of these
elements with labels in the LVS table - For example, if a textbox element has the label
Enter State which best matches an LVS entry
with the label State , the values associated
with that LVS entry (e.g., California) can be
used to fill the textbox - How do we match Form labels with LVS labels?
41Label Matching
- Two steps in matching Form labels with LVS
labels - 1. Normalization includes conversion to a common
case and standard style - 2. Use of an approximate string matching
algorithm to compute minimum edit distances - HiWE employs D. Lopresti and A. Tomkins string
matching algorithm that takes word reordering
into account
42Label Matching
- Let LabelMatch( ) denote the LVS entry with
the minimum distance to label( ) - Threshold
- If all LVS entries are more than edit
operations away from label( ) , LabelMatch(
) nil
43Label Matching
- For each element , compute ( , )
- If has an infinite domain and (L,V) is the
closest matching LVS entry, then V and
- If has a finite domain, then Dom( )
and - The set of value assignments is computed as the
product of all the s - Too many assignments?
44Ranking Value Assignments
- HiWE employs an aggregation function to compute a
rank for each value assignment - Uses a configurable parameter, a minimum
acceptable value assignment rank ( ) - The intent is to improve submission efficiency by
only using high-quality value assignments - We will show three possible aggregation functions
45Fuzzy Conjunction
- The rank of a value assignment is the minimum of
the weights of all the constituent values. - Very conservative in assigning ranks. Assigns a
high rank only if each individual weight is high
46Average
- The rank of a value assignment is the average of
the weights of the constituent values - Less conservative than fuzzy conjunction
47Probabilistic
- This ranking function treats weights as
probabilities - is the likelihood that the choice of
is useful and is the
likelihood that it is not - The likelihood of a value assignment being useful
is - Assigns low rank if all the individual weights
are very low
48Populating the LVS Table
- HiWE supports a variety of mechanisms for adding
entries to the LVS table - Explicit Initialization
- Built-in entries
- Wrapped data sources
- Crawling experience
49Explicit Initialization
- Supply labels and associated value sets at
startup time - Useful to equip the crawler with labels that the
crawler is most likely to encounter - In the semiconductor example, we supply HiWE
with a list of relevant company names and
associate the list with labels Company ,
Company Name
50Built-in Entries
- HiWE has built-in entries for commonly used
concepts - Dates and Times
- Names of months
- Days of week
51Wrapped Data Sources
- LVS Manager can query data sources through a
well-defined interface - The data source must be wrapped by a program
that supports two kinds of queries - Given a set of labels, return a value set
- Given a set of values, return other values that
belong to the same value set
52HiWE Architecture
URL 1 URL 2 URL N
URL List
LVS Table
WWW
Label1 Value-Set1 Label2 Value-Set2 Labeln
Value-Setn
Parser
Crawl Manager
Form Analyzer
Form submission
LVS Manager
Form Processor
Feedback
Response
Response Analyzer
Custom data sources
53Crawling Experience
- Finite domain form elements are a useful source
of labels and associated value sets - HiWE adds this information to the LVS table
- Effective when similar label is associated with a
finite domain element in one form and with an
infinite domain element in another
54Computing Weights
- New value added to the LVS must be assigned a
suitable weight - Explicit initialization and build-in values have
fixed weights - Values obtained from external data sources or
through the crawlers own activity, are assigned
weights that vary with time
55Initial Weights
- For external data sources - computed by the
respective wrappers - For values directly gathered by the crawler
- Finite domain element E with Dom(E)
- 1 iff
- Three cases arise when incorporating Dom(E) into
the LVS table
56Updating LVS Case 1
- Crawler successfully extracts label(E) and
computes LabelMatch(E)(L,V) - Replace the (L,V) entry by the entry
-
- Intuitively, Dom(E) provides new elements to the
value set and boosts the weights of existing
elements
57Updating LVS Case 2
- Crawler successfully extracts label(E) but
LabelMatch(E) nil - A new entry ( label(E),Dom(E) ) is created in the
LVS
58Updating LVS Case 3
- Crawler can not extract label(E)
- For each entry (L,V)
- Compute a score
- Identify the entry with the maximum score
- Identify the value of the maximum score
- Replace entry with new entry
- Confidence of new values
59Configuring HiWE
- Initialization of the crawling activity includes
- Set of sites to crawl
- Explicit initialization for the LVS table
- Set of data sources
- Label matching threshold
- Minimum acceptable value assignment rank
- Value assignment aggregation function
60Introducing LITE
- Layout-based Information Extraction Technique
- Physical Layout of a page is also used to aid in
extraction - For example, a piece of text that is physically
adjacent to a form element is very likely a
description of that element - Unfortunately, this semantic associating is not
always reflected in the underlying HTML of the
Web page
61Layout-based Information Extraction Technique
62The Challenge
- Accurate extraction of the labels and domains of
form elements - Elements that are visually close on the screen,
may be separated arbitrarily in the actual HTML
text - Even when HTML provides a facility for semantic
relationships, its not used in a majority of
pages - Accurate page layout is a complex process
- Even a crude approximate layout of portions of a
page, can yield very useful semantic information
63Form Analysis in HiWE
- LITE-based heuristic
- Prune the form page and isolate elements which
directly influence the layout - Approximately layout the pruned page using a
custom layout engine - Identify the pieces of text that are physically
closest to the form element (these are
candidates) - Rank each candidate using a variety of measures
- Choose the highest ranked candidate as the label
64Pruning Before Partial Layout
65LITE - Figure
- Key Idea in LITE
- Physical page layout embeds significant
semantic information
DOM Parser
DOM Representation
DOM API
Prune
Pruned Page
List of Elements Submission Info
Partial Layout
Labels Domain Values
Internal Form Representation
66Experiments
- A number of experiments were conducted to study
the performance of HiWE - We will see how performance depends on
- Minimum form size
- Crawler input to LVS table
- Different ranking functions
67Parameter Values for Task 1
- Task 1
- News articles, reports, press releases and
white papers relating to the semiconductor
industry, dated sometime in the last ten years
68Variation of Performance with
69Effect of Crawler input to LVS
70Different Ranking Functions
- When using and the crawlers
submission efficiency is mostly above 80 - performs poorly
- submits more forms than (less
conservative)
71Label Extraction
- LITE-based heuristic achieved overall accuracy of
93 - The test set was manually analyzed
72Conclusion
- Addressed the problem of extending current-day
crawlers to build repositories that include pages
from the Hidden Web - Presented a simple operation model of a hidden
web crawler - Described the implementation of a prototype
crawler HiWE - Introduced a technique for Layout-based
information extraction
73Bibliography
- Crawling the Hidden Web, by S. Raghavan and H.
Garcia-Molina, Stanford University, 2001 - BrightPlanet.com white papers
- D. Lopresti and A. Tomkins. Block edit models for
approximate string matching