Crawling the Hidden Web

About This Presentation

Title:

Crawling the Hidden Web

Description:

Should interact with forms that were designed primarily for human consumption. Must provide input in the form of search queries ... – PowerPoint PPT presentation

Number of Views:176

Avg rating:3.0/5.0

Slides: 74

Provided by: michaelw8

Category:

more less

Transcript and Presenter's Notes

Title: Crawling the Hidden Web

1
Crawling the Hidden Web

by
Michael Weinberg
mwmw_at_cs.huji.ac.il

Internet DB Seminar,
The Hebrew University of Jerusalem, School
of Computer Science and Engineering,
December 2001
2
Agenda

Hidden Web - what is it all about?
Generic model for a hidden Web crawler
HiWE (Hidden Web Exposer)
LITE Layout-based Information Extraction
Technique
Results from experiments conducted to test these
techniques

3
Web Crawlers

Automatically traverse the Web graph, building a
local repository of the portion of the Web that
they visit
Traditionally, crawlers have only targeted a
portion of the Web called the publicly indexable
Web (PIW)
PIW the set of pages reachable purely by
following hypertext links, ignoring search forms
and pages that require authentication

4
The Hidden Web

Recent studies show that a significant fraction
of Web content in fact lies outside the PIW
Large portions of the Web are hidden behind
search forms in searchable databases
HTML pages are dynamically generated in response
to queries submitted via the search forms
Also referred as the Deep Web

5
The Hidden Web Growth

Hidden Web continues to grow, as organizations
with large amount of high-quality information are
placing their content online, providing
web-accessible search facilities over existing
databases
For example
Census Bureau
Patents and Trademarks Office
News media companies
InvisibleWeb.com lists over 10000 such databases

6
Surface Web
7
Deep Web
8
Deep Web Content Distribution
9
Deep Web Stats

The Deep Web is 500 times larger than PIW !!!
Contains 7,500 terabytes of information (March
2000)
More than 200,000 Deep Web sites exist
Sixty of the largest Deep Web sites collectively
contain about 750 terabytes of information
95 of the Deep Web is publicly accessible (no
fees)
Google indexes about 16 of the PIW, so we search
about 0.03 of the pages available today

10
The Problem

Hidden Web contains large amounts of high-quality
information
The information is buried on dynamically
generated sites
Search engines that use traditional crawlers
never find this information

11
The Solution

Build a hidden Web crawler
Can crawl and extract content from hidden
databases
Enable indexing, analysis, and mining of hidden
Web content
The content extracted by such crawlers can be
used to categorize and classify the hidden
databases

12
Challenges

Significant technical challenges in designing a
hidden Web crawler
Should interact with forms that were designed
primarily for human consumption
Must provide input in the form of search queries
How equip the crawlers with input values for use
in constructing search queries?
To address these challenges, we adopt the
task-specific, human-assisted approach

13
Task-Specificity

Extract content based on the requirements of a
particular application or task
For example, consider a market analyst interested
in press releases, articles, etc pertaining to
the semiconductor industry, and dated sometime in
the last ten years

14
Human-Assistance

Human-assistance is critical to ensure that the
crawler issues queries that are relevant to the
particular task
For instance, in the semiconductor example, the
market analyst may provide the crawler with lists
of companies or products that are of interest
The crawler will be able to gather additional
potential company and product names as it
processes a number of pages

15
Two Steps

There are two steps in achieving our goal
Resource discovery identify sites and databases
that are likely to be relevant to the task
Content extraction actually visit the
identified sites to submit queries and extract
the hidden pages
In this presentation we do not directly address
the resource discovery problem

16
Hidden Web Crawlers
17
User form interaction
Form page
(1) Download form
(2) View form
(4) Submit form
(3) Fill-out form
Web query front-end
(5) Download response
Hidden Database
(6) View result
Response page
18
Operation Model

Our model of a hidden Web crawler consists of
four components
Internal Form Representation
Task-specific database
Matching function
Response Analysis
Form Page the page containing the search form
Response Page the page received in response to
a form submission

19
Generic Operational Model
Hidden Web Crawler
Form page
Internal Form Representation
Form analysis
Download form
Match
Web query front-end
Form submission
Set of value-assignments
Download response
Response Analysis
Response page
20
Internal Form Representation

Form F
is a set of n form
elements
S submission information associated with the
form
submission URL
Internal identifiers for each form element
M meta-information about the form
web-site hosting the form
set of pages pointing to this form page
other text on the page besides the form

21
Task-specific Database

The crawler is equipped with a task-specific
database D
Contains the necessary information to formulate
queries relevant to the particular task
In the market analyst example, D could contain
list of semiconductor company and product names
The actual format and organization of D are
specific for to a particular crawler
implementation
HiWE uses a set of labeled fuzzy sets

22
Matching Function

Matching algorithm properties
Input Internal form representation and current
contents of the database D
Output Set of value assignments
associates value with element

23
Response Analysis

Module that stores the response page in the
repository
Attempts to distinguish between pages containing
search results and pages containing error
messages
This feedback is used to tune the matching
function

24
Traditional Performance Metric

Traditional crawlers performance metrics
Crawling speed
Scalability
Page importance
Freshness
These metrics are relevant to hidden web
crawlers, but do not capture the fundamental
challenges in dealing with the Hidden Web

25
New Performance Metrics

Coverage metric
Relevant pages extracted / relevant pages
present in the targeted hidden databases
Problem difficult to estimate how much of the
hidden content is relevant to the task

26
New Performance Metrics

the total number of forms that the
crawler submits
num of submissions which result in
response page with one or more search
results
Problem the crawler is penalized if the database
didnt contain any relevant search results

27
New Performance Metrics

number of semantically correct form
submissions
Penalizes the crawler only if a form submission
is semantically incorrect
Problem difficult to evaluate since a manual
comparison is needed to decide whether the
form is semantically correct

28
Design Issues

What information about each form element should
the crawler collect?
What meta-information is likely to be useful?
How should the task-specific database be
organized, updated and accessed?
What Match function is likely to maximize
submission efficiency?
How to use the response analysis module to tune
the Match function?

29
HiWE Hidden Web Exposer
30
Basic Idea

Extract descriptive information (label) for each
element of a form
Task-specific database is organized in terms of
categories, each of which is also associated with
labels
Matching function attempts to match from form
labels to database categories to compute a set of
candidate values assignments

31
HiWE Architecture
URL 1 URL 2 URL N
URL List
LVS Table
WWW
Label1 Value-Set1 Label2 Value-Set2 Labeln
Value-Setn
Parser
Crawl Manager
Form Analyzer
Form submission
LVS Manager
Form Processor
Feedback
Response
Response Analyzer
Custom data sources
32
HiWEs Main Modules

URL List
contains all the URLs the crawler has discovered
so far
Crawl Manager
controls the entire crawling process
Parser
extracts hypertext links from the crawled pages
and adds them to the URL list
Form Analyzer, Form Processor, Response Analyzer
Together implement the form processing and
submission operations

33
HiWEs Main Modules

LVS Manager
Manages additions and accesses to the LVS table
LVS table
HiWEs implementation of the task-specific
database

34
HiWEs Form Representation

Form
The third component of F is an empty set since
current implementation of HiWE does not collect
any meta-information about the form
For each element , HiWE collects a domain
Dom( ) and a label label( )

35
HiWEs Form Representation

Domain of an element
Set of values which can be associated with the
corresponding form element
May be a finite set (e.g., domain of a selection
list)
May be infinite set (e.g., domain of a text box)
Label of an element
The descriptive information associated with the
element, if any
Most forms include some descriptive text to help
users understand the semantics of the element

36
Form Representation - Figure
Element E1
Label(E1) "Document Type" Dom(E1 ) Articles,
Press Releases,
Reports
Element E2
Label(E2) "Company Name" Dom(E2) s s is a
text string
Element E3
Label(E3) "Sector" Dom(E3) Entertainment,
Automobile
Information Technology, Construction
37
HiWEs Task-specific Database

Task-specific information is organized in terms
of a finite set of concepts or categories
Each concept has one or more labels and an
associated set of values
For example the label Company Name could be
associated with the set of values IBM,
Microsoft, HP,

38
HiWEs Task-specific Database

The concepts are organized in a table called the
Label Value Set (LVS)
Each entry in the LVS is of the form (L,V)
L label
fuzzy set of values
Fuzzy set V has an associated membership function
that assigns weights, in the range 0,1 to each
member of the set
is a measure of the crawlers
confidence that the assignment of to E is
semantically meaningful

39
HiWEs Matching Function

For elements with a finite domain
The set of possible values is fixed and can be
exhaustively enumerated
In this example, the crawler can first retrieve
all relevant articles, then all relevant press
releases and finally all relevant reports

Element E1
Label(E1) "Document Type" Dom(E1 ) Articles,
Press Releases,
Reports
40
HiWEs Matching Function

For elements with an infinite domain
HiWE textually matches the labels of these
elements with labels in the LVS table
For example, if a textbox element has the label
Enter State which best matches an LVS entry
with the label State , the values associated
with that LVS entry (e.g., California) can be
used to fill the textbox
How do we match Form labels with LVS labels?

41
Label Matching

Two steps in matching Form labels with LVS
labels
1. Normalization includes conversion to a common
case and standard style
2. Use of an approximate string matching
algorithm to compute minimum edit distances
HiWE employs D. Lopresti and A. Tomkins string
matching algorithm that takes word reordering
into account

42
Label Matching

Let LabelMatch( ) denote the LVS entry with
the minimum distance to label( )
Threshold
If all LVS entries are more than edit
operations away from label( ) , LabelMatch(
) nil

43
Label Matching

For each element , compute ( , )
If has an infinite domain and (L,V) is the
closest matching LVS entry, then V and
If has a finite domain, then Dom( )
and
The set of value assignments is computed as the
product of all the s
Too many assignments?

44
Ranking Value Assignments

HiWE employs an aggregation function to compute a
rank for each value assignment
Uses a configurable parameter, a minimum
acceptable value assignment rank ( )
The intent is to improve submission efficiency by
only using high-quality value assignments
We will show three possible aggregation functions

45
Fuzzy Conjunction

The rank of a value assignment is the minimum of
the weights of all the constituent values.
Very conservative in assigning ranks. Assigns a
high rank only if each individual weight is high

46
Average

The rank of a value assignment is the average of
the weights of the constituent values
Less conservative than fuzzy conjunction

47
Probabilistic

This ranking function treats weights as
probabilities
is the likelihood that the choice of
is useful and is the
likelihood that it is not
The likelihood of a value assignment being useful
is
Assigns low rank if all the individual weights
are very low

48
Populating the LVS Table

HiWE supports a variety of mechanisms for adding
entries to the LVS table
Explicit Initialization
Built-in entries
Wrapped data sources
Crawling experience

49
Explicit Initialization

Supply labels and associated value sets at
startup time
Useful to equip the crawler with labels that the
crawler is most likely to encounter
In the semiconductor example, we supply HiWE
with a list of relevant company names and
associate the list with labels Company ,
Company Name

50
Built-in Entries

HiWE has built-in entries for commonly used
concepts
Dates and Times
Names of months
Days of week

51
Wrapped Data Sources

LVS Manager can query data sources through a
well-defined interface
The data source must be wrapped by a program
that supports two kinds of queries
Given a set of labels, return a value set
Given a set of values, return other values that
belong to the same value set

52
HiWE Architecture
URL 1 URL 2 URL N
URL List
LVS Table
WWW
Label1 Value-Set1 Label2 Value-Set2 Labeln
Value-Setn
Parser
Crawl Manager
Form Analyzer
Form submission
LVS Manager
Form Processor
Feedback
Response
Response Analyzer
Custom data sources
53
Crawling Experience

Finite domain form elements are a useful source
of labels and associated value sets
HiWE adds this information to the LVS table
Effective when similar label is associated with a
finite domain element in one form and with an
infinite domain element in another

54
Computing Weights

New value added to the LVS must be assigned a
suitable weight
Explicit initialization and build-in values have
fixed weights
Values obtained from external data sources or
through the crawlers own activity, are assigned
weights that vary with time

55
Initial Weights

For external data sources - computed by the
respective wrappers
For values directly gathered by the crawler
Finite domain element E with Dom(E)
1 iff
Three cases arise when incorporating Dom(E) into
the LVS table

56
Updating LVS Case 1

Crawler successfully extracts label(E) and
computes LabelMatch(E)(L,V)
Replace the (L,V) entry by the entry
Intuitively, Dom(E) provides new elements to the
value set and boosts the weights of existing
elements

57
Updating LVS Case 2

Crawler successfully extracts label(E) but
LabelMatch(E) nil
A new entry ( label(E),Dom(E) ) is created in the
LVS

58
Updating LVS Case 3

Crawler can not extract label(E)
For each entry (L,V)
Compute a score
Identify the entry with the maximum score
Identify the value of the maximum score
Replace entry with new entry
Confidence of new values

59
Configuring HiWE

Initialization of the crawling activity includes
Set of sites to crawl
Explicit initialization for the LVS table
Set of data sources
Label matching threshold
Minimum acceptable value assignment rank
Value assignment aggregation function

60
Introducing LITE

Layout-based Information Extraction Technique
Physical Layout of a page is also used to aid in
extraction
For example, a piece of text that is physically
adjacent to a form element is very likely a
description of that element
Unfortunately, this semantic associating is not
always reflected in the underlying HTML of the
Web page

61
Layout-based Information Extraction Technique
62
The Challenge

Accurate extraction of the labels and domains of
form elements
Elements that are visually close on the screen,
may be separated arbitrarily in the actual HTML
text
Even when HTML provides a facility for semantic
relationships, its not used in a majority of
pages
Accurate page layout is a complex process
Even a crude approximate layout of portions of a
page, can yield very useful semantic information

63
Form Analysis in HiWE

LITE-based heuristic
Prune the form page and isolate elements which
directly influence the layout
Approximately layout the pruned page using a
custom layout engine
Identify the pieces of text that are physically
closest to the form element (these are
candidates)
Rank each candidate using a variety of measures
Choose the highest ranked candidate as the label

64
Pruning Before Partial Layout
65
LITE - Figure

Key Idea in LITE
Physical page layout embeds significant
semantic information

DOM Parser
DOM Representation
DOM API
Prune
Pruned Page
List of Elements Submission Info
Partial Layout
Labels Domain Values
Internal Form Representation
66
Experiments

A number of experiments were conducted to study
the performance of HiWE
We will see how performance depends on
Minimum form size
Crawler input to LVS table
Different ranking functions

67
Parameter Values for Task 1

Task 1
News articles, reports, press releases and
white papers relating to the semiconductor
industry, dated sometime in the last ten years

68
Variation of Performance with
69
Effect of Crawler input to LVS
70
Different Ranking Functions

When using and the crawlers
submission efficiency is mostly above 80
performs poorly
submits more forms than (less
conservative)

71
Label Extraction

LITE-based heuristic achieved overall accuracy of
93
The test set was manually analyzed

72
Conclusion

Addressed the problem of extending current-day
crawlers to build repositories that include pages
from the Hidden Web
Presented a simple operation model of a hidden
web crawler
Described the implementation of a prototype
crawler HiWE
Introduced a technique for Layout-based
information extraction

73
Bibliography

Crawling the Hidden Web, by S. Raghavan and H.
Garcia-Molina, Stanford University, 2001
BrightPlanet.com white papers
D. Lopresti and A. Tomkins. Block edit models for
approximate string matching

Write a Comment

User Comments (0)