Crawling the Hidden Web - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Crawling the Hidden Web

Description:

Number of Views:34

Avg rating:3.0/5.0

Slides: 22

Provided by: philip144

Category:

Tags: crawling | hidden | web

Transcript and Presenter's Notes

Title: Crawling the Hidden Web

1
Crawling the Hidden Web

Presented by Philip Garcia
2
Defined

Traditionally web crawlers have only been able to
crawl the publicly indexable web (PIW).
Have not been able to crawl behind search forms
or in searchable structured databases
(dynamically generated content).
It has been estimated that the hidden web is 500
times the size of the PIW.

3
HiWE (Hidden Web Explorer) Goals

Crawler should be able to both crawl and extract
content from these hidden databases.
Crawler should be able to fill out froms that
were designed to be used by humans
Should generate many result pages with high
efficiency

4
Task Specific Crawling

Rather then design a hidden web crawler for the
entire web they decided to generate a
task-specific crawler that required user
intervention.
Requires the user to mark relevant sites
(databases) in the resource discovery stage
Requires a user created database (more on this
later)

5
Crawler Form Interaction
6
Analysis of a Form

7
Analysis of a Form (contd)
8
Task Specific Databases

The HiWE system uses an LVS or label value set
table.
For each label (or form entry) we have a set
containing the Label and its value (how relevant
the label is).
Each value is a fuzzy value between 0 and 1
Eg Company Names AMD (1.0), Fairchild (0.92),
Motorola (0.95), Yahoo (0.1)
Values can be updated based on results (more on
this later)

9
Label Matching

System must first match label on form with
appropriate label in D.
HiWE uses the LITE (Layout-based Information
Extraction) system to extract information from
both the form and response pages.

10
LITE operation

Prunes away unnecessary portion of page.
Approximately lays out page
Identifies candidate text that is physically
closest to form element.
Ranks each candidate.
Chooses the highest ranked candidate and performs
any post-processing necessary.

11
Matching Function

12
HiWE Architecture
13
Ranking Values

14
Populating the LVS Table

Explicit Initialization
Built in Entries
Wrapped Data Sources (eg. Obtaining results from
Yahoo or the Open Directory)
Crawling Experience (obtaining entries from sites
that had a finite domain with entries not already
in the LVS table).

15
Analysis

For determining performance of their system they
chose to use a new metric to determine their
systems usefulness Submission Efficiency
SEstrictNsuccess/Ntotal
SElenientNvalid/Ntotal
Used these metrics to evaluate performance
because the performance should be independent of
the contents of D and the performance of the
individual sites form search functionality.

16
Results
Variation of performance using different minimum
form sizes
17
Performance Using Different Ranking Functions

18
Effect of Crawler Input to LVS Table
19
Conclusions

HiWE is a useful system for crawling forms on the
web and obtaining informations
Downsides
Requires Significant User-Intervention
Only works for topic-specific domains (although
this is also a plus)
Does not address forms with optional entries in
them
Does not examine usefulness of crawls.
Does not compare with other results (note this
is the first automated approach so it would be
difficult)

20
References

Sriram Raghavan and Hector Garcia-Molina.
Crawling the hidden Web. Technical Report
2000-36, Computer Science Department, Stanford
University, December 2000. Available at
http//dbpubs.stanford.edu/pub/2000-36.
Sriram Raghavan and Hector Garcia-Molina.
Crawling the hidden Web. In VLDB 2001
Proceedings, Rome, Italy, September 2001.
Authors Original Presentation http//forum.stanfo
rd.edu/events/archive/affiliates-week/Abstracts/sd
w-02-raghavan.html

21
Questions

Write a Comment

User Comments (0)