Crawling the Hidden Web - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Crawling the Hidden Web

Description:

Press Releases. Reports. Document Type. Company Name. Sector. Controllers. Memory chips ... Dom(E1 ) = {Articles, Press Releases, Reports} Element E1. Label(E2) ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 22
Provided by: philip144
Category:
Tags: crawling | hidden | web

less

Transcript and Presenter's Notes

Title: Crawling the Hidden Web


1
Crawling the Hidden Web
  • Authors
  • Sriram Raghavan
  • Hector Garcia-Molina

Presented by Philip Garcia
2
Defined
  • Traditionally web crawlers have only been able to
    crawl the publicly indexable web (PIW).
  • Have not been able to crawl behind search forms
    or in searchable structured databases
    (dynamically generated content).
  • It has been estimated that the hidden web is 500
    times the size of the PIW.

3
HiWE (Hidden Web Explorer) Goals
  • Crawler should be able to both crawl and extract
    content from these hidden databases.
  • Crawler should be able to fill out froms that
    were designed to be used by humans
  • Should generate many result pages with high
    efficiency

4
Task Specific Crawling
  • Rather then design a hidden web crawler for the
    entire web they decided to generate a
    task-specific crawler that required user
    intervention.
  • Requires the user to mark relevant sites
    (databases) in the resource discovery stage
  • Requires a user created database (more on this
    later)

5
Crawler Form Interaction
6
Analysis of a Form
  • Each form on a page is broken down into a set as
    follows
  • F(E1,E2,En,S,M)
  • Ei represents a form element
  • Each E has an associated domain
  • S represents the submission information
  • M represents the pages meta-information

7
Analysis of a Form (contd)
8
Task Specific Databases
  • The HiWE system uses an LVS or label value set
    table.
  • For each label (or form entry) we have a set
    containing the Label and its value (how relevant
    the label is).
  • Each value is a fuzzy value between 0 and 1
  • Eg Company Names AMD (1.0), Fairchild (0.92),
    Motorola (0.95), Yahoo (0.1)
  • Values can be updated based on results (more on
    this later)

9
Label Matching
  • System must first match label on form with
    appropriate label in D.
  • HiWE uses the LITE (Layout-based Information
    Extraction) system to extract information from
    both the form and response pages.

10
LITE operation
  • Prunes away unnecessary portion of page.
  • Approximately lays out page
  • Identifies candidate text that is physically
    closest to form element.
  • Ranks each candidate.
  • Chooses the highest ranked candidate and performs
    any post-processing necessary.

11
Matching Function
  • Crawler must fill in each element in the form . .
    . But how?
  • Uses a task-specific database D containing
    relevant domain information.
  • Eg Company Names, States, etc.
  • Crawler produces Match((E1,,En),S,M),D)
  • E1?v1, , En?vn

12
HiWE Architecture
13
Ranking Values
  • HiWE uses 3 methods for determining which values
    to use.
  • Fuzzy Conjunction
  • Average
  • Probabalistic

14
Populating the LVS Table
  • Explicit Initialization
  • Built in Entries
  • Wrapped Data Sources (eg. Obtaining results from
    Yahoo or the Open Directory)
  • Crawling Experience (obtaining entries from sites
    that had a finite domain with entries not already
    in the LVS table).

15
Analysis
  • For determining performance of their system they
    chose to use a new metric to determine their
    systems usefulness Submission Efficiency
  • SEstrictNsuccess/Ntotal
  • SElenientNvalid/Ntotal
  • Used these metrics to evaluate performance
    because the performance should be independent of
    the contents of D and the performance of the
    individual sites form search functionality.

16
Results
Variation of performance using different minimum
form sizes
17
Performance Using Different Ranking Functions
  • ?fuz produces the highest overall efficiency
  • ?avg generates more successful results, but is
    less efficient
  • Useful if our goal is to get the most content
    with a reasonable efficiency
  • ?prob generates poor results.

18
Effect of Crawler Input to LVS Table
19
Conclusions
  • HiWE is a useful system for crawling forms on the
    web and obtaining informations
  • Downsides
  • Requires Significant User-Intervention
  • Only works for topic-specific domains (although
    this is also a plus)
  • Does not address forms with optional entries in
    them
  • Does not examine usefulness of crawls.
  • Does not compare with other results (note this
    is the first automated approach so it would be
    difficult)

20
References
  • Sriram Raghavan and Hector Garcia-Molina.
    Crawling the hidden Web. Technical Report
    2000-36, Computer Science Department, Stanford
    University, December 2000. Available at
    http//dbpubs.stanford.edu/pub/2000-36.
  • Sriram Raghavan and Hector Garcia-Molina.
    Crawling the hidden Web. In VLDB 2001
    Proceedings, Rome, Italy, September 2001.
  • Authors Original Presentation http//forum.stanfo
    rd.edu/events/archive/affiliates-week/Abstracts/sd
    w-02-raghavan.html

21
Questions
Write a Comment
User Comments (0)
About PowerShow.com