Datarich Section Extraction from HTML pages - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Datarich Section Extraction from HTML pages

Description:

Pages are simular or the same (same CMS, style) Basic method: ... 1. Discover a set of pages as sample pages, that are simular to the target page ... – PowerPoint PPT presentation

Number of Views:145
Avg rating:3.0/5.0
Slides: 15
Provided by: dbaiTu
Category:

less

Transcript and Presenter's Notes

Title: Datarich Section Extraction from HTML pages


1
Data-rich Section Extraction from HTML pages
  • Introducing the DSE-Algorithm
  • Original Paper from
  • Jiying Wang and Fred H. Lochovsky
  • Department of Computer Science
  • University of Science Technology
  • Hong Kong
  • presentation from Max Arends

2
Data-rich Section Extraction from HTML pages
DSE Algorithm
  • The problem
  • Given a web-page find the Data-rich Section of
    the page without any input
  • What is it making difficult?
  • Decoration and advertisement
  • human-oriented HTML pages are difficult for
    computer programs to parse

3
Data-rich Section Extraction from HTML pages
DSE Algorithm
  • Topic distillation
  • tries to distill a small number of high-quality
    pages that are most representative of the topic.
  • Basic Idea ist that the number of links pointing
    to a page offers an assessment of its popularity
    and quality.
  • Web Information Extraction
  • tries to extract data items from web pages,
    usually semi-structured, and return it in a
    structured data
  • DSE Algorithm improves both!

4
Data-rich Section Extraction from HTML pages
DSE Algorithm
  • Overview
  • HITS Algorithm
  • One of the most well-known topic distillation
    algorithms.
  • Given a set of web pages about one specific
    topic, the HITS algorithm calculates the
    authority score (indication for relevant links)
  • Basically looking how many links are pointing to
    that page (Google)

5
Data-rich Section Extraction from HTML pages
DSE Algorithm
  • The DSE Algorithm (Data-rich Section Extraction)
  • Basic Idea
  • Pages are simular or the same (same CMS, style)
  • Basic method
  • Find use structural information and identify the
    basic layout.
  • Find neighboring pages on the same site and
    compare them.

6
Data-rich Section Extraction from HTML pages
DSE Algorithm
  • What is the Data-rich
  • Section on a HTML page?
  • Both sites share similar
  • layout
  • The key content is in the
  • lower right section

7
Data-rich Section Extraction from HTML pages
DSE Algorithm
  • 3 Phases
  • 1. Discover a set of pages as sample pages, that
    are simular to the target page
  • 2. These HTML pages are parsed and converted into
    tag-trees
  • 3. Compare the target page tree with the sample
    page tree to identify their common parts. The
    difference is the Data rich section

8
Data-rich Section Extraction from HTML pages
DSE Algorithm
  • Phase 1 Discovering sample URLs
  • US(i,j) URL similarity estimates the similarity
    of two pages

9
Data-rich Section Extraction from HTML pages
DSE Algorithm
  • Phase 2 Tree creation
  • The target page and the sample page are being
    parsed.
  • The HTML page's layout is brought into a tree
    like structure (DOM)
  • Unimportant tags are being ignored
  • FONT, SMALL, H1,H6
  • Unimportet arributes (like BACKGROUND) are being
    ignored, to avoid unnecessary computations and
    comparisons

10
Data-rich Section Extraction from HTML pages
DSE Algorithm
  • Phase 3 Tree Matching
  • Given two DOM trees (one representing the target
    page and one the sample page), the similar
    structures have to be matched
  • The two trees are being traversed using a
    depth-first order and compare them node-by-node
  • The parts of the tree that don't match are the
    Data-rich Sections

11
Data-rich Section Extraction from HTML pages
DSE Algorithm
12
Data-rich Section Extraction from HTML pages
DSE Algorithm
  • Applying DSE to HITS
  • 28 queries are used
  • for each quer we sent it to the Google search
    engine and require that the first 200 be returned
  • Result pages are add to the root set
  • Send each of the 200 results to Google again to
    retrieve at most 100 inlinks pointing to the
    result page and add them also to the root set.
  • The root set ranges from 975 to 6,776 nodes

13
Data-rich Section Extraction from HTML pages
DSE Algorithm
14
Data-rich Section Extraction from HTML pages
DSE Algorithm
Write a Comment
User Comments (0)
About PowerShow.com