Datarich Section Extraction from HTML pages

About This Presentation

Title:

Description:

Number of Views:145

Avg rating:3.0/5.0

Slides: 15

Provided by: dbaiTu

Category:

more less

Transcript and Presenter's Notes

Title: Datarich Section Extraction from HTML pages

1
Data-rich Section Extraction from HTML pages

2
Data-rich Section Extraction from HTML pages
DSE Algorithm

3
Data-rich Section Extraction from HTML pages
DSE Algorithm

Topic distillation
tries to distill a small number of high-quality
pages that are most representative of the topic.
Basic Idea ist that the number of links pointing
to a page offers an assessment of its popularity
and quality.
Web Information Extraction
tries to extract data items from web pages,
usually semi-structured, and return it in a
structured data
DSE Algorithm improves both!

4
Data-rich Section Extraction from HTML pages
DSE Algorithm

Overview
HITS Algorithm
One of the most well-known topic distillation
algorithms.
Given a set of web pages about one specific
topic, the HITS algorithm calculates the
authority score (indication for relevant links)
Basically looking how many links are pointing to
that page (Google)

5
Data-rich Section Extraction from HTML pages
DSE Algorithm

6
Data-rich Section Extraction from HTML pages
DSE Algorithm

7
Data-rich Section Extraction from HTML pages
DSE Algorithm

3 Phases
1. Discover a set of pages as sample pages, that
are simular to the target page
2. These HTML pages are parsed and converted into
tag-trees
3. Compare the target page tree with the sample
page tree to identify their common parts. The
difference is the Data rich section

8
Data-rich Section Extraction from HTML pages
DSE Algorithm

9
Data-rich Section Extraction from HTML pages
DSE Algorithm

Phase 2 Tree creation
The target page and the sample page are being
parsed.
The HTML page's layout is brought into a tree
like structure (DOM)
Unimportant tags are being ignored
FONT, SMALL, H1,H6
Unimportet arributes (like BACKGROUND) are being
ignored, to avoid unnecessary computations and
comparisons

10
Data-rich Section Extraction from HTML pages
DSE Algorithm

Phase 3 Tree Matching
Given two DOM trees (one representing the target
page and one the sample page), the similar
structures have to be matched
The two trees are being traversed using a
depth-first order and compare them node-by-node
The parts of the tree that don't match are the
Data-rich Sections

11
Data-rich Section Extraction from HTML pages
DSE Algorithm
12
Data-rich Section Extraction from HTML pages
DSE Algorithm

Applying DSE to HITS
28 queries are used
for each quer we sent it to the Google search
engine and require that the first 200 be returned
Result pages are add to the root set
Send each of the 200 results to Google again to
retrieve at most 100 inlinks pointing to the
result page and add them also to the root set.
The root set ranges from 975 to 6,776 nodes