Title: Datarich Section Extraction from HTML pages
1Data-rich Section Extraction from HTML pages
- Introducing the DSE-Algorithm
- Original Paper from
- Jiying Wang and Fred H. Lochovsky
- Department of Computer Science
- University of Science Technology
- Hong Kong
- presentation from Max Arends
2Data-rich Section Extraction from HTML pages
DSE Algorithm
- The problem
- Given a web-page find the Data-rich Section of
the page without any input - What is it making difficult?
- Decoration and advertisement
- human-oriented HTML pages are difficult for
computer programs to parse
3Data-rich Section Extraction from HTML pages
DSE Algorithm
- Topic distillation
- tries to distill a small number of high-quality
pages that are most representative of the topic. - Basic Idea ist that the number of links pointing
to a page offers an assessment of its popularity
and quality. - Web Information Extraction
- tries to extract data items from web pages,
usually semi-structured, and return it in a
structured data - DSE Algorithm improves both!
4Data-rich Section Extraction from HTML pages
DSE Algorithm
- Overview
- HITS Algorithm
- One of the most well-known topic distillation
algorithms. - Given a set of web pages about one specific
topic, the HITS algorithm calculates the
authority score (indication for relevant links) - Basically looking how many links are pointing to
that page (Google)
5Data-rich Section Extraction from HTML pages
DSE Algorithm
- The DSE Algorithm (Data-rich Section Extraction)
- Basic Idea
- Pages are simular or the same (same CMS, style)
- Basic method
- Find use structural information and identify the
basic layout. - Find neighboring pages on the same site and
compare them.
6Data-rich Section Extraction from HTML pages
DSE Algorithm
- What is the Data-rich
- Section on a HTML page?
- Both sites share similar
- layout
- The key content is in the
- lower right section
7Data-rich Section Extraction from HTML pages
DSE Algorithm
- 3 Phases
-
- 1. Discover a set of pages as sample pages, that
are simular to the target page - 2. These HTML pages are parsed and converted into
tag-trees - 3. Compare the target page tree with the sample
page tree to identify their common parts. The
difference is the Data rich section
8Data-rich Section Extraction from HTML pages
DSE Algorithm
- Phase 1 Discovering sample URLs
- US(i,j) URL similarity estimates the similarity
of two pages
9Data-rich Section Extraction from HTML pages
DSE Algorithm
- Phase 2 Tree creation
- The target page and the sample page are being
parsed. - The HTML page's layout is brought into a tree
like structure (DOM) - Unimportant tags are being ignored
- FONT, SMALL, H1,H6
- Unimportet arributes (like BACKGROUND) are being
ignored, to avoid unnecessary computations and
comparisons
10Data-rich Section Extraction from HTML pages
DSE Algorithm
- Phase 3 Tree Matching
- Given two DOM trees (one representing the target
page and one the sample page), the similar
structures have to be matched - The two trees are being traversed using a
depth-first order and compare them node-by-node - The parts of the tree that don't match are the
Data-rich Sections
11Data-rich Section Extraction from HTML pages
DSE Algorithm
12Data-rich Section Extraction from HTML pages
DSE Algorithm
- Applying DSE to HITS
- 28 queries are used
- for each quer we sent it to the Google search
engine and require that the first 200 be returned - Result pages are add to the root set
- Send each of the 200 results to Google again to
retrieve at most 100 inlinks pointing to the
result page and add them also to the root set. - The root set ranges from 975 to 6,776 nodes
13Data-rich Section Extraction from HTML pages
DSE Algorithm
14Data-rich Section Extraction from HTML pages
DSE Algorithm