Title: Extracting Structured Data from Web Pages
1Extracting Structured Data from Web Pages
- By Arsun ARTEL, Ă–zgĂ¼n Ă–ZISIKYILMAZ
- 05.11.2003
- Instructor Prof. Taflan GĂ¼ndem
2Presentation Outline
- Model Problem Formulation
- Experimental Results
- Conclusion
3What is next?
- Model Problem Formulation
- Experimental Results
- Conclusion
4Motivation
- There are many web sites that contain a large
collection of structured pages.
- Extracting structured data from the web pages is
useful, since it enables us to pose complex
queries over the data. - This paper focuses on the problem of
automatically extracting structured data from a
collection of pages.
5What is next?
- Model Problem Formulation
- Experimental Results
- Conclusion
6Example Pages
- In the real world there are many examples for
structured web pages. - amazon web site, e-bay web site etc.
- Two examples from www.amazon.com
- My System
- An Eternal Golden Braid
7Example Pages (My System 21st Century Edition)
8Example Pages (An Eternal Golden Braid)
9What is next?
- Model Problem Formulation
- Experimental Results
- Conclusion
10Underlying Problems
- Complex Schema The schema of the information
encoded in the web pages could be very complex
with arbitrary levels nesting. For instance, each
book page can contain a set of authors, with each
author having a set of addresses and so on. - Template vs. Data Syntactically, there is
nothing that distinguishes the text that is part
of the template and the text that is part of the
data.
11How is a page created with template?
12Basic Type, Tuples and Sets
- Basic Type b, Basic unit of text
- Tuple Ordered List of types, ltT1,T2,,Tngt
- Set T1
lt C Programming Language, lt Brian, Kernighan gt,
lt Dennis, Ritchie gt, 30.00 gt
13Schema and Instance
- lt C Programming Language, lt Brian, Kernighan gt,
lt Dennis, Ritchie gt, 30.00 gt
14Template Definition
- Own example
- Schema S ltb, b, bgt
- Template TS ltA B E C Dgt
- A Title, B Presented by, C
Cost, D , E and - Instance of TS
- Title Extracting Structured Data Presented by
Arsun and Ă–zgĂ¼n Cost 1hr
15Template
16What is next?
- Model Problem Formulation
- Experimental Results
- Conclusion
17General Description of EXALG
18Multiple Pages
19Correct Solution for those pages
20Some Terminology (1)
- The occurrence-vector of a token t, is defined as
the vector ltf1,f2,fngt where fi is the number of
occurrences of t in ith page - An equivalence class is a maximal set of tokens
having the same occurrence-vector. - A token is said to have unique role, if all the
occurrences of the token in the pages, is
generated by a single template-token.
21Some Terminology (2)
22Some Terminology (3)
- For real pages, an equivalence class of large
size and support is usually valid, where support
of a token is defined as the number of pages in
which the token occurs. - Example for invalid equivalence class
- Data, Mining, Jeff, 2, Jane, 6 has occurrence
vector lt0, 1, 0, 0gt
23Some Terminology (4)
- The equivalence classes with large size and
support are called LFEQs (for Large and Frequent
EQuivalence class). LFEQs are rarely formed by
chance. - Threshold for size and support is set by the user
(SizeThres, SupThres).
24Some Terminology(5)
- Valid equivalence class properties Ordering and
Nesting - Back to own example
- Template TS ltA B E C Dgt
- A Title, B Presented by, C
Cost, D , E and - Ordered A gt B gt C gt D
- Nesting B gt E gt C
25Important Observations
- In practice, two page-tokens with different
occurrence-paths have different roles
html-parser - Two page-tokens having same occurrence paths, but
with different neighbours also have different
roles
26Explanation of observations
27Modules and their operations
28Constructing Template (1)
- The extraction algorithm determines the positions
between consecutive tokens of an equivalence
class that are non-empty. - A position between two consecutive tokens is
empty if the two tokens always occur
contiguously, and non-empty, otherwise.
29Constructing Template (2)
- The tokens connected by empty positions belong to
the template. - In the non-empty positions, there are either
basic types (strings extracted from database), or
a more complex type - This unknown type can be determined by inspecting
input pages
30Constructing Template(3)
31What is next?
- Model Problem Formulation
- Experimental Results
- Conclusion
32Experimental Results (1)
- Basically this project is compared with the
RoadRunner, however RoadRunner makes simplifying
assumptions. - The first 6 web pages are obtained from
RoadRunner site. - The last three web pages have more complex
structure.
33Experimental Results(2)
34What is next?
- Model Problem Formulation
- Experimental Results
- Conclusion
35Concluding Remarks
- EXALG first discovers the unknown template that
generated the pages and uses the discovered
template to extract the data from the input
pages. - Besides getting very good results, EXALG does not
completely fail to extract any data even when
some of the assumptions made by EXALG are not met
by the input collection. - No human intervention automatically getting
template and data
36Future Work
- Automatically locate collections of pages that
are structured - Check, whether it is feasible to generate some
large database from these pages
37Questions Answers