Title: ConceptualModelBased Web Data Extraction by Example
1Conceptual-Model-Based Web Data Extraction by
Example
Yuanqiu (Joe) Zhou Data Extraction Group Brigham
Young University Sponsored by NSF
2Motivation
- Data-rich Websites in abundance
- Conceptual-Model-Based Methodology is resilient
- By Example approach is user-friendly
3By Example Approach
- Web users specify desired information by creating
a form - Users collect sample pages on the Web
- An ontology generator learns the task by
analyzing the form and the sample pages - Interactions may be needed to improve or complete
the ontology
4Architecture
Data Frame Libraries
Sample Pages
Ontology Generator
User Created Form
GUI
Extraction Engine
Target Pages
Populated Database
5Sample Web Page
User Created Form
Canon
PowerShot G2
6Extraction Ontology
- Relationship Set and Constraints
- Extraction Patterns
- Keywords
- Context Expressions
7Relationship Set and Constraints
- DigitalCamera - object
- DigitalCamera 01 has Brand 1
- DigitalCamera 01 has Model 1
- DigitalCamera 01 has CCDResolution 1
- DigitalCamera 01 has ImageResolution 1
- DigitalCamera 01 has OpticalZoom 1
- DigitalCamera 01 has DigitalZoom 1
- Primary Object Name
- Other Objects Names
- Participation Constraints
8Relationship Set and Constraints
- DigitalCamera - object
- DigitalCamera 01 has Brand 1
- DigitalCamera 01 has Model 1
- DigitalCamera 01 has CCDResolution 1
- DigitalCamera 01 has ImageResolution 1
- DigitalCamera 01 has OpticalZoom 1
- DigitalCamera 01 has DigitalZoom 1
- Primary Object Name
- Other Objects Names
- Participation Constraints
9Relationship Set and Constraints
- DigitalCamera - object
- DigitalCamera 01 has Brand 1
- DigitalCamera 01 has Model 1
- DigitalCamera 01 has CCDResolution 1
- DigitalCamera 01 has ImageResolution 1
- DigitalCamera 01 has OpticalZoom 1
- DigitalCamera 01 has DigitalZoom 1
- Primary Object Name
- Other Objects Names
- Participation Constraints
10Relationship Set and Constraints
- DigitalCamera - object
- DigitalCamera 01 has Brand 1
- DigitalCamera 01 has Model 1
- DigitalCamera 01 has CCDResolution 1
- DigitalCamera 01 has ImageResolution 1
- DigitalCamera 01 has OpticalZoom 1
- DigitalCamera 01 has DigitalZoom 1
- Primary Object Name
- Other Objects Names
- Participation Constraints
11Extraction Patterns
From Data Frame Libraries
- Data Frame Libraries
- Lexicons
- Synonym Dictionary
- Regular Expressions
-
- Extraction Pattern
- Lexicons for Brand and Model
- Regular Expressions for numbers and Image
resolution
12Extraction Patterns
Data Frame Libraries
- Features a high-quality 4.0 Megapixel
Resolution CCD - The new Nikon Coolpix 995 offers a boasting
3.34 Megapixel CCD - 3 effective megapixel
CCDResolution matches 20 constant extract
"\b\d(\.\d1,2)?\b" keyword
"\bMegapixel\b", "\bCCD\b", "\bResolution\b"
13Keywords
- Features a high-quality 4.0 Megapixel Resolution
CCD - The new Nikon Coolpix 995 offers a boasting 3.34
Megapixel CCD - 3 effective megapixel
14Keywords
- Features a high-quality 4.0 Megapixel Resolution
CCD - The new Nikon Coolpix 995 offers a boasting 3.34
Megapixel CCD - 3 effective megapixel
15Keywords
- Features a high-quality 4.0 Megapixel Resolution
CCD - The new Nikon Coolpix 995 offers a boasting 3.34
Megapixel CCD - 3 effective megapixel
CCDResolution matches 20 constant extract
"\b\d(\.\d1,2)?\b" keyword
"\bMegapixel\b", "\bCCD\b", "\bResolution\b"
16Context Expressions
- 3.5x optical zoom (2.5x digital)
- a superior 4x Optical Zoom Nikkor lens, plus 4x
stepless digital zoom - optical 3X /digital 6X zoom
OpticalZoom matches 10 constant extract
"\b\d(\.\d)?" context "\b\d(\.\d)?(x)\b"
keyword "\boptical\b"
17Extraction Ontology
DigitalCamera - object DigitalCamera 01
has Brand 1 Brand matches 10 constant
extract "\bNikon\b", extract
"\bCanon\b", extract "\bOlympus\b",
extract "\bMinolta\b", extract
"\bSony\b" end DigitalCamera 01 has
CCDResolution 1 CCDResolution matches 20
constant extract "\b\d(\.\d1,2)?\b"
keyword "\bMegapixel\b,
"\bCCD\b", "\bResolution\b" end
DigitalCamera 01 has ImageResolution
1 ImageResolution matches 20 constant
extract "\b\d4(\s)?(x)(\s)?\d4\b" ,
extract "\b\d4(\s)?(x)(\s)?\d4\b"
keyword "\bResolution\b",
"\bImage\b" end DigitalCamera 01 has
OpticalZoom 1 OpticalZoom matches
10 constant extract "\b\d"
context "\b\d(x)\b" keyword
"\boptical\b" end
18Extraction Ontology
DigitalCamera - object DigitalCamera 01
has Brand 1 Brand matches 10 constant
extract "\bNikon\b", extract
"\bCanon\b", extract "\bOlympus\b",
extract "\bMinolta\b", extract
"\bSony\b" end DigitalCamera 01 has
CCDResolution 1 CCDResolution matches 20
constant extract "\b\d(\.\d1,2)?\b"
keyword "\bMegapixel\b,
"\bCCD\b", "\bResolution\b" end
DigitalCamera 01 has ImageResolution
1 ImageResolution matches 20 constant
extract "\b\d4(\s)?(x)(\s)?\d4\b" ,
extract "\b\d4(\s)?(x)(\s)?\d4\b"
keyword "\bResolution\b",
"\bImage\b" end DigitalCamera 01 has
OpticalZoom 1 OpticalZoom matches
10 constant extract "\b\d"
context "\b\d(x)\b" keyword
"\boptical\b" end
19Extraction Ontology
DigitalCamera - object DigitalCamera 01
has Brand 1 Brand matches 10 constant
extract "\bNikon\b", extract
"\bCanon\b", extract "\bOlympus\b",
extract "\bMinolta\b", extract
"\bSony\b" end DigitalCamera 01 has
CCDResolution 1 CCDResolution matches 20
constant extract "\b\d(\.\d1,2)?\b"
keyword "\bMegapixel\b,
"\bCCD\b", "\bResolution\b" end
DigitalCamera 01 has ImageResolution
1 ImageResolution matches 20 constant
extract "\b\d4(\s)?(x)(\s)?\d4\b" ,
extract "\b\d4(\s)?(x)(\s)?\d4\b"
keyword "\bResolution\b",
"\bImage\b" end DigitalCamera 01 has
OpticalZoom 1 OpticalZoom matches
10 constant extract "\b\d(\.\d)"
context "\b\d(\.\d)?(x)\b"
keyword "\boptical\b" end
20Extraction Ontology
DigitalCamera - object DigitalCamera 01
has Brand 1 Brand matches 10 constant
extract "\bNikon\b", extract
"\bCanon\b", extract "\bOlympus\b",
extract "\bMinolta\b", extract
"\bSony\b" end DigitalCamera 01 has
CCDResolution 1 CCDResolution matches 20
constant extract "\b\d(\.\d1,2)?\b"
keyword "\bMegapixel\b,
"\bCCD\b", "\bResolution\b" end
DigitalCamera 01 has ImageResolution
1 ImageResolution matches 20 constant
extract "\b\d4(\s)?(x)(\s)?\d4\b" ,
extract "\b\d4(\s)?(x)(\s)?\d4\b"
keyword "\bResolution\b",
"\bImage\b" end DigitalCamera 01 has
OpticalZoom 1 OpticalZoom matches
10 constant extract "\b\d(\.\d)"
context "\b\d(\.\d)?(x)\b"
keyword "\boptical\b" end
21Results (Same Site)
22Results (Different Site)
23Summary and Future Work
- The example indicates that the approach is
feasible - Some open questions need to be explored