Title: DataExtraction Ontology Generation by Example
1Data-Extraction Ontology Generation by Example
Yuanqiu (Joe) Zhou Data Extraction Group Brigham
Young University Sponsored by NSF
2Motivation
- Semi-structured Web data need to be extracted for
further manipulations. - Contrast to other wrapper generation techniques,
BYU ontology-based data-extraction technique is
resilient. - By-Example approach makes it possible to help
common users generate ontologies easily.
3Web-based System GUI
4Architecture
Data Frame Library
Sample Pages
Ontology Generator
User Defined Form
System GUI
Extraction Engine
Test Pages
Populated Database
5Extraction Ontology
- Object and Relationship Sets and Constraints
- Extraction Patterns
- Keywords
- Context Expressions
6Ontology GenerationObject and Relationship Sets
and Constraints
7Ontology GenerationObject and Relationship Sets
and Constraints
A
B1
B1, B2 B
B2
8User Created Form
Object and Relationship Sets and
Constraints DigitalCamera - object DigitalCame
ra 01 Brand 1 DigitalCamera 01 Model
1 DigitalCamera 01 CCDResolution
1 DigitalCamera 01 ImageResolution
1 DigitalCamera 01 Zoom 1 Zoom 01
DigitalZoom 1 Zoom 01 OpticalZoom 1
9Ontology GenerationExtraction Patterns
- Data Frame Library
- Lexicons
- Synonym Dictionaries or thesauri
- Regular Expressions
-
- Matching extraction patterns
- Only one (bingo!)
- More than one (use extraction pattern filters)
- No matching extraction pattern (create one)
10Ontology GenerationKeywords
- Features a high-quality 4.0 Megapixel Resolution
CCD - The new Nikon Coolpix 995 boasts of a 3.34
Megapixel CCD - 3 effective megapixel
11Ontology GenerationContext Expressions
- 3.5x optical zoom (2.5x digital)
- a superior 4x Optical Zoom Nikkor lens, plus 4x
stepless digital zoom - optical 3X /digital 6X zoom
12Extraction Ontology
DigitalCamera - object DigitalCamera 01
Brand 1 DigitalCamera 01 ImageResolution
1 DigitalCamera 01 Zoom
1 DigitalCamera 01 CCDResolution
1 Zoom01 OpticalZoom1 Brand
matches 10 constant extract
"\bNikon\b", extract "\bCanon\b",
extract "\bOlympus\b", extract
"\bMinolta\b", extract "\bSony\b" end
CCD Resolution matches 20 constant
extract "\b\d(\.\d1,2)?\b" keyword
"\bMegapixel\b, "\bCCD\b",
"\bCCD Resolution\b" end OpticalZoom
matches 10 constant extract "\b\d(\.\d)"
context
"\b\d(\.\d)?(x)\b" keyword
"\boptical\b" end
13Measurements
- How much of the ontology was generated with
respect to how much could have been generated? - How many components generated should not have
been generated? - What comparisons can we make about the precision
and recall ratios of extraction data between a
system-generated ontology and an expert-generated
ontology? - How many sample pages are necessary for
acceptable system performance?
14Contributions
- Proposes a by-example approach to
semi-automatically generate data-extraction
ontologies - Constructs a Web-based tool to generate
data-extraction ontologies