Title: Generating DataExtraction Ontologies By Example
1Generating Data-Extraction Ontologies By Example
- Joe Zhou
- Data Extraction Group
- Brigham Young University
2Background
- World Wide Web contains a huge amount of useful
information. - Web data-extraction is necessary for querying the
data of interest. - Most of wrappers generate extraction patterns
based on delimiters or HTML tags. So they are
source-dependent. - BYU ontology-based technique is resilient.
3Problem and Solution
- BYU Onto approach requires that ontology experts
generate data-extraction ontologies for the
domains of interest to ordinary users - A principal effort of our research is to automate
ontology- generation process as much as possible - We developed a system OntoByE (Ontology By
Example) to generate data-extraction ontologies
semi-automatically
4Extraction Ontology
- Object sets, Relationship sets and Constraints
- Data frames for Lexical Object Sets
5Extraction Ontology
Object sets, Relationship sets and Constraints
Data frame for Digital Zoom
6OntoByE System Overview and Architecture
7OntoByE User Interface
8Form Editor Basic Form Elements
9Form Editor Nesting Forms
10Form Editor Creating Forms for Digital Camera
Application
11Training Web Document Preparation
12Ontology Generator Workflow
Data Frames
Object Sets, Relationship Sets and Constraints
Extraction Ontology
13Ontology Generator Form Analyzer
Sample Form
Object and Realationship Sets and Constraints
- BaseForm 01 A 1
- BaseForm 03 B 1
- BaseForm 0 C 1
- BaseForm 03 D1 1 D2 1 D3 1
- BaseForm 0 E1 1 E2 1 E3 1
14Ontology Generator Form Analyzer
- Object and Relationship Sets
- and Constraints
- Digital Camera application Forms
15Ontology Generator Context Phrase Locator
16Ontology Generator Data Frame Matcher
- Data Frame Matching Heuristics
- Number of matched data
- Data Frame Ranking Heuristics
- Number of matched data
- Keywords and/or Contexts Matching
- Order of Specialization/Generalization
17Ontology Generator Keyword and Context
Recognizer
18Ontology Generator Data Frame Editor
19Extraction Ontology
20Experimental Preparation
- Selected two domains of interest
- Digital Camera Application and Apartment Rental
Application - Constructed an initial data frame library
- Integer (any integer value), SmallPositiveInteger
(from 1 to 99), SingleDigit (from 0 to 9),
RealNumber (any real value), SmallPositiveReal
(from 0.01 to 99.99), Date, Email, PhoneNumber,
and Price - Created application-dependent forms for each
application - Collected 5 sample pages from different web sites
for each domain - Marked desired data on sample pages
21Experimental Results Digital Camera Application
22Experimental Results Digital Camera Application
23Experimental Results Apartment Rental
Application
24Experimental Results Apartment Rental
Application
25Experimental Results Apartment Rental
Application
26Experimental Observations Strengths of OntoByE
- OntoByE provides a friendly and intuitive
interface to help ordinary users describe data of
interest without exposing them to abstract
ontology concepts - With a small initial data frame library and a
small set of sample pages, OntobyE works well to
search for and suggest appropriate existing data
frames for object sets with application-independen
t values - OntoByE successfully recognizes possible keywords
and contexts for user marked-data from sample
pages and helps users to create new data frames
with the keywords and contexts
27Experimental Observations Limitations of OntoByE
- The performance of searching for or constructing
data frames by OntoByE is limited by the scope
and the quality of prior knowledge - The accuracy and completeness of keyword and
context expression construction are limited by
the number and representativeness of user samples - Constructing value expressions for
application-dependent data frames requires that
users know how to write regular expressions.
28Conclusion
- We implemented a user-friendly interface for
ordinary users to take advantage of our
ontology-based web data-extraction approach. - We developed a framework for interacting with
ordinary users to generate ontologies by example. - Our experiments demonstrate that OntoByE works
well to generate ontologies with assistance of a
limited prior knowledge. As time goes by, along
with the expansion of prior knowledge, OntoByE
will achieve better performance.
29Future Work
- Have OntoByE learn to build application-dependent
lexicons for users applications - Improve the sub-components of the back-end
ontology generator, e.g. Context Phrase Locator
30