Title: Brigham Young University
1Source Discovery and Schema Mapping for Data
Integration
- Brigham Young University
- Li Xu
2Data Integration
Find houses with four bedrooms priced under
200,000
global schema
Mediator
source schema 2
source schema 3
source schema 1
wrappers
homes.com
realestate.com
homeseekers.com
3Problems
- How to Recognize Applicable Information Sources
for an Application? - How to Specify Mapping between the Source Schemas
and the Global Schema? - How to Reformulate User Queries?
- How to Merge Data from Heterogeneous Sources?
4Recognizing Ontology-ApplicableHTML Documents
5Application Ontology
How to specify an application?
6Applicable HTML Documents
- Multiple-Record Documents
- Single-Record Documents
- HTML Forms
How to distinguish an applicable HTML document?
7Multiple-Record Docs
8Single-Record Doc.
9HTML Forms
10Recognition Heuristics
- h1 Densities
- h2 Expected Values
- h3 Grouping
How to measure the applicability of an HTML
document for an application?
11h1 Densities
12 h2 Expected Values
ltYear0.98, Make0.93, Model0.91, Mileage0.45,
Price0.80, Feature2.10, PhoneNr1.15gt
13h3 Grouping (of 1-Max Object Sets)
14Classification Problem
- Subtasks
- Multiple Records
- Singleton Record
- Application Form
- Learning Algorithm Decision Tree C4.5
- (h10, h11, , h2, h3, Positive)
- (h10, h11, , h2, h3, Negative)
How to construct recognition rules for an
application?
15Experiments Car Ads and Obituaries
- Training Sets
- Car Ads (Yes No)
- 143 363
- 614 636
- 50 69
- Obituaries (Yes No)
- 68 135
- 50 69
- 62 135
- Test Sets
- Car Ads (40 40)
- Precision 95
- Recall 98
- F-measure 96
- Obituaries (40 40)
- Precision 95
- Recall 95
- F-measure 95
16Link Analysis
17Form Filling
18Form Filling (Cont.)
19Incorrect Positive ResponseMotorcycle
Year Make Price Mileage PhoneNr Feature
20HistoricalFigure
Deceased Name Death Date Birth Date Age Relationsh
ip Relative Name
21AutomatingSchema Mapping for Data Integration
22Schema Mapping
Color
Year
Year
Feature
Make
Make Model
Body Type
Cost
Model
Car
Style
Phone
Cost
Miles
Mileage
Source
23Schema Mappingfor Populated Schemas
- Central Idea Exploit All Data Metadata
- Matching Possibilities (Facets)
- Attribute Names
- Data-Value Characteristics
- Expected Data Values
- Data-Dictionary Information
- Structural Properties
24The Approach
- Input
- Two Graphs, S and T
- Data Instances for S and T
- Lightweight Domain Ontology
- Output
- A Source-to-Target Mapping between S and T
- Should enable translating data instances from S
to T. - Direct and Many Indirect Matches
- (t, s)
- (t, s lt ?)
- Framework
- Individual Facet Matching
- Combination of Individual Matchers
25Attribute Names
- Target and Source Attributes
- T A
- S B
- WordNet
- C4.5 Decision Tree feature selection, trained on
schemas in DB books - f0 same word
- f1 synonym
- f2 sum of distances to a common hypernym root
- f3 number of different common hypernym roots
- f4 sum of the number of senses of A and B
26WordNet Rule
27Data-Value Characteristics
- C4.5 Decision Tree
- Features
- Numeric data
- (Mean, variation, standard deviation, )
- Alphanumeric data
- (String length, numeric ratio, space ratio)
28Expected Data Values
- Concepts and Relationships
- Data Recognizers
- CarMake
- ford
- honda
-
- CarModel
- accord
- mustang
- taurus
-
Make Model
Brand Model
Ford Mustang Ford Taurus Ford F150
Legend Mustang A4
Acura Audi BMW
CarMake . CarModel
CarMake
CarModel
Target
Source
29Structure Matching
MLS
MLS
Bedrooms
Name
Basic_features
location
House
Agent
beds
SQFT
Fax
location_ description
agent
Golf course
Water front
Address
Address
name
fax
phone
Street
City
State
Target
Source
30Structure Matching (Cont.)
MLS
MLS
Bedrooms
Name
Basic_features
location
House
Agent
beds
SQFT
Fax
location_ description
agent
Golf course
Water front
Address
Address
name
fax
phone
Street
City
State
Target
Source
31Structure Matching (Cont.)
MLS
MLS
Bedrooms
Name
Basic_features
location
House
Agent
beds
SQFT
Fax
location_ description
agent
Golf course
Water front
Address
Address
name
fax
phone
Street
City
State
Target
Source
32Structure Matching (Cont.)
MLS
MLS
Bedrooms
Name
Basic_features
location
House
Agent
beds
SQFT
Fax
location_ description
agent
Golf course
Water front
Address
Address
name
fax
phone
Street
City
State
Source
Target
33Structure Matching (Cont.)
MLS
MLS
Bedrooms
Name
Basic_features
location
House
Agent
beds
SQFT
Fax
location_ description
agent
Golf course
Water front
Address
Address
name
fax
phone
Street
City
State
Source
Target
34Structure Matching (Cont.)
MLS
MLS
Bedrooms
Name
Basic_features
location
House
Agent
beds
SQFT
Fax
location_ description
agent
Golf course
Water front
Address
Address
name
fax
phone
Street
City
State
Source
Target
35House, MLS vs. MLS
MLS
MLS
Bedrooms
Basic_features
location
House
beds
SQFT
location_ description
Golf course
Water front
Address
Street
City
State
Source
Target
36House, MLS vs. MLS
MLS
MLS
Bedrooms
Basic_features
location
House
beds
SQFT
location_ description
Golf course
Water front
Address
Street
City
State
Source
Target
37House, MLS vs. MLS
MLS
MLS
Bedrooms
Basic_features
House
SQFT
House
location_ description
beds
Golf course
Water front
location
Address
Address1
Street
City
State
Source
Target
38House, MLS vs. MLS
Basic_features
MLS
MLS
Bedrooms
SQFT
location_ description
House
beds
House
location
Water front
Golf course
Water front
Golf course
Address1
Address
Street
City
State
Street1
City1
State1
Target
Source
39Agent vs. agent
Name
agent
Agent
address
Fax
name
fax
phone
Address
Street
City
State
Source
Target
40Agent vs. agent
agent
Name
name
Agent
phone
Fax
fax
address
Address2
Address
Street
City
State
Street2,
City2
State2
Source
Target
41Inter-Relationship Set
MLS
MLS
Bedrooms
Name
House
Agent
House
Fax
Golf course
Water front
Address
agent
Street
City
State
Source
Target
42Example Source-To-Target Mapping
House
MLS
name
beds
agent
City
Golf course
Street
State
fax
Water front
Address
Address1
Address2
43Target-based Integration and Query System (TIQS)
- Definition I (T, Si, Mi)
- Phases
- Design (Source-to-Target Mappings Mi)
- Query Processing (Rule Unfolding)
44Query Reformulation
- Query
- House-Bedrooms(x, 4) - House-Bedrooms(x, 4),
- House-Golf_course(x, Yes),
- House-Water_front(x, Yes)
45Query Reformulation
- Query
- House-Bedrooms(x, 4) - House-Bedrooms(x, 4),
- House-Golf_Course(x, Yes),
- House-Water_Front(x, Yes)
46TIQS (Cont.)
- User Queries
- Logic Rules
- Maximal and Sound Query Answers
- Advantages
- Rule Unfolding
- Scalability
47Experimental Results
Application (Number of Schemes) Precision () Recall () F () Number Matches Number Correct Number Incorrect
Faculty Member (5) 100 100 100 540 540 0
Course Schedule (5) 99 93 96 490 454 6
Real Estate (5) 90 94 92 876 820 92
Indirect Matches (precision 87, recall 94,
F-measure 90)
Data borrowed from Univ. of Washington DDH,
SIGMOD01
- Rough Comparison with U of W Results
- Course Schedule Accuracy 71
- Real Estate (2 tests) Accuracy 75
- Faculty Member Accuracy, 92
48Conclusion
- A Robust and Flexible Approach to Check
Applicability of HTML documents - A Composite Approach to Automate Schema Mapping
- Direct Matches
- Indirect Matches
- An Approach that Combines Advantages of Basic
Approaches to Data Integration
49Future Work
- Test More Applications and Data to Evaluate the
Approaches - Extend Training Classifiers for Applicability
Checking - Further Automating Schema Mapping
- Automate Ontology Mapping on the Semantic Web
- Automate Mapping between XML Documents
50Thanks ! Questions?