AnHai Doan, Pedro Domingos, Alon Halevy

About This Presentation

Title:

AnHai Doan, Pedro Domingos, Alon Halevy

Description:

University of Washington. Reconciling Schemas of Disparate Data Sources: ... extra-info: (address,0.6), (description,0.4) The Constraint Handler ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 24

Provided by: zam34

Learn more at: https://pages.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: AnHai Doan, Pedro Domingos, Alon Halevy

1
Reconciling Schemas of Disparate Data Sources A
Machine Learning Approach
The LSD Project

AnHai Doan, Pedro Domingos, Alon Halevy
University of Washington

2
Data Integration
Find houses with four bathrooms priced under
500,000
mediated schema
source schema 2
source schema 3
source schema 1
homes.com
realestate.com
homeseekers.com
3
Semantic Mappings between Schemas

Mediated source schemas XML DTDs

house
address
num-baths
contact-info
agent-name agent-phone
1-1 mapping
non 1-1 mapping
house
location contact
full-baths
half-baths
name phone
4
Current State of Affairs

Finding semantic mappings is now the bottleneck!
largely done by hand
labor intensive error prone
Will only be exacerbated
data sharing XML become pervasive
proliferation of DTDs
translation of legacy data
reconciling ontologies on the semantic web
Need (semi-)automatic approaches to scale up!

5
The LSD (Learning Source Descriptions) Approach

Suppose user wants to integrate 100 data sources
1. User
manually creates mappings for a few sources, say
3
shows LSD these mappings
2. LSD learns from the mappings
3. LSD proposes mappings for remaining 97 sources

6
Example
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Learned hypotheses
Schema of realestate.com
If phone occurs in the name gt agent-phone
listed-price 250,000 110,000 ...
location Miami, FL Boston, MA ...
phone (305) 729 0831 (617) 253 1429 ...
comments Fantastic house Great location ...
realestate.com
If fantastic great occur frequently in
data values gt description
homes.com
price 550,000 320,000 ...
contact-phone (278) 345 7215 (617) 335 2315 ...
extra-info Beautiful yard Great beach ...
7
Our Contributions

1. Use of multi-strategy learning
well-suited to exploit multiple types of
knowledge
highly modular extensible
2. Extend learning to incorporate constraints
handle a wide range of domain user-specified
constraints
3. Develop XML learner
exploit hierarchical nature of XML

8
Multi-Strategy Learning

Use a set of base learners
each exploits well certain types of information
Match schema elements of a new source
apply the base learners
combine their predictions using a meta-learner
Meta-learner
uses training sources to measure base learner
accuracy
weighs each learner based on its accuracy

9
Base Learners

Input
schema information name, proximity, structure,
...
data information value, format, ...
Output
prediction weighted by confidence score
Examples
Name learner
agent-name gt (name,0.7), (phone,0.3)
Naive Bayes learner
Kent, WA gt (address,0.8),
(name,0.2)
Great location gt (description,0.9),
(address,0.1)

10
Training the Learners
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Schema of realestate.com
Name Learner
(location, address) (listed-price, price) (phone,
agent-phone) (comments, description) ...
ltlocationgt Miami, FL lt/gt ltlisted-pricegt
250,000lt/gt ltphonegt (305) 729 0831lt/gt
ltcommentsgt Fantastic house lt/gt
realestate.com
Naive Bayes Learner
ltlocationgt Boston, MA lt/gt ltlisted-pricegt
110,000lt/gt ltphonegt (617) 253 1429lt/gt
ltcommentsgt Great location lt/gt
(Miami, FL, address) ( 250,000,
price) ((305) 729 0831, agent-phone) (Fantastic
house, description) ...
11
Applying the Learners
Mediated schema
Schema of homes.com
address price agent-phone
description
area day-phone extra-info
Name Learner Naive Bayes
ltareagtSeattle, WAlt/gt ltareagtKent,
WAlt/gt ltareagtAustin, TXlt/gt
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Name Learner Naive Bayes
Meta-Learner
(address,0.7), (description,0.3)
ltday-phonegt(278) 345 7215lt/gt ltday-phonegt(617) 335
2315lt/gt ltday-phonegt(512) 427 1115lt/gt
(agent-phone,0.9), (description,0.1)
(address,0.6), (description,0.4)
ltextra-infogtBeautiful yardlt/gt ltextra-infogtGreat
beachlt/gt ltextra-infogtClose to Seattlelt/gt
12
Domain Constraints

Impose semantic regularities on sources
verified using schema or data
Examples
a address b address a b
a house-id a is a key
a agent-info b agent-name b is
nested in a
Can be specified up front
when creating mediated schema
independent of any actual source schema

13
The Constraint Handler
Domain Constraints a address b adderss
a b
Predictions from Meta-Learner
area (address,0.7),
(description,0.3) contact-phone
(agent-phone,0.9), (description,0.1) extra-info
(address,0.6), (description,0.4)
0.3 0.1 0.4 0.012
area address contact-phone
agent-phone extra-info description
area address contact-phone
agent-phone extra-info address
0.7 0.9 0.6 0.378
0.7 0.9 0.4 0.252

Can specify arbitrary constraints
User feedback domain constraint
ad-id house-id
Extended to handle domain heuristics
a agent-phone b agent-name a b are
usually close to each other

14
Putting It All Together the LSD System
Matching Phase
Training Phase
Mediated schema
Source schemas
Domain Constraints
Data listings
Training data for base learners
User Feedback
Constraint Handler
L1
L2
Lk
Mapping Combination

Base learners Name Learner, XML learner, Naive
Bayes, Whirl learner
Meta-learner
uses stacking TingWitten99, Wolpert92
returns linear weighted combination of base
learners predictions

15
Empirical Evaluation

Four domains
Real Estate I II, Course Offerings, Faculty
Listings
For each domain
create mediated DTD domain constraints
choose five sources
extract convert data listings into XML
mediated DTDs 14 - 66 elements, source DTDs 13
- 48

Ten runs for each experiment - in each run
manually provide 1-1 mappings for 3 sources
ask LSD to propose mappings for remaining 2
sources
accuracy of 1-1 mappings correctly identified

16
High Matching Accuracy
Average Matching Acccuracy ()
LSDs accuracy 71 - 92
Best single base learner 42 - 72
Meta-learner 5 - 22
Constraint handler 7 - 13 XML
learner 0.8 - 6
17
Performance Sensitivity
Average matching accuracy ()
Number of data listings per source
18
Contribution of Schema vs. Data
Average matching accuracy ()

More experiments in the paper!

19
Related Work

Rule-based approaches
TRANSCM MiloZohar98, ARTEMIS
CastanoAntonellis99, Palopoli et. al. 98,
CUPID Madhavan et. al. 01
utilize only schema information
Learner-based approaches
SEMINT LiClifton94, ILA PerkowitzEtzioni95
employ a single learner, limited applicability
Others
DELTA Clifton et. al. 97, CLIO Miller et. al.
00Yan et. al. 01
Multi-strategy learning in other domains
series of workshops 91,93,96,98,00
Freitag98, Proverb Keim et. al. 99

20
Summary

LSD project
applies machine learning to schema matching
Main ideas contributions
use of multi-strategy learning
extend learning to handle domain user-specified
constraints
develop XML learner
System design A contribution to generic
schema-matching
highly modular extensible
handle multiple types of knowledge
continuously improve over time

21
Ongoing Future Work

Improve accuracy
address current system limitations
Extend LSD to more complex mappings
Apply LSD to other application contexts
data translation
data warehousing
e-commerce
information extraction
semantic web
www.cs.washington.edu/homes/anhai/lsd.h
tml

22
Contribution of Each Component
Average Matching Acccuracy ()
Without Name Learner Without Naive Bayes Without
Whirl Learner Without Constraint Handler The
complete LSD system
23
Exploiting Hierarchical Structure

Existing learners flatten out all structures
Developed XML learner
similar to the Naive Bayes learner
input instance bag of tokens
differs in one crucial aspect
consider not only text tokens, but also structure
tokens

ltcontactgt ltnamegt Gail Murphy lt/namegt ltfirmgt
MAX Realtors lt/firmgt lt/contactgt
ltdescriptiongt Victorian house with a view.
Name your price! To see it, contact Gail
Murphy at MAX Realtors. lt/descriptiongt

Write a Comment

User Comments (0)