Title: Learning Source Mappings
1Learning Source Mappings
- Zachary G. Ives
- University of Pennsylvania
- CIS 650 Database Information Systems
- October 27, 2008
LSD Slides courtesy AnHai Doan
2Administrivia
- Midterm due Thursday
- 5-10 pages (single-spaced, 10-12 pt)
3Semantic Mappings between Schemas
- Mediated source schemas XML DTDs
house
address
num-baths
contact-info
agent-name agent-phone
1-1 mapping
non 1-1 mapping
house
location contact
full-baths
half-baths
name phone
4The LSD (Learning Source Descriptions) Approach
- Suppose user wants to integrate 100 data sources
- 1. User
- manually creates mappings for a few sources, say
3 - shows LSD these mappings
- 2. LSD learns from the mappings
- Multi-strategy learning incorporates many types
of info in a general way - Knowledge of constraints further helps
- 3. LSD proposes mappings for remaining 97 sources
5Example
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Learned hypotheses
Schema of realestate.com
If phone occurs in the name agent-phone
listed-price 250,000 110,000 ...
location Miami, FL Boston, MA ...
phone (305) 729 0831 (617) 253 1429 ...
comments Fantastic house Great location ...
realestate.com
If fantastic great occur frequently in
data values description
homes.com
price 550,000 320,000 ...
contact-phone (278) 345 7215 (617) 335 2315 ...
extra-info Beautiful yard Great beach ...
6LSDs Multi-Strategy Learning
- Use a set of base learners
- each exploits well certain types of information
- Match schema elements of a new source
- apply the base learners
- combine their predictions using a meta-learner
- Meta-learner
- uses training sources to measure base learner
accuracy - weighs each learner based on its accuracy
7Base Learners
- Input
- schema information name, proximity, structure,
... - data information value, format, ...
- Output
- prediction weighted by confidence score
- Examples
- Name learner
- agent-name (name,0.7), (phone,0.3)
- Naive Bayes learner
- Kent, WA (address,0.8),
(name,0.2) - Great location (description,0.9),
(address,0.1)
8Training the Learners
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Schema of realestate.com
Name Learner
(location, address) (listed-price, price) (phone,
agent-phone) (comments, description) ...
Miami, FL
250,000 (305) 729 0831
Fantastic house
realestate.com
Naive Bayes Learner
Boston, MA
110,000 (617) 253 1429
Great location
(Miami, FL, address) ( 250,000,
price) ((305) 729 0831, agent-phone) (Fantastic
house, description) ...
9Applying the Learners
Mediated schema
Schema of homes.com
address price agent-phone
description
area day-phone extra-info
Name Learner Naive Bayes
Seattle, WA Kent,
WA Austin, TX
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Name Learner Naive Bayes
Meta-Learner
(address,0.7), (description,0.3)
(278) 345 7215 (617) 335
2315 (512) 427 1115
(agent-phone,0.9), (description,0.1)
(address,0.6), (description,0.4)
Beautiful yard Great
beach Close to Seattle
10Domain Constraints
- Impose semantic regularities on sources
- verified using schema or data
- Examples
- a address b address a b
- a house-id a is a key
- a agent-info b agent-name b is
nested in a - Can be specified up front
- when creating mediated schema
- independent of any actual source schema
11The Constraint Handler
Domain Constraints a address b adderss
a b
Predictions from Meta-Learner
area (address,0.7),
(description,0.3) contact-phone
(agent-phone,0.9), (description,0.1) extra-info
(address,0.6), (description,0.4)
0.3 0.1 0.4 0.012
area address contact-phone
agent-phone extra-info description
area address contact-phone
agent-phone extra-info address
0.7 0.9 0.6 0.378
0.7 0.9 0.4 0.252
- Can specify arbitrary constraints
- User feedback domain constraint
- ad-id house-id
- Extended to handle domain heuristics
- a agent-phone b agent-name a b are
usually close to each other
12Putting It All Together LSD System
Matching Phase
Training Phase
Mediated schema
Source schemas
Domain Constraints
Data listings
Training data for base learners
User Feedback
Constraint Handler
L1
L2
Lk
Mapping Combination
- Base learners Name Learner, XML learner, Naive
Bayes, Whirl learner - Meta-learner
- uses stacking TingWitten99, Wolpert92
- returns linear weighted combination of base
learners predictions
13Empirical Evaluation
- Four domains
- Real Estate I II, Course Offerings, Faculty
Listings - For each domain
- create mediated DTD domain constraints
- choose five sources
- extract convert data listings into XML
- mediated DTDs 14 - 66 elements, source DTDs 13
48 - Ten runs for each experiment - in each run
- manually provide 1-1 mappings for 3 sources
- ask LSD to propose mappings for remaining 2
sources - accuracy of 1-1 mappings correctly identified
14LSD Matching Accuracy
Average Matching Acccuracy ()
LSDs accuracy 71 - 92
Best single base learner 42 - 72
Meta-learner 5 - 22
Constraint handler 7 - 13 XML
learner 0.8 - 6
15LSD Summary
- Applies machine learning to schema matching
- use of multi-strategy learning
- Domain user-specified constraints
- Probably the most flexible means of doing schema
matching today in a semi-automated way - Complementary project CLIO (IBM Almaden) uses
key and foreign-key constraints to help the user
build mappings
16Since LSD
- A lot more work on the following
- Alternative schemes for putting together info
from base learners - Hierarchical learners
- Compare two trees parent nodes are likely to be
the same if child nodes are similar child nodes
are likely to be the same if parent nodes are
similar - Using mass collaboration humans do the work
- And a lot of work on entity resolution or record
matching - Uses similar ideas to try to determine when two
records are referring to the same entity
17Jumping Up a Level
- Weve now seen how heterogeneous data makes a
huge difference - In the need for relating different kinds of
attributes - Mapping languages
- Mapping tools
- Query reformulation
- and in query processing
- Adaptive query processing
- Next time well go even further, and start to
consider search focusing on Google