Learning Source Mappings - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

Learning Source Mappings

Description:

LSD Slides courtesy AnHai Doan. 2 ... LSD's Multi-Strategy Learning. Use a set of base learners ... ask LSD to propose mappings for remaining 2 sources ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 18

Provided by: zack4

Learn more at: https://www.cis.upenn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Learning Source Mappings

1
Learning Source Mappings

Zachary G. Ives
University of Pennsylvania
CIS 650 Database Information Systems
October 27, 2008

LSD Slides courtesy AnHai Doan
2
Administrivia

Midterm due Thursday
5-10 pages (single-spaced, 10-12 pt)

3
Semantic Mappings between Schemas

Mediated source schemas XML DTDs

house
address
num-baths
contact-info
agent-name agent-phone
1-1 mapping
non 1-1 mapping
house
location contact
full-baths
half-baths
name phone
4
The LSD (Learning Source Descriptions) Approach

Suppose user wants to integrate 100 data sources
1. User
manually creates mappings for a few sources, say
3
shows LSD these mappings
2. LSD learns from the mappings
Multi-strategy learning incorporates many types
of info in a general way
Knowledge of constraints further helps
3. LSD proposes mappings for remaining 97 sources

5
Example
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Learned hypotheses
Schema of realestate.com
If phone occurs in the name agent-phone
listed-price 250,000 110,000 ...
location Miami, FL Boston, MA ...
phone (305) 729 0831 (617) 253 1429 ...
comments Fantastic house Great location ...
realestate.com
If fantastic great occur frequently in
data values description
homes.com
price 550,000 320,000 ...
contact-phone (278) 345 7215 (617) 335 2315 ...
extra-info Beautiful yard Great beach ...
6
LSDs Multi-Strategy Learning

Use a set of base learners
each exploits well certain types of information
Match schema elements of a new source
apply the base learners
combine their predictions using a meta-learner
Meta-learner
uses training sources to measure base learner
accuracy
weighs each learner based on its accuracy

7
Base Learners

Input
schema information name, proximity, structure,
...
data information value, format, ...
Output
prediction weighted by confidence score
Examples
Name learner
agent-name (name,0.7), (phone,0.3)
Naive Bayes learner
Kent, WA (address,0.8),
(name,0.2)
Great location (description,0.9),
(address,0.1)

8
Training the Learners
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Schema of realestate.com
Name Learner
(location, address) (listed-price, price) (phone,
agent-phone) (comments, description) ...
Miami, FL
250,000 (305) 729 0831
Fantastic house
realestate.com
Naive Bayes Learner
Boston, MA
110,000 (617) 253 1429
Great location
(Miami, FL, address) ( 250,000,
price) ((305) 729 0831, agent-phone) (Fantastic
house, description) ...
9
Applying the Learners
Mediated schema
Schema of homes.com
address price agent-phone
description
area day-phone extra-info
Name Learner Naive Bayes
Seattle, WA Kent,
WA Austin, TX
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Name Learner Naive Bayes
Meta-Learner
(address,0.7), (description,0.3)
(278) 345 7215 (617) 335
2315 (512) 427 1115
(agent-phone,0.9), (description,0.1)
(address,0.6), (description,0.4)
Beautiful yard Great
beach Close to Seattle
10
Domain Constraints

Impose semantic regularities on sources
verified using schema or data
Examples
a address b address a b
a house-id a is a key
a agent-info b agent-name b is
nested in a
Can be specified up front
when creating mediated schema
independent of any actual source schema

11
The Constraint Handler
Domain Constraints a address b adderss
a b
Predictions from Meta-Learner
area (address,0.7),
(description,0.3) contact-phone
(agent-phone,0.9), (description,0.1) extra-info
(address,0.6), (description,0.4)
0.3 0.1 0.4 0.012
area address contact-phone
agent-phone extra-info description
area address contact-phone
agent-phone extra-info address
0.7 0.9 0.6 0.378
0.7 0.9 0.4 0.252

Can specify arbitrary constraints
User feedback domain constraint
ad-id house-id
Extended to handle domain heuristics
a agent-phone b agent-name a b are
usually close to each other

12
Putting It All Together LSD System
Matching Phase
Training Phase
Mediated schema
Source schemas
Domain Constraints
Data listings
Training data for base learners
User Feedback
Constraint Handler
L1
L2
Lk
Mapping Combination

Base learners Name Learner, XML learner, Naive
Bayes, Whirl learner
Meta-learner
uses stacking TingWitten99, Wolpert92
returns linear weighted combination of base
learners predictions

13
Empirical Evaluation

Four domains
Real Estate I II, Course Offerings, Faculty
Listings
For each domain
create mediated DTD domain constraints
choose five sources
extract convert data listings into XML
mediated DTDs 14 - 66 elements, source DTDs 13
48
Ten runs for each experiment - in each run
manually provide 1-1 mappings for 3 sources
ask LSD to propose mappings for remaining 2
sources
accuracy of 1-1 mappings correctly identified

14
LSD Matching Accuracy
Average Matching Acccuracy ()
LSDs accuracy 71 - 92
Best single base learner 42 - 72
Meta-learner 5 - 22
Constraint handler 7 - 13 XML
learner 0.8 - 6
15
LSD Summary

Applies machine learning to schema matching
use of multi-strategy learning
Domain user-specified constraints
Probably the most flexible means of doing schema
matching today in a semi-automated way
Complementary project CLIO (IBM Almaden) uses
key and foreign-key constraints to help the user
build mappings

16
Since LSD

A lot more work on the following
Alternative schemes for putting together info
from base learners
Hierarchical learners
Compare two trees parent nodes are likely to be
the same if child nodes are similar child nodes
are likely to be the same if parent nodes are
similar
Using mass collaboration humans do the work
And a lot of work on entity resolution or record
matching
Uses similar ideas to try to determine when two
records are referring to the same entity

17
Jumping Up a Level