Alon Halevy

About This Presentation

Title:

Alon Halevy

Description:

County name, zip code, phone numbers. ... date, time, city, zip code, name, etc. house-area (30 X 70, 500 sq. ft.) county-name recognizer ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 48

Provided by: zam34

Learn more at: https://pages.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Alon Halevy

1
Learning to Map Between Schemas Ontologies

Alon Halevy
University of Washington
Joint work with Anhai Doan and Pedro Domingos

2
Agenda

Ontology mapping is a key problem in many
applications
Data integration
Semantic web
Knowledge management
E-commerce
LSD
Solution that uses multi-strategy learning.
Weve started with schema matching (I.e., very
simple ontologies)
Currently extending to more expressive
ontologies.
Experiments show the approach is very promising!

3
The Structure Mapping Problem

Types of structures
Database schemas, XML DTDs, ontologies, ,
Input
Two (or more) structures, S1 and S2
Data instances for S1 and S2
Background knowledge
Output
A mapping between S1 and S2
Should enable translating between data instances.
Semantics of mapping?

4
Semantic Mappings between Schemas

Source schemas XML DTDs

house
address
num-baths
contact-info
agent-name agent-phone
1-1 mapping
non 1-1 mapping
house
location contact
full-baths
half-baths
name phone
5
Motivation

Database schema integration
A problem as old as databases themselves.
database merging, data warehouses, data migration
Data integration / information gathering agents
On the WWW, in enterprises, large science
projects
Model management
Model matching key operator in an algebra where
models and mappings are first-class objects.
See Bernstein et al., 2000 for more.
The Semantic Web
Ontology mapping.
System interoperability
E-services, application integration, B2B
applications, ,

6
Desiderata from Proposed Solutions

Accuracy, efficiency, ease of use.
Realistic expectations
Unlikely to be fully automated. Need user in the
loop.
Some notion of semantics for mappings.
Extensibility
Solution should exploit additional background
knowledge.
Memory, knowledge reuse
System should exploit previous manual or
automatically generated matchings.
Key idea behind LSD.

7
LSD Overview

L(earning) S(ource) D(escriptions)
Problem generating semantic mappings between
mediated schema and a large set of data source
schemas.
Key idea generate the first mappings manually,
and learn from them to generate the rest.
Technique multi-strategy learning (extensible!)
Step 1
SIGMOD, 2001 1-1 mappings between XML DTDs.
Current focus
Complex mappings
Ontology mapping.

8
Outline

Overview of structure mapping
Data integration and source mappings
LSD architecture and details
Experimental results
Current work.

9
Data Integration
Find houses with four bathrooms priced under
500,000
mediated schema
Query reformulation and optimization.
source schema 2
source schema 3
source schema 1
wrappers
homes.com
realestate.com
homeseekers.com
Applications WWW, enterprises, science
projects Techniques virtual data integration,
warehousing, custom code.
10
Semantic Mappings between Schemas

Source schemas XML DTDs

house
address
num-baths
contact-info
agent-name agent-phone
1-1 mapping
non 1-1 mapping
house
location contact
full-baths
half-baths
name phone
11
Semantics (preliminary)

Semantics of mappings has received no attention.
Semantics of 1-1 mappings
Given
R(A1,,An) and S(B1,,Bm)
1-1 mappings (Ai,Bj)
Then, we postulate the existence of a relation W,
s.t.
P (C1,,Ck) (W) P (A1,,Ak) (R) ,
P (C1,,Ck) (W) P (B1,,Bk) (S) ,
W also includes the unmatched attributes of R and
S.
In English R and S are projections on some
universal relation W, and the mappings specify
the projection variables and correspondences.

12
Why Matching is Difficult

Aims to identify same real-world entity
using names, structures, types, data values, etc
Schemas represent same entity differently
different names gt same entity
area address gt location
same names gt different entities
area gt location or square-feet
Schema data never fully capture semantics!
not adequately documented, not sufficiently
expressive
Intended semantics is typically subjective!
IBM Almaden Lab IBM?
Cannot be fully automated. Often hard for humans.
Committees are required!

13
Current State of Affairs

Finding semantic mappings is now the bottleneck!
largely done by hand
labor intensive error prone
GTE 4 hours/element for 27,000 elements
LiClifton00
Will only be exacerbated
data sharing XML become pervasive
proliferation of DTDs
translation of legacy data
reconciling ontologies on semantic web
Need semi-automatic approaches to scale up!

14
Outline

Overview of structure mapping
Data integration and source mappings
LSD architecture and details
Experimental results
Current work.

15
The LSD Approach

User manually maps a few data sources to the
mediated schema.
LSD learns from the mappings, and proposes
mappings for the rest of the sources.
Several types of knowledge are used in learning
Schema elements, e.g., attribute names
Data elements ranges, formats, word frequencies,
value frequencies, length of texts.
Proximity of attributes
Functional dependencies, number of attribute
occurrences.
One learner does not fit all. Use multiple
learners and combine with meta-learner.

16
Example
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Learned hypotheses
Schema of realestate.com
If phone occurs in the name gt agent-phone
listed-price 250,000 110,000 ...
location Miami, FL Boston, MA ...
phone (305) 729 0831 (617) 253 1429 ...
comments Fantastic house Great location ...
realestate.com
If fantastic great occur frequently in
data values gt description
homes.com
price 550,000 320,000 ...
contact-phone (278) 345 7215 (617) 335 2315 ...
extra-info Beautiful yard Great beach ...
17
Multi-Strategy Learning

Use a set of base learners
Name learner, Naïve Bayes, Whirl, XML learner
And a set of recognizers
County name, zip code, phone numbers.
Each base learner produces a prediction weighted
by confidence score.
Combine base learners with a meta-learner, using
stacking.

18
Base Learners

Name Learner

(contact-info,office-address)
(contact-info,office-address)
(contact,agent-phone)
(contact,agent-phone)
(contact-phone, ? )
(phone,agent-phone)
(phone,agent-phone)
(listed-price,price)
(listed-price,price)
contact-phone gt (agent-phone,0.7),
(office-address,0.3)

Naive Bayes Learner DomingosPazzani 97
Kent, WA gt (address,0.8), (name,0.2)
Whirl Learner CohenHirsh 98
XML Learner
exploits hierarchical structure of XML data

19
Training the Base Learners
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Schema of realestate.com
Name Learner
ltlocationgt Miami, FL lt/gt ltlisted-pricegt
250,000lt/gt ltphonegt (305) 729 0831lt/gt
ltcommentsgt Fantastic house lt/gt
(location, address) (listed-price, price) (phone,
agent-phone) ...
realestate.com
Naive Bayes Learner
ltlocationgt Boston, MA lt/gt ltlisted-pricegt
110,000lt/gt ltphonegt (617) 253 1429lt/gt
ltcommentsgt Great location lt/gt
(Miami, FL, address) ( 250,000,
price) ((305) 729 0831, agent-phone) ...
20
Entity Recognizers

Use pre-programmed knowledge to identify specific
types of entities
date, time, city, zip code, name, etc
house-area (30 X 70, 500 sq. ft.)
county-name recognizer
Recognizers often have nice characteristics
easy to construct
many off-the-self research commercial products
applicable across many domains
help with special cases that are hard to learn

21
Meta-Learner Stacking

Training of meta-learner produces a weight for
every pair of
(base-learner, mediated-schema element)
weight(Name-Learner,address) 0.1
weight(Naive-Bayes,address) 0.9
Combining predictions of meta-learner
computes weighted sum of base-learner confidence
scores

Name Learner Naive Bayes
(address,0.6) (address,0.8)
ltareagtSeattle, WAlt/gt
Meta-Learner
(address, 0.60.1 0.80.9 0.78)
22
Training the Meta-Learner

For address

Name Learner
Naive Bayes
True Predictions
Extracted XML Instances
ltlocationgt Miami, FLlt/gt ltlisted-pricegt
250,000lt/gt ltareagt Seattle, WA lt/gt lthouse-addrgtKen
t, WAlt/gt ltnum-bathsgt3lt/gt ...
0.5 0.8
1 0.4
0.3 0 0.3
0.9 1
0.6 0.8
1 0.3
0.3 0 ...
... ...
Least-SquaresLinear Regression
Weight(Name-Learner,address)
0.1 Weight(Naive-Bayes,address) 0.9
23
Applying the Learners
Mediated schema
Schema of homes.com
address price agent-phone
description
area day-phone extra-info
Name Learner Naive Bayes
ltareagtSeattle, WAlt/gt ltareagtKent,
WAlt/gt ltareagtAustin, TXlt/gt
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Name Learner Naive Bayes
Meta-Learner
(address,0.7), (description,0.3)
ltday-phonegt(278) 345 7215lt/gt ltday-phonegt(617) 335
2315lt/gt ltday-phonegt(512) 427 1115lt/gt
(agent-phone,0.9), (description,0.1)
(description,0.8), (address,0.2)
ltextra-infogtBeautiful yardlt/gt ltextra-infogtGreat
beachlt/gt ltextra-infogtClose to Seattlelt/gt
24
The Constraint Handler

Extends learning to incorporate constraints
hard constraints
a address b address a b
a house-id a is a key
a agent-info b agent-name b is
nested in a
soft constraints
a agent-phone b agent-name
a b are usually
close to each other
user feedback hard or soft constraints
Details in Doan et. al., SIGMOD 2001

25
The Current LSD System
Matching Phase
Training Phase
Mediated schema
Source schemas
Domain Constraints
Data listings
User Feedback
Constraint Handler
Base-Learner1
Base-Learnerk
Meta-Learner
Mappings
26
Outline

Overview of structure mapping
Data integration and source mappings
LSD architecture and details
Experimental results
Current work.

27
Empirical Evaluation

Four domains
Real Estate I II, Course Offerings, Faculty
Listings
For each domain
create mediated DTD domain constraints
choose five sources
extract convert data listings into XML
(faithful to schema!)
mediated DTDs 14 - 66 elements, source DTDs 13
- 48

Ten runs for each experiment - in each run
manually provide 1-1 mappings for 3 sources
ask LSD to propose mappings for remaining 2
sources
accuracy of 1-1 mappings correctly identified

28
Matching Accuracy
Average Matching Acccuracy ()
LSDs accuracy 71 - 92
Best single base learner 42 - 72
Meta-learner 5 - 22
Constraint handler 7 - 13 XML
learner 0.8 - 6
29
Sensitivity to Amount of Available Data
Average matching accuracy ()
Number of data listings per source (Real Estate I)
30
Contribution of Schema vs. Data
Average matching accuracy ()

LSD with only schema info.
LSD with only data info.
Complete LSD

More experiments in the paper Doan et. al. 01

31
Reasons for Incorrect Matching

Unfamiliarity
suburb
solution add a suburb-name recognizer
Insufficient information
correctly identified general type, failed to
pinpoint exact type
ltagent-namegtRichard Smithlt/gtltphonegt (206) 234
5412 lt/gt
solution add a proximity learner
Subjectivity
house-style description?

32
Outline

Overview of structure mapping
Data integration and source mappings
LSD architecture and details
Experimental results
Current work.

33
Moving Up the Expressiveness Ladder

Schemas are very simple ontologies.
More expressive power More domain constraints.
Mappings become more complex, but constraints
provide more to learn from.
Non 1-1 mappings
F1(A1,,Am) F2(B1,,Bm)
Ontologies (of various flavors)
Class hierarchy (I.e., containment on unary
relations)
Relationships between objects
Constraints on relationships

34
Finding Non 1-1 MappingsCurrent work

Given two schemas, find
1-many mappings address concat(city,state)
many-1 half-baths full-baths num-baths
many-many concat(addr-line1,addr-line2)
concat(street,city,state)
1-many mappings
expressed as query
value correspondence expression room-rate rate
(1 tax-rate)
relationship state of tax-rate state of
hotel that has rate
special case 1-many mappings between two
relational tables

Mediated schema
Source schema
address description num-baths
city state comments half-baths full-baths
35
Brute-Force Solution

Define a set of operators
concat, , -, , /, etc
For each set of mediated-schema columns
enumerate all possible mappings
evaluate return best mapping

Source-schema columns
Mediated-schema columns
compute similarity using all base learners
m1
m1, m2, ..., mk
36
Search-Based Solution

States columns
goal state mediated-schema column
initial states all source-schema columns
use 1-1 matching to reduce the set of initial
states
Operators concat, , -, , /, etc
Column-similarity
use all base learners recognizers

37
Multi-Strategy Search

Use a set of expert modules L1, L2, ..., Ln
Each module
applies to only certain types of mediated-schema
column
searches a small subspace
uses a cheap similarity measure to compare
columns
Example
L1 text concat TF/IDF
L2 numeric , -, , / Ho et. al. 2000
L3 address concat Naive Bayes
Search techniques
beam search as default
specialized, do not have to materialize columns

38
Multi-Strategy Search (contd)

Apply all applicable expert modules

L1 m11, m12, m13, ..., m1x L2 m21, m22, m23,
..., m2y L3 m31, m32, m33, ..., m3z

Combine modules predictions select the best one

compute similarity using all base learners
m11
m11, m12, m21, m22, m31,m32
39
Related Work
Single Learner 1-1 Matching
Recognizers Schema 1-1 Matching
TRANSCM MiloZohar98 ARTEMIS
CastanoAntonellis99
Palopoli et. al. 98 CUPID Madhavan et. al. 01
SEMINT LiClifton94 ILA PerkowitzEtzioni95 D
ELTA Clifton et. al. 97
Hybrid 1-1 Matching
DELTA Clifton et. al. 97
Multi-Strategy Learning Learners
Recognizers Schema Data 1-1 non 1-1 Matching
Schema Data 1-1 non 1-1 Matching Sophisticated
Data-Driven User Interaction
CLIO Miller et. al. 00,Yan et. al. 01
LSD Doan et. al. 2000, 2001
?
40
Summary

LSD
uses multi-strategy learning to
semi-automatically generate semantic mappings.
LSD is extensible and incorporates domain and
user knowledge, and previous techniques.
Experimental results show the approach is very
promising.
Future work and issues to ponder
Accommodating more expressive languages
ontologies
Reuse of learned concepts from related domains.
Semantics?
Data management is a fertile area for Machine
Learning research!

41
Backup Slides
42
Mapping Maintenance
Source-schema S
Mediated-schema M
m1
m2
m3

Ten months later ...
are the mappings still correct?

Source-schema S
Mediated-schema M
m1
m2
m3
43
Information Extraction from Text

Extract data fragments from text documents
date, location, victims name from a news
article
Intensive research on free-text documents
Many documents do have substantial structure
XML pages, name card, tables, list
Each such document a data source
structure forms a schema
only one data value per schema element
real data source has many data values per
schema element
Ongoing research in the IE community

44
Contribution of Each Component
Average Matching Acccuracy ()
Without Name Learner Without Naive Bayes Without
Whirl Learner Without Constraint Handler The
complete LSD system
45
Exploiting Hierarchical Structure

Existing learners flatten out all structures
Developed XML learner
similar to the Naive Bayes learner
input instance bag of tokens
differs in one crucial aspect
consider not only text tokens, but also structure
tokens

ltcontactgt ltnamegt Gail Murphy lt/namegt ltfirmgt
MAX Realtors lt/firmgt lt/contactgt
ltdescriptiongt Victorian house with a view.
Name your price! To see it, contact Gail
Murphy at MAX Realtors. lt/descriptiongt
46
Domain Constraints

Impose semantic regularities on sources
verified using schema or data
Examples
a address b address a b
a house-id a is a key
a agent-info b agent-name b is
nested in a
Can be specified up front
when creating mediated schema
independent of any actual source schema

47
The Constraint Handler
Domain Constraints a address b adderss
a b
Predictions from Meta-Learner
area (address,0.7),
(description,0.3) contact-phone
(agent-phone,0.9), (description,0.1) extra-info
(address,0.6), (description,0.4)
0.3 0.1 0.4 0.012
area address contact-phone
agent-phone extra-info description
area address contact-phone
agent-phone extra-info address
0.7 0.9 0.6 0.378
0.7 0.9 0.4 0.252