Title: Alon Halevy
1Learning to Map Between Schemas Ontologies
-
- Alon Halevy
- University of Washington
- Joint work with Anhai Doan and Pedro Domingos
2Agenda
- Ontology mapping is a key problem in many
applications - Data integration
- Semantic web
- Knowledge management
- E-commerce
- LSD
- Solution that uses multi-strategy learning.
- Weve started with schema matching (I.e., very
simple ontologies) - Currently extending to more expressive
ontologies. - Experiments show the approach is very promising!
3The Structure Mapping Problem
- Types of structures
- Database schemas, XML DTDs, ontologies, ,
- Input
- Two (or more) structures, S1 and S2
- Data instances for S1 and S2
- Background knowledge
- Output
- A mapping between S1 and S2
- Should enable translating between data instances.
- Semantics of mapping?
4Semantic Mappings between Schemas
house
address
num-baths
contact-info
agent-name agent-phone
1-1 mapping
non 1-1 mapping
house
location contact
full-baths
half-baths
name phone
5Motivation
- Database schema integration
- A problem as old as databases themselves.
- database merging, data warehouses, data migration
- Data integration / information gathering agents
- On the WWW, in enterprises, large science
projects - Model management
- Model matching key operator in an algebra where
models and mappings are first-class objects. - See Bernstein et al., 2000 for more.
- The Semantic Web
- Ontology mapping.
- System interoperability
- E-services, application integration, B2B
applications, ,
6Desiderata from Proposed Solutions
- Accuracy, efficiency, ease of use.
- Realistic expectations
- Unlikely to be fully automated. Need user in the
loop. - Some notion of semantics for mappings.
- Extensibility
- Solution should exploit additional background
knowledge. - Memory, knowledge reuse
- System should exploit previous manual or
automatically generated matchings. - Key idea behind LSD.
7LSD Overview
- L(earning) S(ource) D(escriptions)
- Problem generating semantic mappings between
mediated schema and a large set of data source
schemas. - Key idea generate the first mappings manually,
and learn from them to generate the rest. - Technique multi-strategy learning (extensible!)
- Step 1
- SIGMOD, 2001 1-1 mappings between XML DTDs.
- Current focus
- Complex mappings
- Ontology mapping.
8Outline
- Overview of structure mapping
- Data integration and source mappings
- LSD architecture and details
- Experimental results
- Current work.
9Data Integration
Find houses with four bathrooms priced under
500,000
mediated schema
Query reformulation and optimization.
source schema 2
source schema 3
source schema 1
wrappers
homes.com
realestate.com
homeseekers.com
Applications WWW, enterprises, science
projects Techniques virtual data integration,
warehousing, custom code.
10Semantic Mappings between Schemas
house
address
num-baths
contact-info
agent-name agent-phone
1-1 mapping
non 1-1 mapping
house
location contact
full-baths
half-baths
name phone
11Semantics (preliminary)
- Semantics of mappings has received no attention.
- Semantics of 1-1 mappings
- Given
- R(A1,,An) and S(B1,,Bm)
- 1-1 mappings (Ai,Bj)
- Then, we postulate the existence of a relation W,
s.t. - P (C1,,Ck) (W) P (A1,,Ak) (R) ,
- P (C1,,Ck) (W) P (B1,,Bk) (S) ,
- W also includes the unmatched attributes of R and
S. - In English R and S are projections on some
universal relation W, and the mappings specify
the projection variables and correspondences.
12Why Matching is Difficult
- Aims to identify same real-world entity
- using names, structures, types, data values, etc
- Schemas represent same entity differently
- different names gt same entity
- area address gt location
- same names gt different entities
- area gt location or square-feet
- Schema data never fully capture semantics!
- not adequately documented, not sufficiently
expressive - Intended semantics is typically subjective!
- IBM Almaden Lab IBM?
- Cannot be fully automated. Often hard for humans.
Committees are required!
13Current State of Affairs
- Finding semantic mappings is now the bottleneck!
- largely done by hand
- labor intensive error prone
- GTE 4 hours/element for 27,000 elements
LiClifton00 - Will only be exacerbated
- data sharing XML become pervasive
- proliferation of DTDs
- translation of legacy data
- reconciling ontologies on semantic web
- Need semi-automatic approaches to scale up!
14Outline
- Overview of structure mapping
- Data integration and source mappings
- LSD architecture and details
- Experimental results
- Current work.
15The LSD Approach
- User manually maps a few data sources to the
mediated schema. - LSD learns from the mappings, and proposes
mappings for the rest of the sources. - Several types of knowledge are used in learning
- Schema elements, e.g., attribute names
- Data elements ranges, formats, word frequencies,
value frequencies, length of texts. - Proximity of attributes
- Functional dependencies, number of attribute
occurrences. - One learner does not fit all. Use multiple
learners and combine with meta-learner.
16Example
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Learned hypotheses
Schema of realestate.com
If phone occurs in the name gt agent-phone
listed-price 250,000 110,000 ...
location Miami, FL Boston, MA ...
phone (305) 729 0831 (617) 253 1429 ...
comments Fantastic house Great location ...
realestate.com
If fantastic great occur frequently in
data values gt description
homes.com
price 550,000 320,000 ...
contact-phone (278) 345 7215 (617) 335 2315 ...
extra-info Beautiful yard Great beach ...
17Multi-Strategy Learning
- Use a set of base learners
- Name learner, Naïve Bayes, Whirl, XML learner
- And a set of recognizers
- County name, zip code, phone numbers.
- Each base learner produces a prediction weighted
by confidence score. - Combine base learners with a meta-learner, using
stacking.
18Base Learners
(contact-info,office-address)
(contact-info,office-address)
(contact,agent-phone)
(contact,agent-phone)
(contact-phone, ? )
(phone,agent-phone)
(phone,agent-phone)
(listed-price,price)
(listed-price,price)
contact-phone gt (agent-phone,0.7),
(office-address,0.3)
- Naive Bayes Learner DomingosPazzani 97
- Kent, WA gt (address,0.8), (name,0.2)
- Whirl Learner CohenHirsh 98
- XML Learner
- exploits hierarchical structure of XML data
19Training the Base Learners
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Schema of realestate.com
Name Learner
ltlocationgt Miami, FL lt/gt ltlisted-pricegt
250,000lt/gt ltphonegt (305) 729 0831lt/gt
ltcommentsgt Fantastic house lt/gt
(location, address) (listed-price, price) (phone,
agent-phone) ...
realestate.com
Naive Bayes Learner
ltlocationgt Boston, MA lt/gt ltlisted-pricegt
110,000lt/gt ltphonegt (617) 253 1429lt/gt
ltcommentsgt Great location lt/gt
(Miami, FL, address) ( 250,000,
price) ((305) 729 0831, agent-phone) ...
20Entity Recognizers
- Use pre-programmed knowledge to identify specific
types of entities - date, time, city, zip code, name, etc
- house-area (30 X 70, 500 sq. ft.)
- county-name recognizer
- Recognizers often have nice characteristics
- easy to construct
- many off-the-self research commercial products
- applicable across many domains
- help with special cases that are hard to learn
21Meta-Learner Stacking
- Training of meta-learner produces a weight for
every pair of - (base-learner, mediated-schema element)
- weight(Name-Learner,address) 0.1
- weight(Naive-Bayes,address) 0.9
- Combining predictions of meta-learner
- computes weighted sum of base-learner confidence
scores
Name Learner Naive Bayes
(address,0.6) (address,0.8)
ltareagtSeattle, WAlt/gt
Meta-Learner
(address, 0.60.1 0.80.9 0.78)
22Training the Meta-Learner
Name Learner
Naive Bayes
True Predictions
Extracted XML Instances
ltlocationgt Miami, FLlt/gt ltlisted-pricegt
250,000lt/gt ltareagt Seattle, WA lt/gt lthouse-addrgtKen
t, WAlt/gt ltnum-bathsgt3lt/gt ...
0.5 0.8
1 0.4
0.3 0 0.3
0.9 1
0.6 0.8
1 0.3
0.3 0 ...
... ...
Least-SquaresLinear Regression
Weight(Name-Learner,address)
0.1 Weight(Naive-Bayes,address) 0.9
23Applying the Learners
Mediated schema
Schema of homes.com
address price agent-phone
description
area day-phone extra-info
Name Learner Naive Bayes
ltareagtSeattle, WAlt/gt ltareagtKent,
WAlt/gt ltareagtAustin, TXlt/gt
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Name Learner Naive Bayes
Meta-Learner
(address,0.7), (description,0.3)
ltday-phonegt(278) 345 7215lt/gt ltday-phonegt(617) 335
2315lt/gt ltday-phonegt(512) 427 1115lt/gt
(agent-phone,0.9), (description,0.1)
(description,0.8), (address,0.2)
ltextra-infogtBeautiful yardlt/gt ltextra-infogtGreat
beachlt/gt ltextra-infogtClose to Seattlelt/gt
24The Constraint Handler
- Extends learning to incorporate constraints
- hard constraints
- a address b address a b
- a house-id a is a key
- a agent-info b agent-name b is
nested in a - soft constraints
- a agent-phone b agent-name
a b are usually
close to each other - user feedback hard or soft constraints
- Details in Doan et. al., SIGMOD 2001
25The Current LSD System
Matching Phase
Training Phase
Mediated schema
Source schemas
Domain Constraints
Data listings
User Feedback
Constraint Handler
Base-Learner1
Base-Learnerk
Meta-Learner
Mappings
26Outline
- Overview of structure mapping
- Data integration and source mappings
- LSD architecture and details
- Experimental results
- Current work.
27Empirical Evaluation
- Four domains
- Real Estate I II, Course Offerings, Faculty
Listings - For each domain
- create mediated DTD domain constraints
- choose five sources
- extract convert data listings into XML
(faithful to schema!) - mediated DTDs 14 - 66 elements, source DTDs 13
- 48
- Ten runs for each experiment - in each run
- manually provide 1-1 mappings for 3 sources
- ask LSD to propose mappings for remaining 2
sources - accuracy of 1-1 mappings correctly identified
28Matching Accuracy
Average Matching Acccuracy ()
LSDs accuracy 71 - 92
Best single base learner 42 - 72
Meta-learner 5 - 22
Constraint handler 7 - 13 XML
learner 0.8 - 6
29Sensitivity to Amount of Available Data
Average matching accuracy ()
Number of data listings per source (Real Estate I)
30Contribution of Schema vs. Data
Average matching accuracy ()
- LSD with only schema info.
- LSD with only data info.
- Complete LSD
- More experiments in the paper Doan et. al. 01
31Reasons for Incorrect Matching
- Unfamiliarity
- suburb
- solution add a suburb-name recognizer
- Insufficient information
- correctly identified general type, failed to
pinpoint exact type - ltagent-namegtRichard Smithlt/gtltphonegt (206) 234
5412 lt/gt - solution add a proximity learner
- Subjectivity
- house-style description?
32Outline
- Overview of structure mapping
- Data integration and source mappings
- LSD architecture and details
- Experimental results
- Current work.
33Moving Up the Expressiveness Ladder
- Schemas are very simple ontologies.
- More expressive power More domain constraints.
- Mappings become more complex, but constraints
provide more to learn from. - Non 1-1 mappings
- F1(A1,,Am) F2(B1,,Bm)
- Ontologies (of various flavors)
- Class hierarchy (I.e., containment on unary
relations) - Relationships between objects
- Constraints on relationships
34Finding Non 1-1 MappingsCurrent work
- Given two schemas, find
- 1-many mappings address concat(city,state)
- many-1 half-baths full-baths num-baths
- many-many concat(addr-line1,addr-line2)
concat(street,city,state) - 1-many mappings
- expressed as query
- value correspondence expression room-rate rate
(1 tax-rate) - relationship state of tax-rate state of
hotel that has rate - special case 1-many mappings between two
relational tables
Mediated schema
Source schema
address description num-baths
city state comments half-baths full-baths
35 Brute-Force Solution
- Define a set of operators
- concat, , -, , /, etc
- For each set of mediated-schema columns
- enumerate all possible mappings
- evaluate return best mapping
Source-schema columns
Mediated-schema columns
compute similarity using all base learners
m1
m1, m2, ..., mk
36 Search-Based Solution
- States columns
- goal state mediated-schema column
- initial states all source-schema columns
- use 1-1 matching to reduce the set of initial
states - Operators concat, , -, , /, etc
- Column-similarity
- use all base learners recognizers
37Multi-Strategy Search
- Use a set of expert modules L1, L2, ..., Ln
- Each module
- applies to only certain types of mediated-schema
column - searches a small subspace
- uses a cheap similarity measure to compare
columns - Example
- L1 text concat TF/IDF
- L2 numeric , -, , / Ho et. al. 2000
- L3 address concat Naive Bayes
- Search techniques
- beam search as default
- specialized, do not have to materialize columns
38Multi-Strategy Search (contd)
- Apply all applicable expert modules
L1 m11, m12, m13, ..., m1x L2 m21, m22, m23,
..., m2y L3 m31, m32, m33, ..., m3z
- Combine modules predictions select the best one
compute similarity using all base learners
m11
m11, m12, m21, m22, m31,m32
39Related Work
Single Learner 1-1 Matching
Recognizers Schema 1-1 Matching
TRANSCM MiloZohar98 ARTEMIS
CastanoAntonellis99
Palopoli et. al. 98 CUPID Madhavan et. al. 01
SEMINT LiClifton94 ILA PerkowitzEtzioni95 D
ELTA Clifton et. al. 97
Hybrid 1-1 Matching
DELTA Clifton et. al. 97
Multi-Strategy Learning Learners
Recognizers Schema Data 1-1 non 1-1 Matching
Schema Data 1-1 non 1-1 Matching Sophisticated
Data-Driven User Interaction
CLIO Miller et. al. 00,Yan et. al. 01
LSD Doan et. al. 2000, 2001
?
40Summary
- LSD
- uses multi-strategy learning to
semi-automatically generate semantic mappings. - LSD is extensible and incorporates domain and
user knowledge, and previous techniques. - Experimental results show the approach is very
promising. - Future work and issues to ponder
- Accommodating more expressive languages
ontologies - Reuse of learned concepts from related domains.
- Semantics?
- Data management is a fertile area for Machine
Learning research!
41Backup Slides
42Mapping Maintenance
Source-schema S
Mediated-schema M
m1
m2
m3
- Ten months later ...
- are the mappings still correct?
Source-schema S
Mediated-schema M
m1
m2
m3
43Information Extraction from Text
- Extract data fragments from text documents
- date, location, victims name from a news
article - Intensive research on free-text documents
- Many documents do have substantial structure
- XML pages, name card, tables, list
- Each such document a data source
- structure forms a schema
- only one data value per schema element
- real data source has many data values per
schema element - Ongoing research in the IE community
44Contribution of Each Component
Average Matching Acccuracy ()
Without Name Learner Without Naive Bayes Without
Whirl Learner Without Constraint Handler The
complete LSD system
45Exploiting Hierarchical Structure
- Existing learners flatten out all structures
- Developed XML learner
- similar to the Naive Bayes learner
- input instance bag of tokens
- differs in one crucial aspect
- consider not only text tokens, but also structure
tokens
ltcontactgt ltnamegt Gail Murphy lt/namegt ltfirmgt
MAX Realtors lt/firmgt lt/contactgt
ltdescriptiongt Victorian house with a view.
Name your price! To see it, contact Gail
Murphy at MAX Realtors. lt/descriptiongt
46Domain Constraints
- Impose semantic regularities on sources
- verified using schema or data
- Examples
- a address b address a b
- a house-id a is a key
- a agent-info b agent-name b is
nested in a - Can be specified up front
- when creating mediated schema
- independent of any actual source schema
47The Constraint Handler
Domain Constraints a address b adderss
a b
Predictions from Meta-Learner
area (address,0.7),
(description,0.3) contact-phone
(agent-phone,0.9), (description,0.1) extra-info
(address,0.6), (description,0.4)
0.3 0.1 0.4 0.012
area address contact-phone
agent-phone extra-info description
area address contact-phone
agent-phone extra-info address
0.7 0.9 0.6 0.378
0.7 0.9 0.4 0.252
- Can specify arbitrary constraints
- User feedback domain constraint
- ad-id house-id
- Extended to handle domain heuristics
- a agent-phone b agent-name a b are
usually close to each other