AnHai Doan

About This Presentation

Title:

AnHai Doan

Description:

University of Washington, Seattle. Spring 2002. Learning to Map between ... sold-at contact-agent extra-info $350K (206) 634 9435 Beautiful yard ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 49

Provided by: zam34

Learn more at: http://pages.cs.wisc.edu

more less

Transcript and Presenter's Notes

Title: AnHai Doan

1
Learning to Map between Structured
Representations of Data

AnHai Doan
Database Data Mining Group
University of Washington, Seattle
Spring 2002

2
Data Integration Challenge
Find houses with 2 bedrooms priced under 200K
New faculty member
homes.com
realestate.com
homeseekers.com
3
Architecture of Data Integration System
Find houses with 2 bedrooms priced under 200K
mediated schema
source schema 2
source schema 3
source schema 1
homes.com
realestate.com
homeseekers.com
4
Semantic Mappings between Schemas
Mediated-schema
price agent-name address
1-1 mapping
complex mapping
homes.com
listed-price contact-name city
state
320K Jane Brown Seattle
WA 240K Mike Smith Miami
FL
5
Schema Matching is Ubiquitous!

Fundamental problem in numerous applications
Databases
data integration
data translation
schema/view integration
data warehousing
semantic query processing
model management
peer data management
AI
knowledge bases, ontology merging, information
gathering agents, ...
Web
e-commerce
marking up data using ontologies (Semantic Web)

6
Why Schema Matching is Difficult

Schema data never fully capture semantics!
not adequately documented
Must rely on clues in schema data
using names, structures, types, data values, etc.
Such clues can be unreliable
same names gt different entities area gt
location or square-feet
different names gt same entity area
address gt location
Intended semantics can be subjective
house-style house-description?
Cannot be fully automated, needs user feedback!

7
Current State of Affairs

Finding semantic mappings is now a key
bottleneck!
largely done by hand
labor intensive error prone
data integration at GTE LiClifton, 2000
40 databases, 27000 elements, estimated time 12
years
Will only be exacerbated
data sharing becomes pervasive
translation of legacy data
Need semi-automatic approaches to scale up!
Many current research projects
Databases IBM Almaden, Microsoft Research, BYU,
George Mason, U of Leipzig,
...
AI Stanford, Karlsruhe University, NEC Japan,
...

8
Goals and Contributions

Vision for schema-matching tools
learn from previous matching activities
exploit multiple types of information
incorporate domain integrity constraints
handle user feedback
My contributions solution for semi-automatic
schema matching
can match relational schemas, DTDs, ontologies,
...
discovers both 1-1 complex mappings
highly modular extensible
achieves high matching accuracy (66 -- 97) on
real-world data

9
Road Map

Introduction
Schema matching SIGMOD-01
1-1 mappings for data integration
LSD (Learning Source Description) system
learns from previous matching activities
employs multi-strategy learning
exploits domain constraints user feedback
Creating complex mappings Tech. Report-02
Ontology matching WWW-02
Conclusions

10
Schema Matching for Data Integrationthe LSD
Approach

Suppose user wants to integrate 100 data
sources
1. User
manually creates mappings for a few sources, say
3
shows LSD these mappings
2. LSD learns from the mappings
3. LSD predicts mappings for remaining 97 sources

11
Learning from the Manual Mappings
Mediated schema
price agent-name agent-phone
office-phone description
If office occurs in name gt office-phone
listed-price contact-name contact-phone
office comments
Schema of realestate.com
realestate.com
listed-price contact-name contact-phone
office comments
250K James Smith (305) 729 0831
(305) 616 1822 Fantastic house 320K
Mike Doan (617) 253 1429 (617) 112
2315 Great location
If fantastic great occur frequently in
data instances gt description
homes.com
sold-at contact-agent extra-info
350K (206) 634 9435 Beautiful yard
230K (617) 335 4243 Close to
Seattle
12
Must Exploit Multiple Types of Information!
Mediated schema
price agent-name agent-phone
office-phone description
If office occurs in name gt office-phone
listed-price contact-name contact-phone
office comments
Schema of realestate.com
realestate.com
listed-price contact-name contact-phone
office comments
250K James Smith (305) 729 0831
(305) 616 1822 Fantastic house 320K
Mike Doan (617) 253 1429 (617) 112
2315 Great location
If fantastic great occur frequently in
data instances gt description
homes.com
sold-at contact-agent extra-info
350K (206) 634 9435 Beautiful yard
230K (617) 335 4243 Close to
Seattle
13
Multi-Strategy Learning

Use a set of base learners
each exploits well certain types of information
To match a schema element of a new source
apply base learners
combine their predictions using a meta-learner
Meta-learner
uses training sources to measure base learner
accuracy
weighs each learner based on its accuracy

14
Base Learners

Training
Matching
Name Learner
training (location, address)
(contact name, name)
matching agent-name gt (name,0.7),(phone,0
.3)
Naive Bayes Learner
training (Seattle, WA,address)
(250K,price)
matching Kent, WA gt
(address,0.8),(name,0.2)

labels weighted by confidence score
X
15
The LSD Architecture
Matching Phase
Training Phase
Mediated schema
Source schemas
Training data for base learners
Base-Learner1 .... Base-Learnerk
Meta-Learner
Base-Learner1
Base-Learnerk
Predictions for instances
Hypothesis1
Hypothesisk
Prediction Combiner
Domain constraints
Predictions for elements
Constraint Handler
Weights for Base Learners
Meta-Learner
Mappings
16
Training the Base Learners
Mediated schema
address price agent-name agent-phone
office-phone description
realestate.com
location price contact-name
contact-phone office
comments
Miami, FL 250K James Smith (305) 729
0831 (305) 616 1822 Fantastic house Boston,
MA 320K Mike Doan (617) 253 1429 (617)
112 2315 Great location
17
Meta-Learner StackingWolpert 92,TingWitten99

Training
uses training data to learn weights
one for each (base-learner,mediated-schema
element) pair
weight (Name-Learner,address) 0.2
weight (Naive-Bayes,address) 0.8
Matching combine predictions of base learners
computes weighted average of base-learner
confidence scores

area
Name Learner Naive Bayes
(address,0.4) (address,0.9)
Seattle, WA Kent, WA Bend, OR
Meta-Learner
(address, 0.40.2 0.90.8 0.8)
18
The LSD Architecture
Matching Phase
Training Phase
Mediated schema
Source schemas
Training data for base learners
Base-Learner1 .... Base-Learnerk
Meta-Learner
Base-Learner1
Base-Learnerk
Predictions for instances
Hypothesis1
Hypothesisk
Prediction Combiner
Domain constraints
Predictions for elements
Constraint Handler
Weights for Base Learners
Meta-Learner
Mappings
19
Applying the Learners
homes.com schema
area sold-at contact-agent
extra-info
area
Name Learner Naive Bayes
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Seattle, WA Kent, WA Bend, OR
Name Learner Naive Bayes
Meta-Learner
Prediction-Combiner
(address,0.7), (description,0.3)
homes.com
sold-at
(price,0.9), (agent-phone,0.1)
contact-agent
(agent-phone,0.9), (description,0.1)
extra-info
(address,0.6), (description,0.4)
20
Domain Constraints

Encode user knowledge about domain
Specified by examining mediated schema
Examples
at most one source-schema element can match
address
if a source-schema element matches house-id then
it is a key
avg-value(price) gt avg-value(num-baths)
Given a mapping combination
can verify if it satisfies a given constraint

area address sold-at
price contact-agent agent-phone extra-info
address
21
The Constraint Handler
Predictions from Prediction Combiner
Domain Constraints At most one element matches
address
area (address,0.7),
(description,0.3) sold-at
(price,0.9), (agent-phone,0.1) contact-agent
(agent-phone,0.9), (description,0.1) extra-info
(address,0.6), (description,0.4)
0.3 0.1 0.1 0.4 0.0012
0.7 0.9 0.9 0.4 0.2268
area address sold-at
price contact-agent agent-phone extra-info
description
0.7 0.9 0.9 0.6 0.3402
area address sold-at
price contact-agent agent-phone extra-info
address

Searches space of mapping combinations
efficiently
Can handle arbitrary constraints
Also used to incorporate user feedback
sold-at does not match price

22
The Current LSD System

Can also handle data in XML format
matches XML DTDs
Base learners
Naive Bayes DudaHart-93, DomingosPazzani-97
exploits frequencies of words symbols
WHIRL Nearest-Neighbor Classifier CohenHirsh
KDD-98
employs information-retrieval similarity metric
Name Learner SIGMOD-01
matches elements based on their names
County-Name Recognizer SIGMOD-01
stores all U.S. county names
XML Learner SIGMOD-01
exploits hierarchical structure of XML data

23
Empirical Evaluation

Four domains
Real Estate I II, Course Offerings, Faculty
Listings
For each domain
created mediated schema domain constraints
chose five sources
extracted converted data into XML
mediated schemas 14 - 66 elements, source
schemas 13 - 48

Ten runs for each domain, in each run
manually provided 1-1 mappings for 3 sources
asked LSD to propose mappings for remaining 2
sources
accuracy of 1-1 mappings correctly identified

24
High Matching Accuracy
Average Matching Acccuracy ()
LSDs accuracy 71 - 92
Best single base learner 42 - 72
Meta-learner 5 - 22
Constraint handler 7 - 13 XML
learner 0.8 - 6
25
Contribution of Schema vs. Data
Average matching accuracy ()

LSD with only schema info.
LSD with only data info.
Complete LSD

More experiments in Doan et al. SIGMOD-01
26
LSD Summary

LSD
learns from previous matching activities
exploits multiple types of information
by employing multi-strategy learning
incorporates domain constraints user feedback
achieves high matching accuracy
LSD focuses on 1-1 mappings
Next challenge discover more complex mappings!
COMAP (Complex Mapping) system

27
The COMAP Approach
Mediated-schema
price num-baths address
homes.com
listed-price agent-id full-baths
half-baths city zipcode

For each mediated-schema element
searches space of all mappings
finds a small set of likely mapping candidates
uses LSD to evaluate them
To search efficiently
employs a specialized searcher for each element
type
Text Searcher, Numeric Searcher, Category
Searcher, ...

28
The COMAP Architecture Doan et al., 02
Source schema data
Mediated schema
Searcherk
Searcher2
Searcher1
Mapping candidates
Base-Learner1 .... Base-Learnerk
Meta-Learner
Prediction Combiner
Domain constraints
Constraint Handler
LSD
Mappings
29
An Example Text Searcher

Beam search in space of all concatenation
mappings
Example find mapping candidates for address

Mediated-schema
price num-baths address
homes.com
listed-price agent-id full-baths
half-baths city zipcode
320K 532a 2
1 Seattle 98105 240K
115c 1 1
Miami 23591
concat(agent-id,zipcode)
concat(city,zipcode)
concat(agent-id,city)
532a 98105 115c 23591
Seattle 98105 Miami 23591
532a Seattle 115c Miami

Best mapping candidates for address
(agent-id,0.7), (concat(agent-id,city),0.75),
(concat(city,zipcode),0.9)

30
Empirical Evaluation

Current COMAP system
eight searchers
Three real-world domains
in real estate product inventory
mediated schema 6 -- 26 elements, source schema
16 -- 31
Accuracy 62 -- 97
Sample discovered mappings
agent-name concat(first-name,last-name)
area building-area / 43560
discount-cost (unit-price quantity) (1 -
discount)

31
Road Map

Introduction
Schema matching
LSD system
Creating complex mappings
COMAP system
Ontology matching
GLUE system
Conclusions

32
Ontology Matching

Increasingly critical for
knowledge bases, Semantic Web
An ontology
concepts organized into a taxonomy tree
each concept has
a set of attributes
a set of instances
relations among concepts
Matching
concepts
attributes
relations

CS Dept. US
Entity
Undergrad Courses
Grad Courses
People
Staff
Faculty
Assistant Professor
Associate Professor
Professor
name Mike Burns degree Ph.D.
33
Matching Taxonomies of Concepts
CS Dept. Australia
Entity
Courses
Staff
Technical Staff
Academic Staff
Senior Lecturer
Lecturer
Professor
34
Constraints in Taxonomy Matching

Domain-dependent
at most one node matches department-chair
a node that matches professor can not be a child
of a node that matches assistant-professor
Domain-independent
two nodes match if parents children match
if all children of X matches Y, then X also
matches Y
Variations have been exploited in many restricted
settingsMelnikGarcia-Molina,ICDE-02,
MiloZohar,VLDB-98,Noy et al., IJCAI-01,
Madhavan et al., VLDB-01
Challenge find a general efficient approach

35
Solution Relaxation Labeling

Relaxation labeling HummelZucker, 83
applied to graph labeling in vision, NLP,
hypertext classification
finds best label assignment, given a set of
constraints
starts with initial label assignment
iteratively improves labels, using constraints
Standard relax. labeling not applicable
extended it in many ways Doan et al., W W W-02
Experiments
three real-world domains in course catalog
company listings
30 -- 300 nodes / taxonomy
accuracy 66 -- 97 vs. 52 -- 83 of best base
learner
relaxation labeling very fast (under few
seconds)

36
Related Work
Hand-crafted rules Exploit schema 1-1 mapping
Single learner Exploit data 1-1 mapping
TRANSCM MiloZohar98 ARTEMIS
CastanoAntonellis99
Palopoli et al. 98 CUPID Madhavan et al.
01 PROMPT Noy et al. 00
SEMINT LiClifton94 ILA PerkowitzEtzioni95 DE
LTA Clifton et al. 97
Learners rules, use multi-strategy
learning Exploit schema data 1-1 complex
mapping Exploit domain constraints
Rules Exploit data 1-1 complex mapping
CLIO Miller et. al., 00 Yan et al.
01
LSD Doan et al., SIGMOD-01 COMAP Doan et al.
2002, submitted GLUE Doan et al., WWW-02
37
Future Work

Learning source descriptions
formal semantics for mapping
query capabilities, source schema, scope,
reliability of data, ...
Dealing with changes in source description
Matching objects across sources
More sophisticated user feedback
Focus on distributed information management
systems
data integration, web-service integration, peer
data management
goal significantly reduce complexity of
construction maintenance

38
Conclusions

Efficiently creating semantic mappings is
critical
Developed solution for semi-automatic schema
matching
learns from previous matching activities
can match relational schemas, DTDs, ontologies,
...
discovers both 1-1 complex mappings
highly modular extensible
achieves high matching accuracy
Made contributions to machine learning
developed novel method to classify XML data
extended relaxation labeling

39
Backup Slides
40
Training the Meta-Learner

For address

Name Learner
Naive Bayes
True Predictions
Extracted XML Instances
ltlocationgt Miami, FLlt/gt ltlisted-pricegt
250,000lt/gt ltareagt Seattle, WA lt/gt lthouse-addrgtKen
t, WAlt/gt ltnum-bathsgt3lt/gt ...
0.5 0.8
1 0.4
0.3 0 0.3
0.9 1
0.6 0.8
1 0.3
0.3 0 ...
... ...
Least-SquaresLinear Regression
Weight(Name-Learner,address)
0.1 Weight(Naive-Bayes,address) 0.9
41
Sensitivity to Amount of Available Data
Average matching accuracy ()
Number of data listings per source (Real Estate I)
42
Contribution of Each Component
Average Matching Acccuracy ()
Without Name Learner Without Naive Bayes Without
Whirl Learner Without Constraint Handler The
complete LSD system
43
Exploiting Hierarchical Structure

Existing learners flatten out all structures
Developed XML learner
similar to the Naive Bayes learner
input instance bag of tokens
differs in one crucial aspect
consider not only text tokens, but also structure
tokens

ltcontactgt ltnamegt Gail Murphy lt/namegt ltfirmgt
MAX Realtors lt/firmgt lt/contactgt
ltdescriptiongt Victorian house with a view.
Name your price! To see it, contact Gail
Murphy at MAX Realtors. lt/descriptiongt
44
Reasons for Incorrect Matchings

Unfamiliarity
suburb
solution add a suburb-name recognizer
Insufficient information
correctly identified general type, failed to
pinpoint exact type
agent-name phoneRichard Smith
(206) 234 5412
solution add a proximity learner
Subjectivity
house-style description?Victorian
Beautiful neo-gothic houseMexican
Great location

45
Evaluate Mapping Candidates

For address, Text Searcher returns
(agent-id,0.7)
(concat(agent-id,city),0.8)
(concat(city,zipcode),0.75)
Employ multi-strategy learning to evaluate
mappings
Example (concat(agent-id,city),0.8)
Naive Bayes Learner 0.8
Name Learner address vs. agent id city 0.3
Meta-Learner 0.8 0.7 0.3 0.3 0.65
Meta-Learner returns
(agent-id,0.59)
(concat(agent-id,city),0.65)
(concat(city,zipcode),0.70)

46
Relaxation Labeling

Applied to similar problems in
vision, NLP, hypertext classification

People
Dept U.S.
Dept Australia
Courses
Courses
Courses
Courses
People
Staff
Staff
Faculty
Tech. Staff
Acad. Staff
Staff
Faculty
47
Relaxation Labeling for Taxonomy Matching