AnHai Doan

1 / 48
About This Presentation
Title:

AnHai Doan

Description:

University of Washington, Seattle. Spring 2002. Learning to Map between ... sold-at contact-agent extra-info $350K (206) 634 9435 Beautiful yard ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 49
Provided by: zam34
Learn more at: http://pages.cs.wisc.edu

less

Transcript and Presenter's Notes

Title: AnHai Doan


1
Learning to Map between Structured
Representations of Data
  • AnHai Doan
  • Database Data Mining Group
  • University of Washington, Seattle
  • Spring 2002

2
Data Integration Challenge
Find houses with 2 bedrooms priced under 200K
New faculty member
homes.com
realestate.com
homeseekers.com
3
Architecture of Data Integration System
Find houses with 2 bedrooms priced under 200K
mediated schema
source schema 2
source schema 3
source schema 1
homes.com
realestate.com
homeseekers.com
4
Semantic Mappings between Schemas
Mediated-schema
price agent-name address
1-1 mapping
complex mapping
homes.com
listed-price contact-name city
state
320K Jane Brown Seattle
WA 240K Mike Smith Miami
FL
5
Schema Matching is Ubiquitous!
  • Fundamental problem in numerous applications
  • Databases
  • data integration
  • data translation
  • schema/view integration
  • data warehousing
  • semantic query processing
  • model management
  • peer data management
  • AI
  • knowledge bases, ontology merging, information
    gathering agents, ...
  • Web
  • e-commerce
  • marking up data using ontologies (Semantic Web)

6
Why Schema Matching is Difficult
  • Schema data never fully capture semantics!
  • not adequately documented
  • Must rely on clues in schema data
  • using names, structures, types, data values, etc.
  • Such clues can be unreliable
  • same names gt different entities area gt
    location or square-feet
  • different names gt same entity area
    address gt location
  • Intended semantics can be subjective
  • house-style house-description?
  • Cannot be fully automated, needs user feedback!

7
Current State of Affairs
  • Finding semantic mappings is now a key
    bottleneck!
  • largely done by hand
  • labor intensive error prone
  • data integration at GTE LiClifton, 2000
  • 40 databases, 27000 elements, estimated time 12
    years
  • Will only be exacerbated
  • data sharing becomes pervasive
  • translation of legacy data
  • Need semi-automatic approaches to scale up!
  • Many current research projects
  • Databases IBM Almaden, Microsoft Research, BYU,
    George Mason, U of Leipzig,
    ...
  • AI Stanford, Karlsruhe University, NEC Japan,
    ...

8
Goals and Contributions
  • Vision for schema-matching tools
  • learn from previous matching activities
  • exploit multiple types of information
  • incorporate domain integrity constraints
  • handle user feedback
  • My contributions solution for semi-automatic
    schema matching
  • can match relational schemas, DTDs, ontologies,
    ...
  • discovers both 1-1 complex mappings
  • highly modular extensible
  • achieves high matching accuracy (66 -- 97) on
    real-world data

9
Road Map
  • Introduction
  • Schema matching SIGMOD-01
  • 1-1 mappings for data integration
  • LSD (Learning Source Description) system
  • learns from previous matching activities
  • employs multi-strategy learning
  • exploits domain constraints user feedback
  • Creating complex mappings Tech. Report-02
  • Ontology matching WWW-02
  • Conclusions

10
Schema Matching for Data Integrationthe LSD
Approach
  • Suppose user wants to integrate 100 data
    sources
  • 1. User
  • manually creates mappings for a few sources, say
    3
  • shows LSD these mappings
  • 2. LSD learns from the mappings
  • 3. LSD predicts mappings for remaining 97 sources

11
Learning from the Manual Mappings
Mediated schema
price agent-name agent-phone
office-phone description
If office occurs in name gt office-phone
listed-price contact-name contact-phone
office comments
Schema of realestate.com
realestate.com
listed-price contact-name contact-phone
office comments
250K James Smith (305) 729 0831
(305) 616 1822 Fantastic house 320K
Mike Doan (617) 253 1429 (617) 112
2315 Great location
If fantastic great occur frequently in
data instances gt description
homes.com
sold-at contact-agent extra-info
350K (206) 634 9435 Beautiful yard
230K (617) 335 4243 Close to
Seattle
12
Must Exploit Multiple Types of Information!
Mediated schema
price agent-name agent-phone
office-phone description
If office occurs in name gt office-phone
listed-price contact-name contact-phone
office comments
Schema of realestate.com
realestate.com
listed-price contact-name contact-phone
office comments
250K James Smith (305) 729 0831
(305) 616 1822 Fantastic house 320K
Mike Doan (617) 253 1429 (617) 112
2315 Great location
If fantastic great occur frequently in
data instances gt description
homes.com
sold-at contact-agent extra-info
350K (206) 634 9435 Beautiful yard
230K (617) 335 4243 Close to
Seattle
13
Multi-Strategy Learning
  • Use a set of base learners
  • each exploits well certain types of information
  • To match a schema element of a new source
  • apply base learners
  • combine their predictions using a meta-learner
  • Meta-learner
  • uses training sources to measure base learner
    accuracy
  • weighs each learner based on its accuracy

14
Base Learners
  • Training
  • Matching
  • Name Learner
  • training (location, address)
    (contact name, name)
  • matching agent-name gt (name,0.7),(phone,0
    .3)
  • Naive Bayes Learner
  • training (Seattle, WA,address)
    (250K,price)
  • matching Kent, WA gt
    (address,0.8),(name,0.2)

labels weighted by confidence score
X
15
The LSD Architecture
Matching Phase
Training Phase
Mediated schema
Source schemas
Training data for base learners
Base-Learner1 .... Base-Learnerk
Meta-Learner
Base-Learner1
Base-Learnerk
Predictions for instances
Hypothesis1
Hypothesisk
Prediction Combiner
Domain constraints
Predictions for elements
Constraint Handler
Weights for Base Learners
Meta-Learner
Mappings
16
Training the Base Learners
Mediated schema
address price agent-name agent-phone
office-phone description
realestate.com
location price contact-name
contact-phone office
comments
Miami, FL 250K James Smith (305) 729
0831 (305) 616 1822 Fantastic house Boston,
MA 320K Mike Doan (617) 253 1429 (617)
112 2315 Great location
17
Meta-Learner StackingWolpert 92,TingWitten99
  • Training
  • uses training data to learn weights
  • one for each (base-learner,mediated-schema
    element) pair
  • weight (Name-Learner,address) 0.2
  • weight (Naive-Bayes,address) 0.8
  • Matching combine predictions of base learners
  • computes weighted average of base-learner
    confidence scores

area
Name Learner Naive Bayes
(address,0.4) (address,0.9)
Seattle, WA Kent, WA Bend, OR
Meta-Learner
(address, 0.40.2 0.90.8 0.8)
18
The LSD Architecture
Matching Phase
Training Phase
Mediated schema
Source schemas
Training data for base learners
Base-Learner1 .... Base-Learnerk
Meta-Learner
Base-Learner1
Base-Learnerk
Predictions for instances
Hypothesis1
Hypothesisk
Prediction Combiner
Domain constraints
Predictions for elements
Constraint Handler
Weights for Base Learners
Meta-Learner
Mappings
19
Applying the Learners
homes.com schema
area sold-at contact-agent
extra-info
area
Name Learner Naive Bayes
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Seattle, WA Kent, WA Bend, OR
Name Learner Naive Bayes
Meta-Learner
Prediction-Combiner
(address,0.7), (description,0.3)
homes.com
sold-at
(price,0.9), (agent-phone,0.1)
contact-agent
(agent-phone,0.9), (description,0.1)
extra-info
(address,0.6), (description,0.4)
20
Domain Constraints
  • Encode user knowledge about domain
  • Specified by examining mediated schema
  • Examples
  • at most one source-schema element can match
    address
  • if a source-schema element matches house-id then
    it is a key
  • avg-value(price) gt avg-value(num-baths)
  • Given a mapping combination
  • can verify if it satisfies a given constraint

area address sold-at
price contact-agent agent-phone extra-info
address
21
The Constraint Handler
Predictions from Prediction Combiner
Domain Constraints At most one element matches
address
area (address,0.7),
(description,0.3) sold-at
(price,0.9), (agent-phone,0.1) contact-agent
(agent-phone,0.9), (description,0.1) extra-info
(address,0.6), (description,0.4)
0.3 0.1 0.1 0.4 0.0012
0.7 0.9 0.9 0.4 0.2268
area address sold-at
price contact-agent agent-phone extra-info
description
0.7 0.9 0.9 0.6 0.3402
area address sold-at
price contact-agent agent-phone extra-info
address
  • Searches space of mapping combinations
    efficiently
  • Can handle arbitrary constraints
  • Also used to incorporate user feedback
  • sold-at does not match price

22
The Current LSD System
  • Can also handle data in XML format
  • matches XML DTDs
  • Base learners
  • Naive Bayes DudaHart-93, DomingosPazzani-97
  • exploits frequencies of words symbols
  • WHIRL Nearest-Neighbor Classifier CohenHirsh
    KDD-98
  • employs information-retrieval similarity metric
  • Name Learner SIGMOD-01
  • matches elements based on their names
  • County-Name Recognizer SIGMOD-01
  • stores all U.S. county names
  • XML Learner SIGMOD-01
  • exploits hierarchical structure of XML data

23
Empirical Evaluation
  • Four domains
  • Real Estate I II, Course Offerings, Faculty
    Listings
  • For each domain
  • created mediated schema domain constraints
  • chose five sources
  • extracted converted data into XML
  • mediated schemas 14 - 66 elements, source
    schemas 13 - 48
  • Ten runs for each domain, in each run
  • manually provided 1-1 mappings for 3 sources
  • asked LSD to propose mappings for remaining 2
    sources
  • accuracy of 1-1 mappings correctly identified

24
High Matching Accuracy
Average Matching Acccuracy ()
LSDs accuracy 71 - 92
Best single base learner 42 - 72
Meta-learner 5 - 22
Constraint handler 7 - 13 XML
learner 0.8 - 6
25
Contribution of Schema vs. Data
Average matching accuracy ()

  • LSD with only schema info.
  • LSD with only data info.
  • Complete LSD

More experiments in Doan et al. SIGMOD-01
26
LSD Summary
  • LSD
  • learns from previous matching activities
  • exploits multiple types of information
  • by employing multi-strategy learning
  • incorporates domain constraints user feedback
  • achieves high matching accuracy
  • LSD focuses on 1-1 mappings
  • Next challenge discover more complex mappings!
  • COMAP (Complex Mapping) system

27
The COMAP Approach
Mediated-schema
price num-baths address
homes.com
listed-price agent-id full-baths
half-baths city zipcode
  • For each mediated-schema element
  • searches space of all mappings
  • finds a small set of likely mapping candidates
  • uses LSD to evaluate them
  • To search efficiently
  • employs a specialized searcher for each element
    type
  • Text Searcher, Numeric Searcher, Category
    Searcher, ...

28
The COMAP Architecture Doan et al., 02
Source schema data
Mediated schema
Searcherk
Searcher2
Searcher1
Mapping candidates
Base-Learner1 .... Base-Learnerk
Meta-Learner
Prediction Combiner
Domain constraints
Constraint Handler
LSD
Mappings
29
An Example Text Searcher
  • Beam search in space of all concatenation
    mappings
  • Example find mapping candidates for address

Mediated-schema
price num-baths address
homes.com
listed-price agent-id full-baths
half-baths city zipcode
320K 532a 2
1 Seattle 98105 240K
115c 1 1
Miami 23591
concat(agent-id,zipcode)
concat(city,zipcode)
concat(agent-id,city)
532a 98105 115c 23591
Seattle 98105 Miami 23591
532a Seattle 115c Miami
  • Best mapping candidates for address
  • (agent-id,0.7), (concat(agent-id,city),0.75),
    (concat(city,zipcode),0.9)

30
Empirical Evaluation
  • Current COMAP system
  • eight searchers
  • Three real-world domains
  • in real estate product inventory
  • mediated schema 6 -- 26 elements, source schema
    16 -- 31
  • Accuracy 62 -- 97
  • Sample discovered mappings
  • agent-name concat(first-name,last-name)
  • area building-area / 43560
  • discount-cost (unit-price quantity) (1 -
    discount)

31
Road Map
  • Introduction
  • Schema matching
  • LSD system
  • Creating complex mappings
  • COMAP system
  • Ontology matching
  • GLUE system
  • Conclusions

32
Ontology Matching
  • Increasingly critical for
  • knowledge bases, Semantic Web
  • An ontology
  • concepts organized into a taxonomy tree
  • each concept has
  • a set of attributes
  • a set of instances
  • relations among concepts
  • Matching
  • concepts
  • attributes
  • relations

CS Dept. US
Entity
Undergrad Courses
Grad Courses
People
Staff
Faculty
Assistant Professor
Associate Professor
Professor
name Mike Burns degree Ph.D.
33
Matching Taxonomies of Concepts
CS Dept. Australia
Entity
Courses
Staff
Technical Staff
Academic Staff
Senior Lecturer
Lecturer
Professor
34
Constraints in Taxonomy Matching
  • Domain-dependent
  • at most one node matches department-chair
  • a node that matches professor can not be a child
    of a node that matches assistant-professor
  • Domain-independent
  • two nodes match if parents children match
  • if all children of X matches Y, then X also
    matches Y
  • Variations have been exploited in many restricted
    settingsMelnikGarcia-Molina,ICDE-02,
    MiloZohar,VLDB-98,Noy et al., IJCAI-01,
    Madhavan et al., VLDB-01
  • Challenge find a general efficient approach

35
Solution Relaxation Labeling
  • Relaxation labeling HummelZucker, 83
  • applied to graph labeling in vision, NLP,
    hypertext classification
  • finds best label assignment, given a set of
    constraints
  • starts with initial label assignment
  • iteratively improves labels, using constraints
  • Standard relax. labeling not applicable
  • extended it in many ways Doan et al., W W W-02
  • Experiments
  • three real-world domains in course catalog
    company listings
  • 30 -- 300 nodes / taxonomy
  • accuracy 66 -- 97 vs. 52 -- 83 of best base
    learner
  • relaxation labeling very fast (under few
    seconds)

36
Related Work
Hand-crafted rules Exploit schema 1-1 mapping
Single learner Exploit data 1-1 mapping
TRANSCM MiloZohar98 ARTEMIS
CastanoAntonellis99
Palopoli et al. 98 CUPID Madhavan et al.
01 PROMPT Noy et al. 00
SEMINT LiClifton94 ILA PerkowitzEtzioni95 DE
LTA Clifton et al. 97
Learners rules, use multi-strategy
learning Exploit schema data 1-1 complex
mapping Exploit domain constraints
Rules Exploit data 1-1 complex mapping
CLIO Miller et. al., 00 Yan et al.
01
LSD Doan et al., SIGMOD-01 COMAP Doan et al.
2002, submitted GLUE Doan et al., WWW-02
37
Future Work
  • Learning source descriptions
  • formal semantics for mapping
  • query capabilities, source schema, scope,
    reliability of data, ...
  • Dealing with changes in source description
  • Matching objects across sources
  • More sophisticated user feedback
  • Focus on distributed information management
    systems
  • data integration, web-service integration, peer
    data management
  • goal significantly reduce complexity of
    construction maintenance

38
Conclusions
  • Efficiently creating semantic mappings is
    critical
  • Developed solution for semi-automatic schema
    matching
  • learns from previous matching activities
  • can match relational schemas, DTDs, ontologies,
    ...
  • discovers both 1-1 complex mappings
  • highly modular extensible
  • achieves high matching accuracy
  • Made contributions to machine learning
  • developed novel method to classify XML data
  • extended relaxation labeling

39
Backup Slides
40
Training the Meta-Learner
  • For address

Name Learner
Naive Bayes
True Predictions
Extracted XML Instances
ltlocationgt Miami, FLlt/gt ltlisted-pricegt
250,000lt/gt ltareagt Seattle, WA lt/gt lthouse-addrgtKen
t, WAlt/gt ltnum-bathsgt3lt/gt ...
0.5 0.8
1 0.4
0.3 0 0.3
0.9 1
0.6 0.8
1 0.3
0.3 0 ...
... ...
Least-SquaresLinear Regression
Weight(Name-Learner,address)
0.1 Weight(Naive-Bayes,address) 0.9
41
Sensitivity to Amount of Available Data
Average matching accuracy ()
Number of data listings per source (Real Estate I)
42
Contribution of Each Component
Average Matching Acccuracy ()
Without Name Learner Without Naive Bayes Without
Whirl Learner Without Constraint Handler The
complete LSD system
43
Exploiting Hierarchical Structure
  • Existing learners flatten out all structures
  • Developed XML learner
  • similar to the Naive Bayes learner
  • input instance bag of tokens
  • differs in one crucial aspect
  • consider not only text tokens, but also structure
    tokens

ltcontactgt ltnamegt Gail Murphy lt/namegt ltfirmgt
MAX Realtors lt/firmgt lt/contactgt
ltdescriptiongt Victorian house with a view.
Name your price! To see it, contact Gail
Murphy at MAX Realtors. lt/descriptiongt
44
Reasons for Incorrect Matchings
  • Unfamiliarity
  • suburb
  • solution add a suburb-name recognizer
  • Insufficient information
  • correctly identified general type, failed to
    pinpoint exact type
  • agent-name phoneRichard Smith
    (206) 234 5412
  • solution add a proximity learner
  • Subjectivity
  • house-style description?Victorian
    Beautiful neo-gothic houseMexican
    Great location

45
Evaluate Mapping Candidates
  • For address, Text Searcher returns
  • (agent-id,0.7)
  • (concat(agent-id,city),0.8)
  • (concat(city,zipcode),0.75)
  • Employ multi-strategy learning to evaluate
    mappings
  • Example (concat(agent-id,city),0.8)
  • Naive Bayes Learner 0.8
  • Name Learner address vs. agent id city 0.3
  • Meta-Learner 0.8 0.7 0.3 0.3 0.65
  • Meta-Learner returns
  • (agent-id,0.59)
  • (concat(agent-id,city),0.65)
  • (concat(city,zipcode),0.70)

46
Relaxation Labeling
  • Applied to similar problems in
  • vision, NLP, hypertext classification

People
Dept U.S.
Dept Australia
Courses
Courses
Courses
Courses
People
Staff
Staff
Faculty
Tech. Staff
Acad. Staff
Staff
Faculty
47
Relaxation Labeling for Taxonomy Matching
  • Must define
  • neighborhood of a node
  • k features of neighborhood
  • how to combine influence of features
  • Algorithm
  • init for each pair ltN,Lgt, compute
  • loop for each pair ltN,Lgt, re-compute

Acad. Staff Faculty Tech. Staff Staff
Staff People
Neighborhood configuration
48
Relaxation Labeling for Taxonomy Matching
  • Huge number of neighborhood configurations!
  • typically neighborhood immediate nodes
  • here neighborhood can be entire graph100 nodes,
    10 labels gt configurations
  • Solution
  • label abstraction dynamic programming
  • guarantee quadratic time for a broad range of
    domain constraints
  • Empirical evaluation
  • GLUE system Doan et. al., WWW-02
  • three real-world domains
  • 30 -- 300 nodes / taxonomy
  • high accuracy 66 -- 97 vs. 52 -- 83 of best
    base learner
  • relaxation labeling very fast, finished in
    several seconds
Write a Comment
User Comments (0)