Title: New England Database Society (NEDS)
1- New England Database Society (NEDS)
- Friday, April 23, 2004
- Volen 101, Brandeis University
Sponsored by Sun Microsystems
2Learning to Reconcile Semantic Heterogeneity
- Alon Halevy
- University of Washington, Seattle
- NEDS, April 23, 2004
-
3Large-Scale Data Sharing
- Large-scale data sharing is pervasive
- Big science (bio-medicine, astrophysics, )
- Government agencies
- Large corporations
- The web (over 100,000 searchable data sources)
- Enterprise Information Integration industry
- The vision
- Content authoring by anyone, anywhere
- Powerful database-style querying
- Use relevant data from anywhere to answer the
query - The Semantic Web
- Fundamental problem reconciling different models
of the world.
4Large-Scale Scientific Data Sharing
Swiss- Prot
HUGO
OMIM
UW
UW Microbiology
UW Genome Sciences
Harvard Genetics
GeneClinics
5Data Integration
Entity
www.biomediator.org Tarczy-Hornoch, Mork
Sequenceable Entity
Structured Vocabulary
Gene
Phenotype
Experiment
Nucleotide Sequence
Microarray Experiment
Protein
OMIM
Swiss- Prot
HUGO
GO
Gene- Clinics
Entrez
Locus- Link
GEO
Query For the micro-array experiment I just ran,
what are the related nucleotide sequences and for
what protein do they code?
6Peer Data Management Systems
Piazza Tatarinov, H., Ives, Suciu, Mork
- Mappings specified locally
- Map to most convenient nodes
- Queries answered by traversing semantic paths.
CiteSeer
Stanford
UW
DBLP
Brown
M.I.T
Brandeis
7Data Sharing Architectures
- Data integration
- PDMS
- Message passing
- Web services
- Data warehousing
8 Semantic Mappings
- Formalism for mappings
- Reformulation algorithms
Mediated Schema
9Semantic Mappings Example
- Differences in
- Names in schema
- Attribute grouping
- Coverage of databases
- Granularity and format of attributes
Books Title ISBN Price DiscountPrice
Edition
Authors ISBN FirstName LastName
BooksAndMusic Title Author Publisher ItemID ItemTy
pe SuggestedPrice Categories Keywords
BookCategories ISBN Category
CDCategories ASIN Category
CDs Album ASIN Price DiscountPrice St
udio
Artists ASIN ArtistName GroupName
Inventory Database A
Inventory Database B
10Why is Schema Matching so Hard?
- Because the schemas never fully capture their
intended meaning - Schema elements are just symbols.
- We need to leverage any additional information we
may have. - Theorem Schema matching is AI-Complete.
- Hence, human will always be in the loop.
- Goal is to improve designers productivity.
- Solution must be extensible.
11Dimensions of the Problem (1)
Matching vs. Mapping
Books Title ISBN Price DiscountPrice
Edition
Authors ISBN FirstName LastName
BooksAndMusic Title Author Publisher ItemID ItemTy
pe SuggestedPrice Categories Keywords
BookCategories ISBN Category
CDCategories ASIN Category
CDs Album ASIN Price DiscountPrice St
udio
Artists ASIN ArtistName GroupName
Inventory Database A
Inventory Database B
- Schema Matching Discovering correspondences
between similar elements - Schema Mapping BooksAndMusic(xTitle,)
Books(xTitle,) ? CDs(xAlbum,)
12Dimensions of the Problem (2)
- Schema level vs. instance level
- Alon Halevy, A. Halevy, Alon Y. Levy same guy!
- Cant always separate the two levels.
- Crucial for Personal Info Management (See Semex)
- What are we mapping?
- Schemas
- Web service descriptions
- Business logic and processes
- Ontologies
13Important Special Cases
- Mapping to a common mediated schema?
- Or mapping two arbitrary schemas?
- One schema may be a new version of the other.
- The two schemas may be evolutions of the same
original schema. - Web forms.
- Horizontal integration many sources talking
about the same stuff. - Vertical integration sources covering different
parts of the domain, and have only little overlap.
14Problem Definition
- Given
- S1 and S2 a pair of schemas/DTDs/ontologies,
- Possibly, data accompanying instances
- Additional domain knowledge
- Find
- A match between S1 and S2
- A set of correspondences between the terms.
15Outline
- Motivation and problem definition
- Learning to match to a mediated schema
- Matching arbitrary schemas using a corpus
- Matching web services.
16Typical Matching HeuristicsSee Rahm
Bernstein, VLDBJ 2001, for a survey
- Build a model for every element from multiple
sources of evidences in the schemas - Schema element names
- BooksAndCDs/Categories BookCategories/Category
- Descriptions and documentation
- ItemID unique identifier for a book or a CD
- ISBN unique identifier for any book
- Data types, data instances
- DateTime ? Integer,
- addresses have similar formats
- Schema structure
- All books have similar attributes
In isolation, techniques are incomplete or
brittle Need principled combination. See the
Coma System
Models consider only the two schemas.
17Matching to a Mediated SchemaDoan et al.,
SIGMOD 2001, MLJ 2003
Find houses with four bathrooms priced under
500,000
mediated schema
Query reformulation and optimization.
source schema 2
source schema 3
source schema 1
homes.com
realestate.com
homeseekers.com
18Finding Semantic Mappings
house
address
num-baths
contact-info
agent-name agent-phone
1-1 mapping
non 1-1 mapping
house
location contact
full-baths
half-baths
name phone
19Learning from Previous Matching
- Every matching task is a learning opportunity.
- Several types of knowledge are used in learning
- Schema elements, e.g., attribute names
- Data elements ranges, formats, word frequencies,
value frequencies, length of texts. - Proximity of attributes
- Functional dependencies, number of attribute
occurrences.
20Matching Real-Estate Sources
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Learned hypotheses
Schema of realestate.com
If phone occurs in the name gt agent-phone
listed-price 250,000 110,000 ...
location Miami, FL Boston, MA ...
phone (305) 729 0831 (617) 253 1429 ...
comments Fantastic house Great location ...
realestate.com
If fantastic great occur frequently in
data values gt description
homes.com
price 550,000 320,000 ...
contact-phone (278) 345 7215 (617) 335 2315 ...
extra-info Beautiful yard Great beach ...
21Learning to Match Schemas
Matching Phase
Training Phase
Mediated schema
Source schemas
Domain Constraints
Data listings
User Feedback
Constraint Handler
Base-Learner1
Base-Learnerk
Meta-Learner
Mappings
Multi-strategy Learning System
22Multi-Strategy Learning
- Use a set of base learners
- Name learner, Naïve Bayes, Whirl, XML learner
- And a set of recognizers
- County name, zip code, phone numbers.
- Each base learner produces a prediction weighted
by confidence score. - Combine base learners with a meta-learner, using
stacking.
23Base Learners
(contact-info,office-address)
(contact-info,office-address)
(contact,agent-phone)
(contact,agent-phone)
(contact-phone, ? )
(phone,agent-phone)
(phone,agent-phone)
(listed-price,price)
(listed-price,price)
contact-phone gt (agent-phone,0.7),
(office-address,0.3)
- Naive Bayes Learner DomingosPazzani 97
- Kent, WA gt (address,0.8), (name,0.2)
- Whirl Learner CohenHirsh 98
- XML Learner
- exploits hierarchical structure of XML data
24Meta-Learner Stacking
- Training of meta-learner produces a weight for
every pair of - (base-learner, mediated-schema element)
- weight(Name-Learner,address) 0.1
- weight(Naive-Bayes,address) 0.9
- Combining predictions of meta-learner
- computes weighted sum of base-learner confidence
scores
Name Learner Naive Bayes
(address,0.6) (address,0.8)
ltareagtSeattle, WAlt/gt
Meta-Learner
(address, 0.60.1 0.80.9 0.78)
25Applying the Learners
Mediated schema
Schema of homes.com
address price agent-phone
description
area day-phone extra-info
Name Learner Naive Bayes
ltareagtSeattle, WAlt/gt ltareagtKent,
WAlt/gt ltareagtAustin, TXlt/gt
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Name Learner Naive Bayes
Meta-Learner
(address,0.7), (description,0.3)
ltday-phonegt(278) 345 7215lt/gt ltday-phonegt(617) 335
2315lt/gt ltday-phonegt(512) 427 1115lt/gt
(agent-phone,0.9), (description,0.1)
(description,0.8), (address,0.2)
ltextra-infogtBeautiful yardlt/gt ltextra-infogtGreat
beachlt/gt ltextra-infogtClose to Seattlelt/gt
26Empirical Evaluation
- Four domains
- Real Estate I II, Course Offerings, Faculty
Listings - For each domain
- create mediated DTD domain constraints
- choose five sources
- mediated DTDs 14 - 66 elements, source DTDs 13
- 48
- Ten runs for each experiment - in each run
- manually provide 1-1 mappings for 3 sources
- ask LSD to propose mappings for remaining 2
sources - accuracy of 1-1 mappings correctly identified
27Matching Accuracy
Average Matching Acccuracy ()
LSDs accuracy 71 - 92
Best single base learner 42 - 72
Meta-learner 5 - 22
Constraint handler 7 - 13 XML
learner 0.8 - 6
28Outline
- Motivation and problem definition
- Learning to match to a mediated schema
- Matching arbitrary schemas using a corpus
- Matching web services.
29Corpus-Based Schema MatchingMadhavan, Doan,
Bernstein, Halevy
- Can we use previous experience to match two new
schemas? - Learn about a domain, rather than a mediated
schema?
Classifier for every corpus element
Learn general purpose knowledge
Reuse extracted knowledge to match new schemas
30Exploiting The Corpus
- Given an element s ? S and t ? T, how do we
determine if s and t are similar? - The PIVOT Method
- Elements are similar if they are similar to the
same corpus concepts - The AUGMENT Method
- Enrich the knowledge about an element by
exploiting similar elements in the corpus.
31Pivot measuring (dis)agreement
Compute interpretations w.r.t. corpus
Pk Probability (s ck )
Interpretation I(s) element s ?Schema S
concepts in corpus
S
T
I(s)
I(t)
s
t
Similarity(I(s), I(t))
- Interpretation captures how similar an element is
to each corpus concept - Compared using cosine distance.
32Augmenting element models
S
Schema
Search similar corpus concepts
s
Corpus of known schemas and mappings
e
f
s e f
Ms
Name Instances Type
Element Model
Build augmented models
- Search similar corpus concepts
- Pick the most similar ones from the
interpretation - Build augmented models
- Robust since more training data to learn from
- Compare elements using the augmented models
33Experimental Results
- Five domains
- Auto and real estate webforms
- Invsmall and inventory relational schemas
- Nameaddr real xml schemas
- Performance measure
- F-Measure
- Precision and recall are measured in terms of the
matches predicted. - Results averaged over hundreds of schema matching
tasks!
34Comparison over domains
Corpus based techniques perform better in all the
domains
35Tough schema pairs
- Significant improvement in difficult to match
schema pairs
36Mixed corpus
Corpus with schemas from different domains can
also be useful
37Other Corpus Based Tools
- A corpus of schemas can be the basis for many
useful tools - Mirror the success of corpora in IR and NLP?
- Auto-complete
- I start creating a schema (or show sample data),
and the tool suggests a completion. - Formulating queries on new databases
- I ask a query using my terminology, and it gets
reformulated appropriately.
38Outline
- Motivation and problem definition
- Learning to match to a mediated schema
- Matching arbitrary schemas using a corpus
- Matching web services.
39Searching for Web ServicesDong, Madhavan,
Nemes, Halevy, Zhang
- Over 1000 web services already on WWW.
- Keyword search is not sufficient.
- Search involves drill-down dont want to repeat
it. Hence, - Find similar operations
- Find operations that compose with this one.
401) Operations With Similar Functionality
- Op1 GetTemperature
- Input Zip, Authorization
- Output Return
- Op2 WeatherFetcher
- Input PostCode
- Output TemperatureF, WindChill, Humidity
Similar Operations
412) Operations with Similar Inputs/Outputs
- Op1 GetTemperature
- Input Zip, Authorization
- Output Return
- Op2 WeatherFetcher
- Input PostCode
- Output TemperatureF, WindChill, Humidity
- Op3 LocalTimeByZipcode
- Input Zipcode
- Output LocalTimeByZipCodeResult
- Op4 ZipCodeToCityState
- Input ZipCode
- Output City, State
Similar Inputs
423) Composable Operations
- Op1 GetTemperature
- Input Zip, Authorization
- Output Return
- Op2 WeatherFetcher
- Input PostCode
- Output TemperatureF, WindChill, Humidity
- Op3 LocalTimeByZipcode
- Input Zipcode
- Output LocalTimeByZipCodeResult
- Op4 ZipCodeToCityState
- Input ZipCode
- Output City, State
- Op5 CityStateToZipCode
- Input City, State
- Output ZipCode
Input of Op2 is similar to Output of Op5 ?
Composition
43Why is this Hard?
- Little to go on
- Input/output parameters (they dont mean much)
- Method name
- Text descriptions of operation or web service
(typically bad) - Difference from schema matching
- Web service not a coherent schema
- Different level of granularity.
44Main Ideas
- Measure similarity of each of the components of
the WS-operation I, O, description, WS
description. - Cluster parameter names into concepts.
- Heuristic Parameters occurring together tend to
express the same concepts - When comparing inputs/outputs, compare parameters
and concepts separately, and combine the results.
45Precision and Recall Results
46Woogle
- A collection of 790 web services431 active web
services, 1262 operations - Function
- Web service similarity search
- Keyword search on web service descriptions
- Keyword search on inputs/outputs
- Web service category browse
- Web service on-site try
- Web service status report
- http//www.cs.washington.edu/woogle
47Conclusion
- Semantic reconciliation is crucial for data
sharing. - Learning from experience an important
ingredient. - See Transformic Inc.
- Current challenges large schemas, GUIs, dealing
with other meta-data issues.