New England Database Society (NEDS)

About This Presentation

Title:

New England Database Society (NEDS)

Description:

New England Database Society NEDS – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 48

Provided by: uw3

Learn more at: https://www.cs.washington.edu

more less

Transcript and Presenter's Notes

Title: New England Database Society (NEDS)

1

New England Database Society (NEDS)
Friday, April 23, 2004
Volen 101, Brandeis University

Sponsored by Sun Microsystems
2
Learning to Reconcile Semantic Heterogeneity

Alon Halevy
University of Washington, Seattle
NEDS, April 23, 2004

3
Large-Scale Data Sharing

Large-scale data sharing is pervasive
Big science (bio-medicine, astrophysics, )
Government agencies
Large corporations
The web (over 100,000 searchable data sources)
Enterprise Information Integration industry
The vision
Content authoring by anyone, anywhere
Powerful database-style querying
Use relevant data from anywhere to answer the
query
The Semantic Web
Fundamental problem reconciling different models
of the world.

4
Large-Scale Scientific Data Sharing
Swiss- Prot
HUGO
OMIM
UW
UW Microbiology
UW Genome Sciences
Harvard Genetics
GeneClinics
5
Data Integration
Entity
www.biomediator.org Tarczy-Hornoch, Mork
Sequenceable Entity
Structured Vocabulary
Gene
Phenotype
Experiment
Nucleotide Sequence
Microarray Experiment
Protein
OMIM
Swiss- Prot
HUGO
GO
Gene- Clinics
Entrez
Locus- Link
GEO
Query For the micro-array experiment I just ran,
what are the related nucleotide sequences and for
what protein do they code?
6
Peer Data Management Systems
Piazza Tatarinov, H., Ives, Suciu, Mork

Mappings specified locally
Map to most convenient nodes
Queries answered by traversing semantic paths.

CiteSeer
Stanford
UW
DBLP
Brown
M.I.T
Brandeis
7
Data Sharing Architectures

Data integration
PDMS
Message passing
Web services
Data warehousing

8
Semantic Mappings

Formalism for mappings
Reformulation algorithms

Mediated Schema

How will we create them?

9
Semantic Mappings Example

Differences in
Names in schema
Attribute grouping
Coverage of databases
Granularity and format of attributes

Books Title ISBN Price DiscountPrice
Edition
Authors ISBN FirstName LastName
BooksAndMusic Title Author Publisher ItemID ItemTy
pe SuggestedPrice Categories Keywords
BookCategories ISBN Category
CDCategories ASIN Category
CDs Album ASIN Price DiscountPrice St
udio
Artists ASIN ArtistName GroupName
Inventory Database A
Inventory Database B
10
Why is Schema Matching so Hard?

Because the schemas never fully capture their
intended meaning
Schema elements are just symbols.
We need to leverage any additional information we
may have.
Theorem Schema matching is AI-Complete.
Hence, human will always be in the loop.
Goal is to improve designers productivity.
Solution must be extensible.

11
Dimensions of the Problem (1)
Matching vs. Mapping
Books Title ISBN Price DiscountPrice
Edition
Authors ISBN FirstName LastName
BooksAndMusic Title Author Publisher ItemID ItemTy
pe SuggestedPrice Categories Keywords
BookCategories ISBN Category
CDCategories ASIN Category
CDs Album ASIN Price DiscountPrice St
udio
Artists ASIN ArtistName GroupName
Inventory Database A
Inventory Database B

Schema Matching Discovering correspondences
between similar elements
Schema Mapping BooksAndMusic(xTitle,)
Books(xTitle,) ? CDs(xAlbum,)

12
Dimensions of the Problem (2)

Schema level vs. instance level
Alon Halevy, A. Halevy, Alon Y. Levy same guy!
Cant always separate the two levels.
Crucial for Personal Info Management (See Semex)
What are we mapping?
Schemas
Web service descriptions
Business logic and processes
Ontologies

13
Important Special Cases

Mapping to a common mediated schema?
Or mapping two arbitrary schemas?
One schema may be a new version of the other.
The two schemas may be evolutions of the same
original schema.
Web forms.
Horizontal integration many sources talking
about the same stuff.
Vertical integration sources covering different
parts of the domain, and have only little overlap.

14
Problem Definition

Given
S1 and S2 a pair of schemas/DTDs/ontologies,
Possibly, data accompanying instances
Additional domain knowledge
Find
A match between S1 and S2
A set of correspondences between the terms.

15
Outline

Motivation and problem definition
Learning to match to a mediated schema
Matching arbitrary schemas using a corpus
Matching web services.

16
Typical Matching HeuristicsSee Rahm
Bernstein, VLDBJ 2001, for a survey

Build a model for every element from multiple
sources of evidences in the schemas
Schema element names
BooksAndCDs/Categories BookCategories/Category
Descriptions and documentation
ItemID unique identifier for a book or a CD
ISBN unique identifier for any book
Data types, data instances
DateTime ? Integer,
addresses have similar formats
Schema structure
All books have similar attributes

In isolation, techniques are incomplete or
brittle Need principled combination. See the
Coma System
Models consider only the two schemas.
17
Matching to a Mediated SchemaDoan et al.,
SIGMOD 2001, MLJ 2003
Find houses with four bathrooms priced under
500,000
mediated schema
Query reformulation and optimization.
source schema 2
source schema 3
source schema 1
homes.com
realestate.com
homeseekers.com
18
Finding Semantic Mappings

Source schemas XML DTDs

house
address
num-baths
contact-info
agent-name agent-phone
1-1 mapping
non 1-1 mapping
house
location contact
full-baths
half-baths
name phone
19
Learning from Previous Matching

Every matching task is a learning opportunity.
Several types of knowledge are used in learning
Schema elements, e.g., attribute names
Data elements ranges, formats, word frequencies,
value frequencies, length of texts.
Proximity of attributes
Functional dependencies, number of attribute
occurrences.

20
Matching Real-Estate Sources
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Learned hypotheses
Schema of realestate.com
If phone occurs in the name gt agent-phone
listed-price 250,000 110,000 ...
location Miami, FL Boston, MA ...
phone (305) 729 0831 (617) 253 1429 ...
comments Fantastic house Great location ...
realestate.com
If fantastic great occur frequently in
data values gt description
homes.com
price 550,000 320,000 ...
contact-phone (278) 345 7215 (617) 335 2315 ...
extra-info Beautiful yard Great beach ...
21
Learning to Match Schemas
Matching Phase
Training Phase
Mediated schema
Source schemas
Domain Constraints
Data listings
User Feedback
Constraint Handler
Base-Learner1
Base-Learnerk
Meta-Learner
Mappings
Multi-strategy Learning System
22
Multi-Strategy Learning

Use a set of base learners
Name learner, Naïve Bayes, Whirl, XML learner
And a set of recognizers
County name, zip code, phone numbers.
Each base learner produces a prediction weighted
by confidence score.
Combine base learners with a meta-learner, using
stacking.

23
Base Learners

Name Learner

(contact-info,office-address)
(contact-info,office-address)
(contact,agent-phone)
(contact,agent-phone)
(contact-phone, ? )
(phone,agent-phone)
(phone,agent-phone)
(listed-price,price)
(listed-price,price)
contact-phone gt (agent-phone,0.7),
(office-address,0.3)

Naive Bayes Learner DomingosPazzani 97
Kent, WA gt (address,0.8), (name,0.2)
Whirl Learner CohenHirsh 98
XML Learner
exploits hierarchical structure of XML data

24
Meta-Learner Stacking

Training of meta-learner produces a weight for
every pair of
(base-learner, mediated-schema element)
weight(Name-Learner,address) 0.1
weight(Naive-Bayes,address) 0.9
Combining predictions of meta-learner
computes weighted sum of base-learner confidence
scores

Name Learner Naive Bayes
(address,0.6) (address,0.8)
ltareagtSeattle, WAlt/gt
Meta-Learner
(address, 0.60.1 0.80.9 0.78)
25
Applying the Learners
Mediated schema
Schema of homes.com
address price agent-phone
description
area day-phone extra-info
Name Learner Naive Bayes
ltareagtSeattle, WAlt/gt ltareagtKent,
WAlt/gt ltareagtAustin, TXlt/gt
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Name Learner Naive Bayes
Meta-Learner
(address,0.7), (description,0.3)
ltday-phonegt(278) 345 7215lt/gt ltday-phonegt(617) 335
2315lt/gt ltday-phonegt(512) 427 1115lt/gt
(agent-phone,0.9), (description,0.1)
(description,0.8), (address,0.2)
ltextra-infogtBeautiful yardlt/gt ltextra-infogtGreat
beachlt/gt ltextra-infogtClose to Seattlelt/gt
26
Empirical Evaluation

Four domains
Real Estate I II, Course Offerings, Faculty
Listings
For each domain
create mediated DTD domain constraints
choose five sources
mediated DTDs 14 - 66 elements, source DTDs 13
- 48

Ten runs for each experiment - in each run
manually provide 1-1 mappings for 3 sources
ask LSD to propose mappings for remaining 2
sources
accuracy of 1-1 mappings correctly identified

27
Matching Accuracy
Average Matching Acccuracy ()
LSDs accuracy 71 - 92
Best single base learner 42 - 72
Meta-learner 5 - 22
Constraint handler 7 - 13 XML
learner 0.8 - 6
28
Outline

Motivation and problem definition
Learning to match to a mediated schema
Matching arbitrary schemas using a corpus
Matching web services.

29
Corpus-Based Schema MatchingMadhavan, Doan,
Bernstein, Halevy

Can we use previous experience to match two new
schemas?
Learn about a domain, rather than a mediated
schema?

Classifier for every corpus element
Learn general purpose knowledge
Reuse extracted knowledge to match new schemas
30
Exploiting The Corpus

Given an element s ? S and t ? T, how do we
determine if s and t are similar?
The PIVOT Method
Elements are similar if they are similar to the
same corpus concepts
The AUGMENT Method
Enrich the knowledge about an element by
exploiting similar elements in the corpus.

31
Pivot measuring (dis)agreement
Compute interpretations w.r.t. corpus
Pk Probability (s ck )
Interpretation I(s) element s ?Schema S
concepts in corpus
S
T
I(s)
I(t)
s
t
Similarity(I(s), I(t))

Interpretation captures how similar an element is
to each corpus concept
Compared using cosine distance.

32
Augmenting element models
S
Schema
Search similar corpus concepts
s
Corpus of known schemas and mappings
e
f
s e f
Ms
Name Instances Type
Element Model
Build augmented models

Search similar corpus concepts
Pick the most similar ones from the
interpretation
Build augmented models
Robust since more training data to learn from
Compare elements using the augmented models

33
Experimental Results

Five domains
Auto and real estate webforms
Invsmall and inventory relational schemas
Nameaddr real xml schemas
Performance measure
F-Measure
Precision and recall are measured in terms of the
matches predicted.
Results averaged over hundreds of schema matching
tasks!

34
Comparison over domains
Corpus based techniques perform better in all the
domains
35
Tough schema pairs

Significant improvement in difficult to match
schema pairs

36
Mixed corpus
Corpus with schemas from different domains can
also be useful
37
Other Corpus Based Tools

A corpus of schemas can be the basis for many
useful tools
Mirror the success of corpora in IR and NLP?
Auto-complete
I start creating a schema (or show sample data),
and the tool suggests a completion.
Formulating queries on new databases
I ask a query using my terminology, and it gets
reformulated appropriately.

38
Outline

Motivation and problem definition
Learning to match to a mediated schema
Matching arbitrary schemas using a corpus
Matching web services.

39
Searching for Web ServicesDong, Madhavan,
Nemes, Halevy, Zhang

Over 1000 web services already on WWW.
Keyword search is not sufficient.
Search involves drill-down dont want to repeat
it. Hence,
Find similar operations
Find operations that compose with this one.

40
1) Operations With Similar Functionality

Op1 GetTemperature
Input Zip, Authorization
Output Return
Op2 WeatherFetcher
Input PostCode
Output TemperatureF, WindChill, Humidity

Similar Operations
41
2) Operations with Similar Inputs/Outputs

Op1 GetTemperature
Input Zip, Authorization
Output Return
Op2 WeatherFetcher
Input PostCode
Output TemperatureF, WindChill, Humidity
Op3 LocalTimeByZipcode
Input Zipcode
Output LocalTimeByZipCodeResult
Op4 ZipCodeToCityState
Input ZipCode
Output City, State

Similar Inputs
42
3) Composable Operations

Op1 GetTemperature
Input Zip, Authorization
Output Return
Op2 WeatherFetcher
Input PostCode
Output TemperatureF, WindChill, Humidity
Op3 LocalTimeByZipcode
Input Zipcode
Output LocalTimeByZipCodeResult
Op4 ZipCodeToCityState
Input ZipCode
Output City, State
Op5 CityStateToZipCode
Input City, State
Output ZipCode

Input of Op2 is similar to Output of Op5 ?
Composition
43
Why is this Hard?

Little to go on
Input/output parameters (they dont mean much)
Method name
Text descriptions of operation or web service
(typically bad)
Difference from schema matching
Web service not a coherent schema
Different level of granularity.

44
Main Ideas

Measure similarity of each of the components of
the WS-operation I, O, description, WS
description.
Cluster parameter names into concepts.
Heuristic Parameters occurring together tend to
express the same concepts
When comparing inputs/outputs, compare parameters
and concepts separately, and combine the results.

45
Precision and Recall Results
46
Woogle