New England Database Society (NEDS)

About This Presentation
Title:

New England Database Society (NEDS)

Description:

New England Database Society NEDS – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 48
Provided by: uw3

less

Transcript and Presenter's Notes

Title: New England Database Society (NEDS)


1
  • New England Database Society (NEDS)
  • Friday, April 23, 2004
  • Volen 101, Brandeis University

Sponsored by Sun Microsystems
2
Learning to Reconcile Semantic Heterogeneity
  • Alon Halevy
  • University of Washington, Seattle
  • NEDS, April 23, 2004

3
Large-Scale Data Sharing
  • Large-scale data sharing is pervasive
  • Big science (bio-medicine, astrophysics, )
  • Government agencies
  • Large corporations
  • The web (over 100,000 searchable data sources)
  • Enterprise Information Integration industry
  • The vision
  • Content authoring by anyone, anywhere
  • Powerful database-style querying
  • Use relevant data from anywhere to answer the
    query
  • The Semantic Web
  • Fundamental problem reconciling different models
    of the world.

4
Large-Scale Scientific Data Sharing
Swiss- Prot
HUGO
OMIM
UW
UW Microbiology
UW Genome Sciences
Harvard Genetics
GeneClinics
5
Data Integration
Entity
www.biomediator.org Tarczy-Hornoch, Mork
Sequenceable Entity
Structured Vocabulary
Gene
Phenotype
Experiment
Nucleotide Sequence
Microarray Experiment
Protein
OMIM
Swiss- Prot
HUGO
GO
Gene- Clinics
Entrez
Locus- Link
GEO
Query For the micro-array experiment I just ran,
what are the related nucleotide sequences and for
what protein do they code?
6
Peer Data Management Systems
Piazza Tatarinov, H., Ives, Suciu, Mork
  • Mappings specified locally
  • Map to most convenient nodes
  • Queries answered by traversing semantic paths.

CiteSeer
Stanford
UW
DBLP
Brown
M.I.T
Brandeis
7
Data Sharing Architectures
  • Data integration
  • PDMS
  • Message passing
  • Web services
  • Data warehousing

8
Semantic Mappings
  • Formalism for mappings
  • Reformulation algorithms

Mediated Schema
  • How will we create them?



9
Semantic Mappings Example
  • Differences in
  • Names in schema
  • Attribute grouping
  • Coverage of databases
  • Granularity and format of attributes

Books Title ISBN Price DiscountPrice
Edition
Authors ISBN FirstName LastName
BooksAndMusic Title Author Publisher ItemID ItemTy
pe SuggestedPrice Categories Keywords
BookCategories ISBN Category
CDCategories ASIN Category
CDs Album ASIN Price DiscountPrice St
udio
Artists ASIN ArtistName GroupName
Inventory Database A
Inventory Database B
10
Why is Schema Matching so Hard?
  • Because the schemas never fully capture their
    intended meaning
  • Schema elements are just symbols.
  • We need to leverage any additional information we
    may have.
  • Theorem Schema matching is AI-Complete.
  • Hence, human will always be in the loop.
  • Goal is to improve designers productivity.
  • Solution must be extensible.

11
Dimensions of the Problem (1)
Matching vs. Mapping
Books Title ISBN Price DiscountPrice
Edition
Authors ISBN FirstName LastName
BooksAndMusic Title Author Publisher ItemID ItemTy
pe SuggestedPrice Categories Keywords
BookCategories ISBN Category
CDCategories ASIN Category
CDs Album ASIN Price DiscountPrice St
udio
Artists ASIN ArtistName GroupName
Inventory Database A
Inventory Database B
  • Schema Matching Discovering correspondences
    between similar elements
  • Schema Mapping BooksAndMusic(xTitle,)
    Books(xTitle,) ? CDs(xAlbum,)

12
Dimensions of the Problem (2)
  • Schema level vs. instance level
  • Alon Halevy, A. Halevy, Alon Y. Levy same guy!
  • Cant always separate the two levels.
  • Crucial for Personal Info Management (See Semex)
  • What are we mapping?
  • Schemas
  • Web service descriptions
  • Business logic and processes
  • Ontologies

13
Important Special Cases
  • Mapping to a common mediated schema?
  • Or mapping two arbitrary schemas?
  • One schema may be a new version of the other.
  • The two schemas may be evolutions of the same
    original schema.
  • Web forms.
  • Horizontal integration many sources talking
    about the same stuff.
  • Vertical integration sources covering different
    parts of the domain, and have only little overlap.

14
Problem Definition
  • Given
  • S1 and S2 a pair of schemas/DTDs/ontologies,
  • Possibly, data accompanying instances
  • Additional domain knowledge
  • Find
  • A match between S1 and S2
  • A set of correspondences between the terms.

15
Outline
  • Motivation and problem definition
  • Learning to match to a mediated schema
  • Matching arbitrary schemas using a corpus
  • Matching web services.

16
Typical Matching HeuristicsSee Rahm
Bernstein, VLDBJ 2001, for a survey
  • Build a model for every element from multiple
    sources of evidences in the schemas
  • Schema element names
  • BooksAndCDs/Categories BookCategories/Category
  • Descriptions and documentation
  • ItemID unique identifier for a book or a CD
  • ISBN unique identifier for any book
  • Data types, data instances
  • DateTime ? Integer,
  • addresses have similar formats
  • Schema structure
  • All books have similar attributes

In isolation, techniques are incomplete or
brittle Need principled combination. See the
Coma System
Models consider only the two schemas.
17
Matching to a Mediated SchemaDoan et al.,
SIGMOD 2001, MLJ 2003
Find houses with four bathrooms priced under
500,000
mediated schema
Query reformulation and optimization.
source schema 2
source schema 3
source schema 1
homes.com
realestate.com
homeseekers.com
18
Finding Semantic Mappings
  • Source schemas XML DTDs

house
address
num-baths
contact-info
agent-name agent-phone
1-1 mapping
non 1-1 mapping
house
location contact
full-baths
half-baths
name phone
19
Learning from Previous Matching
  • Every matching task is a learning opportunity.
  • Several types of knowledge are used in learning
  • Schema elements, e.g., attribute names
  • Data elements ranges, formats, word frequencies,
    value frequencies, length of texts.
  • Proximity of attributes
  • Functional dependencies, number of attribute
    occurrences.

20
Matching Real-Estate Sources
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Learned hypotheses
Schema of realestate.com
If phone occurs in the name gt agent-phone
listed-price 250,000 110,000 ...
location Miami, FL Boston, MA ...
phone (305) 729 0831 (617) 253 1429 ...
comments Fantastic house Great location ...
realestate.com
If fantastic great occur frequently in
data values gt description
homes.com
price 550,000 320,000 ...
contact-phone (278) 345 7215 (617) 335 2315 ...
extra-info Beautiful yard Great beach ...
21
Learning to Match Schemas
Matching Phase
Training Phase
Mediated schema
Source schemas
Domain Constraints
Data listings
User Feedback
Constraint Handler
Base-Learner1
Base-Learnerk
Meta-Learner
Mappings
Multi-strategy Learning System
22
Multi-Strategy Learning
  • Use a set of base learners
  • Name learner, Naïve Bayes, Whirl, XML learner
  • And a set of recognizers
  • County name, zip code, phone numbers.
  • Each base learner produces a prediction weighted
    by confidence score.
  • Combine base learners with a meta-learner, using
    stacking.

23
Base Learners
  • Name Learner

(contact-info,office-address)
(contact-info,office-address)
(contact,agent-phone)
(contact,agent-phone)
(contact-phone, ? )
(phone,agent-phone)
(phone,agent-phone)
(listed-price,price)
(listed-price,price)
contact-phone gt (agent-phone,0.7),
(office-address,0.3)
  • Naive Bayes Learner DomingosPazzani 97
  • Kent, WA gt (address,0.8), (name,0.2)
  • Whirl Learner CohenHirsh 98
  • XML Learner
  • exploits hierarchical structure of XML data

24
Meta-Learner Stacking
  • Training of meta-learner produces a weight for
    every pair of
  • (base-learner, mediated-schema element)
  • weight(Name-Learner,address) 0.1
  • weight(Naive-Bayes,address) 0.9
  • Combining predictions of meta-learner
  • computes weighted sum of base-learner confidence
    scores

Name Learner Naive Bayes
(address,0.6) (address,0.8)
ltareagtSeattle, WAlt/gt
Meta-Learner
(address, 0.60.1 0.80.9 0.78)
25
Applying the Learners
Mediated schema
Schema of homes.com
address price agent-phone
description
area day-phone extra-info
Name Learner Naive Bayes
ltareagtSeattle, WAlt/gt ltareagtKent,
WAlt/gt ltareagtAustin, TXlt/gt
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Name Learner Naive Bayes
Meta-Learner
(address,0.7), (description,0.3)
ltday-phonegt(278) 345 7215lt/gt ltday-phonegt(617) 335
2315lt/gt ltday-phonegt(512) 427 1115lt/gt
(agent-phone,0.9), (description,0.1)
(description,0.8), (address,0.2)
ltextra-infogtBeautiful yardlt/gt ltextra-infogtGreat
beachlt/gt ltextra-infogtClose to Seattlelt/gt
26
Empirical Evaluation
  • Four domains
  • Real Estate I II, Course Offerings, Faculty
    Listings
  • For each domain
  • create mediated DTD domain constraints
  • choose five sources
  • mediated DTDs 14 - 66 elements, source DTDs 13
    - 48
  • Ten runs for each experiment - in each run
  • manually provide 1-1 mappings for 3 sources
  • ask LSD to propose mappings for remaining 2
    sources
  • accuracy of 1-1 mappings correctly identified

27
Matching Accuracy
Average Matching Acccuracy ()
LSDs accuracy 71 - 92
Best single base learner 42 - 72
Meta-learner 5 - 22
Constraint handler 7 - 13 XML
learner 0.8 - 6
28
Outline
  • Motivation and problem definition
  • Learning to match to a mediated schema
  • Matching arbitrary schemas using a corpus
  • Matching web services.

29
Corpus-Based Schema MatchingMadhavan, Doan,
Bernstein, Halevy
  • Can we use previous experience to match two new
    schemas?
  • Learn about a domain, rather than a mediated
    schema?

Classifier for every corpus element
Learn general purpose knowledge
Reuse extracted knowledge to match new schemas
30
Exploiting The Corpus
  • Given an element s ? S and t ? T, how do we
    determine if s and t are similar?
  • The PIVOT Method
  • Elements are similar if they are similar to the
    same corpus concepts
  • The AUGMENT Method
  • Enrich the knowledge about an element by
    exploiting similar elements in the corpus.

31
Pivot measuring (dis)agreement
Compute interpretations w.r.t. corpus
Pk Probability (s ck )
Interpretation I(s) element s ?Schema S
concepts in corpus
S
T
I(s)
I(t)
s
t
Similarity(I(s), I(t))
  • Interpretation captures how similar an element is
    to each corpus concept
  • Compared using cosine distance.

32
Augmenting element models
S
Schema
Search similar corpus concepts
s
Corpus of known schemas and mappings
e
f
s e f
Ms
Name Instances Type
Element Model
Build augmented models
  • Search similar corpus concepts
  • Pick the most similar ones from the
    interpretation
  • Build augmented models
  • Robust since more training data to learn from
  • Compare elements using the augmented models

33
Experimental Results
  • Five domains
  • Auto and real estate webforms
  • Invsmall and inventory relational schemas
  • Nameaddr real xml schemas
  • Performance measure
  • F-Measure
  • Precision and recall are measured in terms of the
    matches predicted.
  • Results averaged over hundreds of schema matching
    tasks!

34
Comparison over domains
Corpus based techniques perform better in all the
domains
35
Tough schema pairs
  • Significant improvement in difficult to match
    schema pairs

36
Mixed corpus
Corpus with schemas from different domains can
also be useful
37
Other Corpus Based Tools
  • A corpus of schemas can be the basis for many
    useful tools
  • Mirror the success of corpora in IR and NLP?
  • Auto-complete
  • I start creating a schema (or show sample data),
    and the tool suggests a completion.
  • Formulating queries on new databases
  • I ask a query using my terminology, and it gets
    reformulated appropriately.

38
Outline
  • Motivation and problem definition
  • Learning to match to a mediated schema
  • Matching arbitrary schemas using a corpus
  • Matching web services.

39
Searching for Web ServicesDong, Madhavan,
Nemes, Halevy, Zhang
  • Over 1000 web services already on WWW.
  • Keyword search is not sufficient.
  • Search involves drill-down dont want to repeat
    it. Hence,
  • Find similar operations
  • Find operations that compose with this one.

40
1) Operations With Similar Functionality
  • Op1 GetTemperature
  • Input Zip, Authorization
  • Output Return
  • Op2 WeatherFetcher
  • Input PostCode
  • Output TemperatureF, WindChill, Humidity

Similar Operations
41
2) Operations with Similar Inputs/Outputs
  • Op1 GetTemperature
  • Input Zip, Authorization
  • Output Return
  • Op2 WeatherFetcher
  • Input PostCode
  • Output TemperatureF, WindChill, Humidity
  • Op3 LocalTimeByZipcode
  • Input Zipcode
  • Output LocalTimeByZipCodeResult
  • Op4 ZipCodeToCityState
  • Input ZipCode
  • Output City, State

Similar Inputs
42
3) Composable Operations
  • Op1 GetTemperature
  • Input Zip, Authorization
  • Output Return
  • Op2 WeatherFetcher
  • Input PostCode
  • Output TemperatureF, WindChill, Humidity
  • Op3 LocalTimeByZipcode
  • Input Zipcode
  • Output LocalTimeByZipCodeResult
  • Op4 ZipCodeToCityState
  • Input ZipCode
  • Output City, State
  • Op5 CityStateToZipCode
  • Input City, State
  • Output ZipCode

Input of Op2 is similar to Output of Op5 ?
Composition
43
Why is this Hard?
  • Little to go on
  • Input/output parameters (they dont mean much)
  • Method name
  • Text descriptions of operation or web service
    (typically bad)
  • Difference from schema matching
  • Web service not a coherent schema
  • Different level of granularity.

44
Main Ideas
  • Measure similarity of each of the components of
    the WS-operation I, O, description, WS
    description.
  • Cluster parameter names into concepts.
  • Heuristic Parameters occurring together tend to
    express the same concepts
  • When comparing inputs/outputs, compare parameters
    and concepts separately, and combine the results.

45
Precision and Recall Results
46
Woogle
  • A collection of 790 web services431 active web
    services, 1262 operations
  • Function
  • Web service similarity search
  • Keyword search on web service descriptions
  • Keyword search on inputs/outputs
  • Web service category browse
  • Web service on-site try
  • Web service status report
  • http//www.cs.washington.edu/woogle

47
Conclusion
  • Semantic reconciliation is crucial for data
    sharing.
  • Learning from experience an important
    ingredient.
  • See Transformic Inc.
  • Current challenges large schemas, GUIs, dealing
    with other meta-data issues.
Write a Comment
User Comments (0)