CSE 636 Data Integration - PowerPoint PPT Presentation

About This Presentation
Title:

CSE 636 Data Integration

Description:

Cupid. Individual matchers. Schema-based. Content-based. Graph matching. Linguistic ... Generic Schema Matching with Cupid. VLDB, 2001. H. H. Do, E. Rahm: ... – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 28
Provided by: michailpe
Learn more at: https://cse.buffalo.edu
Category:
Tags: cse | cupid | data | integration

less

Transcript and Presenter's Notes

Title: CSE 636 Data Integration


1
CSE 636Data Integration
  • Schema Matching
  • Cupid

2
Virtual Integration Architecture
Design-Time
Run-Time
?
Schema Matching
Query Reformulation
Query
Result
End User
Mediation Language
Optimization Execution
Mediator
Global Schema
Web Services
XML
Data Source
Data Source
Local Schema
Local Schema
3
Schema Heterogeneity
  • Independently created schemas
  • might be modeling similar information
  • in slightly different ways

DB1
DB3
DB2
ugrad
student
course
ugradID
studentID
courseID
name
title
name
type
enrollment
student
courseID
course
studentID
ugradID
courseID
name
grade
title ?
type
type
evaluation
letter
4
Schema Heterogeneity
DB1
DB3
DB2
ugrad
student
course
ugradID
studentID
courseID
name
title
name
type
enrollment
student
courseID
course
studentID
ugradID
courseID
name
grade
title ?
type
type
evaluation
letter
  • Similar entities represented
  • Dissimilar structures (inverted nesting)
  • Different element names for similar data values
  • Similar element names for different data values

5
Schema Matching vs. Schema Mapping
  • GAV and LAV are schema mapping languages
  • Mappings
  • set of queries
  • associations semantics
  • Match
  • set of associations only
  • Schema Matching
  • Identifying associations
  • First step towards constructing mappings

6
Schema Matching vs. Schema Mapping
Semantics
  • for s1 in DB3/student
  • where s1/type UGRAD
  • return ltDB1gt
  • ltugradgt
  • ltugradIDgts1/studentIDlt/ugradIDgt
  • ltnamegts1/namelt/namegt
  • lt/ugradgt
  • lt/DB1gt
  • LAV Mapping DB1 ? Q(DB3)

Associations
7
The Problem of Schema Matching
  • Input
  • Schemas S1 and S2
  • Possibly data instances for S1 and S2
  • Background knowledge
  • thesauri
  • validated matches
  • standard schemas
  • reference instances
  • ontologies
  • constraints (keys, data types etc)
  • Output
  • Associations between S1 and S2
  • Goal
  • Schema matching tools with significant automated
    support

8
Schema Matching
How is the match result expressed?
DB3
DB2
student
course
studentID
courseID
name
title
type
student
course
studentID
courseID
name
title ?
type
type
evaluation
letter
  • Pairs of paths
  • Lists of paths
  • Schema names

9
Schema Matching
What do we match?
  • Depends on the queries we want to ask
  • Elements in isolation (leaves in particular)
  • Substructures
  • Whole schemas

10
Motivation
  • Important component in many applications
  • Data Integration
  • Data Migration
  • E-Commerce
  • Model ManagementBernstein, Halevy, Pottinger
    00
  • Algebra for manipulating models and mappings
  • Match, Merge, Compose

11
Problems
  • Minimize user involvement (semi-automatic)
  • Data model independent matching (generic)
  • Schema matching is a hard problem
  • Naming and structural differences in schemas
  • Similar, but non-identical concepts modeled
  • Multiple data models SQL DDL, XML, ODMG

12
Schema Matching Approaches
How to match?
  • Taxonomy based survey Rahm and Bernstein, VLDB
    J, 2001

13
Cupid
Combined matchers
Composite
Hybrid
automatic composition
manual composition
  • Madhavan, Bernstein and Rahm, VLDB, 2001

14
Cupid Example
PO
PurchaseOrder
POLines
Items
POShipTo
DeliverTo
Item
Item
Name
Name
City
Street
Line
ItemNumber
UoM
UnitofMeasure
Qty
Quantity
15
Cupid Architecture
LSIM
SSIM WSIM
16
Linguistic Matching
  • Heuristic name matching
  • Tokenization of names
  • POOrderNum ? PO, Order, Num
  • Expansion of short-forms, acronyms
  • PO ? Purchase, Order Num ? Number
  • Clustering of schema elements based on keywords
    and data-types
  • Street, City, POAddress ? Address
  • Thesaurus of synonyms, hypernyms, acronyms
  • Linguistic Similarity coefficient (LSIM) ? 0,1

17
Structure Matching
PO
PurchaseOrder
POLines
Items
Item
Item
Name
Address
Name
City
Street
Line
ItemNumber
City
Street
UoM
UnitofMeasure
Qty
Quantity
18
Structure MatchingMutually Reinforcing Similarity
PO
PurchaseOrder
WSIM gt thhigh
POLines
Items
WSIM gt thhigh
Item
Item
Line
ItemNum
UoM
UnitofMeasure
Qty
Quantity
19
Structure MatchingContext Dependent
Disambiguation
PO
PurchaseOrder
InvoiceTo
POBillTo
DeliverTo
POShipTo
Address
Street
City
Street
City
SSIM
City
Street
20
Intuition
  • Atomic elements are similar
  • Linguistically and data-type similar
  • Their ancestors are similar
  • Compound elements (non-leaf) are similar if
  • Linguistically similar
  • Subtrees rooted at the elements are similar
  • Mutually recursive
  • Leaves determine internal node similarity
  • Similarity of internal nodes leads to increase in
    leaf similarity

21
Structure Match Details
  • Subtrees are similar if
  • Immediate children are similar
  • Leaf sets are similar
  • Subtree Similarity (nodes s and t)
  • Fraction of leaves in subtree s that can be
    mapped to a leaf in the other subtree t and
    vice-versa
  • Less sensitive to variation in intermediate
    structure
  • Pruning the number of comparisons
  • Elements must have comparable number of leaves

22
Referential Integrity
Customer
Purchase Order
Customer ID
Order ID
Address
Customer ID
Name
Product Name
Schema A
Schema B
  • Join nodes added to the schema tree for each
    referential integrity constraint
  • Views can be similarly used

23
Cupid Architecture
LSIM
SSIM WSIM
Structural (SSIM), Weighted (WSIM) Similarity
Linguistic Similarity (LSIM)
InvoiceTo BillTo 0.7
UoM UnitMeasure 0.9
City City 1.0
InvoiceTo BillTo 0.8 0.7
UoM UnitMeasure 0.7 0.8
InvoiceTo/City BillTo/City 0.8 0.9
24
Mapping Generation
  • Individual mapping elements computed from WSIM
    values
  • Consider only mapping pairs that have WSIM
    greater than threshold
  • For each element of target find most similar
    source element
  • Not accepted mappings with high similarity are
    returned in order to help user modify map

25
Cupid Architecture
LSIM
SSIM WSIM
Generate Mapping
Output Mapping
26
Work Needed
  • A more robust solution
  • Auto-tuning parameters
  • Thesaurus Generation and Evolution
  • Schema matching component architecture
  • Easily extensible by adding multiple techniques
  • Data Instances for matching
  • Look at COMA ProtoPlasm systems

27
References
  1. J. Madhavan, P. A. Bernstein, E. RahmGeneric
    Schema Matching with CupidVLDB, 2001
  2. H. H. Do, E. RahmCOMA - A System for Flexible
    Combination of Schema Matching ApproachesVLDB,
    2002
  3. P. A. Bernstein, S. Melnik, M. Petropoulos, C.
    QuixIndustrial-Strength Schema MatchingSIGMOD
    Record 33(4), 2004
Write a Comment
User Comments (0)
About PowerShow.com