Data Integration - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Data Integration

Description:

The LSD (Learning Source Descriptions) System. Suppose user wants to integrate 100 data sources ... LSD provides attribute correspondences, but not complete mappings ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 29
Provided by: zack4
Category:
Tags: data | integration

less

Transcript and Presenter's Notes

Title: Data Integration


1
Data Integration
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 550 Database Information Systems
  • November 14, 2007

2
Challenges of Mapping Schemas
  • In a perfect world, it would be easy to match up
    items from one schema with another
  • Every table would have a similar table in the
    other schema
  • Every attribute would have an identical attribute
    in the other schema
  • Every value would clearly map to a value in the
    other schema
  • Real world as with human languages, things
    dont map clearly!
  • May have different numbers of tables different
    decompositions
  • Metadata in one relation may be data in another
  • Values may not exactly correspond
  • It may be unclear whether a value is the same

3
Different Aspects to Mapping
  • Schema matching / ontology alignment
  • How do we find correspondences between
    attributes?
  • Entity matching / deduplication / record linking
    / etc.
  • How do we know when two records refer to the
    same thing?
  • Mapping definition
  • How do we specify the constraints or
    transformations that let us reason about when to
    create an entry in one schema, given an entry in
    another schema?

Lets see one influential approach to schema
matching
4
The LSD (Learning Source Descriptions) System
  • Suppose user wants to integrate 100 data sources
  • User
  • manually creates mappings for a few sources, say
    3
  • shows LSD these mappings
  • LSD learns from the mappings
  • Multi-strategy learning incorporates many types
    of info in a general way
  • Knowledge of constraints further helps
  • LSD proposes mappings for remaining 97 sources

5
Example
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Learned hypotheses
Schema of realestate.com
If phone occurs in the name agent-phone
listed-price 250,000 110,000 ...
location Miami, FL Boston, MA ...
phone (305) 729 0831 (617) 253 1429 ...
comments Fantastic house Great location ...
realestate.com
If fantastic great occur frequently in
data values description
homes.com
price 550,000 320,000 ...
contact-phone (278) 345 7215 (617) 335 2315 ...
extra-info Beautiful yard Great beach ...
6
LSDs Multi-Strategy Learning
  • Use a set of base learners
  • Each exploits well certain types of information
  • Name learner looks at words in the attribute
    names
  • Naïve Bayes learner looks at patterns in the data
    values
  • Etc.
  • Match schema elements of a new source
  • Apply the base learners
  • Each returns a score
  • For different attributes one learner is more
    useful than another
  • Combine their predictions using a meta-learner
  • Meta-learner
  • Uses training sources to measure base learner
    accuracy
  • Weighs each learner based on its accuracy

7
Training the Learners
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Schema of realestate.com
Name Learner
(location, address) (listed-price, price) (phone,
agent-phone) (comments, description) ...
Miami, FL
250,000 (305) 729 0831
Fantastic house
realestate.com
Naive Bayes Learner
Boston, MA
110,000 (617) 253 1429
Great location
(Miami, FL, address) ( 250,000,
price) ((305) 729 0831, agent-phone) (Fantastic
house, description) ...
8
Applying the Learners
Mediated schema
Schema of homes.com
address price agent-phone
description
area day-phone extra-info
Name Learner Naive Bayes
Seattle, WA Kent,
WA Austin, TX
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Name Learner Naive Bayes
Meta-Learner
(address,0.7), (description,0.3)
(278) 345 7215 (617) 335
2315 (512) 427 1115
(agent-phone,0.9), (description,0.1)
(address,0.6), (description,0.4)
Beautiful yard Great
beach Close to Seattle
9
Putting It All Together LSD System
Matching Phase
Training Phase
Mediated schema
Source schemas
Domain Constraints
Data listings
Training data for base learners
User Feedback
Constraint Handler
L1
L2
Lk
Mapping Combination
10
Mappings between Schemas
  • LSD provides attribute correspondences, but not
    complete mappings
  • Mappings generally are posed as views define
    relations in one schema (typically either the
    mediated schema or the source schema), given data
    in the other schema
  • This allows us to restructure or recompose
    decompose our data in a new way
  • We can also define mappings between values in a
    view
  • We use an intermediate table defining
    correspondences a concordance table
  • It can be filled in using some type of code, and
    corrected by hand

11
A Few Mapping Examples
  • Movie(Title, Year, Director, Editor, Star1,
    Star2)
  • Movie(Title, Year, Director, Editor, Star1,
    Star2)
  • PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)
  • MotionPicture(ID, Title, Year)Participant(ID,
    Name, Role)

PieceOfArt(I, A, S, T, Movie) - Movie(T, Y, A,
_, S1, S2), ID T Y, S S1 S2
Movie(T, Y, D, E, S1, S2) - MotionPicture(I, T,
Y), Participant(I, D, Dir), Participant(I, E,
Editor), Participant(I, S1, Star1),
Participant(I, S2, Star2)
T1
T2
Need a concordance table from CustIDs to PennIDs
12
Two Important Approaches
  • TSIMMIS Garcia-Molina97 Stanford
  • Focus semistructured data (OEM), OQL-based
    language (Lorel)
  • Creates a mediated schema as a view over the
    sources
  • Spawned a UCSD project called MIX, which led to a
    company now owned by BEA Systems
  • Other important systems of this vein Kleisli/K2
    _at_ Penn
  • Information Manifold Levy96 ATT Research
  • Focus local-as-view mappings, relational model
  • Sources defined as views over mediated schema
  • Requires a special
  • Led to peer-to-peer integration approaches
    (Piazza, etc.)
  • Focus Web-based queriable sources

13
TSIMMIS
  • One of the first systems to support
    semi-structured data, which predated XML by
    several years OEM
  • An instance of a global-as-view mediation
    system
  • We define our global schema as views over the
    sources

14
XML vs. Object Exchange Model
Bernstein
Newcomer Principles of
TP Chamberlinthor DB2 UDB
O1 book O2 author Bernstein O3
author Newcomer O4 title Principles of
TP O5 book O6 author Chamberlin
O7 title DB2 UDB
15
Queries in TSIMMIS
  • Specified in OQL-style language called Lorel
  • OQL was an object-oriented query language that
    looks like SQL
  • Lorel is, in many ways, a predecessor to XQuery
  • Based on path expressions over OEM structures
  • select book where book.title DB2 UDB and
    book.author Chamberlin
  • This is basically like XQuery, which well use in
    place of Lorel and the MSL template language.
    Previous query restated
  • for b in AllData()/bookwhere b/title/text()
    DB2 UDB and b/author/text()
    Chamberlinreturn b

16
Query Answering in TSIMMIS
  • Basically, its view unfolding, i.e., composing a
    query with a view
  • The query is the one being asked
  • The views are the MSL templates for the wrappers
  • Some of the views may actually require
    parameters, e.g., an author name, before theyll
    return answers
  • Common for web forms (see Amazon, Google, )
  • XQuery functions (XQuerys version of views)
    support parameters as well, so well see these in
    action

17
A Wrapper Definition in MSL
  • Wrappers have templates and binding patterns (X)
    in MSL
  • B - B // select
    from book where author X //
  • This reformats a SQL query over Book(author,
    year, title)
  • In XQuery, this might look like
  • declare function GetBook(x AS xsdstring) as
    book for b in sql(Amazon.DB,
    select from book where author x
    )return b/titlex
    book


book
author
title


The union of GetBooks results is unioned with
others to form the view Mediator()
18
How to Answer the Query
  • Given our query
  • for b in Mediator()/bookwhere b/title/text()
    DB2 UDB and b/author/text()
    Chamberlinreturn b
  • Find all wrapper definitions that
  • Contain output enough structure to match the
    conditions of the query
  • Or have already tested the conditions for us!

19
Query Composition with Views
  • We find all views that define book with author
    and title, and we compose the query with each
  • declare function GetBook(x AS xsdstring) as
    book for b in sql(Amazon.DB, select
    from book where author x )return
    b/title x
  • for b in Mediator()/bookwhere b/title/text()
    DB2 UDB and b/author/text()
    Chamberlinreturn b

book
author
title


20
Matching View Output to Our Querys Conditions
  • Determine that b/book/author/text() ?? x by
    matching the pattern on the functions output
  • declare function GetBook(x AS xsdstring) as
    book for b in sql(Amazon.DB,
    select from book where author x
    )return b/title
    x
  • let x Chamberlinfor b in
    GetBook(x)/bookwhere b/title/text() DB2
    UDB return b

book
author
title


21
The Final Step Unfolding
  • let x Chamberlinfor b in ( for b in
    sql(Amazon.com, select from book where
    author x ) return b/title
    x
  • )/bookwhere b/title/text() DB2 UDB
    return b
  • How do we simplify further to get to here?
  • for b in sql(Amazon.com, select from
    book where authorChamberlin)where
    b/title/text() DB2 UDB return b

22
Virtues of TSIMMIS
  • Early adopter of semistructured data, greatly
    predating XML
  • Can support data from many different kinds of
    sources
  • Obviously, doesnt fully solve heterogeneity
    problem
  • Presents a mediated schema that is the union of
    multiple views
  • Query answering based on view unfolding
  • Easily composed in a hierarchy of mediators

23
Limitations of TSIMMIS Approach
  • Some data sources may contain data with certain
    ranges or properties
  • Books by Aho, Students at UPenn,
  • If we ask a query for students at Columbia, dont
    want to bother querying students at Penn
  • How do we express these?
  • Mediated schema is basically the union of the
    various MSL templates as they change, so may
    the mediated schema

24
An Alternate ApproachThe Information Manifold
(Levy et al.)
  • When you integrate something, you have some
    conceptual model of the integrated domain
  • Define that as a basic frame of reference,
    everything else as a view over it
  • Local as View
  • May have overlapping/incomplete sources
  • Define each source as the subset of a query over
    the mediated schema
  • We can use selection or join predicates to
    specify that a source contains a range of values
  • ComputerBooks() ? Books(Title, , Subj), Subj
    Computers

25
The Local-as-View Model
  • The basic model is the following
  • Local sources are views over the mediated
    schema
  • Sources have the data mediated schema is
    virtual
  • Sources may not have all the data from the domain
    open-world assumption
  • The system must use the sources (views) to answer
    queries over the mediated schema

26
Query Answering
  • Assumption conjunctive queries, set semantics
  • Suppose we have a mediated schema author(aID,
    isbn, year), book(isbn, title, publisher)
  • Suppose we have the query
  • q(a, t) - author(a, i, _), book(i, t, p), t
    DB2 UDB
  • and sources
  • s1(a,t) ? author(a, i, _), book(i, t, p), t
    123
  • s5(a, t, p) ? author(a, i, _), book(i,t), p
    SAMS
  • We want to compose the query with the source
    mappings but theyre in the wrong direction!
  • Yet everything in s1, s5 is an answer to the
    query!

27
Answering Queries Using Views
  • Numerous recently-developed algorithms for these
  • Inverse rules Duschka et al.
  • Bucket algorithm Levy et al.
  • MiniCon Pottinger Halevy
  • Also related chase and backchase Popa,
    Tannen, Deutsch
  • Requires conjunctive queries

28
Summary of Data Integration
  • Local-as-view integration has replaced
    global-as-view as the standard
  • More robust way of defining mediated schemas and
    sources
  • Mediated schema is clearly defined, less likely
    to change
  • Sources can be more accurately described
  • Methods exist for query reformulation, including
    inverse rules
  • Integration requires standardization on a single
    schema
  • Can be hard to get consensus
  • Today we have peer-to-peer data integration,
    e.g., Piazza Halevy et al., Orchestra Ives et
    al., Hyperion Miller et al.
  • Some other aspects of integration were addressed
    in related papers
  • Overlap between sources coverage of data at
    sources
  • Semi-automated creation of mappings and wrappers
  • Data integration capabilities in commercial
    products BEAs Liquid Data, IBMs DB2
    Information Integrator, numerous packages from
    middleware companies
Write a Comment
User Comments (0)
About PowerShow.com