Data Integration - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Data Integration

Description:

area day-phone extra-info. 13. Putting It All Together: LSD ... It can be filled in using some type of code, and corrected by hand. 15. A Few Mapping Examples ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 33
Provided by: zack4
Category:

less

Transcript and Presenter's Notes

Title: Data Integration


1
Data Integration
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 550 Database Information Systems
  • November 19, 2009

LSD Slides courtesy AnHai Doan
2
A Problem
  • Weve seen that even with normalization and the
    same needs, different people will arrive at
    different schemas
  • In fact, most people also have different needs!
  • Often people build databases in isolation, then
    want to share their data
  • Different systems within an enterprise
  • Different information brokers on the Web
  • Scientific collaborators
  • Researchers who want to publish their data for
    others to use
  • This is the goal of data integration tie
    together different sources, controlled by many
    people, under a common schema

3
Building a Data Integration System
  • Create a middleware mediator or data
    integration system over the sources
  • Can be warehoused (a data warehouse) or virtual
  • Presents a uniform query interface and schema
  • Abstracts away multitude of sources consults
    them for relevant data
  • Unifies different source data formats (and
    possibly schemas)
  • Sources are generally autonomous, not designed to
    be integrated
  • Sources may be local DBs or remote web
    sources/services
  • Sources may require certain input to return
    output (e.g., web forms) binding patterns
    describe these

4
Typical Data Integration Components
Query
Results
Data Integration System / Mediator
Mediated Schema
Source Catalog
Mappings in Catalog
Wrapper
Wrapper
Wrapper
Source Relations
5
Typical Data Integration Architecture
Source Descrs.
Query
Reformulator
Source Catalog
Query over sources
QueryProcessor
Results
Queries bindings
Data in mediated format
Wrapper
Wrapper
Wrapper
6
Challenges of Mapping Schemas
  • In a perfect world, it would be easy to match up
    items from one schema with another
  • Every table would have a similar table in the
    other schema
  • Every attribute would have an identical attribute
    in the other schema
  • Every value would clearly map to a value in the
    other schema
  • Real world as with human languages, things
    dont map clearly!
  • May have different numbers of tables different
    decompositions
  • Metadata in one relation may be data in another
  • Values may not exactly correspond
  • It may be unclear whether a value is the same

7
Different Aspects to Mapping
  • Schema matching / ontology alignment
  • How do we find correspondences between
    attributes?
  • Entity matching / deduplication / record linking
    / etc.
  • How do we know when two records refer to the
    same thing?
  • Mapping definition
  • How do we specify the constraints or
    transformations that let us reason about when to
    create an entry in one schema, given an entry in
    another schema?

Lets see one influential approach to schema
matching
8
Standard Schema Matcher Architecture(Established
by LSD System)
  • Suppose user wants to integrate 100 data sources
  • User
  • manually creates mappings for a few sources, say
    3
  • shows schema matcher these mappings
  • Schema matcher learns from the mappings
  • Multi-strategy learning incorporates many types
    of info in a general way
  • Knowledge of constraints further helps
  • Matcher proposes mappings for remaining 97 sources

9
Example
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Learned hypotheses
Schema of realestate.com
If phone occurs in the name gt agent-phone
listed-price 250,000 110,000 ...
location Miami, FL Boston, MA ...
phone (305) 729 0831 (617) 253 1429 ...
comments Fantastic house Great location ...
realestate.com
If fantastic great occur frequently in
data values gt description
homes.com
price 550,000 320,000 ...
contact-phone (278) 345 7215 (617) 335 2315 ...
extra-info Beautiful yard Great beach ...
10
Learning from Multiple Sources
  • Use a set of base matchers
  • Each exploits well certain types of information
  • Name learner looks at words in the attribute
    names
  • Naïve Bayes learner looks at patterns in the data
    values
  • Etc.
  • Match schema elements of a new source
  • Apply the base learners
  • Each returns a score
  • For different attributes one learner is more
    useful than another
  • Combine their predictions using a combiner /
    meta-learner
  • Combiner / meta-learner
  • Uses training sources to measure base learner
    accuracy
  • Weighs each learner based on its accuracy

11
Training the Learners
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Schema of realestate.com
Name Learner
(location, address) (listed-price, price) (phone,
agent-phone) (comments, description) ...
ltlocationgt Miami, FL lt/gt ltlisted-pricegt
250,000lt/gt ltphonegt (305) 729 0831lt/gt
ltcommentsgt Fantastic house lt/gt
realestate.com
Naive Bayes Learner
ltlocationgt Boston, MA lt/gt ltlisted-pricegt
110,000lt/gt ltphonegt (617) 253 1429lt/gt
ltcommentsgt Great location lt/gt
(Miami, FL, address) ( 250,000,
price) ((305) 729 0831, agent-phone) (Fantastic
house, description) ...
12
Applying the Learners
Mediated schema
Schema of homes.com
address price agent-phone
description
area day-phone extra-info
Name Learner Naive Bayes
ltareagtSeattle, WAlt/gt ltareagtKent,
WAlt/gt ltareagtAustin, TXlt/gt
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Name Learner Naive Bayes
Meta-Learner
(address,0.7), (description,0.3)
ltday-phonegt(278) 345 7215lt/gt ltday-phonegt(617) 335
2315lt/gt ltday-phonegt(512) 427 1115lt/gt
(agent-phone,0.9), (description,0.1)
(address,0.6), (description,0.4)
ltextra-infogtBeautiful yardlt/gt ltextra-infogtGreat
beachlt/gt ltextra-infogtClose to Seattlelt/gt
13
Putting It All Together LSD Schema Matching
System
Matching Phase
Training Phase
Mediated schema
Source schemas
Domain Constraints
Data listings
Training data for base learners
User Feedback
Constraint Handler
L1
L2
Lk
Mapping Combination
14
Mappings between Schemas
  • LSD provides attribute correspondences, but not
    complete mappings
  • Many similar systems COMA, COMA, Falcon-AO,
  • Mappings generally are posed as views define
    relations in one schema (typically either the
    mediated schema or the source schema), given data
    in the other schema
  • This allows us to restructure or recompose
    decompose our data in a new way
  • We can also define mappings between values in a
    view
  • We use an intermediate table defining
    correspondences a concordance table
  • It can be filled in using some type of code, and
    corrected by hand

15
A Few Mapping Examples
  • Movie(Title, Year, Director, Editor, Star1,
    Star2)
  • Movie(Title, Year, Director, Editor, Star1,
    Star2)
  • PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)
  • MotionPicture(ID, Title, Year)Participant(ID,
    Name, Role)

PieceOfArt(I, A, S, T, Movie) - Movie(T, Y, A,
_, S1, S2), ID T Y, S S1 S2
Movie(T, Y, D, E, S1, S2) - MotionPicture(I, T,
Y), Participant(I, D, Dir), Participant(I, E,
Editor), Participant(I, S1, Star1),
Participant(I, S2, Star2)
T1
T2
Need a concordance table from CustIDs to PennIDs
16
Two Important Approaches
  • TSIMMIS Garcia-Molina97 Stanford
  • Focus semistructured data (OEM), OQL-based
    language (Lorel)
  • Creates a mediated schema as a view over the
    sources
  • Spawned a UCSD project called MIX, which led to a
    company now owned by BEA Systems
  • Other important systems of this vein Kleisli/K2
    _at_ Penn
  • Information Manifold Levy96 ATT Research
  • Focus local-as-view mappings, relational model
  • Sources defined as views over mediated schema
  • Requires a special
  • Led to peer-to-peer integration approaches
    (Piazza, etc.)
  • Focus Web-based queriable sources

17
TSIMMIS
  • One of the first systems to support
    semi-structured data, which predated XML by
    several years OEM
  • An instance of a global-as-view mediation
    system
  • We define our global schema as views over the
    sources
  • Well use XQuery XML to illustrate the
    principles

18
Some Simple Data
ltbookgt ltauthorgtBernsteinlt/authorgt
ltauthorgtNewcomerlt/authorgt lttitlegtPrinciples of
TPlt/titlegtlt/bookgt ltbookgt ltauthorgtChamberlinlt/au
thorgt lttitlegtDB2 UDBlt/titlegtlt/bookgt
19
Queries in TSIMMIS
  • Specified in OQL-style language called Lorel
  • OQL was an object-oriented query language that
    looks like SQL
  • Lorel is, in many ways, a predecessor to XQuery
  • Example in XQuery
  • for b in AllData()/bookwhere b/title/text()
    DB2 UDB and b/author/text()
    Chamberlinreturn b

20
Query Answering in TSIMMIS
  • Basically, its view unfolding, i.e., composing a
    query with a view
  • The query is the one being asked
  • The views are the MSL templates for the wrappers
  • Some of the views may actually require
    parameters, e.g., an author name, before theyll
    return answers
  • Common for web forms (see Amazon, Google, )
  • XQuery functions (XQuerys version of views)
    support parameters as well, so well see these in
    action

21
A Wrapper Definition in MSL
  • Wrappers have templates and binding patterns (X)
    in MSL
  • B - B ltbook ltauthor Xgtgt // select
    from book where author X //
  • This reformats a SQL query over Book(author,
    year, title)
  • In XQuery, this might look like
  • define function GetBook(x AS xsdstring) as book
    for b in sql(Amazon.DB, select
    from book where author x )return
    ltbookgtb/titleltauthorgtxlt/authorgtlt/bookgt


book
author
title


The union of GetBooks results is unioned with
others to form the view Mediator()
22
How to Answer the Query
  • Given our query
  • for b in Mediator()/bookwhere b/title/text()
    DB2 UDB and b/author/text()
    Chamberlinreturn b
  • Find all wrapper definitions that
  • Contain output enough structure to match the
    conditions of the query
  • Or have already tested the conditions for us!

23
Query Composition with Views
  • We find all views that define book with author
    and title, and we compose the query with each
  • define function GetBook(x AS xsdstring) as book
    for b in sql(Amazon.DB, select
    from book where author x )return
    ltbookgt b/title ltauthorgtxlt/authorgtlt/bookgt
  • for b in Mediator()/bookwhere b/title/text()
    DB2 UDB and b/author/text()
    Chamberlinreturn b

book
author
title


24
Matching View Output to Our Querys Conditions
  • Determine that b/book/author/text() ?? x by
    matching the pattern on the functions output
  • define function GetBook(x AS xsdstring) as book
    for b in sql(Amazon.DB, select
    from book where author x )return
    ltbookgt b/title ltauthorgtxlt/author
    gtlt/bookgt
  • let x Chamberlinfor b in
    GetBook(x)/bookwhere b/title/text() DB2
    UDB return b

book
author
title


25
The Final Step Unfolding
  • let x Chamberlinfor b in ( for b in
    sql(Amazon.com, select from book where
    author x ) return ltbookgt b/title
    ltauthorgtxlt/authorgtlt/bookgt
  • )/bookwhere b/title/text() DB2 UDB
    return b
  • How do we simplify further to get to here?
  • for b in sql(Amazon.com, select from
    book where authorChamberlin)where
    b/title/text() DB2 UDB return b

26
Virtues of TSIMMIS
  • Early adopter of semistructured data, greatly
    predating XML
  • Can support data from many different kinds of
    sources
  • Obviously, doesnt fully solve heterogeneity
    problem
  • Presents a mediated schema that is the union of
    multiple views
  • Query answering based on view unfolding
  • Easily composed in a hierarchy of mediators

27
Limitations of TSIMMIS Approach
  • Some data sources may contain data with certain
    ranges or properties
  • Books by Aho, Students at UPenn,
  • If we ask a query for students at Columbia, dont
    want to bother querying students at Penn
  • How do we express these?
  • Mediated schema is basically the union of the
    various MSL templates as they change, so may
    the mediated schema

28
An Alternate ApproachThe Information Manifold
(Levy et al.)
  • When you integrate something, you have some
    conceptual model of the integrated domain
  • Define that as a basic frame of reference,
    everything else as a view over it
  • Local as View
  • May have overlapping/incomplete sources
  • Define each source as the subset of a query over
    the mediated schema
  • We can use selection or join predicates to
    specify that a source contains a range of values
  • ComputerBooks() ? Books(Title, , Subj), Subj
    Computers

29
The Local-as-View Model
  • The basic model is the following
  • Local sources are views over the mediated
    schema
  • Sources have the data mediated schema is
    virtual
  • Sources may not have all the data from the domain
    open-world assumption
  • The system must use the sources (views) to answer
    queries over the mediated schema

30
Query Answering
  • Assumption conjunctive queries, set semantics
  • Suppose we have a mediated schema author(aID,
    isbn, year), book(isbn, title, publisher)
  • Suppose we have the query
  • q(a, t) - author(a, i, _), book(i, t, p), t
    DB2 UDB
  • and sources
  • s1(a,t) ? author(a, i, _), book(i, t, p), t
    123
  • s5(a, t, p) ? author(a, i, _), book(i,t), p
    SAMS
  • We want to compose the query with the source
    mappings but theyre in the wrong direction!
  • Yet everything in s1, s5 is an answer to the
    query!

31
Answering Queries Using Views
  • Numerous recently-developed algorithms for these
  • Inverse rules Duschka et al.
  • Bucket algorithm Levy et al.
  • MiniCon Pottinger Halevy
  • Also related chase and backchase Popa,
    Tannen, Deutsch
  • Requires conjunctive queries

32
Summary of Data Integration
  • Local-as-view integration has replaced
    global-as-view as the standard
  • More robust way of defining mediated schemas and
    sources
  • Mediated schema is clearly defined, less likely
    to change
  • Sources can be more accurately described
  • Methods exist for query reformulation, including
    inverse rules
  • Integration requires standardization on a single
    schema
  • Can be hard to get consensus
  • Today we have peer-to-peer data integration,
    e.g., Piazza Halevy et al., Orchestra Ives et
    al., Hyperion Miller et al.
  • Data integration capabilities in commercial
    products Oracle Fusion, IBMs WebSphere
    Integrator, numerous packages from middleware
    companies
Write a Comment
User Comments (0)
About PowerShow.com