Title: Data Integration
1Data Integration
- Zachary G. Ives
- University of Pennsylvania
- CIS 550 Database Information Systems
- November 19, 2009
LSD Slides courtesy AnHai Doan
2A Problem
- Weve seen that even with normalization and the
same needs, different people will arrive at
different schemas - In fact, most people also have different needs!
- Often people build databases in isolation, then
want to share their data - Different systems within an enterprise
- Different information brokers on the Web
- Scientific collaborators
- Researchers who want to publish their data for
others to use - This is the goal of data integration tie
together different sources, controlled by many
people, under a common schema
3Building a Data Integration System
- Create a middleware mediator or data
integration system over the sources - Can be warehoused (a data warehouse) or virtual
- Presents a uniform query interface and schema
- Abstracts away multitude of sources consults
them for relevant data - Unifies different source data formats (and
possibly schemas) - Sources are generally autonomous, not designed to
be integrated - Sources may be local DBs or remote web
sources/services - Sources may require certain input to return
output (e.g., web forms) binding patterns
describe these
4Typical Data Integration Components
Query
Results
Data Integration System / Mediator
Mediated Schema
Source Catalog
Mappings in Catalog
Wrapper
Wrapper
Wrapper
Source Relations
5Typical Data Integration Architecture
Source Descrs.
Query
Reformulator
Source Catalog
Query over sources
QueryProcessor
Results
Queries bindings
Data in mediated format
Wrapper
Wrapper
Wrapper
6Challenges of Mapping Schemas
- In a perfect world, it would be easy to match up
items from one schema with another - Every table would have a similar table in the
other schema - Every attribute would have an identical attribute
in the other schema - Every value would clearly map to a value in the
other schema - Real world as with human languages, things
dont map clearly! - May have different numbers of tables different
decompositions - Metadata in one relation may be data in another
- Values may not exactly correspond
- It may be unclear whether a value is the same
7Different Aspects to Mapping
- Schema matching / ontology alignment
- How do we find correspondences between
attributes? - Entity matching / deduplication / record linking
/ etc. - How do we know when two records refer to the
same thing? - Mapping definition
- How do we specify the constraints or
transformations that let us reason about when to
create an entry in one schema, given an entry in
another schema?
Lets see one influential approach to schema
matching
8Standard Schema Matcher Architecture(Established
by LSD System)
- Suppose user wants to integrate 100 data sources
- User
- manually creates mappings for a few sources, say
3 - shows schema matcher these mappings
- Schema matcher learns from the mappings
- Multi-strategy learning incorporates many types
of info in a general way - Knowledge of constraints further helps
- Matcher proposes mappings for remaining 97 sources
9Example
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Learned hypotheses
Schema of realestate.com
If phone occurs in the name gt agent-phone
listed-price 250,000 110,000 ...
location Miami, FL Boston, MA ...
phone (305) 729 0831 (617) 253 1429 ...
comments Fantastic house Great location ...
realestate.com
If fantastic great occur frequently in
data values gt description
homes.com
price 550,000 320,000 ...
contact-phone (278) 345 7215 (617) 335 2315 ...
extra-info Beautiful yard Great beach ...
10Learning from Multiple Sources
- Use a set of base matchers
- Each exploits well certain types of information
- Name learner looks at words in the attribute
names - Naïve Bayes learner looks at patterns in the data
values - Etc.
- Match schema elements of a new source
- Apply the base learners
- Each returns a score
- For different attributes one learner is more
useful than another - Combine their predictions using a combiner /
meta-learner - Combiner / meta-learner
- Uses training sources to measure base learner
accuracy - Weighs each learner based on its accuracy
11Training the Learners
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Schema of realestate.com
Name Learner
(location, address) (listed-price, price) (phone,
agent-phone) (comments, description) ...
ltlocationgt Miami, FL lt/gt ltlisted-pricegt
250,000lt/gt ltphonegt (305) 729 0831lt/gt
ltcommentsgt Fantastic house lt/gt
realestate.com
Naive Bayes Learner
ltlocationgt Boston, MA lt/gt ltlisted-pricegt
110,000lt/gt ltphonegt (617) 253 1429lt/gt
ltcommentsgt Great location lt/gt
(Miami, FL, address) ( 250,000,
price) ((305) 729 0831, agent-phone) (Fantastic
house, description) ...
12Applying the Learners
Mediated schema
Schema of homes.com
address price agent-phone
description
area day-phone extra-info
Name Learner Naive Bayes
ltareagtSeattle, WAlt/gt ltareagtKent,
WAlt/gt ltareagtAustin, TXlt/gt
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Name Learner Naive Bayes
Meta-Learner
(address,0.7), (description,0.3)
ltday-phonegt(278) 345 7215lt/gt ltday-phonegt(617) 335
2315lt/gt ltday-phonegt(512) 427 1115lt/gt
(agent-phone,0.9), (description,0.1)
(address,0.6), (description,0.4)
ltextra-infogtBeautiful yardlt/gt ltextra-infogtGreat
beachlt/gt ltextra-infogtClose to Seattlelt/gt
13Putting It All Together LSD Schema Matching
System
Matching Phase
Training Phase
Mediated schema
Source schemas
Domain Constraints
Data listings
Training data for base learners
User Feedback
Constraint Handler
L1
L2
Lk
Mapping Combination
14Mappings between Schemas
- LSD provides attribute correspondences, but not
complete mappings - Many similar systems COMA, COMA, Falcon-AO,
- Mappings generally are posed as views define
relations in one schema (typically either the
mediated schema or the source schema), given data
in the other schema - This allows us to restructure or recompose
decompose our data in a new way - We can also define mappings between values in a
view - We use an intermediate table defining
correspondences a concordance table - It can be filled in using some type of code, and
corrected by hand
15A Few Mapping Examples
- Movie(Title, Year, Director, Editor, Star1,
Star2) - Movie(Title, Year, Director, Editor, Star1,
Star2)
- PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)
- MotionPicture(ID, Title, Year)Participant(ID,
Name, Role)
PieceOfArt(I, A, S, T, Movie) - Movie(T, Y, A,
_, S1, S2), ID T Y, S S1 S2
Movie(T, Y, D, E, S1, S2) - MotionPicture(I, T,
Y), Participant(I, D, Dir), Participant(I, E,
Editor), Participant(I, S1, Star1),
Participant(I, S2, Star2)
T1
T2
Need a concordance table from CustIDs to PennIDs
16Two Important Approaches
- TSIMMIS Garcia-Molina97 Stanford
- Focus semistructured data (OEM), OQL-based
language (Lorel) - Creates a mediated schema as a view over the
sources - Spawned a UCSD project called MIX, which led to a
company now owned by BEA Systems - Other important systems of this vein Kleisli/K2
_at_ Penn - Information Manifold Levy96 ATT Research
- Focus local-as-view mappings, relational model
- Sources defined as views over mediated schema
- Requires a special
- Led to peer-to-peer integration approaches
(Piazza, etc.) - Focus Web-based queriable sources
17TSIMMIS
- One of the first systems to support
semi-structured data, which predated XML by
several years OEM - An instance of a global-as-view mediation
system - We define our global schema as views over the
sources - Well use XQuery XML to illustrate the
principles
18Some Simple Data
ltbookgt ltauthorgtBernsteinlt/authorgt
ltauthorgtNewcomerlt/authorgt lttitlegtPrinciples of
TPlt/titlegtlt/bookgt ltbookgt ltauthorgtChamberlinlt/au
thorgt lttitlegtDB2 UDBlt/titlegtlt/bookgt
19Queries in TSIMMIS
- Specified in OQL-style language called Lorel
- OQL was an object-oriented query language that
looks like SQL - Lorel is, in many ways, a predecessor to XQuery
- Example in XQuery
- for b in AllData()/bookwhere b/title/text()
DB2 UDB and b/author/text()
Chamberlinreturn b
20Query Answering in TSIMMIS
- Basically, its view unfolding, i.e., composing a
query with a view - The query is the one being asked
- The views are the MSL templates for the wrappers
- Some of the views may actually require
parameters, e.g., an author name, before theyll
return answers - Common for web forms (see Amazon, Google, )
- XQuery functions (XQuerys version of views)
support parameters as well, so well see these in
action
21A Wrapper Definition in MSL
- Wrappers have templates and binding patterns (X)
in MSL - B - B ltbook ltauthor Xgtgt // select
from book where author X // - This reformats a SQL query over Book(author,
year, title) - In XQuery, this might look like
- define function GetBook(x AS xsdstring) as book
for b in sql(Amazon.DB, select
from book where author x )return
ltbookgtb/titleltauthorgtxlt/authorgtlt/bookgt
book
author
title
The union of GetBooks results is unioned with
others to form the view Mediator()
22How to Answer the Query
- Given our query
- for b in Mediator()/bookwhere b/title/text()
DB2 UDB and b/author/text()
Chamberlinreturn b - Find all wrapper definitions that
- Contain output enough structure to match the
conditions of the query - Or have already tested the conditions for us!
23Query Composition with Views
- We find all views that define book with author
and title, and we compose the query with each - define function GetBook(x AS xsdstring) as book
for b in sql(Amazon.DB, select
from book where author x )return
ltbookgt b/title ltauthorgtxlt/authorgtlt/bookgt -
- for b in Mediator()/bookwhere b/title/text()
DB2 UDB and b/author/text()
Chamberlinreturn b
book
author
title
24Matching View Output to Our Querys Conditions
- Determine that b/book/author/text() ?? x by
matching the pattern on the functions output - define function GetBook(x AS xsdstring) as book
for b in sql(Amazon.DB, select
from book where author x )return
ltbookgt b/title ltauthorgtxlt/author
gtlt/bookgt -
- let x Chamberlinfor b in
GetBook(x)/bookwhere b/title/text() DB2
UDB return b
book
author
title
25The Final Step Unfolding
- let x Chamberlinfor b in ( for b in
sql(Amazon.com, select from book where
author x ) return ltbookgt b/title
ltauthorgtxlt/authorgtlt/bookgt - )/bookwhere b/title/text() DB2 UDB
return b - How do we simplify further to get to here?
- for b in sql(Amazon.com, select from
book where authorChamberlin)where
b/title/text() DB2 UDB return b
26Virtues of TSIMMIS
- Early adopter of semistructured data, greatly
predating XML - Can support data from many different kinds of
sources - Obviously, doesnt fully solve heterogeneity
problem - Presents a mediated schema that is the union of
multiple views - Query answering based on view unfolding
- Easily composed in a hierarchy of mediators
27Limitations of TSIMMIS Approach
- Some data sources may contain data with certain
ranges or properties - Books by Aho, Students at UPenn,
- If we ask a query for students at Columbia, dont
want to bother querying students at Penn - How do we express these?
- Mediated schema is basically the union of the
various MSL templates as they change, so may
the mediated schema
28An Alternate ApproachThe Information Manifold
(Levy et al.)
- When you integrate something, you have some
conceptual model of the integrated domain - Define that as a basic frame of reference,
everything else as a view over it - Local as View
- May have overlapping/incomplete sources
- Define each source as the subset of a query over
the mediated schema - We can use selection or join predicates to
specify that a source contains a range of values - ComputerBooks() ? Books(Title, , Subj), Subj
Computers
29The Local-as-View Model
- The basic model is the following
- Local sources are views over the mediated
schema - Sources have the data mediated schema is
virtual - Sources may not have all the data from the domain
open-world assumption - The system must use the sources (views) to answer
queries over the mediated schema
30Query Answering
- Assumption conjunctive queries, set semantics
- Suppose we have a mediated schema author(aID,
isbn, year), book(isbn, title, publisher) - Suppose we have the query
- q(a, t) - author(a, i, _), book(i, t, p), t
DB2 UDB - and sources
- s1(a,t) ? author(a, i, _), book(i, t, p), t
123 -
- s5(a, t, p) ? author(a, i, _), book(i,t), p
SAMS - We want to compose the query with the source
mappings but theyre in the wrong direction! - Yet everything in s1, s5 is an answer to the
query!
31Answering Queries Using Views
- Numerous recently-developed algorithms for these
- Inverse rules Duschka et al.
- Bucket algorithm Levy et al.
- MiniCon Pottinger Halevy
- Also related chase and backchase Popa,
Tannen, Deutsch - Requires conjunctive queries
32Summary of Data Integration
- Local-as-view integration has replaced
global-as-view as the standard - More robust way of defining mediated schemas and
sources - Mediated schema is clearly defined, less likely
to change - Sources can be more accurately described
- Methods exist for query reformulation, including
inverse rules - Integration requires standardization on a single
schema - Can be hard to get consensus
- Today we have peer-to-peer data integration,
e.g., Piazza Halevy et al., Orchestra Ives et
al., Hyperion Miller et al. - Data integration capabilities in commercial
products Oracle Fusion, IBMs WebSphere
Integrator, numerous packages from middleware
companies