Title: Data Integration
1Data Integration
- Zachary G. Ives
- University of Pennsylvania
- CIS 550 Database Information Systems
- November 14, 2007
2Challenges of Mapping Schemas
- In a perfect world, it would be easy to match up
items from one schema with another - Every table would have a similar table in the
other schema - Every attribute would have an identical attribute
in the other schema - Every value would clearly map to a value in the
other schema - Real world as with human languages, things
dont map clearly! - May have different numbers of tables different
decompositions - Metadata in one relation may be data in another
- Values may not exactly correspond
- It may be unclear whether a value is the same
3Different Aspects to Mapping
- Schema matching / ontology alignment
- How do we find correspondences between
attributes? - Entity matching / deduplication / record linking
/ etc. - How do we know when two records refer to the
same thing? - Mapping definition
- How do we specify the constraints or
transformations that let us reason about when to
create an entry in one schema, given an entry in
another schema?
Lets see one influential approach to schema
matching
4The LSD (Learning Source Descriptions) System
- Suppose user wants to integrate 100 data sources
- User
- manually creates mappings for a few sources, say
3 - shows LSD these mappings
- LSD learns from the mappings
- Multi-strategy learning incorporates many types
of info in a general way - Knowledge of constraints further helps
- LSD proposes mappings for remaining 97 sources
5Example
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Learned hypotheses
Schema of realestate.com
If phone occurs in the name agent-phone
listed-price 250,000 110,000 ...
location Miami, FL Boston, MA ...
phone (305) 729 0831 (617) 253 1429 ...
comments Fantastic house Great location ...
realestate.com
If fantastic great occur frequently in
data values description
homes.com
price 550,000 320,000 ...
contact-phone (278) 345 7215 (617) 335 2315 ...
extra-info Beautiful yard Great beach ...
6LSDs Multi-Strategy Learning
- Use a set of base learners
- Each exploits well certain types of information
- Name learner looks at words in the attribute
names - Naïve Bayes learner looks at patterns in the data
values - Etc.
- Match schema elements of a new source
- Apply the base learners
- Each returns a score
- For different attributes one learner is more
useful than another - Combine their predictions using a meta-learner
- Meta-learner
- Uses training sources to measure base learner
accuracy - Weighs each learner based on its accuracy
7Training the Learners
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Schema of realestate.com
Name Learner
(location, address) (listed-price, price) (phone,
agent-phone) (comments, description) ...
Miami, FL
250,000 (305) 729 0831
Fantastic house
realestate.com
Naive Bayes Learner
Boston, MA
110,000 (617) 253 1429
Great location
(Miami, FL, address) ( 250,000,
price) ((305) 729 0831, agent-phone) (Fantastic
house, description) ...
8Applying the Learners
Mediated schema
Schema of homes.com
address price agent-phone
description
area day-phone extra-info
Name Learner Naive Bayes
Seattle, WA Kent,
WA Austin, TX
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Name Learner Naive Bayes
Meta-Learner
(address,0.7), (description,0.3)
(278) 345 7215 (617) 335
2315 (512) 427 1115
(agent-phone,0.9), (description,0.1)
(address,0.6), (description,0.4)
Beautiful yard Great
beach Close to Seattle
9Putting It All Together LSD System
Matching Phase
Training Phase
Mediated schema
Source schemas
Domain Constraints
Data listings
Training data for base learners
User Feedback
Constraint Handler
L1
L2
Lk
Mapping Combination
10Mappings between Schemas
- LSD provides attribute correspondences, but not
complete mappings - Mappings generally are posed as views define
relations in one schema (typically either the
mediated schema or the source schema), given data
in the other schema - This allows us to restructure or recompose
decompose our data in a new way - We can also define mappings between values in a
view - We use an intermediate table defining
correspondences a concordance table - It can be filled in using some type of code, and
corrected by hand
11A Few Mapping Examples
- Movie(Title, Year, Director, Editor, Star1,
Star2) - Movie(Title, Year, Director, Editor, Star1,
Star2)
- PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)
- MotionPicture(ID, Title, Year)Participant(ID,
Name, Role)
PieceOfArt(I, A, S, T, Movie) - Movie(T, Y, A,
_, S1, S2), ID T Y, S S1 S2
Movie(T, Y, D, E, S1, S2) - MotionPicture(I, T,
Y), Participant(I, D, Dir), Participant(I, E,
Editor), Participant(I, S1, Star1),
Participant(I, S2, Star2)
T1
T2
Need a concordance table from CustIDs to PennIDs
12Two Important Approaches
- TSIMMIS Garcia-Molina97 Stanford
- Focus semistructured data (OEM), OQL-based
language (Lorel) - Creates a mediated schema as a view over the
sources - Spawned a UCSD project called MIX, which led to a
company now owned by BEA Systems - Other important systems of this vein Kleisli/K2
_at_ Penn - Information Manifold Levy96 ATT Research
- Focus local-as-view mappings, relational model
- Sources defined as views over mediated schema
- Requires a special
- Led to peer-to-peer integration approaches
(Piazza, etc.) - Focus Web-based queriable sources
13TSIMMIS
- One of the first systems to support
semi-structured data, which predated XML by
several years OEM - An instance of a global-as-view mediation
system - We define our global schema as views over the
sources
14XML vs. Object Exchange Model
Bernstein
Newcomer Principles of
TP Chamberlinthor DB2 UDB
O1 book O2 author Bernstein O3
author Newcomer O4 title Principles of
TP O5 book O6 author Chamberlin
O7 title DB2 UDB
15Queries in TSIMMIS
- Specified in OQL-style language called Lorel
- OQL was an object-oriented query language that
looks like SQL - Lorel is, in many ways, a predecessor to XQuery
- Based on path expressions over OEM structures
- select book where book.title DB2 UDB and
book.author Chamberlin - This is basically like XQuery, which well use in
place of Lorel and the MSL template language.
Previous query restated - for b in AllData()/bookwhere b/title/text()
DB2 UDB and b/author/text()
Chamberlinreturn b
16Query Answering in TSIMMIS
- Basically, its view unfolding, i.e., composing a
query with a view - The query is the one being asked
- The views are the MSL templates for the wrappers
- Some of the views may actually require
parameters, e.g., an author name, before theyll
return answers - Common for web forms (see Amazon, Google, )
- XQuery functions (XQuerys version of views)
support parameters as well, so well see these in
action
17A Wrapper Definition in MSL
- Wrappers have templates and binding patterns (X)
in MSL - B - B // select
from book where author X // - This reformats a SQL query over Book(author,
year, title) - In XQuery, this might look like
- declare function GetBook(x AS xsdstring) as
book for b in sql(Amazon.DB,
select from book where author x
)return b/titlex
book
book
author
title
The union of GetBooks results is unioned with
others to form the view Mediator()
18How to Answer the Query
- Given our query
- for b in Mediator()/bookwhere b/title/text()
DB2 UDB and b/author/text()
Chamberlinreturn b - Find all wrapper definitions that
- Contain output enough structure to match the
conditions of the query - Or have already tested the conditions for us!
19Query Composition with Views
- We find all views that define book with author
and title, and we compose the query with each - declare function GetBook(x AS xsdstring) as
book for b in sql(Amazon.DB, select
from book where author x )return
b/title x -
- for b in Mediator()/bookwhere b/title/text()
DB2 UDB and b/author/text()
Chamberlinreturn b
book
author
title
20Matching View Output to Our Querys Conditions
- Determine that b/book/author/text() ?? x by
matching the pattern on the functions output - declare function GetBook(x AS xsdstring) as
book for b in sql(Amazon.DB,
select from book where author x
)return b/title
x -
- let x Chamberlinfor b in
GetBook(x)/bookwhere b/title/text() DB2
UDB return b
book
author
title
21The Final Step Unfolding
- let x Chamberlinfor b in ( for b in
sql(Amazon.com, select from book where
author x ) return b/title
x - )/bookwhere b/title/text() DB2 UDB
return b - How do we simplify further to get to here?
- for b in sql(Amazon.com, select from
book where authorChamberlin)where
b/title/text() DB2 UDB return b
22Virtues of TSIMMIS
- Early adopter of semistructured data, greatly
predating XML - Can support data from many different kinds of
sources - Obviously, doesnt fully solve heterogeneity
problem - Presents a mediated schema that is the union of
multiple views - Query answering based on view unfolding
- Easily composed in a hierarchy of mediators
23Limitations of TSIMMIS Approach
- Some data sources may contain data with certain
ranges or properties - Books by Aho, Students at UPenn,
- If we ask a query for students at Columbia, dont
want to bother querying students at Penn - How do we express these?
- Mediated schema is basically the union of the
various MSL templates as they change, so may
the mediated schema
24An Alternate ApproachThe Information Manifold
(Levy et al.)
- When you integrate something, you have some
conceptual model of the integrated domain - Define that as a basic frame of reference,
everything else as a view over it - Local as View
- May have overlapping/incomplete sources
- Define each source as the subset of a query over
the mediated schema - We can use selection or join predicates to
specify that a source contains a range of values - ComputerBooks() ? Books(Title, , Subj), Subj
Computers
25The Local-as-View Model
- The basic model is the following
- Local sources are views over the mediated
schema - Sources have the data mediated schema is
virtual - Sources may not have all the data from the domain
open-world assumption - The system must use the sources (views) to answer
queries over the mediated schema
26Query Answering
- Assumption conjunctive queries, set semantics
- Suppose we have a mediated schema author(aID,
isbn, year), book(isbn, title, publisher) - Suppose we have the query
- q(a, t) - author(a, i, _), book(i, t, p), t
DB2 UDB - and sources
- s1(a,t) ? author(a, i, _), book(i, t, p), t
123 -
- s5(a, t, p) ? author(a, i, _), book(i,t), p
SAMS - We want to compose the query with the source
mappings but theyre in the wrong direction! - Yet everything in s1, s5 is an answer to the
query!
27Answering Queries Using Views
- Numerous recently-developed algorithms for these
- Inverse rules Duschka et al.
- Bucket algorithm Levy et al.
- MiniCon Pottinger Halevy
- Also related chase and backchase Popa,
Tannen, Deutsch - Requires conjunctive queries
28Summary of Data Integration
- Local-as-view integration has replaced
global-as-view as the standard - More robust way of defining mediated schemas and
sources - Mediated schema is clearly defined, less likely
to change - Sources can be more accurately described
- Methods exist for query reformulation, including
inverse rules - Integration requires standardization on a single
schema - Can be hard to get consensus
- Today we have peer-to-peer data integration,
e.g., Piazza Halevy et al., Orchestra Ives et
al., Hyperion Miller et al. - Some other aspects of integration were addressed
in related papers - Overlap between sources coverage of data at
sources - Semi-automated creation of mappings and wrappers
- Data integration capabilities in commercial
products BEAs Liquid Data, IBMs DB2
Information Integrator, numerous packages from
middleware companies