Title: Bridging Different Data Representations
1Bridging Different Data Representations
- Zachary G. Ives
- University of Pennsylvania
- CIS 550 Database Information Systems
- October 28, 2003
Some slide content may be courtesy of Susan
Davidson, Dan Suciu, Raghu Ramakrishnan
2Administrivia
- HW4 and midterms returned today
- You all did great!
- Median on midterm was 75 of 80 (mean was 73.4)
- Remember to turn in your project plan on
Thursday! - Should have a plan for how to break down the
project tasks among your group - Should have some milestones that get you towards
a completed project - Ill ask for a status report in a couple of weeks
3A Problem
- Weve seen that even with normalization and the
same needs, different people will arrive at
different schemas - In fact, most people also have different needs!
- Often people build databases in isolation, then
want to share their data - Different systems within an enterprise
- Different information brokers on the Web
- Scientific collaborators
- Researchers who want to publish their data for
others to use - This is the goal of data integration tie
together different sources, controlled by many
people, under a common schema
4Building a Data Integration System
- Create a middleware mediator or data
integration system over the sources - Can be warehoused (a data warehouse) or virtual
- Presents a uniform query interface and schema
- Abstracts away multitude of sources consults
them for relevant data - Unifies different source data formats (and
possibly schemas) - Sources are generally autonomous, not designed to
be integrated - Sources may be local DBs or remote web
sources/services - Sources may require certain input to return
output (e.g., web forms) binding patterns
describe these
5Typical Data Integration Components
Query
Results
Data Integration System / Mediator
Mediated Schema
Source Catalog
Mappings in Catalog
Wrapper
Wrapper
Wrapper
Source Relations
6Typical Data Integration Architecture
Source Descrs.
Query
Reformulator
Source Catalog
Query over sources
QueryProcessor
Results
Queries bindings
Data in mediated format
Wrapper
Wrapper
Wrapper
7Challenges of Mapping Schemas
- In a perfect world, it would be easy to match up
items from one schema with another - Every table would have a similar table in the
other schema - Every attribute would have an identical attribute
in the other schema - Every value would clearly map to a value in the
other schema - Real world as with human languages, things
dont map clearly! - May have different numbers of tables different
decompositions - Metadata in one relation may be data in another
- Values may not exactly correspond
- It may be unclear whether a value is the same
8A Few Simple Examples
- Movie(Title, Year, Director, Editor, Star1,
Star2) - Movie(Title, Year, Director, Editor, Star1,
Star2)
- PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)
- MotionPicture(ID, Title, Year)Participant(ID,
Name, Role)
9How Do We Relate Schemas?
- General approach is to use a view to define
relations in one schema, given data in the other
schema - This allows us to restructure or recompose
decompose our data in a new way - We can also define mappings between values in a
view - We use an intermediate table defining
correspondences a concordance table - It can be filled in using some type of code, and
corrected by hand
10Mapping Our Examples
- Movie(Title, Year, Director, Editor, Star1,
Star2) - Movie(Title, Year, Director, Editor, Star1,
Star2)
- PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)
- MotionPicture(ID, Title, Year)Participant(ID,
Name, Role)
PieceOfArt(I, A, S, T, Movie) - Movie(T, Y, A,
_, S1, S2), ID T Y, S S1 S2
Movie(T, Y, D, E, S1, S2) - MotionPicture(I, T,
Y), Participant(I, D, Dir), Participant(I, E,
Editor), Participant(I, S1, Star1),
Participant(I, S2, Star2)
T1
T2
???
11Two Important Approaches
- TSIMMIS Garcia-Molina97 Stanford
- Focus semistructured data (OEM), OQL-based
language (Lorel) - Creates a mediated schema as a view over the
sources - Spawned a UCSD project called MIX, which led to a
company now owned by BEA Systems - Information Manifold Levy96 ATT Research
- Focus local-as-view mappings, relational model
- Sources defined as views over mediated schema
- Spawned Tukwila at Washington, and eventually a
company as well
12TSIMMIS and Information Manifold
- Focus Web-based queryable sources
- CGI forms, online databases, maybe a few RDBMSs
- Each needs to be mapped into the system not as
easy as web search but the benefits are
significant vs. query engines - A few parenthetical notes
- Part of a slew of works on wrappers, source
profiling, etc. - The creation of mappings can be partly automated
systems such as LSD, Cupid, Clio, do this - Today most people look at integrating large
enterprises (thats where the is!) Nimble,
BEA Liquid Data, Enosys, IBM
13TSIMMIS
- The Stanford-IBM Manager of Multiple Information
Sources or, a Yiddish stew - An instance of a global-as-view mediation
system - One of the first systems to support
semi-structured data, which predated XML by
several years
14Semi-structured Data OEM
- Observation given a particular schema, its
attributes may be unavailable from certain
sources inherent irregularity - Proposal Object Exchange Model, OEM
- OID ltlabel, type, valuegt
- How does it relate to XML?
- What problems does OEM solve, and not solve, in
a heterogeneous system?
15OEM Example
Show this XML fragment in OEM ltbookgt
ltauthorgtBernsteinlt/authorgt ltauthorgtNewcomerlt/aut
horgt lttitlegtPrinciples of TPlt/titlegtlt/bookgt ltbo
okgt ltauthorgtChamberlinlt/authorgt lttitlegtDB2
UDBlt/titlegtlt/bookgt
16Queries in TSIMMIS
- Specified in OQL-style language called Lorel
- OQL was an object-oriented query language
- Lorel is, in many ways, a predecessor to XQuery
- Based on path expressions over OEM structures
- select bookwhere book.author DB2 UDB and
book.title Chamberlin - This is basically like XQuery, which well use in
place of Lorel and the MSL template language.
Previous query restated - for b in document(my-source)/bookwhere
b/title/text DB2 UDB and b/author/text()
Chamberlinreturn b
17Query Answering in TSIMMIS
- Basically, its view unfolding, i.e., composing a
query with a view - The query is the one being asked
- The views are the MSL templates for the wrappers
- Some of the views may actually require
parameters, e.g., an author name, before theyll
return answers - Common for web forms (see Amazon, Google, )
- XQuery functions (XQuerys version of views)
support parameters as well, so well see these in
action
18A Wrapper Definition in MSL
- Wrappers have templates and binding patterns (X)
in MSL - B - B ltbook ltauthor Xgtgt // select
from book where author X // - This reformats a SQL query over Book(author,
year, title) - In XQuery, this might look like
- define function GetBook(X AS xsdstring) as book
for x in sql(select from book where
author x )return ltbookgtxltauthorgtxlt/au
thorgtlt/bookgt
19How to Answer the Query
- Given our query
- for b in document(my-source)/bookwhere
b/title/text() DB2 UDB and b/author/text()
Chamberlinreturn b - We want to find all wrapper definitions that
- Either contain output enough information that we
can evaluate all of our conditions over the
output - Or have already tested the conditions for us!
20Query Composition with Views
- We find all views that define book with author
and title, and we compose the query with each - define function GetBook(x AS xsdstring) as book
for b in sql(select from book where
author x )return ltbookgtbltauthorgtxlt/au
thorgtlt/bookgt -
- for b in document(my-source)/bookwhere
b/title/text() DB2 UDB and b/author/text()
Chamberlinreturn b
21Matching View Output to Our Querys Conditions
- Determine that b/book/author/text() ?? x by
matching the pattern on the functions output - define function GetBook(x AS xsdstring) as book
for b in sql(select from book where
author x )return ltbookgtbltauthorgtxlt/au
thorgtlt/bookgt -
- where x Chamberlinfor b in
GetBook(x)/bookwhere b/title/text() DB2
UDB return b
22The Final Step Unfolding
- where x Chamberlinfor b in for b in
sql(select from book where author x
)return ltbookgtbltauthorgtxlt/authorgtlt/bookgt
/bookwhere b/title/text() DB2 UDB return
b
23What Is the Answer?
- Given schema book(author, year, title) and
datalog rules defining an instance - book(Chamberlin, 1992, DB2 UDB)
- book(Chamberlin, 1995, DB2/CS)
24TSIMMIS
- Early adopter of semistructured data
- Can support irregular structure and missing
attributes - Can support data from many different sources
- Doesnt fully solve heterogeneity problem,
though! - Simple algorithms for view unfolding
- Easily can be composed in a hierarchy of mediators
25Limitations of TSIMMIS Approach
- Some data sources may contain data with certain
ranges or properties - Books by Aho, Students at UPenn,
- How do we express these? (Important for
performance!) - Mediated schema is basically the union of the
various MSL templates as they change, so may
the mediated schema
26The Information Manifold
- Defines the mediated schema independently of the
sources! - Local-as-view instead of global-as-view
- Guarantees soundness and completeness of answers
- Allows us to specify information about data
sources - Focuses on relations (with OO extensions),
datalog
27Observations of Levy et al.
- When you integrate something, you have some
conceptual model of the integrated domain - Define that as a basic frame of reference
- May have overlapping/incomplete sources
- Define each source as the subset of a query over
the mediated schema - We can use selection or join predicates to
specify that a source contains a range of values - ComputerBooks() ? Books(Title, , Subj), Subj
Computers
28The Local-as-View Model
- If we look at the Information Manifold model
- Local sources are views over the mediated
schema - Sources have the data mediated schema is
virtual - Sources may not have all the data from the domain
open-world assumption - The system must use the sources (views) to answer
queries over the mediated schema - Thursday well see what answering queries using
views is all about