Title: CSE 636 Data Integration
1CSE 636Data Integration
2Data Warehouse Architecture
?
OLAP / Decision Support Data Cubes / Data Mining
?
Users
Applications
Relational Database (Warehouse)
ETL Tools (Extract-Transform-Load)
Data Cleaning
Data Source
Data Source
Data Source
3Virtual Integration Architecture
- Leave the data in the sources
- When a query comes in
- Determine the relevant sources to the query
- Break down the query into sub-queries for the
sources - Get the answers from the sources, filter them if
needed and combine them appropriately - Data is fresh
- Otherwise known as
- On Demand Integration
4Virtual Integration Architecture
Design-Time
?
?
End Users
Applications
- Sources can be
- Relational DBs
- Excel Files
- Web Sites
- Web Services
Global Schema
Data Source
Data Source
Data Source
Local Schema
Local Schema
Local Schema
5Schema Mappings
- Differences in
- Names in schema
- Attribute grouping
- Coverage of databases
- Granularity and format of attributes
Inventory Database B
Authors ISBN FirstName LastName
Books Title ISBN Price DiscountPrice Edition
Inventory Database A
BookCategories ISBN Category
BooksAndMusic Title Author Publisher ItemID ItemTy
pe SuggestedPrice Categories Keywords
CDCategories ASIN Category
CDs Album ASIN Price DiscountPrice Studio
Artists ASIN ArtistName GroupName
6Issues for Schema Mappings
Design-Time
?
?
End Users
Applications
- What formalisms to express them?
- How to create them?
- Can we discover them somehow?
- How do we use them?
Global Schema
Data Source
Data Source
Data Source
Local Schema
Local Schema
Local Schema
7Virtual Integration Architecture
Run-Time
Query
Result
Mediator
Global Schema
Data Source
Data Source
Data Source
Local Schema
Local Schema
Local Schema
8Issues for Query Processing
Reformulation
Query
- User queries refer to the global schema
- Data is stored in the sources in a local schema
- Rewriting algorithms
Mediator
Reformulation
Global Schema
Data Source
Data Source
Data Source
Local Schema
Local Schema
Local Schema
9Issues for Query Processing
Reformulation
Global Schema
SELECT ISBN, Price FROM Books WHERE Title on
the road
Books Title ISBN Price DiscountPrice Edition
Local Schema A
SELECT ItemID, SuggestedPrice FROM
BooksAndMusic WHERE Title on the road AND
ItemType Books
BooksAndMusic Title Author Publisher ItemID ItemTy
pe SuggestedPrice Categories Keywords
10Issues for Query Processing
Query Translation
Query
- Different query languages
Mediator
Reformulation
Global Schema
Optimization
Execution
Data Source
Data Source
Data Source
Local Schema
Local Schema
Local Schema
11Issues for Query Processing
Query Translation
Global Schema
SELECT ISBN, Price FROM Books WHERE Title on
the road
Books Title ISBN Price DiscountPrice Edition
Local Source A
http//www.amazon.com/homepage.html?ItemTypeBooks
Titleontheroad
12Issues for Query Processing
Data Translation
Query
Mediator
Global Schema
Data Source
Data Source
Data Source
Local Schema
Local Schema
Local Schema
13Issues for Query Processing
Data Translation
Global Schema
Title ISBN Price
On the Road 123 10.86
Books Title ISBN Price DiscountPrice Edition
lttablegt lttrgt lttdgt lta
href/details?isbn123gt ltbgtOn the
Roadlt/bgt lt/agt -- by Jack
Kerouac Paperback ltbrgt lta
href/details?isbn123gt Buy new
lt/agt ltb classpricegt10.86lt/bgt
lt/tdgt lt/trgt lt/tablegt
Local Result A
14Issues for Query Processing
Query Execution
Query
- Access as many data sources as needed
- Duplicate/redundant and irrelevant data
- Limited query capabilities
Mediator
Global Schema
Data Source
Data Source
Data Source
Local Schema
Local Schema
Local Schema
15Issues for Query Processing
Limited Query Capabilities
SELECT ISBN, Price, DiscountPrice FROM
Books WHERE Title on the road
Global Schema
Books Title ISBN Price DiscountPrice Edition
ISBN Price DiscountPrice
123 10.86 8.86
E
Local Schema A
Local Schema B
BooksAndMusic Title Author ItemID ItemType Suggest
edPrice
DiscountBooks Title Edition ISBN GreatPrice
ItemID SuggestedPrice
123 10.86
ItemID SuggestedPrice
123 10.86
GreatPrice
8.86
B
D
SELECT GreatPrice FROM DiscountBooks WHERE ISBN
?
SELECT ItemID, SuggestedPrice FROM
BooksAndMusic WHERE Title ?
SELECT GreatPrice FROM DiscountBooks WHERE ISBN
123
C
16Issues for Query Processing
Query Answering
Query
Result
- Combine the results and further process them if
needed - Mainly union and merge
- Inconsistencies
Mediator
Global Schema
Data Source
Data Source
Data Source
Local Schema
Local Schema
Local Schema
17Issues for Query Processing
Query Answering (Union)
ISBN Price
123 10.86
456 8.86
ItemID SuggestedPrice
123 10.86
ISBN GreatPrice
456 8.86
18Issues for Query Processing
Query Answering (Merge)
Primary Key
ISBN Title Edition Price
123 On the Road 2nd 8.86
ISBN Title Edition Price
123 On the Road 2nd 8.86
ItemID Title
123 On the Road
ISBN Edition Price
123 2nd 8.86
Primary Key
Primary Key
19Issues for Query Processing
Query Answering (Inconsistencies)
Primary Key
ISBN Title Edition Price
123 On the Road 8.86
ISBN Title Edition Price
123 On the Road ??? 8.86
ItemID Title Edition
123 On the Road 1st
ISBN Edition Price
123 2nd 8.86
Primary Key
Primary Key
20Community-Based Integration
?
Web Domain
- Fairly-dynamic environment
- New sources register over time and new
applications queries are formulated - Allow developers to easily build applications
based on the community schema - ?
- So that each others needs
- are accommodated
- ?
- Allow source owners to easily and independently
register their source
Web Forms Reports
End Users
?
Application Domain
?
?
Developers
New Application
Application
?
Community Domain
Mediator
Community Owner
Community Schema
Source Domain
?
Web Service
Web Service
Source Owners
Data Source
New Source
Source Schema
21Peer-Based Integration
Query
Peer 4
Query
Peer 5
Peer 2
Peer 1
Peer 3
22Peer-Based Integration
- No need for a central mediated schema
- Peers serve as mediators for other peers
- A peer can be both a server and a client
- Semantic relationships are specified
locally(between small sets of peers) - Queries are posed using the peers schema
- Answers come from anywhere in the system
- This is not P2P file sharing.
- Data has rich semantics
23References
- Information integration
- Maurizio Lenzerini
- Eighteenth International Joint Conference on
Artificial Intelligence, IJCAI 2003 - Invited Tutorial
- Data Integration a Status Report
- Alon Halevy
- German Database Conference (BTW), 2003
- Invited Talk