Title: Architecture
1From Enterprise Information Integration to
Community-Based Mediation
A presentation by
Yannis Papakonstantinou
on joint works with
Alin Deutsch, Yannis Katsis, Michalis Petropoulos
CSE Department
2Data Integration Requirements Desiderata (high
level)
- Provide application with integrated database
- single point of (query/update) access to the
data - Provide distribution and heterogeneity
transparency - heterogenous formats, heterogenous interfaces,
different rates of change (static versus
dynamic), autonomous sources - Decouple application logic from integration
- Easily add/change sources
- Customize the delivery of content
3Most-Generic Integration System Architecture
Client Application
Client Application
Client Application
Pull
Push
Integration Software
. . .
Information Source
Information Source
Information Source
4SIGMOD Communitys Architecture forUnified
Access to Data Services
(Web) Client Application
(Web) Client Application
(Web) Client Application
Integrated (XML) Global View / Ontology Services
Cache Replication
Mediator
Local Common Model (XML) View Services
Local Common Model (XML) View Services
Wrapper
Wrapper
Information Service Source
Information Service Source
5Approaches towards View-BasedData Integration
Integration Specification Method
Info Model Query Language
GLAVGAVLAV
Local As View (LAV)
XML (XQuery)
Global As View (GAV)
Object-Oriented
Relational (SQL)
Storage Method
Warehousing (materialized views)
On-Demand (virtual views)
6Enterprise Information IntegrationReaches
Maturity
Enterprise
- Materialized View (Warehousing) approach
well-adopted since mid/late 90s - GAV function role played by Extract-Transform-Load
tools - Human Intervention Occasionally Needed in
Cleaning Up - Concordance tables for Object Identification
- Virtual View (Mediation) approach at early
adoption - many years of research
- Distributed dbs, federated dbs, mediators
- moving well into mainstream
- BEA AquaLogic (XML, Virtual, GAV view)
- IBM DB2
7Current Enterprise InformationIntegration
Deployments
Enterprise
- Small Domain
- Mostly Vertical Partition of Sources
- Primarily Application-Driven View or Identity
View - Integration Administrator/Developer in charge
Integrated Global View V(M, S, E)
Integration Admin
GAV View V
View Builder(design time)
Mediator QueryProcessor (run time)
Schemas
Data
Local View M
Local View S
Local View E
Marketing
Sales
Service
8Opportunities and Needs Presented by Motivated
Communities
Communities
- Emerging Myriads of Internet Communities of
- Myriads of sources and clients
- Source owners motivated to participate
- EII does not address needs
- Expensive
- Bottleneck of Single Integration Admin
- Make building corresponding portals similar to
starting and participating in newsgroups - Appropriate tools needed to enable source owner
and client participation
9A Community-Based Information Modeling
Architecture
Client Application 1
Client Application m
Application View V1a (G)
Application View Vma (G)
GAV V1a
GAV Vma
Integrated View Owners Domain
Data Services
Integrated XML View G
Source Owners Domain
Source Owners Domain
GLAV V1
GLAV Vn
Local XML View S1
Local XML View Sn
Information Source n
Information Source 1
Data Services 1
Data Services n
10Visual Tools Matter! (example from the Enosys
Query Builder)
TARGET SCHEMA (XML VIEW)
1
OPEN VIEW SOURCE SCHEMAS IN XML
2
DRAG DROP TO CREATE TARGET XML VIEW
AUTOMATICALLY GENERATED MAPS
113
RUN TEST XQUERY
XML RESULT
XQUERY BASED ON DESIGN SPECS
12Architecture for Large-Scale Data Integration
System and Design Tools
How can the user query and Browse the integrated
data? QURSED
Web Domain
Web Forms Reports
?
Developer
What queries can my app issue? What integrated
view services can I build? CLIDE
Application Domain
?
?
Application
Application
?
Integration Domain
Cache (Metadata)
Mediator
Global View Schema
Web Services
Integration Engineer
Source Domain
How do I export my database services
functionality? RIDE-Services
?
Web Service
Web Service
Web Service
Source Owner
How do I export my data? RIDE
Data Source
Data Source
Source Schema
Source Schema
13 Dual Interactive Registration Problems
?
?
Register Source Given Global Schema,
Constraints Queries
Register Client Given Sources
New App
New Query
Guide the client in query/form writing
Apps
Queries
?
?
?
?
Global View
Global View
Guide the source owner in registering a new
source and services
Source Services
New Source and Services
14Source Data Registration
Server Side
- How do my source attributes map to global
attributes - mappers automatic matchers
- How do my data relate to queries other sources
- Inconsistencies?
- What takes to contribute to queries?
- How much should I clean up?
- Multiple ways of dealing with redundancy
Apps
Queries
?
?
Global View
?
New Source
15How to achieve this Goal
Before
Now
Apps
Queries
Apps
Queries
?
?
?
?
?
?
Look at all sources queries
Follow the suggestions of the interface
?
Decide how to register your source
Global View
Global View
?
?
Source Registration Tool
?
New Source
New Source
16Our Goal in Source Registration
Guide the source owner visually through the
registration of the source
so as to avoid/warn about (potential)
inconsistencies and contribute information to the
answer of the queries
while exposing the minimum information possible
and/or minimizing effort
17The Problem
17
18The Contribution Problem
Client Queries
- What is the contribution of source S to the
result of the query Q?
?
?
?
Q
Mediator
(Global DB)
S
Sources
(Actual Local DBs)
18
19The Problem
Client Queries
- What is the contribution of source S to the
result of the query Q?
?
Q
Q cars
Mediator
(Global DB)
Q cars JOIN reviews
cars
reviews
S
Sources
(Actual Local DBs)
19
20Relational Schemas Local and Global
?
S1
S2
G
auto
make
car
Id
Carmake
Model
Origin
Model
Carmake
Carmake
Sales
Doors
Baseprice
detail
brand
Id
Engine
Carmake
Baseprice
Origin
Source 1Business Magazine
Source 2Car Magazine
Global Car Portal
20
21Source Registration using GLAV Mappings
- Source Registration Correspondence between
a source schema and
the global schema -
- Set of Mapping Constraints of the form
- (U ? V)
- Open World
- Global and Local As View (GLAV)
?
CQ over global schema
CQover source schema
21
22Target Constraints
- Constraints on the global schema
-
- Set of Constraints of the form
- (U ? V)
- Also Expresses Dependencies (PKs, Ref Integrity,
)
?
CQ over global schema
CQover global schema
22
23Visual Representation of Mappings (1)
- Visual Representation (IBM Clio)
?
G
S1
car
make
C
Model
Carmake
O
Carmake
Origin
Doors
Sales
Baseprice
brand
Carmake
Origin
Business Magazine Provides Carmake and Origin
23
24Visual Representation of Mappings (2)
- Visual Representation (IBM Clio)
?
G
S2
car
auto
Model
Id
Carmake
Model
Doors
Carmake
Baseprice
detail
brand
Id
Carmake
Engine
Origin
Baseprice
Car Magazine Provides Model, Carmake and
Baseprice
24
25Example of Target Constraint
- (Model, Carmake) is a PK of car
?
G
car
?
Model
Carmake
Doors
Baseprice
brand
Carmake
Origin
U1(M, C, D1, B1, D2, B2) - car(M, C, D1, B1),
car(M, C,
D2, B2) V1(M, C, D, B, D, B) - car(M, C,
D, B)
(U1 ? V1)
25
26Query Semantics
- Queries in UCQ
- Set of Possible Global Instances
- Set of global instances that satisfy all
constraints - Query Answers Set of Certain Answers
- The tuples appearing in the answer to Q for any
possible global instance
?
?
Q
Possible globalinstances
CertainAnswersto Q
Answer to Qfor any of thepossible
globalinstances
26
27Source Instances Contribution
-
- For given instances of the sources
- Contribution to Q of Source Instance
-
- The tuples in answer of Q not provided by the
other sources
Answer to Q
Answer to Q
27
28Source Registrations Contribution
- Source Registration Source Mappings
- Degrees of Source Registrations Contribution
- ? Self Sufficient
- ? Now Complementary
- ? Later Complementary
- ? Unusable
More contribution
Less contribution
28
29Self Sufficient Registration Example
Example
Baseprices of Models
?
G
car
Model
Carmake
Doors
Baseprice
brand
Carmake
Origin
29
30Self Sufficient Registration Definition
? Self Sufficient
-
? Source instance s.t. The source has a non
empty contribution in the absence of the other
sources
Answer to Q
Answer to Q
? ?
X
X
X
X
30
31Now Complementary Registration Example
Example
Baseprices of Modelsby German manufacturers
?
G
car
Model
Carmake
Doors
Baseprice
brand
Carmake
Origin
31
32Now Complementary Registration Definition
? Now Complementary
-
Not Self Sufficient ? Source instances
s.t. The source has a non empty contribution
in combination with the other existing sources
Answer to Q
Answer to Q
? ?
32
33Later Complementary Registration Example
Example
Baseprices of Modelsby German manufacturers
?
G
car
Model
Carmake
Doors
Baseprice
brand
Carmake
Origin
33
34Later Complementary Registration Definition
? Later Complementary
-
Not Self Sufficient Not Now Complementary ?
Potential future sources Source
instances s.t. The source has a non empty
contribution in combinationwith the future
sources
Answer to Q
Answer to Q
? ?
34
35Unusable Registration Example
Example
Origin of Carmakes
?
G
car
Model
Carmake
Doors
Baseprice
brand
Carmake
Origin
35
36Unusable Registration Definition
? Unusable
-
Not Self Sufficient Not Now Complementary Not
Later Complementary ? The source has a empty
contribution regardless of what sources enter the
system
Answer to Q
Answer to Q
?
36
37Subtleties for Unusable Registrations
Example
Baseprices and Doorsof Models
?
G
car
Model
Carmake
Doors
Baseprice
brand
Carmake
Origin
37
38In presence of PK Unusable Example becomes Later
Complementary
Example
Baseprices and Doorsof Models
?
G
car
?
Model
Carmake
Doors
Baseprice
brand
Carmake
Origin
38
39Decidability Results
Overview What is decidable
Target constraints
Degrees
39
40Issues
?
Vs
Unique client query
Multiple client queries
Contribute to - all queries?
- one query? - specific
queries? - some queries
based on some ranking?
?
Vs
Data independence
Data dependence
e.g.
M1 cars, refPricesM2 reviewsQ cars JOIN
reviews JOIN refPrices
DB1 cars, refPrices (Audis)DB2 reviews (Hondas)
(M2, Q) now-complementarybut Certain Answers for
Instances DB1, DB2
?
41Putting it all together
Architecture
Architecture
?
Query Answering / Mappings / Schemas
Query
Q
Contribution
4 categories Self Sufficient / Now Complementary
/ Later Complementary / Unusable
Global Schema
S
Goal
Mappings
M1
Mn1
Mn
Guide the source owner visually through the
registration of the source
LocalSchemas
so as to raise contribution to the answer of the
queries
S1
Sn
Sn1
while exposing the minimum info possible and/or
minimizing effort
Registeredsources
Newsource
42Example 1
Without primary keys in the target
Global Schema
Query
Local Schemas
Community
AutoTrader
AppQuery
car
car
car
model
id
model
cmodel
drive
drive
ad
review
review
vin
model
model
carId
quality
quality
price
usedAd
usedAd
vin
vin
model
model
price
price
refPrice
refPrice
model
model
condition
condition
price
price
Unusable
BLUE Map at least one of the groups
43Example 1
Without primary keys in the target
Global Schema
Query
Local Schemas
Community
AutoTrader
AppQuery
car
car
car
model
id
model
cmodel
drive
drive
ad
review
review
vin
model
model
carId
quality
quality
price
usedAd
usedAd
vin
vin
cmo
model
model
price
price
refPrice
refPrice
model
model
condition
condition
price
price
Unusable
44Example 1
Without primary keys in the target
Global Schema
Query
Local Schemas
Community
AutoTrader
AppQuery
car
car
car
model
id
model
cmodel
drive
drive
ad
review
review
vin
model
model
carId
quality
quality
price
usedAd
usedAd
vin
vin
cmo
model
model
price
price
price
refPrice
refPrice
model
model
condition
condition
price
price
LaterComplementary
45Example 2
With primary keys in the target
Global Schema
Query
Local Schemas
Community
AutoTrader
AppQuery
car
car
car
model
id
model
cmodel
drive
drive
ad
review
review
vin
model
model
carId
quality
quality
price
usedAd
usedAd
vin
vin
model
model
price
price
refPrice
refPrice
model
model
condition
condition
price
price
Unusable
46Example 2
With primary keys in the target
Global Schema
Query
Local Schemas
Community
AutoTrader
AppQuery
car
car
car
model
id
model
cmodel
drive
drive
ad
review
review
vin
model
model
carId
quality
quality
price
usedAd
usedAd
vin
vin
vin
model
model
price
price
price
refPrice
refPrice
model
model
condition
condition
price
price
LaterComplementary
47Lessons learned
Target constraints make a difference
?
To merge data with that of other sources (become
complementary)
Pick a relation and provide
In absence of primary keys
all its attributes asked by the query
In presence of primary keys
its primary key and one of its attributes asked
by the query
The number of choices increases in presence of
primary keys
?
Foreign keys on the target affect the suggestions
48Large-Scale Data Integration Systems
How can the user query and Browse the integrated
data? QURSED
Web Domain
Web Forms Reports
?
Developer
What queries can the mediator answer for
me? CLIDE
Application Domain
?
?
Application
Application
?
Integration Domain
Mediator
Global View Schema
Integration Engineer
Source Domain
How do I export my database services
functionality? RIDE-Services
?
Web Service
Web Service
Web Service
Source Owner
How do I export my data? RIDE
Data Source
Data Source
Source Schema
Source Schema
49Running Example
Parameterized Views
Dell
Cisco
- Schema
- Computers(cid, cpu, ram, price)
- NetCards(cid, rate, standard, interface)
- Views
- V1 ComByCpu(cpu) ? (Computer)
- SELECT DISTINCT Com1.
- FROM Computers Com1
- WHERE Com1.cpucpu
- V2 ComNetByCpuRate(cpu, rate) ?
- (Computer, NetCard)
- SELECT DISTINCT Com1., Net1.
- FROM Computers Com1, Network Net1
- WHERE Com1.cidNet1.cid
- AND Com1.cpucpu
- AND Net1.raterate
Schema Routers(rate, standard, price,
type) Views V3 RouByTypeW() ?
(Router) SELECT DISTINCT Rou1. FROM Routers
Rou1 WHERE Rou1.type'Wired' V4 RouByTypeWL() ?
(Router) SELECT DISTINCT Rou1. FROM Routers
Rou1 WHERE Rou1.type'Wireless'
50Running Example
Global Schema
?
?
Developer
Application
- Global schema puts togetherthe Dell and Cisco
schemas - Resembles the schema of CNET.com portal
- Column Associations
- (Computers.cid, NetCards.cid)
- (NetCards.rate, Routers.rate)
- (NetCards.standard, Routers.standard)
Mediator
Global Schema
V1
V3
V2
V4
Dell
Cisco
51Sophisticated Mediators Make Feasibility Hard to
Predict
- Feasible Queries FQ
- Equivalent CQ query rewritings using the views
- Might involve more than one views
- Order might matter
Feasible
Query Get all P4 Computers, together with
their NetCards and their compatible Wireless
Routers
Query Get all Computers
Infeasible
E
B
D
Mediator
A
C
RouByTypeWL()
ComNetByCpuRate(P4, 10)
ComNetByCpuRate(P4, 54)
V4
V2
52Problem
- Large number of sources
- Large number of views
- Mediator capabilities
- Developer formulates an application query
- Is an application query feasible?
- If not, how do I know which ones are feasible?
- Previous options
- The developer had to browse the view definitions
and somehow formulate a feasible query - Or formulate queries until a feasible one is
found(trial-and-error) - No system-provided guidance
53The CLIDE Solution
CLIDE
?
?
Developer
Application
- A query formulation interface, which
interactively guides the user toward feasible
queries by employing a coloring scheme
Mediator
Global Schema
V1
V3
V2
V4
Dell
Cisco
54QBE-Like Interfaces
Microsoft SQL-Server
55CLIDE Interface
- Table, selection, projection and join actions
- Color-based suggestions
- Feasibility Flag
56CLIDE Interface
Snapshot 1
- Yellow ? required action
- All feasible queries require this action
- White ? optional action
- Feasible queries can be formulatedw/ or w/o
these actions
57CLIDE Interface
Snapshot 2
- Blue ? required choice of action
- At least one feasible (next) query cannot be
formulated unless this action is performed
C
Mediator
A
ComByCpu(P4)
B
V1
58CLIDE Interface
Snapshot 3
- Join Lines
- Only yellow and blue are displayed
- Must appear in Column Associations
59CLIDE Interface
Snapshot 4
60CLIDE Interface
Snapshot 5
- ? any other constant
- Red ? prohibited action
- Does not appear in any feasible query
- Lead to Dead End state
61CLIDE Interface
Snapshot 6
F
Mediator
A
D
RouByTypeWL()
ComNetByCpuRate(P4, rate)
B
E
V4
V2
62CLIDE Facts
- Rapid Convergence
- At every step, yellow and blue actions lead to a
feasible query in a minimum number of steps - Completeness of Suggestions
- Every feasible query can be formulated by
performing yellow and blue actions at every step - Minimality of Suggestions
- At every step, only a minimal number of actions
are suggested, i.e., the ones that are needed to
preserve completeness
63Interaction Graph
Join Action
Table Action
Selection Action
Com1.cidNet1.cid
Com1.cpuP4
Com1
Com1.ram
Rou1
Com1.price
Net1
- Nodes are queries
- One for each q?CQ
- Edges are actions
- Table, selection, projection and join actions
- Green nodes are feasible queries
- Infinitely big structure
- All CQ queries
- All possible combinations of actions formulating
them
64Interaction Graph Colorable Actions
Com1.cid
Current Node
Com1.cpu
Com1.cid
- Colorable actions AC labeloutgoing edges of the
current node
Com1.cpu
Com1.ram
Com1.price
Net1
Rou1
Com2
65Interaction Graph Colors
- Yellow action ?
- Every path from current node n to a feasible node
contains ? - Blue action ?
- At least one feasible query cannot be formulated
unless this action is performed (minimality) - Red action ?
- No path to a feasible node contains ?
Current Node
Com1.cid
Com1.cpu
Com1.cid
Com1.cpu
Net1
Com1.cidNet1.cid
Net1.rate54Mbps
Com1.cpu
Com1.ram
Com1.cidNet1.cid
Net1.rate54Mbps
Com1.price
Com1.cpu
Com1.cpu
Rou1
Net1.rateRou1.rate
Com1.cidNet1.cid
Net1
Com2
Rou1
Com2
Com2.cidNet1.cid
Com2.cpuP4
Net1.rate54Mbps
66Color Determined By a Finite Set of Feasible
Queries
Challenge Infinitely Many Feasible Queries
?
Radius
Closest Feasible Queries FQC
Infinitely many feasible queries
n
- Start by considering the closest feasible queries
FQC - FQC is sufficient to color actions in AC
- Theorem Set of Closest Feasible Queries is
Finite - How far can closest feasible queries FQC be?
- Based on Maximally Contained Queries FQMC?
67Color Algorithm
Maximally Contained Queries FQMC
Query Q1 Get all Computers
Maximally Contained Query
Query Q2 Get all Computers with a given cpu
- Assuming fixed SELECT clause (projection list)
- Covered extensively in literature
- MiniCon, Bucket, InverseRules
- FQMC is finite
68Color Algorithm
Maximally Contained Queries FQMC
Maximally Contained Queries FQMC
pL Radius
Closest Feasible Queries FQC
n
- Compute maximally contained queries FQMC
- The radius pL is the longest path to a node n
such that q(n) in FQMC - All FQC queries are reachable via a path of
length p ? pL
69Color Algorithm
More on Closest Feasible Queries
Maximally Contained Feasible Queries FQMC
Closest Feasible Queries FQC
n
More feasible nodes
- Theorem All queries in FQMC are in FQC
- But not all queries in FQC are in FQMC
70Color Algorithm
More on Closest Feasible Queries
Maximally Contained Feasible Queries FQMC
Closest Feasible Queries FQC
n
- Naïve Approach
- Start from n and explore paths up to length pL
71Color Algorithm
Collapse Aliases
Maximally Contained Feasible Queries FQMC
Closest Feasible Queries FQC
n
- Collapse Aliases to compute FQC \ FQMC
- Check satisfiability
72Color Algorithm
- Coloring Non-Projection Actions
- No interaction graph materialization
- Use of containment mapping from current query to
the closest feasible ones - An action ? is colored
- Yellow, if ? is mapped into all queries in FQC
- Red, if ? is not mapped into any query in FQC
- Blue, if ? is mapped into at least one query qF
in FQC, no other action in AP is mapped into qF,
and ? is neither yellow nor red - Coloring Projection Actions
- Never colored yellow
- Can be colored blue only if
- the current query is feasible
- it is not colored red
- Which ones are red?
- Bring all projection atoms from views such that
feasibility is preserved - If action ? is not mapped into any query in FQC,
then ? is red
73CLIDE Implementation
?
Other Back-End
Parameterized Views Back-End
Developer
Optimal Maximally Contained Queries
Action
Closest Feasible Queries
Maximally Contained Queries
Current Query
Colored Actions
MiniCon
Containment Test
Collapse Aliases
Color Actions
Front-End
Schemas
Column Associations
Views
- MiniCon
- Outputs redundant and non-minimal queries
- Affects CLIDEs rapid convergence and minimality
properties - Containment Test
- Well-known NP-complete problem
- Polynomial when query is acyclic
- Collapse Aliases / Color Actions
- Reuse containment mappings created by MiniCon
74CLIDE Performance
Chains of Stars
A-span 7 B-span 4 Selections 4,6,8,10
A