Title: Implementing Mapping Composition
1Implementing Mapping Composition
- Todd J. Green
- University of Pennsylania
- with Philip A. Bernstein (Microsoft Research),
- Sergey Melnik (Microsoft Research),
- Alan Nash (UC San Diego)
- VLDB 2006 Seoul, Korea
- Work partially supported by NSF grants
IIS0513778 and IIS0415810
2Schema mappings
- Mapping a correspondence between instances of
different schemas
Names SID, Name
Students Name, Address
m
Addresses SID, Address
S1
S2
Students ? ?Name,Address (Names ? Addresses)
3Applications of mappings
- Names ? Names?
- sCountry KR(Addresses) ? ?SID,Address(Local)KR
- sCountry ? KR(Addresses) ? Foreign
Students ? ?Name,Address,Country(Names ?
Addresses)
...
m12
m23
S3
S2
S1
4Applications of mappings
- Data integration, data exchange
Sn
Addresses SID, Address, Country
Names SID, Name
...
m1
mn
Students ? ?Name,Address (Names ?
Addresses)
Names? ? Names Local ? ?SID,Address(?Country
KR(Addresses)) Foreign ? ?Country ? KR(Addresses)
S1
Sn-1
Foreign SID, Address, Country
Students Name, Address, Country
Names? SID, Name
Local SID, Address
...
5Requirements for constraints
- First attribute in R is a key for R
- ?2,4(R ?13 R) µ ?2,2(R)
- View V equals R joined with S
- V µ R ? S, V R ? S
- Second attribute of R is a foreign key in S
- ?2(R) µ ?1(S)
- ?2,4(S ?13 S) µ ?2,2(S)
- Data integration, data exchange GLAV
- R ? S µ T ? U
6Mapping composition
S1
S3
7Composition is hard
- Hard part write composition in the same language
as the input mappings. Depending on language - Not always possible
- Not even decidable whether possible
- Strategy 1 use powerful (second-order) mapping
language closed under composition FKPT04 - Not supported by DBMS today
- Expensive to check
- Source-target restriction
- Strategy 2 settle for partial solutions NBM05
- Containment mappings ? easier integration with
DBMS - The strategy we adopt in this work
8Our contributions
- New algorithm for composition problem
- Incorporates view unfolding and left-composition
(new technique) - Makes best effort in failure cases
- Algebraic rather than logic-based mappings
- Use of monotonicity to handle more operators
- Modular and extensible factoring of algorithm
- First implementation of composition
- Experimental evaluation
9Formal definition of composition
- Mapping set of pairs of instances of db schemas
- The composition m12 m23 is the mapping
- hA,Ci (9B)(hA,Bi 2 m12 and hB,Ci 2 m23)
- where A,B,C are instances of S1,S2,S3
- Composition problem find constraints in same
language as input mappings giving the composition
of the input mappings - Example
S1 R, S2 S,T, S3 U,V,W R ? S?T, S ?
?(U), T V W
) R ? ?(U)?(V - W)
10Best-effort composition problem
- Composition not always possible
- Best-effort composition problem compute set of
constraints equivalent to input constraints, but
with as many symbols from S2 eliminated as
possible - R ? U, R ? V,
- ?1,4(?23(U?U)) ? U, ?1,4(?23(V?V)) ? V,
- U ? T, V ? T
- Can eliminate U (cross out left column) or V
(right column), but not both NBM05
11Composition algorithm overview
- For each relation R in S2
- Try to eliminate R via (1) view unfolding
- Replace by pairs of ?, ?
- For each relation R in S2 not yet eliminated
- Try to eliminate R via (2) left compose
- Else, try to eliminate R via (3) right compose
- Output
- New constraints and list of relations
successfully eliminated
12(1) View unfolding
- Idea exploit equality constraints (if we have
any) - Standard technique substitute view definition
for occurrences of view relation in mappings - T V W, R ? S ?T, T ? X ? ?(U)
- ? R ? S ?(V W), (V W) ? X ? ?(U)
- Body must not mention view relation itself
- Doesnt matter what else is in body
- Can substitute everywhere
13(2) Left compose
- View unfolding for containment constraints
- ?(V) ? R U, R ? S ? T
- ? ?(V) ? (S ? T) U
- Needs monotonicity of expressions in R.
- E1 ? E2(R), R ? E3 E1 ? E2(E3)
- if E2(R) is monotone in R (and R not in E3)
- Partial check for monotonicity
- Is S (T R) monotone in R?
14Normalization for left compose
- Need one constraint of form R ? E1
- Use identities to normalize, e.g.
- R ? E1 and R ? E2 iff R ? E1 ? E2
- E1 ? E2 ? E3 iff E1 ? E3 and E2 ? E3
- ?(E1) ? E2 iff E1 ? E2 ? Dr
- More identities in paper
- After left compose, try to eliminate D
15(3) Right compose
- Dual to left compose, from NBM05
- Example
- S ?T ? R, R U ???(V)
- ? (S ?T) U ? ?(V)
- Monotonicity check needed here too
- Normalization may introduce Skolem functions
- E1 ? ?(E2) iff f(E1) ? E2
- Must eliminate Skolem functions after composition
- Lots of effort coding this step!
16User-defined operators
- User specifies
- Monotonicity of operator in its arguments
- If E1 monotone in R and E2 antimonotone in R or
independent of R, then E1 E2 monotone in R - if E1 monotone in R or independent of R and E2
antimonotone in R, then E1 E2 monotone in R - Identities for normalization
- E1 E2 ? E3 iff E1 ? E2 ? E3
- User-defined operators and standard relational
operators treated uniformly
17Implementation
- 12K lines of C code, command-line tool
- Test case 13 PODS05 example 2
- SCHEMA
- R(2), S(2), T(2)
- CONSTRAINTS
- R lt S,
- P_0,2 J_0,11,2 (S S) lt R,
- S lt T
- ELIMINATE
- S
- Output
- P_0,2 J_0,11,2(R R) lt R,
- R lt T
18Experimental evaluation
- First attempt at a composition benchmark
- Schema editing and schema reconciliation
scenarios - Add a column to R to produce S ?(R) S
- Measure
- of symbols eliminated
- Running time
- As a function of
- Editing primitives allowed, length of edit
sequence, presence/absence of keys, starting
schema size, - Synthetic data
19Summary of results
- Algorithm often effective in eliminating most or
even all relation symbols from S2 - Running time in subsecond range even for large
problems containing hundreds of constraints - Certain schema editing primitives problematic
- Key constraints did not reduce effectiveness,
although did increase running time (and output
size)
20Schema editing
- Random starting schema (30 relations of 2-10
attributes) - 100 random edits
- 100 different runs, sorted by execution time
21Schema reconciliation (1)
- Random schema (30 relations of 2-10 attributes),
random edits - Point represents median time of reconciliation
step of 500 runs
22Schema reconciliation (2)
- Random schema (variable relations of 2-10
attributes) - 100 random edits
- 100 different runs, sorted by execution time
23Related work
- MH03 J. Madhavan, A. Y. Halevy. Composing
mappings among data sources. VLDB, 2003. - FKPT04 R. Fagin, Ph. G. Kolaitis, L. Popa, W.C.
Tan. Composing schema mappings second-order
dependencies to the rescue. PODS, 2004. - NBM05 A. Nash, P. A. Bernstein, S. Melnik.
Composition of mappings given by embedded
dependencies. PODS, 2005.
24Conclusion and future work
- We motivated and described the mapping
composition problem - We presented an implementation of a practical new
algorithm for the composition problem - We also presented an experimental evaluation
- To do theoretical analysis of impact of
user-defined operators - To do output constraints from algorithm can be a
mess! How to clean up?