Title: Data Exchange with
1- Data Exchange with
- Data-Metadata Translations
Wang-Chiew Tan UC Santa Cruz
Mauricio A. Hernández IBM Almaden
Research Center
VLDB 2008
August 24 -- Auckland, New Zealand
2Data-Metadata Translations
- Data exchange scenarios may involve metadata
transformations. - E.g., Pivot/Unpivot in spreadsheets.
example from Miller98
- Mapping systems support Data-to-Data
transformations with fixed schemas. - Goal Extend mapping systems to support
Data-Metadata Translations.
2
3Mapping Systems
GUI
Source schema S
Target schema T
IBM Clio HepTox MS ADO.net Altova
MapForce StylusStudio BEA Aqualogic
Declarative (internal) representation
Executable code (XSLT, XQuery, Java)
I
J
Data exchange
4Outline
1. Data and Metadata translations
2. Generation Algorithms
Data-to-Data
Mapping Generation
Metadata-to-Data
Query Generation
Data-to-Metadata
Graphic Design
3. Results Discussion
Experiments
Related Work
Conclusion
Graphic Design
5Data-to-Data
- Mapping Generation Algorithm PVMHF 2002
- Input Source and Target schemas, and
correspondences. - Output declarative schema mapping
- For example
for s in Source.Sales exists t in
Target.CountrySales, c in
t.Sales where t.country s.country and
c.style s.style and c.shipdate
s.shipdate and c.units s.units
Source Rcd Sales SetOf Rcd
country region style
shipdate units price
Target Rcd CountrySales SetOf Rcd
country Sales SetOf Rcd
style shipdate
units id
6Mappings
for s in Source.Sales exists t in
Target.CountrySales, c in t.Sales where
t.country s.country and c.style s.style
and c.shipdate s.shipdate and
c.units s.units
- Query Generation into multiple query languages
- Input a data to data schema mapping
- Output a query script (XQuery, XSLT, SQL, etc.)
for x0 in doc/Source/Sales return (
ltCountrySalesgt ltcountrygt x0/country/text()
lt/countrygt
7State-of-the-art Metadata-to-Data
How can we transform the following source data
into the corresponding target?
Target.Sales month country units Jan
USA 120 Jan UK 223 Jan
Italy 89 Feb USA 83 Feb UK
168 Feb Italy 56
Source.Sales month USA UK Italy
Jan 120 223 89 Feb 83 168 56
Schema mapping m1
Source Rcd Sales SetOf Rcd month
USA UK Italy
Target Rcd Sales SetOf Rcd month
country units
USA
7
8State-of-the-art Metadata-to-Data
How can we transform the following source data
into the corresponding target?
Target.Sales month country units Jan
USA 120 Jan UK 223 Jan
Italy 89 Feb USA 83 Feb UK
168 Feb Italy 56
Source.Sales month USA UK Italy
Jan 120 223 89 Feb 83 168 56
m1 for s in Source.Sales exists t
in Target.Sales where t.month s.month
and t.country USA and
t.units s.USA
Schema mapping m2
Source Rcd Sales SetOf Rcd month
USA UK Italy
Target Rcd Sales SetOf Rcd month
country units
UK
8
9State-of-the-art Metadata-to-Data
How can we transform the following source data
into the corresponding target?
Target.Sales month country units Jan
USA 120 Jan UK 223 Jan
Italy 89 Feb USA 83 Feb UK
168 Feb Italy 56
Source.Sales month USA UK Italy
Jan 120 223 89 Feb 83 168 56
m1 for s in Source.Sales exists t
in Target.Sales where t.month s.month
and t.country USA and
t.units s.USA
Schema mapping m3
Source Rcd Sales SetOf Rcd month
USA UK Italy
Target Rcd Sales SetOf Rcd month
country units
Italy
9
10Metadata-to-Data Our solution
MetadatA-Data (MAD) mapping
Source Rcd Sales SetOf Rcd month
USA UK Italy
Target Rcd Sales SetOf Rcd month
country units
Target.Sales Jan USA 120 Jan UK
223 Jan Italy 89 Feb USA 83 Feb
UK 168 Feb Italy 56
Source.Sales Jan 120 223 89 Feb 83
168 56
??countries?? label value
Select the elements to group
Placeholder
Copy elements values
Copy elements labels
Set of labels (strings)
for s in Source.Sales, c in
USA, UK, Italy exists t in
Target.Sales where t.month s.month and
t.country c and t.units s.(c)
Is a label value
Dynamic selection of the source element
11Data-to-Metadata
Now we want to support the opposite operation
example from Miller98
The target schema depends on the source data We
define a target template Nested Dynamic Output
Schemas (ndos)
Target Rcd Stockquotes SetOf Rcd
time ??symbols??
label value
Source Rcd StockTicker SetOf Rcd
time symbol price
Dynamic element
Run-time The dynamic element defines the target
instance and the target
schema.
12Data-to-Metadata Heterogeneous records
Consider this mapping and this source instance
Source Instance
There are two possible interpretations for the
target ndos
StockTicker (time 0900, Symbol MSFT, Price
27.20 ) StockTicker (time 0900, Symbol IBM,
Price 120.00 ) StockTicker (time 0905, Symbol
MSFT, Price 27.30 )
First alternative Heterogeneous target records
Computed Target Schema
Computed Target Instance
Target Rcd Stockquotes SetOf Rcd
time symbols Choice MSFT
IBM
Stockquotes (time 0900, MSFT 27.20 )
Stockquotes (time 0900, IBM 120.00 )
Stockquotes (time 0905, MSFT 27.30 )
13Data-to-Metadata Homogenous records
Consider this mapping and this source instance
Source Instance
There are two possible interpretations for the
target ndos
StockTicker (time 0900, Symbol MSFT Price
27.20 ) StockTicker (time 0900, Symbol IBM
Price 120.00 ) StockTicker (time 0905, Symbol
MSFT Price 27.30 )
Second alternative Homogeneous target records
Computed Target Schema
Computed Target Instance
Target Rcd Stockquotes SetOf Rcd
time MSFT IBM
Stockquotes (time 0900, MSFT 27.20, IBM null
) Stockquotes (time 0900, MSFT null , IBM
120.00 ) Stockquotes (time 0905, MSFT 27.30,
IBM null )
14Data-to-Metadata Homogenous records
Natural solution for semi-structured data models
(XSD, DTD, JSON)
Stockquotes(time 0900, MSFT 27.20 )
Stockquotes(time 0900, IBM 120.00 )
Stockquotes(time 0905, MSFT 27.30 )
Homogeneity Constraint For every pair of tuples
t1 and t2, if a is a label in t1, then a is a
label in t2 for t1 in Target.Stockquotes,
t2 in Target.Stockquotes, a in dom
(t1) exists a in dom (t2) where a a
Stockquotes(time 0900, MSFT 27.20, IBM null
) Stockquotes(time 0900, MSFT null , IBM
120.00) Stockquotes(time 0905, MSFT 27.30,
IBM null )
Natural solution for the Relational data model
15MAD Mapping Generation
ltByShipDateCountrygt lt12-07gt ltUSAgt
ltstylegtTeelt/stylegtltunitsgt11lt
/unitsgtltpricegt1200lt/pricegt lt/USAgtltUSAgt
ltstylegtElec.lt/stylegtltunitsgt12lt/unitsgtltpricegt3600
lt/pricegt lt/USAgt lt/12-07gt lt01-08gt
ltUSAgt ltstylegtTeelt/stylegtltunitsgt10lt/unitsgtltpr
icegt1600lt/pricegt lt/USAgt lt/01-08gt lt02-08gt
ltUKgt ltstylegtTeelt/stylegtltunitsgt12lt/unitsgtlt
pricegt2000lt/pricegt lt/UKgt
lt/02-08gt lt/ByShipDataCountrygt
Source.Sales country region style shipdate
units price USA East Tee 12-07
11 1200 USA East Elec. 12-07
12 3600 USA West Tee 01-08 10
1600 UK West Tee 02-08 12
2000
16MAD Mapping Generation
- Three Steps
- Modify schemas with dynamic placeholders
- Compile mappings
- Simplify mapping
This is what we get from Clio PVMHF 02
for s in Source.Sales exists t in
Target.ByShipdateCountry, y in
??dates??, u in case t of y,
z in ??countries??, v in
u.(z) where y s.shipdate and z
s.country and v.style s.style and
v.units s.units and v.price s.price
and u.(z) SKs.shipdate,s.country
for s in Source.Sales exists t in
Target.ByShipdateCountry, u in case
t of s.shipdate, v in
u.(s.country) where v.style s.style
and v.units s.units and v.price s.price
and u.(s.country) SKs.shipdate,s.countr
y
17Query Generation two-phase algorithm
PVMHF 02
Phase 2 Q2 assembles the target instance J from
the relational views
T2
T1
T
T3
T4
conforms-to
r
r
J
r
r
Q2
18New Query Generation
Phase 2 Q2 assembles the target instance J from
the relational views Q3 computes the target
schema T Q4 is the optional post - processing
Phase 1 Q1 shreds the source instance I over
relational views of the target ndos
T2
T1
ndos
T3
conforms-to
T2
T1
S1
T
S
T3
T4
conforms-to
conforms-to
Q3
r
I
r
J
r
r
Q2
Q4
19MAD Clio vs. Commercial Tools
Commercial Tool
20MAD Clio vs. Commercial Tools
MAD Clio
Optimized query
48 source labels (10 MB) naïve 183 s, dynamic 14
s, optimized 10 s
21MAD Clio Performance
22Some Related Work
- Lots of related work in the relational setting
- FIRA/FISQL Wyss,Robertson 2005 has an excellent
survey. - SchemaSQL Lakshmanan,Sadri,Subramanian 1996,
FIRA/FISQL Wyss,Robertson 2005 - Extensions to SQL to handle metadata as data
- Only relational dynamic output schemas
- Language and semantics, NO transformations from
GUI - In XML settings
- HepTox BCHLP 2005, commercial mapping tools
Altova MapForce, MS ADO.net, StylusStudio, BEA
(Oracle) Aqualogic - No dynamic elements in the target
23MAD Clio
Data exchange with data-metadata support Data to
Data is a special case
- New construct to iterate over elements labels
placeholder - Target schema can be incomplete nested dynamic
output schema (ndos)
GUI
Source schema S
Target schema T
- New constructs for the mapping language
- New mapping query generation algorithms
- Including a query to generate the target schema.
Declarative (internal) representation
Executable code (XSLT, XQuery, Java)
24Data Exchange with Data-Metadata Translations
25Metadata-to-Metadata
... ltproperties nameprice langen-us date
01-01-2008 ... gt
ltpvalgt48.15lt/pvalgt lt/propertiesgt ...
... ltprice value48.15 langen-us
date01-01-2008 ... /gt ...
Source Rcd properties SetOf Rcd
_at_name _at_lang _at_date
_at_format pval
Target Rcd label1
value1 SetOf Rcd _at_value
label2 value2
ltltnamesgtgt
ltltattrsgtgt label value
ltltelemsgtgt
for x1 in Source.properties, x2 in
_at_lang, _at_date, , _at_format exists y1 in
Target.(x1._at_name), where y1._at_value x1.pval
and y1.(x2) x1.(x2)
Metadata to Metadata placeholder to dynamic
element