Title: Clio: Translating Web Data
1Clio Translating Web Data
- University of Toronto
- Renée J. Miller
- Periklis Andritsos, Ariel Fuxman
- Tasos Kemesetsidis, Yannis Velegrakis
- IBM Almaden
- Laura Haas
- Ron Fagin, Mauricio Hernández
- Howard Ho, Phokion Kolaitis, Felix Naumann
- Lucian Popa, Ling-Ling Yan
2Schema Mapping Data Translation
- Wants data from S
- Understands T
- May not understand S
- XML Schema
- DTD
- Relational
Source schema S
Target schema T
conforms to
data
User interactively chooses among (or corrects)
translation queries
3Mapping is not Schema Integration
Integrated schema
Source schema S1
Source schema Sn
Design Problem integrated schema designed to
match sources
4Overview
- Goal interoperability between independent data
sources - Schema Mapping
- Data Translation
- Challenges
- Schemas can be arbitrarily different
- Still, data must not lose its meaning during
translation - Maximum advantage of semantics embedded in
schemas data - Used in compilation
- Facilitate user specification of any additional
semantics - As by-product user learns if semantics
incorrect/incomplete - Performed manually complex user queries,
programs, etc. - Output correct data translation program
5Illustration Clio Schema Mapping
- Support Nested Structures
- Element correspondences
- Human friendly
- Automatic discovery
- Preserve data meaning
- Discover data associations
- Use constraints schema
6 Generate Queries (XQuery)
- ltstatisticsDBgt
- FOR x0 IN /expenseDB/grant, x1 IN
/expenseDB/project, x2 IN /expenseDB/company - WHERE
- x2/cid/text() x0/cid/text()
- x0/project/text()
x1/name/text() - RETURN
- ltcityStatisticsgt
- FOR x0L1 IN /expenseDB/grant,
x1L1 IN /expenseDB/project, x2L1 IN
/expenseDB/company - WHERE
- x2L1/cid/text()
x0L1/cid/text() - x0L1/project/text()
x1L1/name/text() - x2/city/text()
x2L1/city/text() - RETURN
- ltorganizationgt
- ltcidgt
x0L1/cid/text() lt/cidgt - ltcnamegt
x2L1/name/text() lt/cnamegt -
- FOR x0L2 IN
/expenseDB/grant, x1L2 IN /expenseDB/project,
x2L2 IN /expenseDB/company - WHERE
7Outline
- Schema Mapping (VLDB00, VLDB02)
- Interpret high-level mapping (correspondences)
- Relies on semantics of schemas
- Generate low-level mapping (logical mapping)
- Data Translation (VLDB02)
- Query generation
- Retrieve from source (join/unnest)
- Insert into target (partition/nest)
- Accommodate differences in data content
- Skolemization used to create missing data
- Understand Mapping (SIGMOD 01)
- Explain mappings, resolve ambiguities using data
8Creating Correspondence
- Graphical User Interface
- DBA interactively specifies
- Automatic Discovery
- Attribute (Element) Classifier
- Extensible to
- Other Schema Matchers
- VLDB J. 01 Survey
- Correspondence based on syntactic information
9Interpreting Arrows
statDB Set of Rcd cityStat Rcd
orgs Set of Rcd org Rcd
cid name
fundings Set of Rcd funding Rcd
gid proj aid
financials Set of Rcd financial
Rcd aid
date amount city
expenseDB Rcd companies Set of Rcd
company Rcd cid
name city
grants Set of Rcd grant Rcd
cid gid
amount project
Good enough for one arrow !
?cid expenseDB.companies ? ?cid
statDB.cityStat.orgs
Still works for two arrows!
?cid,name expenseDB.companies ? ?cid,name
statDB.cityStat.orgs
?gid expenseDB.grants ? ?gid
statDB.cityStat.orgs.fundings
How about now ?
10Data Associations
statDB Set of Rcd cityStat Rcd
orgs Set of Rcd org Rcd
cid name
fundings Set of Rcd funding Rcd
gid proj aid
financials Set of Rcd financial
Rcd aid
date amount city
expenseDB Rcd companies Set of Rcd
company Rcd cid
name city
grants Set of Rcd grant Rcd
cid gid
amount project
- We must recognize that grants are associated to
companies
- Association (in the source) grants ? companies
- Association (in the target) statDB ? orgs ?
fundings ? financials
11Logical Associations
statDB Set of Rcd cityStat Rcd
city orgs Set of Rcd org
Rcd cid name
fundings Set of Rcd
funding Rcd gid proj
aid financials Set of Rcd
financial Rcd aid
date amount
expenseDB Rcd companies Set of Rcd
company Rcd cid
name city
grants Set of Rcd grant Rcd
cid gid
amount sponsor
project
12Multiple Logical Associations
statDB Set of Rcd cityStat Rcd
orgs Set of Rcd org Rcd
cid name
fundings Set of Rcd funding Rcd
gid proj aid
financials Set of Rcd financial
Rcd aid
date amount city
expenseDB Rcd companies Set of Rcd
company Rcd cid
name city
grants Set of Rcd grant Rcd
cid gid
amount project sponsor
- Grants may be associated with companies in
multiple ways - Association 1 grants ? companies join on cid
cid - Association 2 grants ? companies join on
sponsor cid - We do not make the one flavor assumption (URA)
13Schema Mapping
- Enumerate ALL logical associations consistent
with schema semantics - Constraints
- Nesting (schema structure)
- Data (statistical models of data)
- Interpret correspondences (arrows) over pair
source target association
14Features of Algorithm
- Generate associations
- Modified dependency chase algorithm
- Nested schemas
- Broad class of (nested) referential constraints
- Including (nested) foreign keys
- Consider both source and target constraints
- Algorithm is complete generate all associations
(in the UR sense) for acyclic dependencies - Logical mapping
- Set of (alternative) inter-schema dependencies
15Mapping Algorithm
Attribute-Attribute correspondences
Source schema S
Target schema T
1 Generate logical mappings
2 Select subset mappings
16Outline
- Schema Mapping
- Interpret high-level mapping
- Relies on semantics of schemas
- Generate logical (low-level) mapping
- Data Translation
- Query generation
- Retrieve from source (join/unnest)
- Insert into target (partition/nest)
- Accommodate differences in data content
- Skolemization used to create missing data
17From Dependencies to Queries
- Logical mappings inter-schema dependencies
- Convenient intermediate representation (in Clio)
- can be reasoned about (query optimization)
- can even be shown (graphically) to a user
- but
- not all target attributes determined source
- flat representation of how schemas correspond
- we need to materialize a nested target
18Query Generation Issues
- We translate logical mappings into queries
- Two parts, conceptually
- Unnest (join) the source association
- Nest (partition) according to the shape of the
target association - Skolemization algorithm the heart of query
generation - Achieves a good nesting (grouping)
- Generates new values (ids)
- Not to many ids, but not too few either !
19Query Generation
statDB Set of Rcd cityStat Rcd
city orgs Set of Rcd org
Rcd cid name
fundings Set of Rcd
funding Rcd gid proj
aid financials Set of Rcd
financial Rcd aid
date amount
expenseDB Rcd companies Set of Rcd
company Rcd cid
name city
grants Set of Rcd grant Rcd
cid gid
amount sponsor
project
Sk2
M2
Sk1name
Sk3name
- Correspondences map only into some of the atomic
attributes - We use Skolem functions to control the creation
of the other elements - sets (this controls how we group elements in the
target) - atomic values (this enforces the integrity of the
target)
20Mapping Algorithm
Source schema S
Target schema T
?1
?2
Target Associations
Source Associations
BUT we dont have It we need to create it!!
21Correctness Guarantees
- Information preservation
- For every logical mapping M ?1 ? ?2
generate query QM - For any source instance Is ?2
(QM(?1(Is))) ?1(Is) - Map no less and no more data than was intended
to be mapped - In fact, ?2 and QM are inverse to each other
(view inversion) - Lossless data transmission
- Target integrity constraints satisfied!!
- Target nested referential constraints ?t
- QM(?1(Is)) It where It ? ?t
22When Can This Tool Help?
- In general, whenever
- schemas are complex (nesting constraints) or
different - it matters whether we lose data or not
- it is not obvious to a human user that there are
many forms of data that can coexist within a
schema - Specific applications
- data migration from one schema to another (schema
redesign) - data transmission (between independently created
schemas) - data integration and warehousing
23Project Status So Far
- Can map between any combination of
XML schemas/DTDs/Relational schemas - The generated queries are XQuery Skolem fcts
- (or XSLT or SQL)
- Tested Clio with many complex real data
- LifeScience schemas (private and public GeneX,
SwissProt) - Amalgam integration benchmark, DBLP, Mondial,
several others, - Demos SIGMOD01 (Relational), ICDE02 (XML)
- Papers VLDB00, SIGMOD01, VLDB02
No commercial tool correctly generates XML
transformations that are lossless !
24Conclusion
- Schema mapping framework with solid semantic
foundation - Covers relational and core of XML Schema
- Maps source associations to target associations
- We build data transformations (XQuery with Skolem
functions) rather than just queries - Separated into 2 phases
- Logical mapping (arrows -gt dependencies)
- Data translation (dependencies -gt queries)
- It works (and we have users!)
25Answering Queries Using Views
q
Target schema T
Source schema S
?t
?st
virtual !
I
J
- The target is not materialized it is just a
querying interface - Queries are posed on the target schema data is
in the source. - Problem how to answer the query in the best
possible way - Classical setting ?st have a special form
- GAV Q(S) ? Ti where Ti is a relation in T, Q
is a query on S - LAV Si ? Q(T) where Si is a relation in S , Q
is a query on T - Schema Mapping Q1(S) ? Q2(T) and constraints on
S and T (?s , ?t ) - GAV/LAV (mostly) assumes conjunctive queries on
relations - (mostly) assumes no target constraints target
is a view