Title: XML Constraints: Specification, Analysis, and Applications
1XML Constraints Specification, Analysis, and
Applications
- Wenfei Fan
- School of Informatics, University of Edinburgh
-
- Network Data and Services Research, Bell
Laboratories
2Outline
- XML Specifications types and integrity
constraints - Specification of XML constraints
- keys, foreign keys, FDs
- absolute vs. relative constraints
- Analysis of XML constraints
- Consistency analysis
- Implication analysis
- Applications of XML constraints, and research
issues - Relational storage of XML data via constraint
propagation - Schema-directed XML integration
- Normal forms, query optimization, updates, data
cleaning . . .
3Introduction to XML specificaiton
- XML Specification
- types
- integrity constraints
- the need for XML constraints
4XML data - an example
- Rooted, node-labeled tree
- elements db, province, capital, city,
subtree/sub-document elements/subelements, e.g.,
the capital child of province - _at_attributes _at_name, _at_inProvince, carrying text
- text nodes, e.g., Hasselt
5XML specification DTD (type)
- Production constrains the subelement list of
each element lt!ELEMENT db (province,
capital)gt - lt!ELEMENT province (city, capital)gt
- Attributes uniquely identified by name for each
element, unordered - province _at_name, capital _at_inProvince
6XML specification integrity constraints
- Keys and foreign keys (vs. relational
constraints) - key the value of a _at_name uniquely identifies a
province - province._at_name ? province
- capital._at_inProvince ? capital
- FK _at_inProvince of a capital references _at_name of
a province - capital._at_inProvince ? province._at_name
7XML specification
- A type (DTD) D
- A set of integrity constraints, ?
- Example
- DTD D structure of the document, vs. types in a
PL - lt!ELEMENT db (province, capital)gt
- lt!ELEMENT province (city, capital)gt
- province._at_name, capital._at_inProvince
- Constraints ? defined in terms of data values
across elements - province._at_name ? province
- capital._at_inProvince ? capital
- capital._at_inProvince ? province._at_name
8Why XML constraints?
- Supported by W3C XML standard, XML Schema
- In databases (supported by SQL standard),
constraints are - an essential part of the semantics of data,
- fundamental to conceptual design,
- useful for choosing efficient storage and access
methods, - central to update anomaly prevention,
- In the XML setting constraints have proved
useful in - database storage of XML data (via constraint
propagation), - schema-directed database publishing/integration
in XML, - XML query optimization and formulation,
- design theory for XML specifications normal
forms - data cleaning,
9Data exchange on the Web XML publishing
- All members of a community (or industry) agree on
a schema and exchange data w.r.t. the schema
e-commerce, health-care, ... - Schema-Directed XML Publishing/Integration
- mapping data from traditional database to XML
- satisfying the predefined DTD and constraints
Web
XML
XML
Q XML view
DB1
DB2
10Data exchange on the Web XML shredding
- XML shredding
- mapping XML data to relations
- relational design normalization via constraint
propagation from XML to relations - optimal relational storage of XML data
- semantic connection query/update optimization
Web
XML
XML
XML keys
XML shredding
propagation
DB1
DB2
relational FDs
11XML constraints
- Specification of XML constraints
- keys, foreign keys, FDs
- absolute vs. relative constraints
12absolute constraints
- Absolute keys and foreign keys are to hold on the
entire document. - province._at_name ? province
- capital._at_inProvince ? capital
- capital._at_inProvince ? province._at_name
- Extensions of relational counterparts
13Absolute keys and foreign keys PODS00, 01
- key ??X ? ?. An XML document satisfies
the key iff - ? x y ? ext(?) (?l ?X (x.l y.l) ? x y)
- foreign key (FK) a combination of an inclusion
constraint ? ?1X ?? ??2Y, and a key ?
?2Y ? ? ??2 . - A document satisfies the FK iff it satisfies the
key and - ? x ? ext(??1 ) ? y ? ext(??2 ) (xX yY)
- ?, ?1 ,??2 element types X, Y sets (lists)
of attributes - ext(?) the set of ? elements in an XML document.
- Equality issue
- (string) value equality when comparing
attributes - node identify when comparing XML elements
- Unary keys and foreign keys defined in terms of
single-attribute.
14Relative constraints WWW01, PODS02
- An XML tree specifies countries, provinces,
province capitals. - What is a key for a province?
- What does _at_inProvince of a capital reference?
db
...
country
country
...
...
province
capital
capital
province
_at_name
_at_name
Holland
Belgium
capital
_at_name
_at_name
capital
_at_inProvince
Hasselt
_at_inProvince
Maastricht
Limburg
Limburg
Limburg
Limburg
_at_inProvince
Hasselt
_at_inProvince
Hasselt
Limburg
Limburg
15Examples of relative constraints
- Relative constraints on a subdocument rooted at
a country - key country (province._at_name ?
province) - country (capital._at_inProvince ? capital)
- FK country (capital._at_inProvince ?
province._at_name) - Absolute on the entire document country._at_name
? country
db
...
country
country
...
...
province
capital
capital
province
_at_name
_at_name
Belgium
Holland
capital
_at_name
Hasselt
capital
_at_name
_at_inProvince
_at_inProvince
Maastricht
Limburg
Limburg
Limburg
Limburg
_at_inProvince
Hasselt
_at_inProvince
Hasselt
Limburg
Limburg
16Relative keys and foreign keys
- key ??(??1X ? ??1). An document satisfies the
key iff - ? c ? ext(?) ? y, z ? ext(?1)
- ( (y ?? c) ? (z ?? c) ? ?l ?X (y.l z.l) ?
y z) - foreign key (FK) ??( ?1X ?? ??2Y ) and a key
?( ?2Y ? ??2) . - A document satisfies the FK iff it satisfies the
key and - ? c ? ext(?) ? y ? ext(?1) (( y ?? c) ?
- ? z ? ext(??2 ) ((z ?? c) ? yX zY
)) - where ?
- (y ?? c) y is a descendant of c (y in the
subtree rooted at c) - ? context type
- ext(?) the set of ? elements in an XML document.
17Relative vs. Absolute
- Absolute constraints are a special case of
relative ones - country._at_name ? country ? db ( country._at_name
? country ) - absolute a fixed context type -- the root type
r - Absolute constraints are scoped within the entire
document whereas relative ones within the
context of a subdocument. - country (province._at_name ? province)
- country (capital._at_inProvince ? capital)
- country (capital._at_inProvince ?
province._at_name) - country._at_name ? country
- Together they specify constraints on the entire
document - Beyond relational constraints important for
hierarchically structured data XML, scientific
databases, biomedical data, ...
18Define keys with path expressions
- XML data is hierarchically structured!
- name as a key for employees of companies only
target set is identified with a path expression
//company//employee - XML data is semistructured it may not have a
DTD/schema! - key paths may be missing or have multiple
occurrences - key specification should be independent of types
name
name
_at_id
_at_id
firstName
lastName
19Absolute path constraints WWW01
- Absolute key (Q, P1, . . ., Pk )
- Path expressions Q, Pi XPath, regular path
expressions, - target path Q to identify a target set Q of
nodes on which the key is defined (vs. relation) - a set of key paths P1, . . ., Pk to provide
an identification for nodes in Q (vs. key
attributes) - semantics for any two nodes in Q, if they
have all the key paths and agree on them by value
equality (existential), then they must be the
same node (value equality and node identity) - Examples
- (//company//employees, name, phone) --
composite key - ( //company//employees, //_at_id) --
multiple keys - (//., _at_id)
-- capturing ID attributes in DTDs
20Relative path constraints WWW01
- Relative key (Q, K)
- path Q identifies a set Q of nodes, called
the context path - K (Q, P1, . . ., Pk ) is a key on
sub-documents rooted at nodes in Q (relative
to Q). - Example. (//country, (province, _at_capital))
- (//country, _at_name) -- absolute key
- Absolute keys are a special case of relative
keys - (Q, K) when Q is the empty path
- Similarly for foreign keys
- Specification of XML constraints is more involved
than its relational counterparts
21Keys and foreign keys in XML Schema
- key (Q, P1, . . ., Pk )
- Path expressions Q, Pi fragments of XPath
- Uniqueness and existence for each node x in
Q and each i in 1, n, there exists a unique
node yi reached via Pi, and yi is either a text
node or an attribute - Foreign keys (Q, P1, . . ., Pk ) ?? (S,
S1, . . ., Sk ) - (S, S1, . . ., Sk ) is a key
- Uniqueness and existence both Pi and Si
- The uniqueness and existence condition
complicates the consistency and implication
analyses - Absolute constraint
22Other constraints for XML
- Functional dependencies P1, . . ., Pk ?
S1, . . ., Sk - Generalizations of relational FDs for deriving
an extension of relational-schema normal forms - Absolute constraints Arenas and Libkin, PODS02
- XIGs ? x1 ? xn ( B(x1, , Xn) ?
- ? (i ? 1, l) (? y1 ? yk
Ci (x1, , xn, y1, , yk)) - Generalization of relational embedded constraints
- B, Ci conjunction of simple XPath expressions
- Subsuming relative keys and foreign keys (Deutsch
and Tannen, KRDB01)
23Constraint analysis
- Analysis of XML constraints
- Consistency analysis
- Implication analysis
- Absolute, relative, path-expression constraints
24Consistency of XML specifications
- Given D a DTD
- ? a set of integrity constraints
over D - Consistency Is there an XML document that both
conforms to D and satisfies ?? - One wants to know whether XML specifications make
sense! - Run-time check attempts to validate documents
with (D, ?). - This would not tell us whether repeated failures
are due to a bad specification or problems with
the documents - ? static analysis is desirable
25An inconsistent specification
- The specification with D and ? is inconsistent!
- DTD D
- lt!ELEMENT db (province, capital)gt
- lt!ELEMENT province (city, capital)gt
- province._at_name, capital._at_inProvince
- Constraints ?
- province._at_name ? province
- capital._at_inProvince ? capital
- capital._at_inProvince ? province._at_name
- In contrast, one can specify keys and foreign
keys in SQL without worrying about their
consistency with schema.
26Cardinality constraints by keys, foreign keys
- Constraints ?
- province._at_name ? province
- capital._at_inProvince ? capital
- capital._at_inProvince ? province._at_name
- Notation
- ext(?) the set of ? elements in an XML document
- ext(?.l) the set of l attribute values of all ?
elements - ?
- ext(province._at_name)
ext(province) - ext(capital._at_inProvince) ext(capital)
- ext(capital._at_inProvince) ?
ext(province._at_name) - ? ext(capital) ? ext(province)
27Cardinality constraints imposed by DTDs
- DTD D lt!ELEMENT db (province, capital)gt
- lt!ELEMENT province (city,
capital)gt - Variables
- Xprovince the number of province elements under
the root - Xcapital the number of capital subelements of
the root - Ycapital the number of capital subelements of
provinces - ?
- Xprovince ? 1, Xcapital ? 1
- ext(province) Xprovince,
Xprovince Ycapital - ext(capital) Xcapital Ycapital
- ?
- ext(capital) gt ext(province)
28The interaction
- Contradiction
- From the constraints ? ext(capital) ?
ext(province) - From the DTD D ext(capital) gt
ext(province) - Thus there exists NO XML document that both
conforms to D and satisfies ?.
29Consistency analysis PODS01, 02
- Trivial for relational databases given any
schema and keys, foreign keys, one can always
find a nonempty instance of the schema satisfying
the constraints. - Hard for XML XML specifications may not be
consistent! - Both DTDs and constraints impose cardinality
constraints - The interaction between these two classes of
cardinality constraints is rather complicated.
30Consistency analysis of XML constraints
- Theorem The consistency problem is
- undecidable for multi-attribute absolute keys and
foreign keys - NP-complete for unary absolute keys and foreign
keys, even for primary keys (primary at most one
key for each element type) - in NEXPTIME for primary multi-attribute absolute
keys and unary foreign keys - in NEXPTIME and PSPACE-hard for unary absolute
regular keys and foreign keys (target path ?/?,
where ? is a regular path expression and ? an
element type key paths attributes) - undecidable for relative keys and foreign keys,
even when all the constraints are unary and
primary. - As opposed to the trivial analysis of the
relational counterpart.
31Some tractable cases
- Restrictions on constraints.
- Theorem For multi-attribute relative keys only,
the consistency problem is in linear time for
arbitrary DTDs. - Recall relative keys country (province._at_name
? province) - In contrast, due to the existence and uniqueness
condition - Theorem It is intractable for unary keys alone
in XML Schema. - Restrictions on DTDs
- Theorem When DTD is fixed, the consistency
problem is in PTIME for absolute unary keys and
foreign keys. - In practice, DTD is designed at one time, but
constraints are written in stages constraints
are incrementally added.
32Implication analysis PODS00, 01, 02, DBPL01
- Given D a DTD
- ? a set of constraints expressed in
C - ? a property (a constraint of C)
- Implication (C ) Is it the case that for any
XML document, if it conforms to D and satisfies
?, then it must satisfy ?? - C a constraint language
- The need for studying implication
- data integration constraints checking at virtual
views - optimization of XML queries and XML relational
storage - design theory for XML specifications
normalization
33Some complexity results for implication analysis
- Theorem The implication problem is
- undecidable for multi-attribute absolute keys
and foreign keys, and for unary relative keys and
foreign keys - PSPACE-hard for unary regular absolute keys and
foreign keys - coNP-complete for unary absolute keys and foreign
keys. - coNP-hard for XML-Schema unary keys
- in linear time for absolute multi-attribute keys
- in PTIME for arbitrary absolute keys and foreign
keys when the DTD is fixed, and - in PTIME for relative path keys in the absence of
DTDs - The analysis of XML constraints is far more
intricate than its relational counterpart
34Applications
- Application of XML constraints, and open problems
- Constraint propagation
- Schema-directed XML integration
- Normal form
- Query rewriting/optimization
- Update processing
- Data cleaning
- . . .
35XML shredding relational storage of XML data
- XML shredding
- mapping XML data to relations
- relational design normalization
- optimal relational storage of XML data
- semantic connection query/update optimization
Web
XML
XML
XML keys
XML shredding
propagation
DB1
DB2
relational FDs
36Example XML constraints
- (//book, isbn) -- isbn is an (absolute)
key of book - (//book, (chapter, number) -- number is a
key of chapter relative to book - (//book, (title, )) -- each book has a
unique title
chapter
chapter
37Mapping from XML to a predefined relation
- Predefined RDB chapter(bookTitle, chapterNum,
chapterTitle) - Mapping for each book, extract its title, and
the numbers and titles of all its chapters - Predefined relational key (bookTitle,
chapterNum) - Can the XML data be mapped to the RDB without
violating the key?
38A safe mapping
- Now change the relational schema to
- RDB chapter(isbn, chapterNum, chapterTitle)
- The relation can be populated without any
violation. Why? - The relational key (isbn, chapterNum) for
chapter is implied (entailed) by the keys on the
original XML data - (//book, isbn), (//book, (chapter,
number), (//book, (title, ))
39Constraint Propagation ICDE03
- Input
- a set K of XML keys (context and target path a
fragment of XPath, key paths attributes) - a predefined relational schema S,
- a mapping f from XML to S (XPath, projection,
join, union) - and a relational functional dependency FD over S
- Output is the FD propagated from K via f?
I.e., does FD hold over the DB f(T) for any XML
document T that satisfies K? - Theorem The constraint propagation problem is in
PTIME. - Checking the consistency of a predefined
relational schema for storing XML data - XML schema/DTD is not required K is the only
semantics
40Deriving relational schema for storing XML
- One wants to find a good relational schema to
store - chapter(isbn, bookTitle, author, chapterNum,
chapterTitle) - What is a good schema? In normal form BCNF, 3NF,
- Prevent update anomaly (the relational theory)
- Efficient storage, query optimization
- But how to find a normalized design?
41Constraint propagation and normalization
- From the given XML keys
- (//book, isbn), (//book, (chapter,
number), (//book, (title, )) - one can derive functional dependencies
- isbn ? bookTitle, isbn, chapterNum ?
chapterTitle - Normalize the relation by using these functional
dependencies - chapter(isbn, bookTitle, author, chapterNum,
chapterTitle) -
- book(isbn, bookTitle),
- chapter(isbn, chapterNum, chapterTitle),
- author(isbn, author)
- The new schema is in BCNF!
42Computing minimum cover of propagated FDs
- Input a set K of XML keys, and a mapping f
from XML to a universal schema U - Output a minimum cover F of all the functional
dependencies (FDs) propagated from the XML keys K
via f - F is a cover (a set of FDs) any FD propagated
from K via f is implied by F - F is minimum F contains no redundant FDs, i.e.,
any FD in F is not entailed by other FDs in F. - Theorem There is a PTIME algorithm for computing
a minimum cover of propagated FDs. - Normalize relational schema for storing/querying
XML data!
43Research issues
- For general constraints/mapping languages
undecidable - if the mapping language is relationally complete
(selection, projection, join, union, difference),
even for XML keys alone - if both XML keys and foreign keys are considered,
even for the identity transformation - Open
- To identify (a) practical mapping languages and
(b) practical XML constraints that allow
efficient constraint propagation - Constraint propagation from relations to XML
- Information preserving (lossless) data exchange
- Query/update rewriting/optimization
- Overcoming incompleteness of source data (foreign
keys)
44XML publishing/integration
- All members of a community (or industry) agree on
a schema and exchange data w.r.t. the schema
e-commerce, health-care, ... - Schema-directed XML Publishing/Integration
- mapping data from traditional database to XML
- satisfying the predefined DTD and constraints
Web
XML
XML
Q XML view
DB1
DB2
45Schema-directed integration SIGMOD03
DTD
DB
DB
integration
DB
constraints
multiple, distributed sources
- Schema-directed XML view conforming to a schema
(D, ?) - D a DTD
- ? a set of XML constraints (relative keys,
foreign keys) - Attribute Integration Grammar (AIG)
- DTD-directed view definition recursive,
nondeterministic - Inherited and synthesized attributes
- Constraint compilation automatically captures
integrity constraints and DTD in a uniform
framework
46XML normal forms
- Extensions of (nested) relational normal forms,
via XML FDs - M. Arenas and L. Libkin. A Normal Form for XML
Documents, PODS 02. XNFs, decomposition
algorithms, complexity, - M. Vincent, J. Liu and C. Liu. Strong functional
dependencies and their application to normal
forms in XML. TODS 29(3), 2004 - X. Wu, T.W. Ling, S. Lee, M. Lee, G. Dobbie.
NF-SS A Normal Form for Semistructured Schema.
ER (Workshops) 2001 - Research issues
- Implication analysis more intriguing than
relational FDs - Relative functional dependencies hierarchical
nature of XML - Right normal form XML data is typically stored
in RDBMS - redundancy often helps, e.g., performance and
reliability - XML data is often static update anomalies?
47Run-time analysis incremental constraint
checking
- Input XML tree T, constraints ?, update ?T,
where T satisfies ? - Question does (T ?T) satisfy ??
- ?X . Code generator incremental checking. Lucent
applications - M. Benedikt, G. Brun, J. Gibson, R. Kuss and A.
Ng. Automated update management for XML integrity
constraints. PLANX02 - Application of incremental techniques for
attribute grammar - M. Abrao, B. Bouchou, M. Alves, D. Laurent, M.
Musicante. Incremental Constraint Checking for
XML Documents XSym04 - Research issues
- Complexity of incremental constraint checking
- XML editors broken link detection and repair
- Incremental checking techniques for XML data
stored in RDBMS
48Query rewriting and optimization
- Query translation from XQuery to SQL XML data
stored in RDBMS - encode XIGs and XQuery in relational queries and
constraints - extensions of chase and backchase
- A. Deustch and V. Tannen
- Reformulation of XML Queries and Constraints
ICDT03 - MARS A System for Publishing XML from Mixed and
Redundant Storage VLDB03 - R. Krishnamurthy, R. Kaushik, J. Naughton.
Efficient XML-to-SQL Query Translation Where to
Add the Intelligence? VLDB 2004 - Research issues
- Rewriting queries over (recursive security) views
of XML data - Query optimization for (compressed) XML data in
native store
49Data cleaning
- Input XML tree T, constraints ?, DTD D
- Question if T does not satisfy D ?, find a
repair T such that (a) T satisfies D ?, and
(b) the distance between T and T is minimal
(update operations insert, delete, modify) - G. Flesca, F. Furfaro, S. Greco, E. Zumpano.
Repairs and Consistent Answers for XML Data with
Functional Dependencies XSym03 - Research issues
- Effective techniques for repairing integrated XML
data conflicts and inconsistencies may emerge as
violations of constraints. - Various constraint languages,
- XML schema
- Automated tools for repairing Web pages broken
links
50Summary
- Specification of XML constraints
- absolute vs. relative, path constraints XML data
is hierarchical and semi-structured - mild extensions of relational constraints are not
sufficient - Consistency and implication analysis of XML
constraints - DTDs interact with XML constraints
- far more intricate than their relational
counterparts - Applications of XML constraints
- XML storage, query, update, integration,
cleaning, - many practical issues remain to be explored