Title: On Verifying Consistency of XML Specifications
1On Verifying Consistency of XML Specifications
- Wenfei Fan
- Internet Management Research Dept., Bell Labs
- Dept. of CIS, Temple University
2Overview
- XML Specifications
- types DTDs (Document Type Definitions)
- integrity constraints keys and foreign keys
- Interaction between DTDs and constraints
- Consistency analysis of XML specifications
- absolute constraints
- relative constraints
- regular constrains
- Implication analysis of XML constraints
- Joint work with L. Libkin and M. Arenas, Univ. of
Toronto PODS01, PODS02, JACM
3 - 1. XML specifications introduction
4XML data - an example
- Rooted, node-labeled tree
- elements db, province, capital, city,
subtrees/sub-document - subelements, e.g., the capital child of province
- _at_attributes _at_name, _at_inProvince, carrying text
- text nodes, e.g., Hasselt
5XML specifications with DTDs
- Production constrains the subelement list of
each element lt!ELEMENT db (province,
capital)gt - lt!ELEMENT province (city, capital)gt
- Attributes uniquely identified by name for each
element, unordered - province _at_name, capital _at_inProvince
6Document Type Definition a formalism
- DTD D (E, A, P, R, r)
- E a set of element types, e.g., db, province,
capital, city - A a set of attributes, e.g., _at_name,
_at_inProvince - P element type definitions in terms of regular
expressions, e.g., db ? province, capital - R attribute definitions,
- e.g., province._at_name, capital._at_inProvince
- r the element type of the root, e.g., db.
- ECFG nonterminals (E, A), productions (P, R),
start symbol (r)
7XML specifications with constraints
- Keys and foreign keys (vs. relational
constraints) - key the value of a _at_name uniquely identifies a
province - province._at_name ? province
- capital._at_inProvince ? capital
- FK _at_inProvince of a capital references _at_name of
a province - capital._at_inProvince ? province._at_name
8Why keys and foreign keys?
- Supported by the XML standard, XML Schema, XML
Data - In databases (supported by SQL standard)
- essential part of the semantics of data,
- fundamental to conceptual design,
- useful for choosing efficient storage and access
methods, - central to update anomaly prevention,
- In the XML setting have proved useful in
- database storage of XML data (query and update),
- database publishing in XML,
- data integration,
- XML query optimization and formulation,
- design theory for XML specifications,
9XML specification
- A DTD D
- A set of keys and foreign keys, ?
- Example
- DTD D structure of the document
- lt!ELEMENT db (province, capital)gt
- lt!ELEMENT province (city, capital)gt
- province._at_name, capital._at_inProvince
- Constraints ? fundamental semantics of the data
- province._at_name ? province
- capital._at_inProvince ? capital
- capital._at_inProvince ? province._at_name
10 - 2. Interaction between DTDs and constraints
11Consistency of XML specifications
- Given D a DTD
- ? a set of keys and foreign keys
over D - Consistency Is there an XML document that both
conforms to D and satisfies ?? - One wants to know whether XML specifications make
sense! - Run-time check attempts to validate documents
with (D, ?). - This would not tell us whether repeated failures
are due to a bad specification or problems with
the documents - ? static analysis is a better approach
12An inconsistent specification
- The specification with D and ? is inconsistent!
- DTD D
- lt!ELEMENT db (province, capital)gt
- lt!ELEMENT province (city, capital)gt
- province._at_name, capital._at_inProvince
- Constraints ?
- province._at_name ? province
- capital._at_inProvince ? capital
- capital._at_inProvince ? province._at_name
- In contrast, one can specify keys and foreign
keys in SQL without worrying about their
consistency with schema.
13Cardinality constraints by keys, foreign keys
- Constraints ?
- province._at_name ? province
- capital._at_inProvince ? capital
- capital._at_inProvince ? province._at_name
- Notation
- ext(?) the set of ? elements in an XML document
- ext(?.l) the set of l attribute values of all ?
elements - ?
- ext(province._at_name)
ext(province) - ext(capital._at_inProvince) ext(capital)
- ext(capital._at_inProvince) ?
ext(province._at_name) - ?
- ext(capital) ? ext(province)
14Cardinality constraints imposed by DTDs
- DTD D lt!ELEMENT db (province, capital)gt
- lt!ELEMENT province (city,
capital)gt - Variables
- Xprovince the number of province elements under
the root - Xcapital the number of capital subelements of
the root - Ycapital the number of capital subelements of
provinces - ?
- Xprovince ? 1, Xcapital ? 1
- ext(province) Xprovince,
Xprovince Ycapital - ext(capital) Xcapital Ycapital
- ?
- ext(capital) gt ext(province)
15The interaction
- Contradiction
- From the constraints ? ext(capital) ?
ext(province) - From the DTD D ext(capital) gt
ext(province) - Thus there exists NO XML document that both
conforms to D and satisfies ?.
16 - 3. Consistency analysis of XML specifications
17The consistency problem
- Given D a DTD
- ? a set of keys and foreign keys
expressed in C - Consistency (C ) Is there an XML document that
both conforms to D and satisfies ?? - C a constraint language, ranges over
- absolute constraints
- relative constraints
- regular constraints
- These constraint languages are important for
hierarchically structured data, including but not
limited to XML.
18Absolute keys and foreign keys
- key ??X ? ?. A document satisfies the key
iff - ? x y ? ext(?) (?l ?X (x.l y.l) ? x y)
- foreign key (FK) a combination of an inclusion
constraint ??1X ?? ??2Y, and a key ?
?2Y ? ? ??2 . - A document satisfies the FK iff it satisfies the
key and - ? x ? ext(??1 ) ? y ? ext(??2 ) (xX yY)
- where ??, ?1 ,??2 element types X, Y sets
(lists) of attributes - ext(?) the set of ? elements in an XML document.
- Equality issue
- value equality when comparing attributes
- node identify when comparing XML elements
19More on absolute constraints
- Absolute constraints are to hold on the entire
document. - Unary constraints keys and foreign keys defined
in terms of single-attribute. - Example of unary constraints
- province._at_name ? province
- capital._at_inProvince ? capital
- capital._at_inProvince ? province._at_name
20Consistency analysis
- Trivial for relational databases given any
schema and keys, foreign keys, one can always
find a nonempty instance of the schema satisfying
the constraints. - Hard for XML XML specifications may not be
consistent! - Both DTDs and constraints impose cardinality
constraints - The interaction between these two classes of
cardinality constraints is rather complicated.
21Consistency analysis of absolute constraints
- Theorem The consistency problem is undecidable
for multi-attribute keys and foreign keys. - Theorem It becomes NP-complete for unary
constraints. - Primary key restriction at most one key for each
element type. - Theorem It remains intractable for unary
constraints under the primary key restriction. - Theorem For primary multi-attribute keys and
unary foreign keys, the consistency problem is
decidable in NEXPTIME. - As opposed to the trivial analysis of the
relational counterpart.
22Proof ideas
- Multi-attribute constraints reduction from the
implication problem for functional and inclusion
dependencies in RDBs. - Unary keys and foreign keys
- a nontrivial encoding of DTDs and unary
constraints in terms of linear integer
constraints (O(n2 log n)-time) - polynomially equivalent to LIP, linear integer
programming - Multi-attribute primary keys and unary foreign
keys - polynomially equivalent to Prequadratic
Diophantine Problem (PDE) satisfiability of
linear integer constraints and prequadratic
constraints of the form x ? y z - the precise complexity of PDE, a restriction to
the Hilberts 10th problem, is open --
nontrivial.
23Introduction to relative constraints
- An XML tree specifies countries, provinces,
province capitals. - What is a key for a province?
- What does _at_inProvince of a capital reference?
db
...
country
country
...
...
province
capital
capital
province
_at_name
_at_name
Holland
Belgium
capital
_at_name
_at_name
capital
_at_inProvince
Hasselt
_at_inProvince
Maastricht
Limburg
Limburg
Limburg
Limburg
_at_inProvince
Hasselt
_at_inProvince
Hasselt
Limburg
Limburg
24Examples of relative constraints
- Relative constraints on a subdocument rooted at
a country - key country (province._at_name ?
province) - country (capital._at_inProvince ? capital)
- FK country (capital._at_inProvince ?
province._at_name) - Absolute on the entire document country._at_name
? country
db
...
country
country
...
...
province
capital
capital
province
_at_name
_at_name
Belgium
Holland
capital
_at_name
Hasselt
capital
_at_name
_at_inProvince
_at_inProvince
Maastricht
Limburg
Limburg
Limburg
Limburg
_at_inProvince
Hasselt
_at_inProvince
Hasselt
Limburg
Limburg
25Relative keys and foreign keys
- key ??(??1X ? ??1). A document satisfies the
key iff - ? c ? ext(?) ? y, z ? ext(?1)
- ( (y ?? c) ? (z ?? c) ? ?l ?X (y.l z.l) ?
y z) - foreign key (FK) ??( ?1X ?? ??2Y ) and a key
?( ?2Y ? ??2) . - A document satisfies the FK iff it satisfies the
key and - ? c ? ext(?) ? y ? ext(?1) (( y ?? c) ?
- ? z ? ext(??2 ) ((z ?? c) ? yX zY
)) - where ?
- (y ?? c) y is a descendant of c (y in the
subtree rooted at c) - ? context type
- ext(?) the set of ? elements in an XML document.
26Relative vs. Absolute
- Absolute constraints are a special case of
relative ones - country._at_name ? country ?
- db ( country._at_name ? country )
- absolute a fixed context type r
- Absolute constraints are scoped within the entire
document whereas relative ones within the
context of a subdocument. - country (province._at_name ? province)
- country (capital._at_inProvince ? capital)
- country (capital._at_inProvince ?
province._at_name) - country._at_name ? country
- Together they specify constraints on the entire
document - Important for hierarchically structured data
XML, scientific databases, biomedical data, ...
27Consistency analysis of relative constraints
- Theorem The consistency problem is undecidable
for relative keys and foreign keys, even when all
the constraints are unary and are under the
primary key restriction. - As opposed to the NP complexity of its absolute
counterpart. - Proof ideas reduction from the Hilberts 10th
problem. - Diophantine equation problem
- P1 (x1, , xk) Q1 (x1, , xk) c1
- . . .
- Pn (x1, , xk) Qn (x1, , xk) cn
28Introduction to regular constraints
- XML data is hierarchically structured
- define _at_eid as a key of employees of companies
and schools - define _at_taughtBy as a foreign key of students
referencing _at_eid of school employees.
29Examples of regular constraints
- Key (university._ company._).employee._at_eid
? - (university._ company._).employee
- FK _.student._at_taughtBy ? university._.employe
e._at_eid - _ wildcard that matches any label
- _ the Kleene closure of _
30Regular path expression
- Vertical regular expressions
- ? ? ? _ ?.? ??
? - ? empty word ? element type _
wildcard - ., , concatenation, disjunction, Kleene
star - Example (university._ company._).employee
- university._.employee
- nodes(?. ?) the set of ? elements in an XML
document that are reachable from the root by
following ?
31Regular expression constraints
- key ? ?.?X ? ? ?.?. A document satisfies
the key iff - ? x y ? nodes( ?.? ) (?l ?X (x.l y.l)
? x y) - foreign key ? ?1.?1X ?? ?2.?2Y, and a key
??2.?2Y ? ??2.?2 - A document satisfies the FK iff it satisfies
the key and - ? x ? nodes(? ?1.?1 ) ? y ? nodes(? ?2.?2 )
(xX yY) - where nodes(?.?) the set of ? elements reachable
from the root by following ?.
32Regular an extension of absolute constraints
- Example
- Key (university._ company._).employee._at_eid
? - (university._ company._).employee
- FK _.student._at_taughtBy ? university._.employe
e._at_eid - Observation nodes( _. ? ) ext(?)
- Recall absolute constraints
- key ??X ? ? ? ? ? _. ? X ? ? _. ?
- foreign key ??1X ?? ??2Y, ??2Y ? ? ??2
? - ? _. ?1 X ?? _.??2 Y, ?
_. ?2 Y ? ?_.??2
33Consistency analysis of regular constraints
- Corollary The consistency problem is undecidable
for multi-attribute regular keys and foreign
keys. - Theorem It is decidable in NEXPTIME and is
PSPACE-hard for unary regular constraints. - NEXPTIME an involved encoding in terms of LIP
- regular expressions in a DTD interact with
(vertical) regular path expressions reduce DTD
to a simple normal form - regular path expressions interact with each
other introduce exponentially many variables for
all boolean combinations - encoding reachability (nodes(?.?)) of a path
expression tag variables with states of finite
state automata
34Some tractable cases
- Restrictions on constraints unary, primary.
- Theorem For multi-attribute relative keys only,
the consistency problem is in linear time for
arbitrary DTDs. - Recall relative keys country (province._at_name
? province) - Restrictions on DTDs
- Theorem When DTD is fixed, the consistency
problem is in PTIME for absolute unary
constraints. - In practice, DTD is designed at one time, but
constraints are written in stages constraints
are incrementally added.
35Other restricted cases
- Theorem In the absence of recursion and Kleene
star in the DTD involved, the consistency problem
remains - undecidable for multi-attribute absolute
constraints, - intractable for unary absolute constraints,
- PSPACE-hard for unary regular constraints.
- Recall absolute ? regular unary
single-attribute - Other severely restricted cases
- nonrecursive DTD of a bounded depth, a set of
absolute unary constraints of a bounded size
NLOGSPACE - nonrecursive DTD, unary relative constraints that
can be partitioned into sets local to each
other with respect to the DTD (without
interaction) PSPACE-complete
36 - 4. Implication analysis of XML constraints
37Implication of XML constraints
- Given D a DTD
- ? a set of keys and foreign keys
expressed in C - ? a property (a key or foreign key of C)
- Implication (C ) Is it the case that for any
XML document, if it conforms to D and satisfies
?, then it must satisfy ?? - C a constraint language
- The need for studying implication
- data integration constraints checking at virtual
views - optimization of XML queries and XML relational
storage - design theory for XML specifications
normalization
38Some complexity results for implication
- Proposition For any class of XML constraints, if
its consistency problem is K-hard, then its
implication problem is coK-hard, where K is some
complexity class that contains DLOGSPACE. - Corollary The implication problem is
- undecidable for multi-attribute absolute
constraints and for unary relative constraints - PSPACE-hard for unary regular constraints
- coNP-hard for unary absolute constraints.
- Recall relative country (province._at_name ?
province) - regular _.student._at_taughtBy ?
_.professor._at_id - absolute country._at_name ? country
39Upper bounds
- Theorem The implication problem is
coNP-complete for unary absolute constraints. - Proof idea a nontrivial encoding in terms of
LIP and the Set Intersection Pattern Problem - Theorem The implication problem is decidable in
- linear time for absolute multi-attribute keys,
and - in PTIME for arbitrary absolute constraints when
the DTD is fixed.
40Summary
- XML specification DTD and constraints (keys,
foreign keys) - Consistency and implication analysis of XML
constraints - DTDs interact with XML constraints
- The analysis is far more intricate than its
relational counterpart - The negative results carry over to XML Schema,
XML Data and other more expressive specifications