Title: Specifying XML Data with Integrity Constraints
1Specifying XML Data with Integrity Constraints
- Wenfei Fan
- Bell Labs and Temple University
2Outline
- XML, Web data and database techniques
- XML specifications types and constraints
- XML constraints absolute/relative keys and
foreign keys - Analysis of XML constraints consistency and
implication - Research issues
- Area references
- "A Web odyssey From Codd to XML", V. Vianu, PODS
2001. http//www.cis.upenn.edu/wfan/PODS2001/pro
ceedings.html - "Constraints for semistructured data and XML", P.
Buneman, W. Fan, J. Simeon and S. Weinstein,
SIGMOD Record 30(1), March 2001.
http//www.cis.temple.edu/fan/papers/xml/survey.p
s.gz
3 - Part 1. XML a brief introduction
4What is wrong with HTML?
- HTML (HyperText Markup Language) is good for
presentation, but does not help information
extraction by programs. - lth3gt George Bush lt/h3gt
- ltbgt Taking Eng 055 lt/bgt ltbrgt
- ltemgt GPA 1.5 lt/emgt ltbrgt
- lth3gt Eng 055 lt/h3gt
- ltbgt Spelling lt/bgt
- HTML tags
- predefined and fixed
- describing display format rather than the
structure of data
5eXtensible Markup Language
- XML tags
- user-defined, arbitrarily nested
- describing the structure of the data rather than
display - ltstudent id 123gt
- ltnamegt
- ltfirstNamegt George lt/firstNamegt ltlastNamegt
Bush lt/lastNamegt - lt/namegt
- lttakinggt Eng 055 lt/takinggt
- ltGPAgt 1.5 lt/GPAgt
- lt/studentgt
- ltcourse cno Eng 055gt
- lttitlegt Spelling lt/titlegt
- lt/coursegt
6XML basics
- Element the segment between a start tag and a
corresponding end tag, e.g., student, name. - Subelement relation between an element and its
component elements, e.g., name to student. - Attribute marked text within a start tag, e.g.,
id. - Text the single basic type (PCDATA), e.g.,
Bush. - XML elements are ordered, whereas attributes are
not. - ltstudent id 123gt
- ltnamegt
- ltfirstNamegt George lt/firstNamegt ltlastNamegt
Bush lt/lastNamegt - lt/namegt
- lttakinggt Eng 055 lt/takinggt ltGPAgt
1.5 lt/GPAgt - lt/studentgt
7The XML tree model
- An XML document is typically modeled as a
node-labeled tree. - Element node internal, with a name (tag) and
children (subelements and attributes), e.g.,
student, name. - Attribute node leaf with a name (tag) and text,
e.g., _at_id. - Text node leaf with text (string) but without a
name.
8XML and Web data
- Web data is semistructured schemaless, irregular
- Traditional database systems cant model Web data
- XML model a special case of the semistructured
data model - flexible model Web data (with references
foreign keys) - powerful represent data from databases
9XML in data exchange
- XML the primary standard for data exchange on
the Web - across formats/platforms/enterprises
- generated and consumed by applications
- healthcare industry, e-commerce, digital library,
Web
XML
XML
OODB Unix
RDB MS
10XML in data integration
- mediator/wrapper vs. virtual view of a database
- data warehouse vs. materialized view of a
database - Web databases, e-commerce
client
client
mediator -- XML
wrapper
wrapper
wrapper
file
Web
DB
11XML in e-commerce
- A site for a car dealer provides a uniform query
interface for price, rating, review and
competitors price/availability. - Integrating local data, national archive for
safety records, review data, competitors sites - e-commerce query interface (XML), integration
system (XML), database system, workflow management
client
client
query interface, warehouse -- XML
integrator
integrator
integrator
integrator
local DB
national records
review
competitor
12Database techniques for managing XML data
- specifying XML types and constraints
- querying XML XSL, XQL, XML-QL, Lorel, UnQL
- updating XML constraints and concurrency control
- integrating XML database transformations and
integration - storing XML efficient storage and access
methods, indexing - These are crucial for Web applications
- e-commerce, digital library, data exchange, Web
databases, - Web site management,
- XML players W3C, Microsoft, HP, Oracle, Adobe,
...
13 - Part 2. XML specification types and constraints
14A relational schema (SQL)
- Type and constraints
- create table students create table
courses - ( id char(9), ( cno
char(9), - name char(20), title
char(20), - primary key id) primary key
cno) - create table enroll
- ( id char(9),
- cno char(9),
- primary key (id, cno),
- foreign key id references students,
- foreign key cno references courses)
15An object-oriented schema (ODMG)
- Types and constraints
- class student class
course - (key id, (key
cno, - extent students) extent
courses) - attribute string id attribute
string cno - attribute string name
attribute string title - relationship setltcoursegt taking
relationship setltstudentgt takenBy - inverse coursetakenBy
inverse studenttaking - The distinction between types and constraints is
dictated by what programming languages treat as
types
16XML specification types
- DTD (Document Type Definition)
- lt!ELEMENT db (student, course) gt
- lt!ELEMENT student (name, taking)gt
- lt!ELEMENT course (title)gt
- lt!ELEMENT taking (title)gt
- attributes
- student _at_id
- course _at_cno
- taking _at_cno
- XML Schema
- Xduce, XML Algebra, XML Data, ...
17XML specification constraints
- Keys locating a specific object, an invariant
connection from an object in the real world to
its representation - student._at_id ? student
- course._at_cno ? course
- foreign keys referencing an object from another
object - taking._at_cno ? course._at_cno, course._at_cno ?
course - Key specifications
- the XML standard (DTD), XML Schema, XML Data,
...
18Constraints are important for XML
- XML is semistructured and may not come with a
DTD/type - constraints are a fundamental part of the
semantics - constraints have proved useful in
- semantic specifications obvious
- query optimization chasing algorithm
- database conversion to an XML encoding a must
- data integration information preservation
- update anomaly prevention classical
- normal forms for XML specifications BCNF,
3NF - efficient storage/access indexing
- ...
19 - Part 3. XML constraints keys and foreign keys
20The limitations of the XML standard
- ID and IDREF attributes in DTD
- lt!ATTLIST student id ID
requiredgt - lt!ATTLIST course cno ID
requiredgt - lt!ATTLIST taking cno IDREF
impliedgt - Scoping
- ID unique within the entire document (like oids)
- IDREF untyped one has no control over what it
points to - unary and primary
- defined in a type
- A mixture of relational keys and object
identities (oids)
21The limitations of XML Schema
- Keys defined with a list of XPath expressions
- (student, firstName, lastName)
- (student, lastName, firstName)
- (student, lastName, lastName,
firstName) - Equivalence/containment of XPath expressions is
unresolved - No efficient way to tell whether two keys are
equivalent - The notion of value equality is too restricted
(text only) - The notion of relative keys is not addressed
- Mild generalizations of relational keys fail to
capture some fundamental semantics associated
with the hierarchical structure of XML data
22To overcome the limitations WWW10
- Absolute key (Q, P1, . . ., Pk )
- target path Q to identify a target set Q of
nodes on which the key is defined (vs. relation) - a set of key paths P1, . . ., Pk to provide
an identification for nodes in Q (vs. key
attributes) - semantics for any two nodes in Q, if they
have all the key paths and agree on them up to
value equality, then they must be the same node
(value equality and node identity) - ( _.student, _at_id)
- ( _.student, _.name)
- ( _.enroll, _at_id, _at_cno)
- ( _, _at_id)
23Value equality on trees
- Two nodes are value equal iff
- either they are text nodes (PCDATA) with the same
value - or they are attributes with the same tag and the
same value - or they are elements having the same tag and
their children are pairwise value equal
...
24Capturing the semistructured nature
- independent of types
- no structural requirement tolerating
missing/multiple paths - (person, name) (person, name, _at_phone)
25Path expressions
- A simple yet powerful regular path language
- q ? l q.q
_ - ? empty path
- l tag
- q.q concatenation
- _ combination of wildcard and the Kleene
closure - Theorem. The containment and equivalence problems
for these path expressions are finitely
axiomatizable and decidable in quadratic time.
26A key constraint language K
- Relative key (Q, K)
- path Q identifies a set Q of nodes, called
the context - k (Q, P1, . . ., Pk ) is a key on
sub-documents rooted at nodes in Q (relative
to Q). - Example. (book, (chapter, number)
- (book.chapter, (section, number))
- (book, title) -- absolute key
- Analogous to keys for weak entities in a
relational database - the key of the parent entity
- an identification relative to the parent entity
27Absolute vs. relative keys
- Absolute keys as a special case of relative keys
- (Q, K) when Q is the empty path
- Absolute keys are scoped within the context of
the entire document, while relative keys are
scoped within the context of a sub-document - Important for hierarchically structured data
XML, scientific databases, - absolute (book, title)
- relative (book, (chapter, number)
- relative (book.chapter, (section, number))
- XML keys are more complex than relational keys!
28 - Part 4. XML constraint analysis
29Consistency of an XML specification
- Given D a DTD
- ? a set of keys and foreign keys
- Consistency is there an XML document that both
conforms to D and satisfies ?? - Example.
- DTD D lt!ELEMENT foo (X, X) gt
- lt!ELEMENT X (empty)gt
- constraints ? (X, ?)
- One wants to know whether an XML specification
makes sense!
30Implication of XML constraints
- Given D a DTD
- ? a set of keys and foreign keys
- ? a property (a key or foreign key)
- Implication is it the case that for any XML
document, if it conforms to D and satisfies ?,
then it must satisfy ?? - The need for studying implication
- data integration constraints cannot be checked
directly at the mediator level - design theory for XML specifications along the
same lines as database normalization - query optimization (chase), . . .
31Consistency analysis
- Trivial for relational databases given any
schema and keys, foreign keys, one can always
find a nonempty instance of the schema satisfying
the constraints. - Hard for XML XML specifications with DTD and
keys, foreign keys may not be consistent! - DTDs interact with constraints in an intricate
way.
32The interaction between DTDs and constraints
- DTD D lt!ELEMENT foo (X, X) gt
- lt!ELEMENT X (empty)gt
- key ? (X, ?)
- (1) conforms to D two X nodes under the root
- (2) satisfies ? no two X nodes under the root
can have the same value - There is no XML tree both conforming to D and
satisfying ?
33Consistency of DTDs
- There is need for consistency analysis even in
the absence of constraints - Example. DTD
- lt!ELEMENT foo (foo)gt
- There exists no XML document that conforms to the
DTD!
34A simple constraint language, C
- absolute key ?X ? ?. A document satisfies
the key iff - ? x y ? ext(?) (xX v yX ? x y)
-
- absolute foreign key an inclusion constraint
?1X ? ?2Y and a key ?2Y ? ?2. A document
satisfies the foreign key iff it satisfies the
key and - ? x ? ext(?1) ? y ? ext(?2) (xX v yY)
-
- where
- ?, ?1, ?2 element types
- X, Y sets (sequences) of attributes
- ext(?) the set of all ? elements in the
document - v value equal.
35C constraints vs. K constraints
- absolute key ?X ? ? in C is equivalent to an
absolute key in K (_. ?, X) - absolute keys are a special case of K constraints
- absolute foreign key ?1X ? ?2Y and ?2Y ?
?2 of C is not expressible in K
36Unary constraints
- Keys and foreign keys defined in terms of
single-attribute. - Example.
- student._at_id ? student
- course._at_cno ? course
- taking._at_cno ? course._at_cno
37Analysis of C constraints PODS01
- Theorem. In the presence of DTDs, the following
problems are undecidable for keys and foreign
keys of C - the consistency problem
- the implication problem.
- As opposed to the trivial consistency analysis in
relational databases. - These negative results carry over to
- other schema languages XML Schema, XML Data,
XDuce, - other constraint languages XML Schema, XML
Data,...
38Analysis of unary constraints
- Theorem. In the presence of DTDs, for unary
constraints of C - the consistency problem is NP-complete
- the implication problem is coNP-complete.
- In relational databases, implication of unary
keys and foreign keys is decidable in linear
time. - Primary key restriction at most one key for each
element type. - Theorem. In the presence of DTDs, the consistency
and implication problems remain intractable for
unary keys and foreign keys of C even under the
primary key restriction. - Keys specified with ID attributes are primary and
unary!
39A simple language for relative constraints, R
- relative key (Q, ?X ? ?). A document
satisfies the key iff - ? x ? Q ? y z ? ext(x.?) (yX v zX
? x y) -
- relative foreign key (Q1, ?1X) ? (Q2, ?2Y)
and a key (Q2, ?2Y ? ?2). A document
satisfies the foreign key iff it satisfies the
key and - ? x ? Q1 ? y ? Q2 (ext(x.?1)X ?v
ext(y.?2)Y) -
- where
- Q, Q1, Q2 path expressions
- ?, ?1, ?2 element types X, Y attributes
- ext(x.?) the set of ? sub-elements of x
- ?v set inclusion defined in terms of value
equality
40R constraints
- key (Q, ?X ? ?) of R is equivalent to
- (Q, (?, X))
- relative keys are a special case of K constraints
- foreign key (Q1, ?1X) ? (Q2, ?2Y) and (Q2,
?2Y ? ?2) of R is not expressible in K - Example.
- (CS.student, (taking._at_cno ? taking)
- (_, (course._at_cno ? course))
- (CS.student, taking._at_cno) ? (CS,
course._at_cno) - (CS, course._at_cno) ? (CS.student,
taking._at_cno)
41Analysis of relative constraints
- Theorem. In the presence of DTDs, the following
problems are undecidable even for unary relative
constraints of R - the consistency problem
- the implication problem.
- The analysis of XML constraints is far more
intriguing than its database counterparts!
42Tractable special cases
- Theorem. In the absence of constraints, the
consistency problem for arbitrary DTDs is
decidable in linear time. - Theorem. When DTD is fixed, the consistency and
implication problems for unary constraints of C
are in PTIME. - Theorem. When only keys of C are considered, the
consistency and implication problems are
decidable in linear time in the presence of DTDs.
43Constraint analysis in the absence of DTDs
- Regardless of DTDs
- Consistency given any set of keys and foreign
keys, can they be satisfied by an XML document? - Implication given a set ? of keys and foreign
keys, does it follow that all documents
satisfying ? must also satisfy another key or
foreign key? - The need for investigating these issues
- many XML documents do not come with a DTD
- one is interested in implication that generally
holds for all kinds of documents, regardless of
their DTDs.
44Analysis of C constraints PODS00
- Without DTDs, the consistency problem becomes
trivial any keys and foreign keys of C are
satisfiable. - Theorem. In the absence of DTDs, the implication
problem for C constraints remains undecidable. - Theorem. In the absence of DTDs, the implication
problem is decidable in PSPACE for keys and
foreign keys of C under the primary key
restriction. - Theorem. In the absence of DTDs, the implication
problem is decidable in linear time for unary
keys and foreign keys of C. - These results also hold when inverse constraints
are allowed.
45Analysis of K constraints DBPL01
- Without DTDs, the consistency problem for K also
becomes trivial any keys of K are satisfiable. - Theorem. In the absence of DTDs, the implication
problem for keys of K is finitely axiomatizable
and is decidable in PTIME. - Theorem. In the absence of DTDs, the implication
problem for absolute keys of K is finitely
axiomatizable and is decidable in O(n3) time. - The absence of DTDs simplifies the constraint
analysis but does not make it trivial!
46 - Part 5. Current research issues
47Summary
- Constraints are extremely important for XML data
management - XML constraints and their analysis are more
intricate than their database counterparts - Further work is needed for a better understanding
of - XML constraints
- consistency and implication of XML constraints
48Open problems
- Practical, tractable classes of XML constraints
- Normal forms for XML specifications is (D, ?)
good? - XML query optimization chasing for XML
constraints - Constraint propagation given certain database
constraint, what is the XML constraint that must
hold on the XML view of the database? - Relative information capacity is it the case
that if an XML document conforms to (D1, ?1) ,
then it must also conform to (D2, ?2)? - . . .