Title: Keys for XML
1Keys for XML
Peter Buneman, Susan Davidson, Wenfei Fan Carmem
Hara , Wang-Chiew Tan University of
Pennsylvania Temple University Universidade
Federal do Parana, Brazil
Jonathan Mamou
2Keys in DB design
- Essential part of DB design
- Invariant connection between the tuple and the
real-world entity - Important in update
- Guarantee that an update will affect precisely
one tuple -
3Keys in XML
- XML documents are to do at least - double duty
as databases - Examination of existing DTDs reveals a number of
cases in which some element or attribute is
specified as a unique identifier in comments - Various key specifications in XML Standard, XML
Data, XML Schema
4Components XML vs. relational DB
- ltdbgt
- ltstudentgt
- ltnamegt Smith lt/namegt
- ltcoursegt Math lt/coursegt
- ltgradegt B lt/gradegt
- lt/studentgt
- ltstudentgt
- ltnamegt Jones lt/namegt
- ltcoursegt Math lt/coursegt
- ltgradegt A lt/gradegt
- lt/studentgt
- ltstudentgt
- ltnamegt Smith lt/namegt
- ltcoursegt CS lt/coursegt
- ltgradegt A- lt/gradegt
- lt/studentgt
- lt/dbgt
5Components XML vs. relational DB (contd)
- DB
- If 2 tuples agree on their name and course
attributes they agree everywhere
- XML
- If 2 elements agree on the name and course
subelements then they are the same element - Node identification?
- Equality?
6Nodes - Value Equality
- name key for person nodes
- name may have a complex structure first-name,
last-name
db
company
government
company
university
employee
employee
dept
employee
employee
...
employee
name
_at_id
name
_at_id
_at_id
name
firstName
lastName
Bill Clinton
Bill
Clinton
7Hierarchical structure
- Hierarchically structured databases, e.g.
scientific data formats - Top-level key to identify components of a
document - Secondary key to identify sub-components
- Book/chapter/section
- Bible/book/chapter/verse
8Absolute and relative keys
- In an XML document, how to identify
- A book?
- a chapter?
- a section?
db
book
book
book
book
chapter
title
chapter
title
chapter
chapter
XML
section
section
number
section
number
section
SGML
number
number
number
text
number
1
number
number
1
10
6
1
5
1
...
10
9XML standard - ID attribute
- lt!ATTLIST book title ID requiredgt
- lt!ATTLIST chapter number ID requiredgt
- lt!ATTLIST section number ID
requiredgt - Internal pointers rather than keys
- Scoping ID attribute unique within the entire
document rather than among a designated set of
elements - cant express relative keys, e.g., for
chapters/sections. - Limit to using attributes rather than elements
- unary at most one key can be defined, in terms
of a single attribute - value equality on text (string)
- defined in a attribute type keys must come with
a DTD
10XML Data
- Introduces a notion of keys explicitly
- ltelementType id"booktable"gt
- ltelement id"titleID" type"title"gt
- ltelement type"author"gt
- ltelement type"pages"gt
- ltkey id"bookkey"gt
- ltkeyPart href"titleID"/gt
- lt/keygt
- lt/elementTypegt
- BUT
- Can only be defined for element types rather than
for certain collections of elements e.g. book,
articles,
11XPath
- Possible to specify interesting fragments of a
document - Syntax similar to navigating directories in a
file system - // arbitrary path
- . empty path
- / document root - path concatenator
- any single node name
12XPath example
- Select BBB elements which have any
attribute     ltAAAgt           ltBBB id
"b1"/gt           ltBBB id "b2"/gt
          ltBBB name "bbb"/gt           ltBBB/gt
     lt/AAAgt - //BBB_at_
13Xpath example (contd)
- ltAAAgt ltBBBgt lt/BBBgt ltXXXgt ltDDDgt ltFFFgt
ltGGGgt lt/GGGgt            lt/FFFgt       lt/D
DDgt lt/XXXgt   ltCCCgt    lt/CCCgt lt/AAAgt
//GGG/ancestor
14XML-Schema
- ltelement name bookgt
- ltcomplexTypegt
- ltsequencegt
- ltelement nametitle typestring/gt
- ltelement namechapters max0occursunbounded
gt - ltcomplexTypegt ... lt/complexTypegt
- lt/elementgt
- lt/sequencegt
- lt/complexTypegt
- ltkey namek gt
- ltselector xpath./gt
- ltfield xpathtitle/gt
- lt/keygt
- lt/elementgt
15XML Schema (contd)
- Allow to specify keys in term of XPath
expressions - BUT
- XPath is a relatively complex language (move
down, sideways, upwards, predicates and functions
can be embedded) - Equivalence/containment of XPath expressions is
unresolved ? No efficient way to tell whether two
keys are equivalent. - Value equality restricted to text
- Relative key not addressed
- Structural requirement key paths must exist and
be unique.
16A new key constraint language for XML
- Powerful enough to express absolute and relative
keys - Simple enough to be reasoned about efficiently
- Equivalence/containment
- consistency (satisfiability)
- implication (keys derived from others)
- Capturing the semistructured nature of XML data
- independent of any types/schema
- no structural requirements tolerating
missing/multiple key paths
17Outline
- Node addresses testing whether 2 nodes are the
same node - Value equality testing whether 2 nodes have the
same value - Path expression language
- Absolute key
- Key Inference
- Relative key
- Strong key
- Some issues
18Tree representation
- DOM (Document Object Model)
- Document is a hierarchical structure of nodes
- Element nodes
- Attribute nodes
- Text nodes
19Tree representation (contd)
- ltdbgt
- ltcomposergt
- ltnamegt J.S. Bach lt/namegt
- ltborngt 1685 lt/borngt
- ltwork num"BWV82gt
- lttitlegt Ich habe genug lt/titlegt
- lt/workgt
- ltwork num"BWV552gt
- lt/workgt
- lt/composergt
- ltcomposer period"baroquegt
- ltnamegt G.F. Handel lt/namegt
- ltwork num"HWV19gt
- lttitlegt Art Thou Troubled? lt/titlegt
- lt/workgt
- lt/composergt
- lt/dblt
20Tree representation (contd)
21Tree representation (contd)
- Attribute node nametext, terminal
- Text node text, terminal
- Element node
- name, may have children
- Text and element children held in an array
- Index in the array determined by the order of the
subelement in the document - Attribute children held in a dictionary
- Name of the attribute used as the index
- Edge label uniquely identify children
22Node Address
- A path of edge labels from the root uniquely
identifies a node ltl1lngt - lt121gt, lt13_at_numgt
- An attribute node can only occur at the end of a
node address - Order of attributes is unimportant
- Order of subelements specified by their indexes
- Address of a subnode relative to a node
- Any subnode of a node with address ltagt will have
a node address of the form ltabgt where ltbgt is the
address of the subnode relative to ltagt.
23Value Equality
- Value of a node
- A set S of relative addresses of its subnodes
- A partial function from S to names
- A partial function from S to texts
- 2 nodes are value-equal if they agree on 1, 2, 3
- Notation a v b
24Value Equality (example)
- S ., lt1gt, lt2gt, lt1,1gt, lt2,1gt
db
person
...
person
person
person
_at_pnone
name
name
_at_phone
1
2
1
234-5678
2
123-4567
firstName
lastName
firstName
lastName
1
1
1
1
George
George
Bush
Bush
25Path expressions
- How to identify nodes in a tree?
- Expression involving node names (tags
attributes) that describes a set of paths in the
document tree - XPath (XML-Schema)
- Regular expressions (semistructured data)
26Regular Path Expressions
In the normal syntax of regular
expressions db.emps.emp db.(depts.dept.mgr
emps.emp) db._.name
Mary
Bill
John
27Language for path expression
- 2 necessary properties
- Concatenation operation, not uniform presentation
in XPath - Concatenate a/b with /c/d a/b//c/d
- A path should only move down the tree
- Navigation axis in XPath
28Language for path expression
- Empty path e (.)
- Node name (tag/attribute name)
- Wild card _, single node name ()
- Arbitrary path _ (//)
- Concatenation of paths P, Q is P.Q (/)
- Notation
- nP set of nodes (node addresses) reached by
starting at node n and following a path that
conforms to P - P rootP
29Examples
- Simple path
- lt22gttitle lt221gt
- composer.work lt13gt, lt14gt, lt22gt
- Complex path
- lt22gt_ lt22gt, lt221gt, lt2211gt,
lt22_at_numgt - composer._ lt11gt, lt12gt, lt13gt, lt14gt,
lt21gt, lt22gt - _.num lt13_at_numgt, lt14_at_numgt,
lt22_at_numgt
30Absolute key
31Key specification
- Necessary to specify
- Set on which we are defining the key (relation)
- Attributes (set of column names)
- Pair (Q, P1, , Pn)
- Target path Q path expression target set on
which the key constraint is to hold - Key path P1, , Pn set of simple path
expressions
32Key specification (contd)
- Target path Q
- Key path P1, , Pn
- For any node n in Q, there is a set of nodes
nPi found by following Pi from n (may be empty) -
- Examples
- (person.employees, name.firstname,
name.lastname) - (composer, name)
- (composer, born)
33Formal Definition
- A node n satisfies a key specification (Q,P1,...
, Pk) iff for any n1, n2 in nQ, - if for all i, 1lt i lt k , there exist z1 in
n1Pi and z2 in n2Pi such that z1 v z2 - then n1 n2.
- Value equality z1 v z2
- Node equality 2 nodes are equal if they have
the same node address n1 n2 - The values associated with key paths uniquely
identify a node in the target set - Not part of the schema, data
34Remarks
- For any n1, n2 in Q, if Pi is missing at either
n1 or n2 then n1Pi and n2Pi are by definition
disjoint - Multiple nodes
- ltdbgt
- ltAgt ltBgt 1 lt/Bgt lt/Agt
- ltAgt ltBgt 1 lt/Bgt ltBgt 2 lt/Bgt lt/Agt
- lt/dbgt
- Key (A, B) with respect to the root.
- The document does not satisfy the key.
35Example of keys
- (_.person, id)
- 2 persons elements are disjoint on their id
fields - (person, e)
- Any 2 person nodes immediately under the root
have different values - (employee, )
- Empty key. There is at most one employee under
the root - (_, id)
- Any 2 nodes are disjoint on their id fields up to
value-equality - Semantics of ID attribute in the XML standard
36XML vs. relational
- XML, paths that define keys
- Need not exist (null-valued keys)
- Do not have to be unique
- Key paths specify a set of addresses within a
document
- Relational DB
- Key values cannot be null, must exist
- Have to be unique
- 1NF requires each component of every tuple to be
atomic value, not set
37Remarks
- Equivalence of 2 path expressions is decidable
- Given a definition of equality on tree, do we
need to have more than one key path in a key
specification? - All key attributes must be represented as
subnodes of some node - Constrain this node to contain only those
subnodes - Too restrictive, unnecessary interference between
key specifications and data models - Allow a (possible empty) set of nodes at the end
of each key path - How to require each of the key paths to exist and
to be unique?
38Remarks (contd)
- Language of path expression
- Need something more powerful to express Q
- (person.(mother father), id)
- A person element followed by zero or more father
or mother elements - Provisional language of path expressions
- Does not change in the way of the theory
39Key inference
- In relational DB
- Infer some keys from the presence of others
- If (Q, S) is a key and S ? S, then so is (Q, S)
- Counterpart of relational inference rule
- If (Q.Q, P) is a key, then so is (Q, Q.P)
- tree-like structure if a node is identified in
a tree then its ancestor are also determined I.e.
if a key path P uniquely identifies a node n in
Q.Q then Q.P is a key path for the ancestor
of n in Q.
40Key Inference (contd)
- If (Q,S) is a key and Q ? Q, then (Q, S) is
also a key - Any key of the set Q is also a key for any
subset of Q - For any finite set S of keys, there exists an
(finite) XML document satisfying S - Key paths may be missing, e.g. (_,id)
- If key path was required to exist at all nodes
specified by the target path, the XML document
would have to be infinite to satisfy the key - Only holds in the absence of DTDs
41Key Inference
- Key K (X, )
- DTD D lt!ELEMENT foo (X, X)gt
- foo foo
- No XML document that both conforms to D and
satisfies K - DTDs interact with XML key constraint
42Relative Key
43Relative key - Motivation
- Motivated by scientific data format, hierarchical
structure, large set of entries at the top-level - Protein sequence database Swiss-prot
- Accession number (key) for each entry
- Within each entry, sequence of citations each
identified by a number 1, 2, 3, - Linguistic database recording of speech
- Data sets held in files
- Metadata provided by directory structure
- /timit/train/dr1/fcjjf0/sa1.wav
- TIMIT corpus, training set, dialect region 1,
female speaker, speaker-ID "cjf0", sentence text
"sa1", speech waveform file
44An absolute key for books
- An absolute key to identify a book (book,
title ) - target path book, starting from the root and
identifying a collection of books - key path title its value uniquely identifies a
book - absolute defined on the entire document
45Relative key - definition
- Like the key of a weak entity set in DB
- Studios(name, address)
- Crews(number)
-
- A document satisfies a relative key specification
(Q, (Q,S)) iff for all nodes n in Q, n
satisfies the key (Q,S). - Absolute keys are a special case of relative keys
- (Q,S) equivalent to (e, (Q,S))
46A relative key for chapters
- A relative key (book, (chapter, number )
) - A chapter number uniquely identifies a chapter
within a book! - Context path book
- target path chapter, starting at a book
- key path number
- relative defined on sub-documents, relative to
the context
47Absolute/Relative Key
- What is the difference between
- Absolute key (book.chapter, number)
- Relative key (book, (chapter, number ) )
48A relative key for sections
Key (book.chapter, (section, number ) ) A
section number uniquely identifies a section
within a particular chapter of a particular
book! relative to the chapter containing the
section, and to the book containing the chapter
49Transitivity of relative keys
- A relative key such as (bible.book.chapter,(verse,
number)) - does not uniquely identify a particular verse in
the bible - Book name, chapter number, verse number ? verse
50immediately precedes relation
- (Q1, (Q1,S1)) immediately precedes (Q2,
(Q2,S2)) if Q2 Q1.Q1 - (bible, (book,name))
- immediately precedes
- (bible.book, (chapter,number))
- Any absolute key immediately precedes itself
51precede relation
- Precede is the transitive closure of the
immediately precedes relation - Qn Q1.Q1Qn-1
- (bible, (book, name)),
- (bible.book,(chapter, number)),
- (bible.book.chapter,(verse, number))
52Transitivity of relative keys
- A set S of relative keys is transitive if for any
relative key K1 (Q1,(Q1,S1)) in S there is a
key K2 (e,(Q2,S2)) in S which precedes K1 - Any transitive set of relative key must contain
some absolute key
53Transitivity of relative keys - example
- TRANSITIVE SET
- (e,(bible.book, name))
- (bible.book,(chapter, number))
- (bible.book.chapter,(verse, number))
54Insertion-friendly relative keys
- Transitive key specification
- (e, (university, name))
- (university, (dept.employee, emp-id))
- Identify an employee university name emp-id
- Add an employee specify a dept for the employee
- No way to identify a dept
- Many ways to add an employee!!!
55Insertion-friendly relative keys (contd)
- Insert an element in the keyed part of the
document unambiguously by specifying where to
insert the element using keys. - A set S of relative keys is insertion-friendly if
it is transitive and whenever (Q1,(Q1.n,S1)) ?
S, there is a relative key (Q2,(Q2,S2)) ? S
where Q2 gt 0 and Q1. Q1 Q2.Q2. - n is a node name
- Every element with a prefix along the path Q1.Q1
can be identified through some keys
56Insertion-friendly relative keys (contd)
-
- (e, (university, name))
- (university, (dept, dept-name))
- (university, (dept.employee, emp-id))
- n employee
57Insertion-friendly relative keys (contd)
- (e, (university, name))
- (university, (dept, dept-name))
- (university, (dept.employee, emp-id))
- Nothing about the dept is necessary to identify
employees!!! - Anomaly that occurs in non-second NF of
relational databases - Employees should not be children of department
nodes, but only of university nodes - Linkage between employees and department should
be expressed through a foreign key
58Notation for relative key
- If system of relative keys is transitive, it
forms a hierarchical structure ? create a
compressed syntax for such systems - Basic syntactic form
- Q1P1Â ,...,Pk1.Q2P1,...,Pk2.
...QnP1Â ,...,Pkn
59Notation for relative key (contd)
- bible.bookname.chapternumber.versenumber
- (e, (bible, ))
- (bible, (book, name)
- (bible.book, (chapter,number))
- (bible.book.chapter, (verse,number))
- companyname.employeeid, .departmentname
- companyname.employeeid
- companyname.departmentname
60Notation for relative key
- Compact and understandable
- Ensure the internal consistency of the document
- To tell other how to cite a component of our
document - Our document have a structured core
61Strong keys
62Stronger definitions of keys
- Requirements imposed by a key in relational DB
- Uniqueness of a key
- Existence of key
- Key paths exist and are unique (for 1 ? i ? n,
nPi contains exactly one node) - name is unique at lt1gt
- work and num are not unique at this node
63Stronger definitions of keys (contd)
- A node n satisfies a strong key specification
(Q, P1, , Pk) if - For all n in nQ and for all Pi, Pi exists and
is unique at n. - For any n1, n2 in nQ, if for all I, n1Pi v
n2Pi then n1n2
64Stronger definitions of keys (contd)
- (_.person, id)
- Any 2 person elements, have unique id and differ
on those elements - (person, e)
- Unchanged
- (employees, )
- Unchanged
65Stronger definitions of keys (contd)
- (_, k)
- Every element has a key k, including element
whose name is k - Finite satisfiability?
- Impose an infinite chain of k nodes
- No finite document satisfies it
- Because of the requirement of existence of key
paths - Structural constraint
66Relative Strong Key
- A document satisfies a strong relative key
specification (Q, (Q,S)) iff for all nodes n in
Q, n satisfies the strong key (Q,S)
67Unconstrained XML Node names as key values
68Node names as key values
- Key specification must cover the practical cases
without using definitions that are too complex to
allow any kind of reasoning about keys - Issue in unconstrained XML interchanging
structure (the names) with data (their values)
69unconstrained XML
- ltdbgt
- ltpartsgt
- ltwidgetgt
- ltidgt 123 lt/idgt ltwgt 1.5 lt/wgt lt/widgetgt
- ltwidgetgt
- ltidgt 234 lt/idgt ltwgt 2.5 lt/wgt lt/widgetgt
- ltgadgetgt
- ltidgt 123 lt/idgt ltwgt 3.2 lt/wgt lt/gadgetgt
- lt/partsgt
- lt/dbgt
- ltdbgt
- ltpartsgt
- ltpartgt
- lttypegt widget lt/typegt
- ltidgt 123 lt/idgt
- ltwgt 1.5 lt/wgt lt/partgt
- ltpartgt
- lttypegt widget lt/typegt
- ltidgt 234 lt/idgt
- ltwgt 2.5 lt/wgt lt/partgt
- ltpartgt
- lttypegt gadget lt/typegt
- ltidgt 123 lt/idgt
- ltwgt 3.2 lt/wgt lt/partgt
- lt/partsgt
- lt/dbgt
70Node names as key values (contd)
- Unconstrained XML
- Type of a part is expressed in the tag
- Key constraint parts.widgetid,.gadgetid
- Alternative XML representation
- type expressed as an attribute or subelement of a
part element - Key constraint parts.parttype,id
71Introducing a new part type
- Introduce a thingy
- unconstrained
- Change key specification
- parts.widgetid,.gadgetid,.thingyid
- Alternative
- No change parts.parttype,id
- Ability to interchange structure and data is
supposed to be one of the strong points of
semistructured data and XML
72Solution
- Adding a virtual subelement node-name to each
named node, whose value consists of the node name - Key parts._node-name, id
- Does not alter any of the properties expected to
hold for keys - Account for any practical use of tag names in keys
73Conclusion
- A new key constraint language for XML
- independent of any schema specifications for XML
- powerful enough to express absolute and relative
keys - simple enough to be reasoned about efficiently
- In contrast to their relational counterparts
- XML keys are more complex
- the analyses of XML keys are far more intricate
74References
- Peter Buneman, Susan Davidson, Wenfei Fan, Carmem
Hara, and Wang-Chiew Tan. Keys for XML. WWW10
(2001) http//db.cis.upenn.edu/DL/xmlkeys.ps - Peter Buneman, Susan Davidson, Wenfei Fan, Carmem
Hara, and Wang-Chiew Tan. Reasoning about keys
for XML. University of Pennsylvania. Technical
Report MS-CIS-00-26, 2000 http//db.cis.upenn.edu/
DL/absolute-full.ps