Keys for XML - PowerPoint PPT Presentation

About This Presentation
Title:

Keys for XML

Description:

Universidade Federal do Parana, Brazil. Jonathan Mamou. Keys for XML. 2. Keys in DB design ... XML documents are to do at least - double duty as databases ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 75
Provided by: csHu
Category:
Tags: xml | keys | parana

less

Transcript and Presenter's Notes

Title: Keys for XML


1
Keys for XML
Peter Buneman, Susan Davidson, Wenfei Fan Carmem
Hara , Wang-Chiew Tan University of
Pennsylvania Temple University Universidade
Federal do Parana, Brazil
Jonathan Mamou
2
Keys in DB design
  • Essential part of DB design
  • Invariant connection between the tuple and the
    real-world entity
  • Important in update
  • Guarantee that an update will affect precisely
    one tuple

3
Keys in XML
  • XML documents are to do at least - double duty
    as databases
  • Examination of existing DTDs reveals a number of
    cases in which some element or attribute is
    specified as a unique identifier in comments
  • Various key specifications in XML Standard, XML
    Data, XML Schema

4
Components XML vs. relational DB
  • ltdbgt
  • ltstudentgt
  • ltnamegt Smith lt/namegt
  • ltcoursegt Math lt/coursegt
  • ltgradegt B lt/gradegt
  • lt/studentgt
  • ltstudentgt
  • ltnamegt Jones lt/namegt
  • ltcoursegt Math lt/coursegt
  • ltgradegt A lt/gradegt
  • lt/studentgt
  • ltstudentgt
  • ltnamegt Smith lt/namegt
  • ltcoursegt CS lt/coursegt
  • ltgradegt A- lt/gradegt
  • lt/studentgt
  • lt/dbgt

5
Components XML vs. relational DB (contd)
  • DB
  • If 2 tuples agree on their name and course
    attributes they agree everywhere
  • XML
  • If 2 elements agree on the name and course
    subelements then they are the same element
  • Node identification?
  • Equality?

6
Nodes - Value Equality
  • name key for person nodes
  • name may have a complex structure first-name,
    last-name

db
company
government
company
university
employee
employee
dept
employee
employee
...
employee
name
_at_id
name
_at_id
_at_id
name
firstName
lastName
Bill Clinton
Bill
Clinton
7
Hierarchical structure
  • Hierarchically structured databases, e.g.
    scientific data formats
  • Top-level key to identify components of a
    document
  • Secondary key to identify sub-components
  • Book/chapter/section
  • Bible/book/chapter/verse

8
Absolute and relative keys
  • In an XML document, how to identify
  • A book?
  • a chapter?
  • a section?

db
book
book
book
book
chapter
title
chapter
title
chapter
chapter
XML
section
section
number
section
number
section
SGML
number
number
number
text
number
1
number
number
1
10
6
1
5
1
...
10
9
XML standard - ID attribute
  • lt!ATTLIST book title ID requiredgt
  • lt!ATTLIST chapter number ID requiredgt
  • lt!ATTLIST section number ID
    requiredgt
  • Internal pointers rather than keys
  • Scoping ID attribute unique within the entire
    document rather than among a designated set of
    elements
  • cant express relative keys, e.g., for
    chapters/sections.
  • Limit to using attributes rather than elements
  • unary at most one key can be defined, in terms
    of a single attribute
  • value equality on text (string)
  • defined in a attribute type keys must come with
    a DTD

10
XML Data
  • Introduces a notion of keys explicitly
  • ltelementType id"booktable"gt
  • ltelement id"titleID" type"title"gt
  • ltelement type"author"gt
  • ltelement type"pages"gt
  • ltkey id"bookkey"gt
  • ltkeyPart href"titleID"/gt
  • lt/keygt
  • lt/elementTypegt
  • BUT
  • Can only be defined for element types rather than
    for certain collections of elements e.g. book,
    articles,

11
XPath
  • Possible to specify interesting fragments of a
    document
  • Syntax similar to navigating directories in a
    file system
  • // arbitrary path
  • . empty path
  • / document root - path concatenator
  • any single node name

12
XPath example
  • Select BBB elements which have any
    attribute      ltAAAgt           ltBBB id
    "b1"/gt           ltBBB id "b2"/gt
              ltBBB name "bbb"/gt           ltBBB/gt
         lt/AAAgt
  • //BBB_at_

13
Xpath example (contd)
  • ltAAAgt ltBBBgt lt/BBBgt ltXXXgt ltDDDgt ltFFFgt
    ltGGGgt lt/GGGgt             lt/FFFgt        lt/D
    DDgt  lt/XXXgt    ltCCCgt    lt/CCCgt lt/AAAgt

//GGG/ancestor
14
XML-Schema
  • ltelement name bookgt
  • ltcomplexTypegt
  • ltsequencegt
  • ltelement nametitle typestring/gt
  • ltelement namechapters max0occursunbounded
    gt
  • ltcomplexTypegt ... lt/complexTypegt
  • lt/elementgt
  • lt/sequencegt
  • lt/complexTypegt
  • ltkey namek gt
  • ltselector xpath./gt
  • ltfield xpathtitle/gt
  • lt/keygt
  • lt/elementgt

15
XML Schema (contd)
  • Allow to specify keys in term of XPath
    expressions
  • BUT
  • XPath is a relatively complex language (move
    down, sideways, upwards, predicates and functions
    can be embedded)
  • Equivalence/containment of XPath expressions is
    unresolved ? No efficient way to tell whether two
    keys are equivalent.
  • Value equality restricted to text
  • Relative key not addressed
  • Structural requirement key paths must exist and
    be unique.

16
A new key constraint language for XML
  • Powerful enough to express absolute and relative
    keys
  • Simple enough to be reasoned about efficiently
  • Equivalence/containment
  • consistency (satisfiability)
  • implication (keys derived from others)
  • Capturing the semistructured nature of XML data
  • independent of any types/schema
  • no structural requirements tolerating
    missing/multiple key paths

17
Outline
  • Node addresses testing whether 2 nodes are the
    same node
  • Value equality testing whether 2 nodes have the
    same value
  • Path expression language
  • Absolute key
  • Key Inference
  • Relative key
  • Strong key
  • Some issues

18
Tree representation
  • DOM (Document Object Model)
  • Document is a hierarchical structure of nodes
  • Element nodes
  • Attribute nodes
  • Text nodes

19
Tree representation (contd)
  • ltdbgt
  • ltcomposergt
  • ltnamegt J.S. Bach lt/namegt
  • ltborngt 1685 lt/borngt
  • ltwork num"BWV82gt
  • lttitlegt Ich habe genug lt/titlegt
  • lt/workgt
  • ltwork num"BWV552gt
  • lt/workgt
  • lt/composergt
  • ltcomposer period"baroquegt
  • ltnamegt G.F. Handel lt/namegt
  • ltwork num"HWV19gt
  • lttitlegt Art Thou Troubled? lt/titlegt
  • lt/workgt
  • lt/composergt
  • lt/dblt

20
Tree representation (contd)
21
Tree representation (contd)
  • Attribute node nametext, terminal
  • Text node text, terminal
  • Element node
  • name, may have children
  • Text and element children held in an array
  • Index in the array determined by the order of the
    subelement in the document
  • Attribute children held in a dictionary
  • Name of the attribute used as the index
  • Edge label uniquely identify children

22
Node Address
  • A path of edge labels from the root uniquely
    identifies a node ltl1lngt
  • lt121gt, lt13_at_numgt
  • An attribute node can only occur at the end of a
    node address
  • Order of attributes is unimportant
  • Order of subelements specified by their indexes
  • Address of a subnode relative to a node
  • Any subnode of a node with address ltagt will have
    a node address of the form ltabgt where ltbgt is the
    address of the subnode relative to ltagt.

23
Value Equality
  • Value of a node
  • A set S of relative addresses of its subnodes
  • A partial function from S to names
  • A partial function from S to texts
  • 2 nodes are value-equal if they agree on 1, 2, 3
  • Notation a v b

24
Value Equality (example)
  • S ., lt1gt, lt2gt, lt1,1gt, lt2,1gt

db
person
...
person
person
person
_at_pnone
name
name
_at_phone
1
2
1
234-5678
2
123-4567
firstName
lastName
firstName
lastName
1
1
1
1
George
George
Bush
Bush
25
Path expressions
  • How to identify nodes in a tree?
  • Expression involving node names (tags
    attributes) that describes a set of paths in the
    document tree
  • XPath (XML-Schema)
  • Regular expressions (semistructured data)

26
Regular Path Expressions
In the normal syntax of regular
expressions db.emps.emp db.(depts.dept.mgr
emps.emp) db._.name
Mary
Bill
John
27
Language for path expression
  • 2 necessary properties
  • Concatenation operation, not uniform presentation
    in XPath
  • Concatenate a/b with /c/d a/b//c/d
  • A path should only move down the tree
  • Navigation axis in XPath

28
Language for path expression
  • Empty path e (.)
  • Node name (tag/attribute name)
  • Wild card _, single node name ()
  • Arbitrary path _ (//)
  • Concatenation of paths P, Q is P.Q (/)
  • Notation
  • nP set of nodes (node addresses) reached by
    starting at node n and following a path that
    conforms to P
  • P rootP

29
Examples
  • Simple path
  • lt22gttitle lt221gt
  • composer.work lt13gt, lt14gt, lt22gt
  • Complex path
  • lt22gt_ lt22gt, lt221gt, lt2211gt,
    lt22_at_numgt
  • composer._ lt11gt, lt12gt, lt13gt, lt14gt,
    lt21gt, lt22gt
  • _.num lt13_at_numgt, lt14_at_numgt,
    lt22_at_numgt

30
Absolute key
31
Key specification
  • Necessary to specify
  • Set on which we are defining the key (relation)
  • Attributes (set of column names)
  • Pair (Q, P1, , Pn)
  • Target path Q path expression target set on
    which the key constraint is to hold
  • Key path P1, , Pn set of simple path
    expressions

32
Key specification (contd)
  • Target path Q
  • Key path P1, , Pn
  • For any node n in Q, there is a set of nodes
    nPi found by following Pi from n (may be empty)
  • Examples
  • (person.employees, name.firstname,
    name.lastname)
  • (composer, name)
  • (composer, born)

33
Formal Definition
  • A node n satisfies a key specification (Q,P1,...
    , Pk) iff for any n1, n2 in nQ,
  • if for all i, 1lt i lt k , there exist z1 in
    n1Pi and z2 in n2Pi such that z1 v z2
  • then n1 n2.
  • Value equality z1 v z2
  • Node equality 2 nodes are equal if they have
    the same node address n1 n2
  • The values associated with key paths uniquely
    identify a node in the target set
  • Not part of the schema, data

34
Remarks
  • For any n1, n2 in Q, if Pi is missing at either
    n1 or n2 then n1Pi and n2Pi are by definition
    disjoint
  • Multiple nodes
  • ltdbgt
  • ltAgt ltBgt 1 lt/Bgt lt/Agt
  • ltAgt ltBgt 1 lt/Bgt ltBgt 2 lt/Bgt lt/Agt
  • lt/dbgt
  • Key (A, B) with respect to the root.
  • The document does not satisfy the key.

35
Example of keys
  • (_.person, id)
  • 2 persons elements are disjoint on their id
    fields
  • (person, e)
  • Any 2 person nodes immediately under the root
    have different values
  • (employee, )
  • Empty key. There is at most one employee under
    the root
  • (_, id)
  • Any 2 nodes are disjoint on their id fields up to
    value-equality
  • Semantics of ID attribute in the XML standard

36
XML vs. relational
  • XML, paths that define keys
  • Need not exist (null-valued keys)
  • Do not have to be unique
  • Key paths specify a set of addresses within a
    document
  • Relational DB
  • Key values cannot be null, must exist
  • Have to be unique
  • 1NF requires each component of every tuple to be
    atomic value, not set

37
Remarks
  • Equivalence of 2 path expressions is decidable
  • Given a definition of equality on tree, do we
    need to have more than one key path in a key
    specification?
  • All key attributes must be represented as
    subnodes of some node
  • Constrain this node to contain only those
    subnodes
  • Too restrictive, unnecessary interference between
    key specifications and data models
  • Allow a (possible empty) set of nodes at the end
    of each key path
  • How to require each of the key paths to exist and
    to be unique?

38
Remarks (contd)
  • Language of path expression
  • Need something more powerful to express Q
  • (person.(mother father), id)
  • A person element followed by zero or more father
    or mother elements
  • Provisional language of path expressions
  • Does not change in the way of the theory

39
Key inference
  • In relational DB
  • Infer some keys from the presence of others
  • If (Q, S) is a key and S ? S, then so is (Q, S)
  • Counterpart of relational inference rule
  • If (Q.Q, P) is a key, then so is (Q, Q.P)
  • tree-like structure if a node is identified in
    a tree then its ancestor are also determined I.e.
    if a key path P uniquely identifies a node n in
    Q.Q then Q.P is a key path for the ancestor
    of n in Q.

40
Key Inference (contd)
  • If (Q,S) is a key and Q ? Q, then (Q, S) is
    also a key
  • Any key of the set Q is also a key for any
    subset of Q
  • For any finite set S of keys, there exists an
    (finite) XML document satisfying S
  • Key paths may be missing, e.g. (_,id)
  • If key path was required to exist at all nodes
    specified by the target path, the XML document
    would have to be infinite to satisfy the key
  • Only holds in the absence of DTDs

41
Key Inference
  • Key K (X, )
  • DTD D lt!ELEMENT foo (X, X)gt
  • foo foo
  • No XML document that both conforms to D and
    satisfies K
  • DTDs interact with XML key constraint

42
Relative Key
43
Relative key - Motivation
  • Motivated by scientific data format, hierarchical
    structure, large set of entries at the top-level
  • Protein sequence database Swiss-prot
  • Accession number (key) for each entry
  • Within each entry, sequence of citations each
    identified by a number 1, 2, 3,
  • Linguistic database recording of speech
  • Data sets held in files
  • Metadata provided by directory structure
  • /timit/train/dr1/fcjjf0/sa1.wav
  • TIMIT corpus, training set, dialect region 1,
    female speaker, speaker-ID "cjf0", sentence text
    "sa1", speech waveform file

44
An absolute key for books
  • An absolute key to identify a book (book,
    title )
  • target path book, starting from the root and
    identifying a collection of books
  • key path title its value uniquely identifies a
    book
  • absolute defined on the entire document

45
Relative key - definition
  • Like the key of a weak entity set in DB
  • Studios(name, address)
  • Crews(number)
  • A document satisfies a relative key specification
    (Q, (Q,S)) iff for all nodes n in Q, n
    satisfies the key (Q,S).
  • Absolute keys are a special case of relative keys
  • (Q,S) equivalent to (e, (Q,S))

46
A relative key for chapters
  • A relative key (book, (chapter, number )
    )
  • A chapter number uniquely identifies a chapter
    within a book!
  • Context path book
  • target path chapter, starting at a book
  • key path number
  • relative defined on sub-documents, relative to
    the context

47
Absolute/Relative Key
  • What is the difference between
  • Absolute key (book.chapter, number)
  • Relative key (book, (chapter, number ) )

48
A relative key for sections
Key (book.chapter, (section, number ) ) A
section number uniquely identifies a section
within a particular chapter of a particular
book! relative to the chapter containing the
section, and to the book containing the chapter
49
Transitivity of relative keys
  • A relative key such as (bible.book.chapter,(verse,
    number))
  • does not uniquely identify a particular verse in
    the bible
  • Book name, chapter number, verse number ? verse

50
immediately precedes relation
  • (Q1, (Q1,S1)) immediately precedes (Q2,
    (Q2,S2)) if Q2 Q1.Q1
  • (bible, (book,name))
  • immediately precedes
  • (bible.book, (chapter,number))
  • Any absolute key immediately precedes itself

51
precede relation
  • Precede is the transitive closure of the
    immediately precedes relation
  • Qn Q1.Q1Qn-1
  • (bible, (book, name)),
  • (bible.book,(chapter, number)),
  • (bible.book.chapter,(verse, number))

52
Transitivity of relative keys
  • A set S of relative keys is transitive if for any
    relative key K1 (Q1,(Q1,S1)) in S there is a
    key K2 (e,(Q2,S2)) in S which precedes K1
  • Any transitive set of relative key must contain
    some absolute key

53
Transitivity of relative keys - example
  • TRANSITIVE SET
  • (e,(bible.book, name))
  • (bible.book,(chapter, number))
  • (bible.book.chapter,(verse, number))

54
Insertion-friendly relative keys
  • Transitive key specification
  • (e, (university, name))
  • (university, (dept.employee, emp-id))
  • Identify an employee university name emp-id
  • Add an employee specify a dept for the employee
  • No way to identify a dept
  • Many ways to add an employee!!!

55
Insertion-friendly relative keys (contd)
  • Insert an element in the keyed part of the
    document unambiguously by specifying where to
    insert the element using keys.
  • A set S of relative keys is insertion-friendly if
    it is transitive and whenever (Q1,(Q1.n,S1)) ?
    S, there is a relative key (Q2,(Q2,S2)) ? S
    where Q2 gt 0 and Q1. Q1 Q2.Q2.
  • n is a node name
  • Every element with a prefix along the path Q1.Q1
    can be identified through some keys

56
Insertion-friendly relative keys (contd)
  • (e, (university, name))
  • (university, (dept, dept-name))
  • (university, (dept.employee, emp-id))
  • n employee

57
Insertion-friendly relative keys (contd)
  • (e, (university, name))
  • (university, (dept, dept-name))
  • (university, (dept.employee, emp-id))
  • Nothing about the dept is necessary to identify
    employees!!!
  • Anomaly that occurs in non-second NF of
    relational databases
  • Employees should not be children of department
    nodes, but only of university nodes
  • Linkage between employees and department should
    be expressed through a foreign key

58
Notation for relative key
  • If system of relative keys is transitive, it
    forms a hierarchical structure ? create a
    compressed syntax for such systems
  • Basic syntactic form
  • Q1P1 ,...,Pk1.Q2P1,...,Pk2.
    ...QnP1 ,...,Pkn

59
Notation for relative key (contd)
  • bible.bookname.chapternumber.versenumber
  • (e, (bible, ))
  • (bible, (book, name)
  • (bible.book, (chapter,number))
  • (bible.book.chapter, (verse,number))
  • companyname.employeeid, .departmentname
  • companyname.employeeid
  • companyname.departmentname

60
Notation for relative key
  • Compact and understandable
  • Ensure the internal consistency of the document
  • To tell other how to cite a component of our
    document
  • Our document have a structured core

61
Strong keys
62
Stronger definitions of keys
  • Requirements imposed by a key in relational DB
  • Uniqueness of a key
  • Existence of key
  • Key paths exist and are unique (for 1 ? i ? n,
    nPi contains exactly one node)
  • name is unique at lt1gt
  • work and num are not unique at this node

63
Stronger definitions of keys (contd)
  • A node n satisfies a strong key specification
    (Q, P1, , Pk) if
  • For all n in nQ and for all Pi, Pi exists and
    is unique at n.
  • For any n1, n2 in nQ, if for all I, n1Pi v
    n2Pi then n1n2

64
Stronger definitions of keys (contd)
  • (_.person, id)
  • Any 2 person elements, have unique id and differ
    on those elements
  • (person, e)
  • Unchanged
  • (employees, )
  • Unchanged

65
Stronger definitions of keys (contd)
  • (_, k)
  • Every element has a key k, including element
    whose name is k
  • Finite satisfiability?
  • Impose an infinite chain of k nodes
  • No finite document satisfies it
  • Because of the requirement of existence of key
    paths
  • Structural constraint

66
Relative Strong Key
  • A document satisfies a strong relative key
    specification (Q, (Q,S)) iff for all nodes n in
    Q, n satisfies the strong key (Q,S)

67
Unconstrained XML Node names as key values
68
Node names as key values
  • Key specification must cover the practical cases
    without using definitions that are too complex to
    allow any kind of reasoning about keys
  • Issue in unconstrained XML interchanging
    structure (the names) with data (their values)

69
unconstrained XML
  • ltdbgt
  • ltpartsgt
  • ltwidgetgt
  • ltidgt 123 lt/idgt ltwgt 1.5 lt/wgt lt/widgetgt
  • ltwidgetgt
  • ltidgt 234 lt/idgt ltwgt 2.5 lt/wgt lt/widgetgt
  • ltgadgetgt
  • ltidgt 123 lt/idgt ltwgt 3.2 lt/wgt lt/gadgetgt
  • lt/partsgt
  • lt/dbgt
  • ltdbgt
  • ltpartsgt
  • ltpartgt
  • lttypegt widget lt/typegt
  • ltidgt 123 lt/idgt
  • ltwgt 1.5 lt/wgt lt/partgt
  • ltpartgt
  • lttypegt widget lt/typegt
  • ltidgt 234 lt/idgt
  • ltwgt 2.5 lt/wgt lt/partgt
  • ltpartgt
  • lttypegt gadget lt/typegt
  • ltidgt 123 lt/idgt
  • ltwgt 3.2 lt/wgt lt/partgt
  • lt/partsgt
  • lt/dbgt

70
Node names as key values (contd)
  • Unconstrained XML
  • Type of a part is expressed in the tag
  • Key constraint parts.widgetid,.gadgetid
  • Alternative XML representation
  • type expressed as an attribute or subelement of a
    part element
  • Key constraint parts.parttype,id

71
Introducing a new part type
  • Introduce a thingy
  • unconstrained
  • Change key specification
  • parts.widgetid,.gadgetid,.thingyid
  • Alternative
  • No change parts.parttype,id
  • Ability to interchange structure and data is
    supposed to be one of the strong points of
    semistructured data and XML

72
Solution
  • Adding a virtual subelement node-name to each
    named node, whose value consists of the node name
  • Key parts._node-name, id
  • Does not alter any of the properties expected to
    hold for keys
  • Account for any practical use of tag names in keys

73
Conclusion
  • A new key constraint language for XML
  • independent of any schema specifications for XML
  • powerful enough to express absolute and relative
    keys
  • simple enough to be reasoned about efficiently
  • In contrast to their relational counterparts
  • XML keys are more complex
  • the analyses of XML keys are far more intricate

74
References
  • Peter Buneman, Susan Davidson, Wenfei Fan, Carmem
    Hara, and Wang-Chiew Tan. Keys for XML. WWW10
    (2001) http//db.cis.upenn.edu/DL/xmlkeys.ps
  • Peter Buneman, Susan Davidson, Wenfei Fan, Carmem
    Hara, and Wang-Chiew Tan. Reasoning about keys
    for XML. University of Pennsylvania. Technical
    Report MS-CIS-00-26, 2000 http//db.cis.upenn.edu/
    DL/absolute-full.ps
Write a Comment
User Comments (0)
About PowerShow.com