Constraints for XML - PowerPoint PPT Presentation

1 / 85
About This Presentation
Title:

Constraints for XML

Description:

Attribute: marked text within a start tag, e.g., id. ... XML elements are ordered, whereas attributes are not. student id = '123' name ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 86
Provided by: CIS471
Category:

less

Transcript and Presenter's Notes

Title: Constraints for XML


1
Constraints for XML
  • Susan B. Davidson
  • University of Pennsylvania
  • Wenfei Fan
  • Bell Labs and Temple University

2
Outline
  • XML, Web data and database techniques
  • XML specifications types and constraints
  • XML constraints absolute/relative keys and
    foreign keys
  • Analysis of XML constraints consistency and
    implication
  • Constraints in practice
  • Area references
  • "A Web odyssey From Codd to XML", V. Vianu, PODS
    2001. http//www.cis.upenn.edu/wfan/PODS2001/pro
    ceedings.html
  • "Constraints for semistructured data and XML", P.
    Buneman, W. Fan, J. Simeon and S. Weinstein,
    SIGMOD Record 30(1), March 2001.
    http//www.cis.temple.edu/fan/papers/xml/survey.p
    s.gz

3
  • Part 1. XML a brief introduction

4
What is wrong with HTML?
  • HTML (HyperText Markup Language) is good for
    presentation, but does not help information
    extraction by programs.
  • lth3gt George Bush lt/h3gt
  • ltbgt Taking Eng 055 lt/bgt ltbrgt
  • ltemgt GPA 1.5 lt/emgt ltbrgt
  • lth3gt Eng 055 lt/h3gt
  • ltbgt Spelling lt/bgt
  • HTML tags
  • predefined and fixed
  • describing display format rather than the
    structure of data

5
eXtensible Markup Language
  • XML tags
  • user-defined, arbitrarily nested
  • describing the structure of the data rather than
    display
  • ltstudent id 123gt
  • ltnamegt
  • ltfirstNamegt George lt/firstNamegt ltlastNamegt
    Bush lt/lastNamegt
  • lt/namegt
  • lttakinggt Eng 055 lt/takinggt
  • ltGPAgt 1.5 lt/GPAgt
  • lt/studentgt
  • ltcourse cno Eng 055gt
  • lttitlegt Spelling lt/titlegt
  • lt/coursegt

6
XML basics
  • Element the segment between a start tag and a
    corresponding end tag, e.g., student, name.
  • Subelement relation between an element and its
    component elements, e.g., name to student.
  • Attribute marked text within a start tag, e.g.,
    id.
  • Text the single basic type (PCDATA), e.g.,
    Bush.
  • XML elements are ordered, whereas attributes are
    not.
  • ltstudent id 123gt
  • ltnamegt
  • ltfirstNamegt George lt/firstNamegt ltlastNamegt
    Bush lt/lastNamegt
  • lt/namegt
  • lttakinggt Eng 055 lt/takinggt ltGPAgt
    1.5 lt/GPAgt
  • lt/studentgt

7
Representing relational databases
  • A relational database for school
  • student course
  • enroll

8
XML representation
  • ltschoolgt
  • ltstudent id001gt
  • ltnamegt Joe lt/namegt ltgpagt 3.0 lt/gpagt
  • lt/studentgt
  • ltstudent id002gt
  • ltnamegt Mary lt/namegt ltgpagt 4.0
    lt/gpagt
  • lt/studentgt
  • ltcourse cno331gt
  • lttitlegt DB lt/titlegt ltcreditgt 3.0
    lt/creditgt
  • lt/coursegt
  • ltcourse cno350gt
  • lttitlegt Web lt/titlegt ltcreditgt 3.0
    lt/creditgt
  • lt/coursegt

9
XML representation
  • ltenrollgt
  • ltidgt 001 lt/idgt ltcnogt 331 lt/cnogt
  • lt/enrollgt
  • ltenrollgt
  • ltidgt 001 lt/idgt ltcnogt 350 lt/cnogt
  • lt/enrollgt
  • ltenrollgt
  • ltidgt 002 lt/idgt ltcnogt 331 lt/cnogt
  • lt/enrollgt
  • lt/schoolgt

10
Representing object-oriented databases
  • An object-oriented database for school
  • student s1, s2
  • value(s1) id 001, name Joe, gpa
    3.0,
  • taking c1, c2
  • value(s2) id 002, name Mary, gpa
    4.0,
  • taking c1
  • course c1, c2
  • value(c1) cno 331, title DB, credit
    3.0,
  • taken_by s1, s2
  • value(c2) cno 350, title Web, credit
    3.0,
  • taken_by s1

11
XML representation
  • ltschoolgt
  • ltstudent ids1gt
  • ltidgt 001 lt/idgt
  • ltnamegt Joe lt/namegt
  • ltgpagt 3.0 lt/gpagt
  • lttaking idrefsc1 c2 /gt
  • lt/studentgt
  • ltstudent ids2gt
  • ltidgt 002 lt/idgt
  • ltnamegt Mary lt/namegt
  • ltgpagt 4.0 lt/gpagt
  • lttaking idrefsc1 c2 /gt
  • lt/studentgt

12
XML representation
  • ltcourse idc1gt
  • ltcnogt 331 lt/cnogt
  • lttitlegt DB lt/titlegt
  • ltcreditgt 3.0 lt/creditgt
  • lttaken_by idrefss1 s2 /gt
  • lt/coursegt
  • ltcourse idc2gt
  • ltcnogt 331 lt/cnogt
  • lttitlegt Web lt/titlegt
  • ltcreditgt 3.0 lt/creditgt
  • lttaken_by idrefss1 /gt
  • lt/coursegt
  • lt/schoolgt

13
The XML tree model
  • An XML document is typically modeled as a
    node-labeled tree.
  • Element node internal, with a name (tag) and
    children (subelements and attributes), e.g.,
    student, name.
  • Attribute node leaf with a name (tag) and text,
    e.g., _at_id.
  • Text node leaf with text (string) but without a
    name.

14
XML and Web data
  • Web data is semistructured schemaless, irregular
  • Traditional database systems cant model Web data
  • XML model a special case of the semistructured
    data model
  • flexible model Web data (with references
    foreign keys)
  • powerful represent data from databases

15
XML in data exchange
  • XML the primary standard for data exchange on
    the Web
  • across formats/platforms/enterprises
  • generated and consumed by applications
  • healthcare industry, e-commerce, digital library,

Web
XML
XML
OODB Unix
RDB MS
16
XML in data integration
  • mediator/wrapper vs. virtual view of a database
  • data warehouse vs. materialized view of a
    database
  • Web databases, e-commerce

client
client
mediator -- XML
wrapper
wrapper
wrapper
file
Web
DB
17
XML in e-commerce
  • A site for a car dealer provides a uniform query
    interface for price, rating, review and
    competitors price/availability.
  • Integrating local data, national archive for
    safety records, review data, competitors sites
  • e-commerce query interface (XML), integration
    system (XML), database system, workflow management

client
client
query interface, warehouse -- XML
integrator
integrator
integrator
integrator
local DB
national records
review
competitor
18
Database techniques for managing XML data
  • specifying XML types and constraints
  • querying XML XSL, XQL, XML-QL, Lorel, UnQL
  • updating XML constraints and concurrency control
  • integrating XML database transformations and
    integration
  • storing XML efficient storage and access
    methods, indexing
  • These are crucial for Web applications
  • e-commerce, digital library, data exchange, Web
    databases,
  • Web site management,
  • XML players W3C, Microsoft, HP, Oracle, Adobe,
    ...

19
  • Part 2. XML specification types and constraints

20
A relational schema (SQL)
  • Types and constraints
  • create table students create table
    courses
  • ( id char(9), ( cno
    char(9),
  • name char(20), title
    char(20),
  • primary key id) primary key
    cno)
  • create table enroll
  • ( id char(9),
  • cno char(9),
  • primary key (id, cno),
  • foreign key id references students,
  • foreign key cno references courses)

21
An object-oriented schema (ODMG)
  • Types and constraints
  • class student class
    course
  • (key id, (key
    cno,
  • extent students) extent
    courses)
  • attribute string id attribute
    string cno
  • attribute string name
    attribute string title
  • relationship setltcoursegt taking
    relationship setltstudentgt takenBy
  • inverse coursetakenBy
    inverse studenttaking
  • The distinction between types and constraints is
    dictated by what programming languages treat as
    types

22
XML specification DTD
  • DTD (Document Type Definition)
  • Type
  • lt!ELEMENT db (student, course) gt
  • lt!ELEMENT student (name, taking)gt
  • lt!ELEMENT course (title, taken_by)gt
  • lt!ELEMENT taking (empty)gt
  • lt!ELEMENT taken_by (empty)gt
  • Constraints ID and IDREF attributes in DTD
  • lt!ATTLIST student id ID
    requiredgt
  • lt!ATTLIST course cno ID
    requiredgt
  • lt!ATTLIST taking cno IDREF
    impliedgt
  • lt!ATTLIST taken_by id IDREF
    impliedgt
  • Others XML Schema, XML-Data, XDR, SOX,
    Schematron, DSD, ...

23
Capturing oids with IDs
  • Recall our XML encoding of our OODB
  • student s1, s2
  • course c1, c2
  • ltschoolgt
  • ltstudent ids1gt
  • ltidgt 001 lt/idgt ltnamegt Joe lt/namegt
  • ltgpagt 3.0 lt/gpagt lttaking
    idrefsc1 c2 /gt
  • . . .
  • ltcourse idc2gt
  • ltcnogt 331 lt/cnogt lttitlegt Web lt/titlegt
  • ltcreditgt 3.0 lt/creditgt lttaken_by
    idrefss1 /gt
  • lt/coursegt
  • lt/schoolgt

24
A DTD for the OODB
  • Types
  • lt!ELEMENT db (student, course) gt
  • lt!ELEMENT student (id, name, gpa,
    taking)gt
  • lt!ELEMENT course (cno, title, credit,
    taken_by)gt
  • lt!ELEMENT taking (empty)gt
  • Constraints
  • lt!ATTLIST student id ID
    requiredgt
  • lt!ATTLIST course id ID
    requiredgt
  • lt!ATTLIST taking idrefs IDREFS
    impliedgt
  • lt!ATTLIST taken_by idrefs IDREFS
    impliedgt
  • ID vs. object-identifier (oid)

25
  • Part 3. XML constraints keys and foreign keys

26
Keys and foreign keys for XML
  • Keys locating a specific object, an invariant
    connection from an object in the real world to
    its representation
  • student._at_id ? student
  • course._at_cno ? course
  • foreign keys referencing an object from another
    object
  • taking._at_cno ? course._at_cno, course._at_cno ?
    course
  • taken_by._at_id ? student._at_id, student._at_id ?
    student
  • Central issues value equality, typing, scoping,
    absolute/relative, ...
  • Key specifications
  • the XML standard (DTD), XML Schema, XML Data,
    ...

27
Specification of student in XML-Schema
  • ltelement name studentgt
  • ltcomplexTypegt
  • ltsequencegt
  • ltelement namename typestring/gt
  • ltelement nametaking min0occurs0
  • max0occursunboundedgt
  • ltcomplexTypegt
  • ltattribute namecno typestringgt
  • lt/complexTypegt
  • lt/elementgt
  • ltattribute nameid typestring
    /gt
  • lt/sequencegt
  • lt/complexTypegt

28
Keys and foreign keys in student
  • ltkey namek1 gt
  • ltselector xpath./gt
  • ltfield xpath_at_id/gt
  • lt/keygt
  • ltkeyref namefk1 referk2gt
  • ltselector xpathtaking/gt
  • ltfield xpath_at_cno/gt
  • lt/keyrefgt
  • lt/elementgt

29
Specification of course in XML-Schema
  • ltelement name coursegt
  • ltcomplexTypegt
  • ltsequencegt
  • ltelement nametitle typestring/gt
  • ltelement nametaken_by min0occurs0
  • max0occursunboundedgt
  • ltcomplexTypegt
  • ltattribute nameid typestringgt
  • lt/complexTypegt
  • lt/elementgt
  • ltattribute namecno typestring
    /gt
  • lt/sequencegt
  • lt/complexTypegt

30
Keys and foreign keys in course
  • ltkey namek2 gt
  • ltselector xpath./gt
  • ltfield xpath_at_cno/gt
  • lt/keygt
  • ltkeyref namefk2 referk1gt
  • ltselector xpathtaken_by/gt
  • ltfield xpath_at_id/gt
  • lt/keyrefgt
  • lt/elementgt

31
Keys in XML-Data
  • ltelementType id studentgt
  • ltelement idp1 typeid /gt
  • ltelement typename /gt
  • ltelement typetaking
    occursONEORMORE/gt
  • ltkey idk1 gt ltkeyPart hrefp1/gt lt/keygt
  • lt/elementTypegt
  • ltelementType id coursegt
  • ltelement idp2 typecno /gt
  • ltelement typetitle /gt
  • ltelement typetaken_by
    occursONEORMORE/gt
  • ltkey idk2 gt ltkeyPart hrefp2/gt lt/keygt
  • lt/elementTypegt

32
Foreign keys in XML-Data
  • ltelementType id takinggt
  • ltelement typecno /gt
  • ltdomain typestudent /gt
  • ltforeignKey rangecourse key k2 /gt
  • lt/elementTypegt
  • ltelementType id taken_bygt
  • ltelement typeid /gt
  • ltdomain typecourse /gt
  • ltforeignKey rangestudent key k1/gt
  • lt/elementTypegt

33
Constraints are important for XML
  • XML is semistructured and may not come with a
    DTD/type
  • constraints are a fundamental part of the
    semantics
  • constraints have proved useful in
  • semantic specifications obvious
  • query optimization chasing algorithm
  • database conversion to an XML encoding a must
  • data integration information preservation
  • update anomaly prevention classical
  • normal forms for XML specifications BCNF,
    3NF
  • efficient storage/access indexing
  • ...

34
The limitations of the XML standard
  • ID and IDREF attributes in DTD
  • lt!ATTLIST student id ID
    requiredgt
  • lt!ATTLIST course cno ID
    requiredgt
  • lt!ATTLIST taking cno IDREF
    impliedgt
  • lt!ATTLIST taken_by idrefs IDREF
    impliedgt
  • Scoping
  • ID unique within the entire document (like oids)
  • IDREF untyped one has no control over what it
    points to
  • unary and primary
  • defined in a type
  • A mixture of relational keys and object
    identities (oids)

35
The limitations of XML Schema
  • Keys defined with a list of XPath expressions
  • (student, firstName, lastName)
  • (student, lastName, firstName)
  • (student, lastName, lastName,
    firstName)
  • Equivalence/containment of XPath expressions is
    unresolved
  • No efficient way to tell whether two keys are
    equivalent
  • The notion of value equality is too restricted
    (text only)
  • The notion of relative keys is not addressed
  • Mild generalizations of relational keys fail to
    capture some fundamental semantics associated
    with the hierarchical structure of XML data

36
To overcome the limitations WWW10
  • Absolute key (Q, P1, . . ., Pk )
  • target path Q to identify a target set Q of
    nodes on which the key is defined (vs. relation)
  • a set of key paths P1, . . ., Pk to provide
    an identification for nodes in Q (vs. key
    attributes)
  • semantics for any two nodes in Q, if they
    have all the key paths and agree on them up to
    value equality, then they must be the same node
    (value equality and node identity)
  • ( _.student, _at_id)
  • ( _.student, _.name)
  • ( _.enroll, _at_id, _at_cno)
  • ( _, _at_id)

37
Value equality on trees
  • Two nodes are value equal iff
  • either they are text nodes (PCDATA) with the same
    value
  • or they are attributes with the same tag and the
    same value
  • or they are elements having the same tag and
    their children are pairwise value equal

...
38
Capturing the semistructured nature
  • independent of types
  • no structural requirement tolerating
    missing/multiple paths
  • (person, name) (person, name, _at_phone)

39
Path expressions
  • A simple yet powerful regular path language
  • q ? l q.q
    _
  • ? empty path
  • l tag
  • q.q concatenation
  • _ combination of wildcard and the Kleene
    closure
  • Theorem. The containment and equivalence problems
    for these path expressions are finitely
    axiomatizable and decidable in quadratic time.

40
Relative constraints
  • How to identify in a document
  • a book?
  • a chapter?
  • a section?

41
A key constraint language K
  • Relative key (Q, K)
  • path Q identifies a set Q of nodes, called
    the context
  • k (Q, P1, . . ., Pk ) is a key on
    sub-documents rooted at nodes in Q (relative
    to Q).
  • Example. (book, (chapter, number)
  • (book.chapter, (section, number))
  • (book, title) -- absolute key
  • Analogous to keys for weak entities in a
    relational database
  • the key of the parent entity
  • an identification relative to the parent entity

42
Examples of K constraints
  • absolute (book, title)
  • relative (book, (chapter, number)
  • relative (book.chapter, (section, number))

43
Absolute vs. relative keys
  • Absolute keys as a special case of relative keys
  • (Q, K) when Q is the empty path
  • Absolute keys are scoped within the context of
    the entire document, while relative keys are
    scoped within the context of a sub-document
  • Important for hierarchically structured data
    XML, scientific databases,
  • absolute (book, title)
  • relative (book, (chapter, number)
  • relative (book.chapter, (section, number))
  • XML keys are more complex than relational keys!

44
Inverse constraints
  • Recall inverse constraints in OODB
  • class student class
    course
  • (key id, (key
    cno,
  • extent students) extent
    courses)
  • attribute string id attribute
    string cno
  • attribute string name
    attribute string title
  • relationship setltcoursegt taking
    relationship setltstudentgt takenBy
  • inverse coursetakenBy
    inverse studenttaking
  • Inverse constraints
  • if student s is taking course c, then c must be
    taken by s
  • it course c is taken by student s, then s must be
    taking c.

45
Inverse constraints for XML pods00
  • lt!ELEMENT student (name, taking)gt
  • lt!ELEMENT course (title, taken_by)gt
  • lt!ATTLIST student id ID
    requiredgt
  • lt!ATTLIST course cno ID
    requiredgt
  • lt!ATTLIST taking cno IDREF
    impliedgt
  • lt!ATTLIST taken_by id IDREF
    impliedgt
  • Inverse constraints
  • student(id).taking(cno) ? course(cno).taken_by
    (id)
  • for any student s and any course c,
  • if c.cno ?s.taking.cno, then s.id ? c.taken_by.id
  • if s.id ? c.taken_by.id, then c.cno ?s.taking.cno

46
Other constraints pods00
  • Path inclusion constraints
  • student.taking.cno ? course.cno
  • course.taken_by.id ? student.id
  • Path functional constraints
  • lt!ELEMENT professor (name, research,
    course)gt
  • lt!ELEMENT course (cno, title,
    credit)gt
  • professor.research ? professor.course.cno
  • value equality in both sides

47
  • Part 4. XML constraint analysis

48
Consistency of an XML specification
  • Given D a DTD
  • ? a set of keys and foreign keys
  • Consistency is there an XML document that both
    conforms to D and satisfies ??
  • Example.
  • DTD D lt!ELEMENT foo (X, X) gt
  • lt!ELEMENT X (empty)gt
  • constraints ? (X, ?)
  • One wants to know whether an XML specification
    makes sense!

49
Implication of XML constraints
  • Given D a DTD
  • ? a set of keys and foreign keys
  • ? a property (a key or foreign key)
  • Implication is it the case that for any XML
    document, if it conforms to D and satisfies ?,
    then it must satisfy ??
  • The need for studying implication
  • data integration constraints cannot be checked
    directly at the mediator level
  • design theory for XML specifications along the
    same lines as database normalization
  • query optimization (chase), . . .

50
Consistency analysis
  • Trivial for relational databases given any
    schema and keys, foreign keys, one can always
    find a nonempty instance of the schema satisfying
    the constraints.
  • Hard for XML XML specifications with DTD and
    keys, foreign keys may not be consistent!
  • DTDs interact with constraints in an intricate
    way.

51
The interaction between DTDs and constraints
  • DTD D lt!ELEMENT foo (X, X) gt
  • lt!ELEMENT X (empty)gt
  • key ? (X, ?)
  • (1) conforms to D two X nodes under the root
  • (2) satisfies ? no two X nodes under the root
    can have the same value
  • There is no XML tree both conforming to D and
    satisfying ?

52
Consistency of DTDs
  • There is need for consistency analysis even in
    the absence of constraints
  • Example. DTD
  • lt!ELEMENT foo (foo)gt
  • There exists no XML document that conforms to the
    DTD!

53
A simple constraint language, C
  • absolute key ?X ? ?. A document satisfies
    the key iff
  • ? x y ? ext(?) (xX v yX ? x y)
  • absolute foreign key an inclusion constraint
    ?1X ? ?2Y and a key ?2Y ? ?2. A document
    satisfies the foreign key iff it satisfies the
    key and
  • ? x ? ext(?1) ? y ? ext(?2) (xX v yY)
  • where
  • ?, ?1, ?2 element types
  • X, Y sets (sequences) of attributes
  • ext(?) the set of all ? elements in the
    document
  • v value equal.

54
Examples of C constraints
  • Specifying keys and foreign keys in terms of
    element types, rather than paths (in the flavor
    of XML-Data).
  • student._at_id ? student
  • course._at_cno ? course
  • taking._at_cno ? course._at_cno
  • person_at_firstName, _at_lastName ? person
  • C constraints vs. K constraints
  • absolute key ?X ? ? in C is equivalent to an
    absolute key in K (_. ?, X)
  • absolute keys are a special case of K constraints
  • absolute foreign key ?1X ? ?2Y and ?2Y ?
    ?2 of C is not expressible in K

55
Unary constraints
  • Keys and foreign keys defined in terms of
    single-attribute.
  • Example.
  • student._at_id ? student
  • course._at_cno ? course
  • taking._at_cno ? course._at_cno

56
Analysis of C constraints PODS01
  • Theorem. In the presence of DTDs, the following
    problems are undecidable for keys and foreign
    keys of C
  • the consistency problem
  • the implication problem.
  • As opposed to the trivial consistency analysis in
    relational databases.
  • These negative results carry over to
  • other schema languages XML Schema, XML Data,
    XDuce,
  • other constraint languages XML Schema, XML
    Data,...

57
Analysis of unary constraints
  • Theorem. In the presence of DTDs, for unary
    constraints of C
  • the consistency problem is NP-complete
  • the implication problem is coNP-complete.
  • In relational databases, implication of unary
    keys and foreign keys is decidable in linear
    time.
  • Primary key restriction at most one key for each
    element type.
  • Theorem. In the presence of DTDs, the consistency
    and implication problems remain intractable for
    unary keys and foreign keys of C even under the
    primary key restriction.
  • Keys specified with ID attributes are primary and
    unary!

58
A simple language for relative constraints, R
  • relative key (Q, ?X ? ?). A document
    satisfies the key iff
  • ? x ? Q ? y z ? ext(x.?) (yX v zX
    ? x y)
  • relative foreign key (Q1, ?1X) ? (Q2, ?2Y)
    and a key (Q2, ?2Y ? ?2). A document
    satisfies the foreign key iff it satisfies the
    key and
  • ? x ? Q1 ? y ? Q2 (ext(x.?1)X ?v
    ext(y.?2)Y)
  • where
  • Q, Q1, Q2 path expressions
  • ?, ?1, ?2 element types X, Y attributes
  • ext(x.?) the set of ? sub-elements of x
  • ?v set inclusion defined in terms of value
    equality

59
Examples of R constraints
  • Specifying relative constraints in terms of
    element types
  • (CS.student, (taking._at_cno ? taking)
  • (_, (course._at_cno ? course))
  • (CS.student, taking._at_cno) ? (CS,
    course._at_cno)
  • (CS, course._at_cno) ? (CS.student,
    taking._at_cno)
  • R constraints vs. K constraints
  • key (Q, ?X ? ?) of R is equivalent to
  • (Q, (?, X))
  • relative keys are a special case of K constraints
  • foreign key (Q1, ?1X) ? (Q2, ?2Y) and (Q2,
    ?2Y ? ?2) of R is not expressible in K

60
Analysis of relative constraints
  • Theorem. In the presence of DTDs, the following
    problems are undecidable even for unary relative
    constraints of R
  • the consistency problem
  • the implication problem.
  • The analysis of XML constraints is far more
    intriguing than its database counterparts!

61
Tractable special cases
  • Theorem. In the absence of constraints, the
    consistency problem for arbitrary DTDs is
    decidable in linear time.
  • Theorem. When DTD is fixed, the consistency and
    implication problems for unary constraints of C
    are in PTIME.
  • Theorem. When only keys of C are considered, the
    consistency and implication problems are
    decidable in linear time in the presence of DTDs.

62
Constraint analysis in the absence of DTDs
  • Regardless of DTDs
  • Consistency given any set of keys and foreign
    keys, can they be satisfied by an XML document?
  • Implication given a set ? of keys and foreign
    keys, does it follow that all documents
    satisfying ? must also satisfy another key or
    foreign key?
  • The need for investigating these issues
  • many XML documents do not come with a DTD
  • one is interested in implication that generally
    holds for all kinds of documents, regardless of
    their DTDs.

63
Analysis of C constraints PODS00
  • Without DTDs, the consistency problem becomes
    trivial any keys and foreign keys of C are
    satisfiable.
  • Theorem. In the absence of DTDs, the implication
    problem for C constraints remains undecidable.
  • Theorem. In the absence of DTDs, the implication
    problem is decidable in PSPACE for keys and
    foreign keys of C under the primary key
    restriction.
  • Theorem. In the absence of DTDs, the implication
    problem is decidable in linear time for unary
    keys and foreign keys of C.
  • These results also hold when inverse constraints
    are allowed.

64
Analysis of K constraints DBPL01
  • Without DTDs, the consistency problem for K also
    becomes trivial any keys of K are satisfiable.
  • Theorem. In the absence of DTDs, the implication
    problem for keys of K is finitely axiomatizable
    and is decidable in PTIME.
  • Theorem. In the absence of DTDs, the implication
    problem for absolute keys of K is finitely
    axiomatizable and is decidable in O(n3) time.
  • The absence of DTDs simplifies the constraint
    analysis but does not make it trivial!

65
Inference rules for K constraint implication
  • superkey if (Q, (Q, S)) then (Q, (Q, S ?
    P))
  • where P is any path
  • Example (_, (person, id) ? (_, (person,
    id, name))
  • containment-reduce if (Q, (Q, S ? P1, P2))
    and P1 ? P2, then (Q, (Q, S ? P1))
  • Example (_, (person, id, _.id ) ? (_,
    (person, id))
  • context-target if (Q, (Q1.Q2, S)), then
    (Q.Q1, (Q2, S))
  • Example (_, (university.employee, id) ?
  • (_.university, (employee, id))

66
  • Part 5. Constraints in Practice

67
Updates in XML Tatatarinov et al
(SIGMOD01)Zhang Shasha SIAM J. Comput
18(5), Chawathe SIGMOD97
  • Updates for XML are based on its ordered tree
    model Insert, Delete, Rename, InsertBefore/Insert
    After, Replace.

68
Using Keys to Update Transitive Keys WWW01
  • To update a unique node, we must be able to
    identify it uniquely.
  • Example 1
  • Example 2
  • In the first example, the second (relative) key
    is given a context by the first. This is not
    the case in the second example.
  • (Q1, (Q1, S1)) immediately precedes (Q2, (Q2,
    S2)) if Q2Q1.Q1. Precedes is the transitive
    closure of immediately precedes.
  • A set ? of relative keys is transitive if for any
    relative key (Q1, (Q1, S1)) ?? there is a key
    (?, (Q2, S2)) which precedes (Q1, (Q1, S1)).

(?, (bible.book, name)) (bible.book, (chapter,
number))
(?, (bible.book, name)) (bible.book.chapter,
(verse, number))
69
Checking Key Constraints
  • How efficiently can we check that a document
    satisfies a key specification of absolute and
    relative keys (Q, (Q, K))? It turns out there
    is an incremental technique which runs in linear
    time in the size of the document, and uses
    efficient indexing and SAX.
  • The index is a hierarchical hash table, which is
    composed of levels
  • Key specification level
  • Context node level
  • Key path level
  • Key value level
  • Nodes are partitioned by key path and key value.
  • The index is incremental, and updates can be
    performed in linear time in the size of the
    update.

70
Structure of Index
Where the Key Value Sharing Class (KVSC)
represents a set of nodes (oids) that share the
key value.
71
Example
6 ln
72
Example, cont
  • Suppose we have the following key specification
  • KS1 (?, (book, ISBN)
  • KS2 (book, (author, fn, ln))
  • KS2 (?, (author, _at_ID)

73
XML keys and relational storage
  • The previous approach does not consider how the
    XML document is being stored
  • Text file?
  • Relational storage?
  • Object system?
  • If a relational store is used, can we use the
    native key or constraint checking to check XML
    keys?

74
XML Relational Storage Strategies quick review
  • Edge approach create a single relational table
    called the edge table (Florescu Kossman Data
    Eng. Bul. 22(3))
  • (sourceID, tag, ordinal, targetID, data)
  • Basic inlining each table corresponds to an
    element using the DTD, place within an element
    table as many single-valued attributes as
    possible (Wisconsin VLDB99)

lt?xml?gt lt!ELEMENT Dept(Student)gt lt!ATTLIST Dept
dept_id ID REQUIREDgt lt!ELEMENT Student(Name,
Enroll)gt lt!ATTLIST Student student_id ID
REQUIREDgt lt!ELEMENT Name PCDATAgt lt!ELEMENT
Enroll PCDATAgt
Dept(parentID, ID, dept_id) Student(parentID,
ID, student_id, Name) Enroll(parentID, ID,
TEXT)
75
Mapping XML keys to relational constraints
  • How do XML keys translate to relational keys or
    functional dependencies?
  • Edge model separates all edges out, cannot use
    key constraints.
  • Inlining allows more.
  • (?, (Student, _at_student_id)), (?, (Dept,
    _at_dept_id))
  • Check that _at_student_id is a key in the Student
    relation and _at_dept_id is a key in the Dept
    relation
  • (Dept, (Student, Name))
  • Check that (parentID, Name) is a key for the
    Student relation.
  • However, these are special cases in which the Q
    path is simple and consists of a single label!

76
Constraints-Preserving TransformationsLee Chu
ER00
  • DTDs encapsulate certain types of constraints
  • Domain lt!ATTLIST author gender (malefemale) gt
  • Cardinality lt!ELEMENT article (title, author,
    ref, price?)gt
  • Inclusion lt!ATTLIST contact aid IDREF REQUIREDgt
  • Hybrid inlining can be modified to preserve these
    constraints, and to generate SQL constraint
    statements create domain, NOT NULL,
    UNIQUE, id and foreign key.
  • The key is assumed to be the attribute of type
    ID, whenever it exists.
  • Can our extended notion of keys be captured as
    well to influence the transformation? This is an
    area of future research.

77
XML Relational Storage Strategies, cont.
  • There are many more storage strategies
  • Shared inlining (Wisconsin) inlines element
    tags that are single valued and are not
    subelements of more than one element type.
  • Hybrid inlining (Wisconsin) inlines all element
    tags
  • Both of these approaches may pull subelements and
    attributes that are needed in the key to separate
    relations, making it complex to check XML keys.
  • These storage strategies go automatically from
    the DTD of the document to a relational schema.
    What if we want more control?

78
Mapping Constraints Through Views
  • Describing a transformation e.g. basic, shared
    or hybrid inlining can be done using a basic
    set of primitives (language). This also allows
    other possibilities in how the data will be
    stored.
  • We are then faced with the general problem of
    mapping constraints through a view definition.
  • Mapping constraints through a view definition is
    understood in the context of relational and
    object-oriented databases.
  • Klug TODS 5(3), 1980 mapping functional
    dependencies and join dependencies over
    relational views
  • Beeri Vardi SIAM J. of Comput. 13(1), 1984
    algebraic dependencies over relational views
  • Popa ICDT99 mapping constraints over
    object-oriented views
  • For XML this is an area of current research.

79
Other uses of constraints Query Optimization
  • Initial work on query optimization for XML
    focused on indices (Stanford, ATT, Wisconsin,
    etc)
  • Value, label, and edge indices Dataguides (Lore)
  • Template index (Milo Suciu ICDT99)
  • Work on query optimization using statistics and
    cost model has been done for the Lore system
    (e.g. McHugh VLDB99)
  • Other work has focused on pushing XML queries
    into relational databases (e.g. Silkroute WWW9,
    Manolescu VLDB01, Shanmugasundaram VLDB2001)
  • What about constraints? They have been used in
    relational databases, and more recently in
    object-oriented databases with a constraint
    language that can capture keys, foreign keys,
    inclusion constraints and indices. (Popa
    VLDB99, ICDT99). This is an active area of
    research (Deutsch, UPenn).

80
Other uses of constraints Normalization
  • Consider the following transitive set of keys
  • (?, (university, name)
  • (university, (dept, dept-name))
  • (university, (dept.employee, emp-id))
  • Note that employee is nested under dept.
    However, to insert an employee nothing about the
    dept is necessary to identify the employee! This
    is reminiscent of non-second normal form
    relations. We would like to say that employees
    should be directly nested under university, and
    that the linkage between employee and dept be
    expressed by a foreign key.
  • This is also an area which needs further research.

81
XML Keys Practical Observations
  • In bioinformatics , the popular sequence
    databases tend to have natural keys. For
    example, EMBL format SwissProt has a natural
    translation to XML and keys can be formulated

STANDARD PRT 924 AA. AC P15711 DT
01-APR-1990 (REL. 14, CREATED) DT 01-APR-1990
(REL. 14, LAST SEQUENCE UPDATE) DT 01-AUG-1992
(REL. 23, LAST ANNOTATION UPDATE) DE 104 KD
MICRONEME-RHOPTRY ANTIGEN. OS THEILERIA
PARVA. RN 1 RC STRAINMUGUGA RX MEDLINE
90158697. RA IAMS K.P., YOUNG J.R., NENE V. RL
MOL. BIOCHEM. PARASITOL. 3947-60(1990). DR
EMBL M29954 G161866 -. DR PIR A44945
A44945. KW ANTIGEN SPOROZOITE. FT DOMAIN
1 19 HYDROPHOBIC. FT DOMAIN
905 924 HYDROPHOBIC.
82
SwissProt Entry in XML
ltEntry mtype"PRT" seqlen"924"gt
ltPrimACgtP15711lt/PrimACgt ltMod
date"01-APR-1990" Rel"14" type"CREATED"gtlt/gt
ltMod date"01-APR-1990" Rel"14" type"LAST SEQ
UPD"gtlt/gt ltMod date"01-AUG-1992" Rel"23"
type"LAST ANNOT UPD"gtlt/gt ltDescrgt104 KD
MICRONEME-RHOPTRY ANTIGENlt/Descrgt
ltSpeciesgtTHEILERIA PARVAlt/Speciesgt ltRef
num"1"gt ltSTRAINgtMUGUGAlt/STRAINgt
ltMedlineIDgt90158697lt/MedlineIDgt
ltAuthorgtIAMS K.P.lt/Authorgt ltAuthorgtYOUNG
J.R.lt/Authorgt ltAuthorgtNENE Vlt/Authorgt
ltCitegtMOL. BIOCHEM. PARASITOL. 3947-60(1990)lt/Cit
egt lt/Refgt ltEMBL prim_id"M29954"
sec_id"G161866" status"-"gtlt/gt ltPIR
prim_id"A44945" sec_id"A44945"gtlt/PIRgt
ltKeywordgtANTIGENlt/Keywordgt ltKeywordgtSPOROZOITElt/Ke
ywordgt ltFeaturesgt ltDOMAIN from"1"
to"19"gt ltDescrgtHYDROPHOBIClt/Descrgt lt/DOMAINgt
ltDOMAIN from"905" to"924"gt ltDescrgtHYDROPHOBIClt/
Descrgt lt/DOMAINgt lt/Featuresgt lt/Entrygt
83
Practical Observations, cont
  • Many DTDs are now being formulated for data
    exchange within bioinformatics. In particular,
    gene expression data uses MAGE, representing the
    merge of MAML (MicroArray Markup Language) and
    GEML (Gene Expression Markup Language). They
    have also switched to modeling the concepts in
    UML, from which there is a natural translation to
    DTD's.
  • Within these representations, attributes are
    often used to hold key information IDs are
    occasionally used with special prefixes to
    capture their element type.

84
Conclusions and Future Work
  • Constraints are extremely important for XML data
    management
  • XML constraints and their analysis are more
    intricate than their database counterparts
  • Further work is needed for a better understanding
    of
  • XML constraints
  • consistency and implication of XML constraints

85
Open problems
  • Practical, tractable classes of XML constraints
  • Normal forms for XML specifications is (D, ?)
    good?
  • XML query optimization chasing for XML
    constraints
  • Constraint propagation given certain database
    constraint, what is the XML constraint that must
    hold on the XML view of the database?
  • Constraint implementation given an XML
    constraint, what impact does this have on the
    storage representation? Can the constraint be
    checked by the underlying storage system (e.g.
    relational)?
  • Relative information capacity is it the case
    that if an XML document conforms to (D1, ?1) ,
    then it must also conform to (D2, ?2)?
  • . . .
Write a Comment
User Comments (0)
About PowerShow.com