Specifying XML Data with Integrity Constraints - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Specifying XML Data with Integrity Constraints

Description:

Tutorial. 1. Specifying XML Data with Integrity Constraints. Wenfei Fan ... Tutorial. 2. Outline. XML, Web data and database techniques ... Tutorial. 6. XML basics ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 49
Provided by: CIS471
Category:

less

Transcript and Presenter's Notes

Title: Specifying XML Data with Integrity Constraints


1
Specifying XML Data with Integrity Constraints
  • Wenfei Fan
  • Bell Labs and Temple University

2
Outline
  • XML, Web data and database techniques
  • XML specifications types and constraints
  • XML constraints absolute/relative keys and
    foreign keys
  • Analysis of XML constraints consistency and
    implication
  • Research issues
  • Area references
  • "A Web odyssey From Codd to XML", V. Vianu, PODS
    2001. http//www.cis.upenn.edu/wfan/PODS2001/pro
    ceedings.html
  • "Constraints for semistructured data and XML", P.
    Buneman, W. Fan, J. Simeon and S. Weinstein,
    SIGMOD Record 30(1), March 2001.
    http//www.cis.temple.edu/fan/papers/xml/survey.p
    s.gz

3
  • Part 1. XML a brief introduction

4
What is wrong with HTML?
  • HTML (HyperText Markup Language) is good for
    presentation, but does not help information
    extraction by programs.
  • lth3gt George Bush lt/h3gt
  • ltbgt Taking Eng 055 lt/bgt ltbrgt
  • ltemgt GPA 1.5 lt/emgt ltbrgt
  • lth3gt Eng 055 lt/h3gt
  • ltbgt Spelling lt/bgt
  • HTML tags
  • predefined and fixed
  • describing display format rather than the
    structure of data

5
eXtensible Markup Language
  • XML tags
  • user-defined, arbitrarily nested
  • describing the structure of the data rather than
    display
  • ltstudent id 123gt
  • ltnamegt
  • ltfirstNamegt George lt/firstNamegt ltlastNamegt
    Bush lt/lastNamegt
  • lt/namegt
  • lttakinggt Eng 055 lt/takinggt
  • ltGPAgt 1.5 lt/GPAgt
  • lt/studentgt
  • ltcourse cno Eng 055gt
  • lttitlegt Spelling lt/titlegt
  • lt/coursegt

6
XML basics
  • Element the segment between a start tag and a
    corresponding end tag, e.g., student, name.
  • Subelement relation between an element and its
    component elements, e.g., name to student.
  • Attribute marked text within a start tag, e.g.,
    id.
  • Text the single basic type (PCDATA), e.g.,
    Bush.
  • XML elements are ordered, whereas attributes are
    not.
  • ltstudent id 123gt
  • ltnamegt
  • ltfirstNamegt George lt/firstNamegt ltlastNamegt
    Bush lt/lastNamegt
  • lt/namegt
  • lttakinggt Eng 055 lt/takinggt ltGPAgt
    1.5 lt/GPAgt
  • lt/studentgt

7
The XML tree model
  • An XML document is typically modeled as a
    node-labeled tree.
  • Element node internal, with a name (tag) and
    children (subelements and attributes), e.g.,
    student, name.
  • Attribute node leaf with a name (tag) and text,
    e.g., _at_id.
  • Text node leaf with text (string) but without a
    name.

8
XML and Web data
  • Web data is semistructured schemaless, irregular
  • Traditional database systems cant model Web data
  • XML model a special case of the semistructured
    data model
  • flexible model Web data (with references
    foreign keys)
  • powerful represent data from databases

9
XML in data exchange
  • XML the primary standard for data exchange on
    the Web
  • across formats/platforms/enterprises
  • generated and consumed by applications
  • healthcare industry, e-commerce, digital library,

Web
XML
XML
OODB Unix
RDB MS
10
XML in data integration
  • mediator/wrapper vs. virtual view of a database
  • data warehouse vs. materialized view of a
    database
  • Web databases, e-commerce

client
client
mediator -- XML
wrapper
wrapper
wrapper
file
Web
DB
11
XML in e-commerce
  • A site for a car dealer provides a uniform query
    interface for price, rating, review and
    competitors price/availability.
  • Integrating local data, national archive for
    safety records, review data, competitors sites
  • e-commerce query interface (XML), integration
    system (XML), database system, workflow management

client
client
query interface, warehouse -- XML
integrator
integrator
integrator
integrator
local DB
national records
review
competitor
12
Database techniques for managing XML data
  • specifying XML types and constraints
  • querying XML XSL, XQL, XML-QL, Lorel, UnQL
  • updating XML constraints and concurrency control
  • integrating XML database transformations and
    integration
  • storing XML efficient storage and access
    methods, indexing
  • These are crucial for Web applications
  • e-commerce, digital library, data exchange, Web
    databases,
  • Web site management,
  • XML players W3C, Microsoft, HP, Oracle, Adobe,
    ...

13
  • Part 2. XML specification types and constraints

14
A relational schema (SQL)
  • Type and constraints
  • create table students create table
    courses
  • ( id char(9), ( cno
    char(9),
  • name char(20), title
    char(20),
  • primary key id) primary key
    cno)
  • create table enroll
  • ( id char(9),
  • cno char(9),
  • primary key (id, cno),
  • foreign key id references students,
  • foreign key cno references courses)

15
An object-oriented schema (ODMG)
  • Types and constraints
  • class student class
    course
  • (key id, (key
    cno,
  • extent students) extent
    courses)
  • attribute string id attribute
    string cno
  • attribute string name
    attribute string title
  • relationship setltcoursegt taking
    relationship setltstudentgt takenBy
  • inverse coursetakenBy
    inverse studenttaking
  • The distinction between types and constraints is
    dictated by what programming languages treat as
    types

16
XML specification types
  • DTD (Document Type Definition)
  • lt!ELEMENT db (student, course) gt
  • lt!ELEMENT student (name, taking)gt
  • lt!ELEMENT course (title)gt
  • lt!ELEMENT taking (title)gt
  • attributes
  • student _at_id
  • course _at_cno
  • taking _at_cno
  • XML Schema
  • Xduce, XML Algebra, XML Data, ...

17
XML specification constraints
  • Keys locating a specific object, an invariant
    connection from an object in the real world to
    its representation
  • student._at_id ? student
  • course._at_cno ? course
  • foreign keys referencing an object from another
    object
  • taking._at_cno ? course._at_cno, course._at_cno ?
    course
  • Key specifications
  • the XML standard (DTD), XML Schema, XML Data,
    ...

18
Constraints are important for XML
  • XML is semistructured and may not come with a
    DTD/type
  • constraints are a fundamental part of the
    semantics
  • constraints have proved useful in
  • semantic specifications obvious
  • query optimization chasing algorithm
  • database conversion to an XML encoding a must
  • data integration information preservation
  • update anomaly prevention classical
  • normal forms for XML specifications BCNF,
    3NF
  • efficient storage/access indexing
  • ...

19
  • Part 3. XML constraints keys and foreign keys

20
The limitations of the XML standard
  • ID and IDREF attributes in DTD
  • lt!ATTLIST student id ID
    requiredgt
  • lt!ATTLIST course cno ID
    requiredgt
  • lt!ATTLIST taking cno IDREF
    impliedgt
  • Scoping
  • ID unique within the entire document (like oids)
  • IDREF untyped one has no control over what it
    points to
  • unary and primary
  • defined in a type
  • A mixture of relational keys and object
    identities (oids)

21
The limitations of XML Schema
  • Keys defined with a list of XPath expressions
  • (student, firstName, lastName)
  • (student, lastName, firstName)
  • (student, lastName, lastName,
    firstName)
  • Equivalence/containment of XPath expressions is
    unresolved
  • No efficient way to tell whether two keys are
    equivalent
  • The notion of value equality is too restricted
    (text only)
  • The notion of relative keys is not addressed
  • Mild generalizations of relational keys fail to
    capture some fundamental semantics associated
    with the hierarchical structure of XML data

22
To overcome the limitations WWW10
  • Absolute key (Q, P1, . . ., Pk )
  • target path Q to identify a target set Q of
    nodes on which the key is defined (vs. relation)
  • a set of key paths P1, . . ., Pk to provide
    an identification for nodes in Q (vs. key
    attributes)
  • semantics for any two nodes in Q, if they
    have all the key paths and agree on them up to
    value equality, then they must be the same node
    (value equality and node identity)
  • ( _.student, _at_id)
  • ( _.student, _.name)
  • ( _.enroll, _at_id, _at_cno)
  • ( _, _at_id)

23
Value equality on trees
  • Two nodes are value equal iff
  • either they are text nodes (PCDATA) with the same
    value
  • or they are attributes with the same tag and the
    same value
  • or they are elements having the same tag and
    their children are pairwise value equal

...
24
Capturing the semistructured nature
  • independent of types
  • no structural requirement tolerating
    missing/multiple paths
  • (person, name) (person, name, _at_phone)

25
Path expressions
  • A simple yet powerful regular path language
  • q ? l q.q
    _
  • ? empty path
  • l tag
  • q.q concatenation
  • _ combination of wildcard and the Kleene
    closure
  • Theorem. The containment and equivalence problems
    for these path expressions are finitely
    axiomatizable and decidable in quadratic time.

26
A key constraint language K
  • Relative key (Q, K)
  • path Q identifies a set Q of nodes, called
    the context
  • k (Q, P1, . . ., Pk ) is a key on
    sub-documents rooted at nodes in Q (relative
    to Q).
  • Example. (book, (chapter, number)
  • (book.chapter, (section, number))
  • (book, title) -- absolute key
  • Analogous to keys for weak entities in a
    relational database
  • the key of the parent entity
  • an identification relative to the parent entity

27
Absolute vs. relative keys
  • Absolute keys as a special case of relative keys
  • (Q, K) when Q is the empty path
  • Absolute keys are scoped within the context of
    the entire document, while relative keys are
    scoped within the context of a sub-document
  • Important for hierarchically structured data
    XML, scientific databases,
  • absolute (book, title)
  • relative (book, (chapter, number)
  • relative (book.chapter, (section, number))
  • XML keys are more complex than relational keys!

28
  • Part 4. XML constraint analysis

29
Consistency of an XML specification
  • Given D a DTD
  • ? a set of keys and foreign keys
  • Consistency is there an XML document that both
    conforms to D and satisfies ??
  • Example.
  • DTD D lt!ELEMENT foo (X, X) gt
  • lt!ELEMENT X (empty)gt
  • constraints ? (X, ?)
  • One wants to know whether an XML specification
    makes sense!

30
Implication of XML constraints
  • Given D a DTD
  • ? a set of keys and foreign keys
  • ? a property (a key or foreign key)
  • Implication is it the case that for any XML
    document, if it conforms to D and satisfies ?,
    then it must satisfy ??
  • The need for studying implication
  • data integration constraints cannot be checked
    directly at the mediator level
  • design theory for XML specifications along the
    same lines as database normalization
  • query optimization (chase), . . .

31
Consistency analysis
  • Trivial for relational databases given any
    schema and keys, foreign keys, one can always
    find a nonempty instance of the schema satisfying
    the constraints.
  • Hard for XML XML specifications with DTD and
    keys, foreign keys may not be consistent!
  • DTDs interact with constraints in an intricate
    way.

32
The interaction between DTDs and constraints
  • DTD D lt!ELEMENT foo (X, X) gt
  • lt!ELEMENT X (empty)gt
  • key ? (X, ?)
  • (1) conforms to D two X nodes under the root
  • (2) satisfies ? no two X nodes under the root
    can have the same value
  • There is no XML tree both conforming to D and
    satisfying ?

33
Consistency of DTDs
  • There is need for consistency analysis even in
    the absence of constraints
  • Example. DTD
  • lt!ELEMENT foo (foo)gt
  • There exists no XML document that conforms to the
    DTD!

34
A simple constraint language, C
  • absolute key ?X ? ?. A document satisfies
    the key iff
  • ? x y ? ext(?) (xX v yX ? x y)
  • absolute foreign key an inclusion constraint
    ?1X ? ?2Y and a key ?2Y ? ?2. A document
    satisfies the foreign key iff it satisfies the
    key and
  • ? x ? ext(?1) ? y ? ext(?2) (xX v yY)
  • where
  • ?, ?1, ?2 element types
  • X, Y sets (sequences) of attributes
  • ext(?) the set of all ? elements in the
    document
  • v value equal.

35
C constraints vs. K constraints
  • absolute key ?X ? ? in C is equivalent to an
    absolute key in K (_. ?, X)
  • absolute keys are a special case of K constraints
  • absolute foreign key ?1X ? ?2Y and ?2Y ?
    ?2 of C is not expressible in K

36
Unary constraints
  • Keys and foreign keys defined in terms of
    single-attribute.
  • Example.
  • student._at_id ? student
  • course._at_cno ? course
  • taking._at_cno ? course._at_cno

37
Analysis of C constraints PODS01
  • Theorem. In the presence of DTDs, the following
    problems are undecidable for keys and foreign
    keys of C
  • the consistency problem
  • the implication problem.
  • As opposed to the trivial consistency analysis in
    relational databases.
  • These negative results carry over to
  • other schema languages XML Schema, XML Data,
    XDuce,
  • other constraint languages XML Schema, XML
    Data,...

38
Analysis of unary constraints
  • Theorem. In the presence of DTDs, for unary
    constraints of C
  • the consistency problem is NP-complete
  • the implication problem is coNP-complete.
  • In relational databases, implication of unary
    keys and foreign keys is decidable in linear
    time.
  • Primary key restriction at most one key for each
    element type.
  • Theorem. In the presence of DTDs, the consistency
    and implication problems remain intractable for
    unary keys and foreign keys of C even under the
    primary key restriction.
  • Keys specified with ID attributes are primary and
    unary!

39
A simple language for relative constraints, R
  • relative key (Q, ?X ? ?). A document
    satisfies the key iff
  • ? x ? Q ? y z ? ext(x.?) (yX v zX
    ? x y)
  • relative foreign key (Q1, ?1X) ? (Q2, ?2Y)
    and a key (Q2, ?2Y ? ?2). A document
    satisfies the foreign key iff it satisfies the
    key and
  • ? x ? Q1 ? y ? Q2 (ext(x.?1)X ?v
    ext(y.?2)Y)
  • where
  • Q, Q1, Q2 path expressions
  • ?, ?1, ?2 element types X, Y attributes
  • ext(x.?) the set of ? sub-elements of x
  • ?v set inclusion defined in terms of value
    equality

40
R constraints
  • key (Q, ?X ? ?) of R is equivalent to
  • (Q, (?, X))
  • relative keys are a special case of K constraints
  • foreign key (Q1, ?1X) ? (Q2, ?2Y) and (Q2,
    ?2Y ? ?2) of R is not expressible in K
  • Example.
  • (CS.student, (taking._at_cno ? taking)
  • (_, (course._at_cno ? course))
  • (CS.student, taking._at_cno) ? (CS,
    course._at_cno)
  • (CS, course._at_cno) ? (CS.student,
    taking._at_cno)

41
Analysis of relative constraints
  • Theorem. In the presence of DTDs, the following
    problems are undecidable even for unary relative
    constraints of R
  • the consistency problem
  • the implication problem.
  • The analysis of XML constraints is far more
    intriguing than its database counterparts!

42
Tractable special cases
  • Theorem. In the absence of constraints, the
    consistency problem for arbitrary DTDs is
    decidable in linear time.
  • Theorem. When DTD is fixed, the consistency and
    implication problems for unary constraints of C
    are in PTIME.
  • Theorem. When only keys of C are considered, the
    consistency and implication problems are
    decidable in linear time in the presence of DTDs.

43
Constraint analysis in the absence of DTDs
  • Regardless of DTDs
  • Consistency given any set of keys and foreign
    keys, can they be satisfied by an XML document?
  • Implication given a set ? of keys and foreign
    keys, does it follow that all documents
    satisfying ? must also satisfy another key or
    foreign key?
  • The need for investigating these issues
  • many XML documents do not come with a DTD
  • one is interested in implication that generally
    holds for all kinds of documents, regardless of
    their DTDs.

44
Analysis of C constraints PODS00
  • Without DTDs, the consistency problem becomes
    trivial any keys and foreign keys of C are
    satisfiable.
  • Theorem. In the absence of DTDs, the implication
    problem for C constraints remains undecidable.
  • Theorem. In the absence of DTDs, the implication
    problem is decidable in PSPACE for keys and
    foreign keys of C under the primary key
    restriction.
  • Theorem. In the absence of DTDs, the implication
    problem is decidable in linear time for unary
    keys and foreign keys of C.
  • These results also hold when inverse constraints
    are allowed.

45
Analysis of K constraints DBPL01
  • Without DTDs, the consistency problem for K also
    becomes trivial any keys of K are satisfiable.
  • Theorem. In the absence of DTDs, the implication
    problem for keys of K is finitely axiomatizable
    and is decidable in PTIME.
  • Theorem. In the absence of DTDs, the implication
    problem for absolute keys of K is finitely
    axiomatizable and is decidable in O(n3) time.
  • The absence of DTDs simplifies the constraint
    analysis but does not make it trivial!

46
  • Part 5. Current research issues

47
Summary
  • Constraints are extremely important for XML data
    management
  • XML constraints and their analysis are more
    intricate than their database counterparts
  • Further work is needed for a better understanding
    of
  • XML constraints
  • consistency and implication of XML constraints

48
Open problems
  • Practical, tractable classes of XML constraints
  • Normal forms for XML specifications is (D, ?)
    good?
  • XML query optimization chasing for XML
    constraints
  • Constraint propagation given certain database
    constraint, what is the XML constraint that must
    hold on the XML view of the database?
  • Relative information capacity is it the case
    that if an XML document conforms to (D1, ?1) ,
    then it must also conform to (D2, ?2)?
  • . . .
Write a Comment
User Comments (0)
About PowerShow.com