XML Data Management 10. Deterministic DTDs and Schemas - PowerPoint PPT Presentation

About This Presentation

Title:

XML Data Management 10. Deterministic DTDs and Schemas

Description:

XML Data Management 10. Deterministic DTDs and Schemas Werner Nutt How Expressive can a Schema Be? – PowerPoint PPT presentation

Number of Views:110

Avg rating:3.0/5.0

Slides: 30

Provided by: Dan1299

Category:

more less

Transcript and Presenter's Notes

Title: XML Data Management 10. Deterministic DTDs and Schemas

1
XML Data Management10. Deterministic DTDs and
Schemas

Werner Nutt

2
How Expressive can a Schema Be?
This schema is a frequent example in teaching
material on XML Schema
ltxsdcomplexType nameoneBgt ltxsdchoicegt
ltxsdelement nameB typexsdstring/gt
ltxsdsequencegt ltxsdelement nameA
typeonlyAs/gt ltxsdelement nameA
typeoneB/gt lt/xsdsequencegt
ltxsdsequencegt ltxsdelement nameA
typeoneB/gt ltxsdelement nameA
typeonlyAs/gt lt/xsdsequencegt
lt/xsdchoicegtlt/xsdcomplexTypegt
ltxsdelement nameA typeoneB/gtltxsdcomplexT
ype nameonlyAsgt ltxsdchoicegt
ltxsdsequencegt ltxsdelement nameA
typeonlyAs/gt ltxsdelement nameA
typeonlyAs/gt lt/xsdsequencegt
ltxsdelement nameA typexsdstring/gt
lt/xsdchoicegtlt/xsdcomplexTypegt
What would documents look like that satisfy this
schema?
Arbitrary deep binary tree with A elements, and a
single B element
How would one check validity? What would be the
cost? What are the pros and cons of allowing such
schemas?
3
Lets see what SAXON says
4
Here is the Full Error Message from Eclipse

cos-element-consistent Error for type 'oneB'.
Multiple elements with name 'A', with different
types, appear in the model group.
cos-element-consistent Error for type 'onlyAs'.
Multiple elements with name 'A', with different
types, appear in the model group.
cos-nonambig A and A (or elements from their
substitution group) violate "Unique Particle
Attribution". During validation against this
schema, ambiguity would be created for those two
particles.
cos-nonambig A and A (or elements from their
substitution group) violate "Unique Particle
Attribution". During validation against this
schema, ambiguity would be created for those two
particles.

I.e., in a given context, elements with the same
name must have the same content. Easy to check!
Thats more subtle ...
5
The Country Example in XML Schema
lt?xml version"1.0" encoding"UTF-8"?gt ltxsdschema
xmlnsxsd"http//www.w3.org/2001/XMLSchema"
targetNamespace"http//www.example.org/country"
xmlns"http//www.example.org/country"
elementFormDefault"qualified"gt ltxsdelement
name"country"gt ltxsdcomplexTypegt
ltxsdchoicegt ltxsdelement name"king"
type"xsdstring"gtlt/xsdelementgt
ltxsdelement name"queen" type"xsdstring"gtlt/xsd
elementgt ltxsdsequencegt
ltxsdelement name"king" type"xsdstring"gtlt/xsde
lementgt ltxsdelement name"queen"
type"xsdstring"gtlt/xsdelementgt
lt/xsdsequencegt lt/xsdchoicegt
lt/xsdcomplexTypegt lt/xsdelementgt lt/xsdschemagt
As DTD lt!ELEMENT country (king queen
(king,queen))gt
6
Also this is not validated

cos-nonambig king and king (or elements from
their substitution group) violate "Unique
Particle Attribution". During validation against
this schema, ambiguity would be created for those
two particles.

Lets check what this means!
7
What the W3C Standard Explains

Schema Component Constraint Unique Particle
Attribution
A content model must be formed such that during
validation of an element information item
sequence, the particle contained directly,
indirectly or implicitly therein with which to
attempt to validate each item in the sequence
in turn can be uniquely determined without
examining the content or attributes of that item,
and without any information about the items in
the remainder of the sequence.
http//www.w3.org/TR/2001/REC-xmlschem
a-1-20010502/cos-nonambig

8
Questions and Ideas

Questions
How can one make the standard formal?
How can a validator implement the standard?
Ideas
Content models are specified by regular
expressions
A regular expression E can be translated into a
finite state automaton A (Glushkov
automaton)that checks which strings satisfy E
? Construct A from E and check whether A is
deterministic

9
Formalization

Alphabet ? (i.e., set of symbols) the element
names occurring in the content model
Regular expressions over ? are generated with the
rule
e, f ? a (e?f) (ef) (e)
(e)
where e, f are expressions and a ? ?
Language L(e) of an expression e (inductively
defined)
Exercise Which of the following are in the
language defined by a ? (b c) ? a ?
aba
abca

In the following, we denote concatenation by a
dot, no more by a comma.

aab
aaacaaa

10
Regular Expressions and DTDs

These are formalizations of DTDs and validation
A DTD is a pair (d, s) where
s ? ? is the start symbol
d maps every ?-symbol to a regular expression
over ?
A document tree t satisfies d (t is valid wrt d)
iff
the root of t is labeled s
for every node n in t, with symbol a, the string
formed by the names of the children of n
satisfies d(a)
? Validation is checking whether a string
satisfies a regexp

11
Markings

Distinguish between the different occurrences of
a symbol in
a regexp by using numbers markings of regexps
Examples
a1 ? (b2 c3) ? a4 is a marking of a ?
(b c) ? a
king1 queen2 king3 ? queen4 is a marking
of
king queen king ? queen
Definition
A marking e' of a regular expression e is an
assignment of numbers to every symbol in e.

12
Unmarked Version

Consider a regular expression e and a e? marking
of e
Definition
For w ? L(e?) , we denote by w the
corresponding unmarked string in L(r).
Example
If w b2a1a3, then w baa

13
Unique Particle Attribution Formalization

Brüggemann-Klein/Wood 1998

Definition A regular expression r is
deterministic iff
there are no strings uxv, uyw ? L(r') with
x y 1
x ? y, (x and y are
different marked symbols)
x y (their
unmarking is the same).
Example (a b) a is not deterministic because
there are
marking ((a1 b2) a3)
strings b2 a1 a3 and b2 a3 ?

u x v
u x w
How can we check, whether e is deterministic?
14
Finite State Automata
The automaton is deterministic if every pair
(q,a) is only mapped to a single state

Regular anguages can also be defined using
automata
A finite state automaton (FSA) consists of
a set of states Q.
an alphabet ? (i.e., a set of symbols)
a transition function ?, which maps every pair
(q,a) to a set of states q
an initial state q0
a set of accepting states F
A word a1an is in the language defined by an
automaton if there is a path from q0 to a state
in F with edges labeled a1,,an

15
Which Language Does this FSA Define?
16
Non-Deterministic Automata

An automaton is non-deterministic if there is a
state q and a letter a such that there are at
least two transitions from q via edges labeled
with a
What words are in the language of a
non-deterministic automaton?
We now create a Glushkov automaton from a
regular expression

17
Creating a Glushkov Automaton from a Regular
Expression
Step 1 Create a marking of the expression
a?(bc)?a
18
Creating a Glushkov Automaton from a Regular
Expression
Step 2 Create a state q0 and create a
state for each subscripted letter
a1?(b1c1)?a2
Step 3 Choose as accepting states all
subscripted letters with which it is possible to
end a word
How do we find these states?

q0
19
Creating a Glushkov Automaton from a Regular
Expression
Step 4 Create a transition from a state li to a
state kj if there is a word in which kj follows
li. Label the transition with k
a1?(b1c1)?a2
How do we find these transitions?
20
Exercises

What are the Glushkov automata of
a ? b ?(a ? b)
(a b) ? a ? (a b)
(a b)?a ?

21
Recognizing Deterministic Regular Expressions

Theorem (Book et al 1971, Brüggemann-Klein, Wood,
1998)
A regular expression is deterministic
(one-unambiguous) iff its Glushkov automaton is
deterministic.

22
Construction of the Glushkov Automaton

For an arbitrary alphabet ? and a language L ? ?
we define two sets
first(L) ?a?? ? ? ?? u?? ?. a?u ? L?
last(L) ?a?? ? ? ?? u?? ?. u?a ? L?
and the function
follow(L,a) ?b?? ? ? ?? u,v?? ?. u?a?b?v ?
L?.
Consider an expression e and its marking e?
We can construct the Glushkov automaton for e if
we know
the sets first(L(e?)) , last(L(e?)) ,
the function follow(L(e?), ? ) ,
and if we know whether ? ? L(e?) .

empty word
Why?
23
Construction of the Glushkov Automaton

Where do we get this info?
If e? a1 , then
first(L(e?)) ? a1 ?
last(L(e?)) ? a1 ?
follow(L(e?), ? ) is not defined for any li ???
Also, ??? L( e?)
If e? (f g) , then
first(L(e?)) first(L(f))?? first(L(g))
last(L(e?)) last(L(f))?? last(L(g))
follow(L(e?), li) follow(L(f), li) if li ?
L(f) follow(L(g), li) if li ? L(g)
Also, ?? ? L(e?) if ?? ? L(f) or ?? ? L(g)

For e? f, f, f?g,exercise!
24
Construction of the Glushkov Automaton

If e? (f?g) , then
first(L(e?)) first(L(f))?? first(L(g)) if ? ?
L(f)
first(L(f))?otherwise
last(L(e?)) last(L(f))?? last(L(g)) if ? ? L(g)
first(L(g))?otherwise
follow(L(e?), li) follow(L(f), li) if li in f
but not li ? last(L(f)) follow(L(g),
li) ? first(L(g)) if li ? last(L(f))
follow(L(g), li) if li in g
Also, ?? ? L(e?) if ?? ? L(f) and ?? ? L(g)

25
Construction of the Glushkov Automaton

If e? (f) , then
first(L(e?)) first(L(f))
last(L(e?)) last(L(f))
follow(L(e?), li) follow(L(f), li) if li in f
but not li ? last(L(f))
follow(L(f), li) ? first(L(f)) if li
? last(L(f))
Also, ?? ? L(e?) if ?? ? L(f) and ?? ? L(g)

26
Recognizing Deterministic Regular Expressions

Observation
For each operator, first, last, and follow can be
computed in quadratic time.
?This yields an O(n3) algorithm.
Theorem (Brüggemann-Klein, Wood, 1998)
There is an O(n2) algorithm to check whether a
regexpis deterministic.

27
More Results

Theorems (Brüggemann-Klein, Wood, 1998)
Not every regular language can be denoted by a
deterministic regular expression.
E.g.,
(a b) a (a b)
Deterministic regular languages are not closed
under union, concatenation, or Kleene-star.
I.e., there is no easy syntactic
characterization
If it exists, an equivalent deterministic regular
expression can be constructed in exponential
time.
It is possible to help users, but
that is costly

28
Theory for XML Schema

XML schema allows schemas where
the same element appears with different types
However,
it is illegal to have two elements of the same
name,but different types in one content model.
Also, content models must be deterministic.
Consequence
Documents can be validated in a deterministic
top-down pass

29
References