Title: More xml chpt 6 DTD
1More xml chpt 6 DTD
2DTD document type definition
- A DTD is defined using EBNF (extended BNF) and
can be used to specify allowable elements and
attributes for an XML document. - There is a move away from DTD currently, toward
Schema. Schema documents have XML (not BNF)
syntax. - Some parsers can check an XML document against
its DTD and determine if it is valid. These are
called validating parsers. A document which is
syntactically correct but does not correspond to
its DTD is well-formed. Non-validating parsers
cant check documents against their DTD and can
thus only determine if the document is
well-formed.
3Document Type Declaration
- ltDOCTYPE.gt in an XML document prolog is used to
specify DTD appearing within or outside the
document. These are referred to as the internal
or external subset. - ltDOCTYPE thingy
- lt!ELEMENT thingy (PCDATA)gt
- gt
- Declares a dtd called thingy with one element in
the internal subset. - PCDATA refers to parseable character data
meaning reserved characters lt,gt and within the
PCDATA will be treated as markup. The
parentheses contain the content specification for
the element.
4MS XML validator
- We can check an xml document for adherence to an
external DTD using MS XML validator. Heres the
xml - lt?xml version "1.0"?gt
- lt!-- Fig. 6.1 intro.xml --gt
- lt!-- Using an external subset --gt
- lt!DOCTYPE myMessage SYSTEM "intro.dtd"gt
- ltmyMessagegt
- ltmessagegtWelcome to XML!lt/messagegt
- lt/myMessagegt
- And heres the DTD
- lt!-- Fig. 6.2 intro.dtd --gt
- lt!-- External declarations --gt
- lt!ELEMENT myMessage ( message )gt
- lt!ELEMENT message ( PCDATA )gt
5MS Validating parser can validate against schema
or dtd
6Invalid xml
- In the next slide we use the MS XML validator to
check an xml (appearing below) like intro.xml but
missing the message element - lt?xml version "1.0"?gt
- lt!-- Fig. 6.3 intro-invalid.xml --gt
- lt!-- Simple introduction to XML markup --gt
- lt!DOCTYPE myMessage SYSTEM "intro.dtd"gt
- lt!-- Root element missing child element message
--gt - ltmyMessagegt
- lt/myMessagegt
7If xml doc does not match dtd/schema
8Sequences, pipes and occurrences
- The comma can be used to indicate a sequence in
which elements must appear. - lt!ELEMENT class (prof, student)gt
- Indicates the order and number of elements making
up a class one prof and one student, in that
order. Content may specify any number of
elements. - lt!ELEMENT sidedish (coleslawchips)gt
- Indicates just one of the choices must be
selected. - , , and ? Indicate frequency of element
occurrences. - means 1 or more occurences, means 0 or more
occurences, ? Means 0 or 1 occurrence. - lt!ELEMENT class (prof, student)gt
- Might be appropriate for a class DTD meaning just
one professor and one or more students.
9example
- lt!ELEMENT donuts (jelly?,lemon,((crèmesugar)gl
azed))gt - Specifies donuts consists of 0 or 1 jelly, 0 or
more lemon, 1 or more of crème or sugar, or a
glazed. A legal markup for this would be - ltdonutsgt
- ltjellygtgrapelt/jellygt
- ltlemongtsourlt/lemongt
- ltlemongtreal sourlt/lemongt
- ltglazedgtchocolatelt/glazedgt
- lt/donutsgt
10The dtd and xml
- Pastry.dtd
- lt!ELEMENT jelly (PCDATA)gt
- lt!ELEMENT glazed (PCDATA)gt
- lt!ELEMENT lemon (PCDATA)gt
- lt!ELEMENT creme (PCDATA)gt
- lt!ELEMENT sugar (PCDATA)gt
- lt!ELEMENT donuts (jelly?,lemon,((cremesugar)gl
azed))gt - Pastry.xml
- lt?xml version "1.0"?gt
- lt!-- pastry.xml --gt
- lt!-- Using an external subset --gt
- lt!DOCTYPE donuts SYSTEM "pastry.dtd"gt
- ltdonutsgt
- ltjellygtgrapelt/jellygt
- ltlemongtsourlt/lemongt
- ltlemongtreal sourlt/lemongt
- ltglazedgtchocolatelt/glazedgt
- lt/donutsgt
11In validator files are in myexamples directory
12Pastry.xml in xml validator
13content specification
- An element may contain one or more child elements
as content. - Content specification types describe non-element
content. - Theses consist of ANY, EMPTY and mixed content.
- Empty elements do not contain character data or
child elements. An empty element specification
like - lt!ELEMENT nest EMPTYgt could be marked up as
- ltnest/gt. Recall the shorthand /gt may be used for
an empty element closetag. - , and cant be used with mixed content
elements containing only PCDATA. If mixed
content may contain PCDATA, then this must be
listed first. - An element of type ANY may contain any content
including PCDATA, or combinations of elements and
PCDATA. They may also be empty.
14Mixed content
- ltELEMENT mymessage (PCDATAmessage)gt
- Declares mymessage to have mixed content. PCDATA
must be listed first in mixed content. means
mymessage may contain nothing or any number of
occurences of PCDATA and message elements. This
would be legal markup - ltmymessagegthere is an example of the dtd above
- ltmessagegtthis is a messagelt/messagegt
- ltmessagegtand anotherlt/messagegt
- lt/mymessagegt
15Internal dtd
- An xml document is standalone if it does not
reference an external subset. - lt?xml version "1.0" standalone "yes"?gt
- lt!-- Fig. 6.5 mixed.xml --gt
- lt!-- Mixed content type elements --gt
- lt!DOCTYPE format
- lt!ELEMENT format ( PCDATA bold italic )gt
- lt!ELEMENT bold ( PCDATA )gt
- lt!ELEMENT italic ( PCDATA )gt
- gt
- ltformatgt
- This is a simple formatted sentence.
- ltboldgtI have tried bold.lt/boldgt
- ltitalicgtI have tried italic.lt/italicgt
- Now what?
- lt/formatgt
16In ms xml validator
17Element group
- lt!ELEMENT courselist (department, (coursenumber,
coursedescription))gt - Above, a courselist contains a single department
followed by any number of coursenumber,
coursedescription pairs. - What does the following mean?
- lt!ELEMENT course (coursenumber, (sectionnumber,
instructor, roomnumber))gt
18Attribute specification
- An attribute specification specifies an attribute
list for an element via ATTLIST declaration - lt!ELEMENT x EMPTYgt
- ltATTLIST x y CDATA REQUIREDgt
- Here, y is a required attribute of element x. y
may contain any char data (except lt,gt, , and
). - CDATA in an attribute declaration has different
meaning than a CDATA section in an XML document
where gt (end tag) may not appear.
19Using attributes
- lt?xml version "1.0"?gt
- lt!-- Fig. 6.7 intro2.xml --gt
- lt!-- Declaring attributes --gt
- lt!DOCTYPE myMessage
- lt!ELEMENT myMessage ( message )gt
- lt!ELEMENT message ( PCDATA )gt
- lt!ATTLIST message id CDATA REQUIREDgt
- gt
- ltmyMessagegt
- ltmessage id "445"gt
- Welcome to XML!
- lt/messagegt
- lt/myMessagegt
20Document with attributes in MS validator
21Attribute defaults
- Page authors can specify default values for
attributes. - The keywords are IMPLIED, REQUIRED and FIXED.
- An implied attribute, if missing, can be replaced
by any value the application using the document
wishes. - A required attribute must appear or the document
is not valid. - A fixed attribute must have the specific value
provided. - ltmessagegtnumberlt/messagegt does not conform to
lt!ATTLIST message number CDATA REQUIREDgt - lt!ATTLIST address zip FIXED 13820gt specifies
that zip can only have value 13820 and an
application processing an XML document with
address element missing attribute zip would be
passed this default zip value.
22Attributes
- Attribute types may be CDATA (Strings), tokenized
or enumerated. - Strings have no constraints beyond prohibiting
lt,gt,,,and . Entity references must be used
for these. - Tokenization imposes constraints on attribute
values such as which characters are permitted in
an attribute name. - An enumerated attribute has a restricted value
range It can only take on one of the values
listed in the attribute declaration.
23tokenized attribute
- 4 tokenized types exist
- ID
- IDREF
- ENTITY
- NMTOKEN
- ID uniquely identifies an element.
- IDREF attributes point to elements with ID
attribute. - A validating parser verifies that each ID
attribute type referenced by an IDREF is in the
document. - Using the same value for multiple ID attributes
is an error. - Declaring attributes of type ID to be FIXED is
an error.
24Using ID and IDREF attributes
- lt?xml version "1.0"?gt
- lt!-- IDExample.xml Example for ID
and IDREF values of attributes --gt - lt!DOCTYPE bookstore
- lt!ELEMENT bookstore ( shipping, book )gt
- lt!ELEMENT shipping ( duration )gt
- lt!ATTLIST shipping shipID ID REQUIREDgt
- lt!ELEMENT book ( PCDATA )gt
- lt!ATTLIST book shippedBy IDREF IMPLIEDgt
- lt!ELEMENT duration ( PCDATA )gtgt
- ltbookstoregt
- ltshipping shipID "s1"gt
- ltdurationgt2 to 4 dayslt/durationgt
- lt/shippinggt
- ltshipping shipID "s2"gt
- ltdurationgt1 daylt/durationgt
- lt/shippinggt
- ltbook shippedBy "s2"gt
- Java How to Program 3rd edition.
- lt/bookgt
25In MS Validator
- Use URL http//employees.oneonta.edu/higgindm/int
ernet20programming/validate_js.htm - with file examples\ch06\IDExample.xml
26ID example
27id example internal subset
- lt?xml version "1.0"?gt
- lt!-- Fig. 6.8 IDExample.xml
--gt - lt!-- Example for ID and IDREF values of
attributes --gt - lt!DOCTYPE bookstore
- lt!ELEMENT bookstore ( shipping, book )gt
- lt!ELEMENT shipping ( duration )gt
- lt!ATTLIST shipping shipID ID REQUIREDgt
- lt!ELEMENT book ( PCDATA )gt
- lt!ATTLIST book shippedBy IDREF IMPLIEDgt
- lt!ELEMENT duration ( PCDATA )gt
- gt
28Idexample.xml continued
- ltbookstoregt
- ltshipping shipID "s1"gt
- ltdurationgt2 to 4 dayslt/durationgt
- lt/shippinggt
- ltshipping shipID "s2"gt
- ltdurationgt1 daylt/durationgt
- lt/shippinggt
- ltbook shippedBy "s2"gt
- Java How to Program 3rd edition.
- lt/bookgt
- ltbook shippedBy "s2"gt
- C How to Program 3rd edition.
- lt/bookgt
- ltbook shippedBy "s1"gt
- C How to Program 3rd edition.
- lt/bookgt
- lt/bookstoregt
29remarks
- It is an error not to begin a type attribute IDs
value with a letter, underscore or colon. - Providing more than one ID attribute type for an
element is an error. - Referencing a value as an ID is not defined is an
error.
30IDExample2.xml (note s3 shippedBy value)
- ltbookstoregt
- ltshipping shipID "s1"gt
- ltdurationgt2 to 4 dayslt/durationgt
- lt/shippinggt
- ltshipping shipID "s2"gt
- ltdurationgt1 daylt/durationgt
- lt/shippinggt
- ltbook shippedBy "s2"gt
- Java How to Program 3rd edition.
- lt/bookgt
- ltbook shippedBy "s2"gt
- C How to Program 3rd edition.
- lt/bookgt
- ltbook shippedBy "s3"gt
- C How to Program 3rd edition.
- lt/bookgt
- lt/bookstoregt
31IDExample2.xml in Validator
32Entities
- As we saw in chapter 5 entity references in an
xml document are replaced by the entity values
found in the dtd. - We saw this for lang.xml and lang.dtd where assoc
and text entities were replaced with Arabic
script. - Here is another example. Entity city is replaced.
33entityexample.xml
- lt?xml version "1.0"?gt
- lt!-- Fig. 6.10 entityExample.xml --gt
- lt!-- ENTITY and ENTITY attribute types --gt
- lt!DOCTYPE database
- lt!NOTATION html SYSTEM "iexplorer"gt
- lt!ENTITY city SYSTEM "tour.html" NDATA htmlgt
- lt!ELEMENT database ( company )gt
- lt!ELEMENT company ( name )gt
- lt!ATTLIST company tour ENTITY REQUIREDgt
- lt!ELEMENT name ( PCDATA )gt
- gt
- ltdatabasegt
- ltcompany tour "city"gt
- ltnamegtDeitel amp Associates, Inc.lt/namegt
- lt/companygt
- lt/databasegt
34entityexample.xml
35entityexample.xml
- Here line 7 ltNOTATIO indicates that an
application may wish to run IE and load tour.html
to handle unparsed entities. - line 8 declares an entity named city which refers
to the external document tour.html. - NDATA in this line indicates that the content of
this entity is not xml and supplies the name of
the notation (html) for this entity.
36ENTITIES
- ENTITIES keyword can be used in a dtd to indicate
that an attribute has mutliple entities for its
value. - lt!ATTLIST directory file ENTITIES REQUIREDgt
- Specifies that file must contain multiple
entities. Conforming markup is - ltdirectory file animations graphics tablesgt
- animations, graphics and tables are entities
declared in a dtd. - NMTOKEN type is more restrictive, containing
letters, digits, periods, underscores, hyphens
and colons. - lt!ATTLIST mathdept phonenum NMTOKEN REQUIREDgt
might have conforming markup - ltmathdept phonenum 607-436-3708gt
- ltmathdept phonenum 607 436 3708gt does not
conform because spaces are not allowed. - NMTOKENS attribute type would allow multiple
string tokens separated by blanks.
37Enumerated attribute types
- Enumerated attribute type declares a list of
possible values. Attributes must be assigned a
value from this list in order to conform to the
dtd. Enumerated values are separated with pipe
() - lt!ATTLIST person gender (MF) Fgt allows a
person to have gender M or F with default F. - lt!ATTLIST person gender (MF) IMPLIEDgt does not
supply a default and would permit an application
to process a person with no gender in whatever
way it liked.
38Enumerated attribute types
- NOTATION is also an enumerated attribute type.
- lt!ATTLIST CSCI116 language NOTATION (JavaC) Cgt
- Specifies that language must be assigned a value,
Java or C with C as the default. The notation
for C might be specified as - lt!NOTATION C System http//....htmlgt
39conditional.xml
- Conditional sections provide the flexibility of
including or excluding declarations. - These enable us to check xml documents against
different sets of dtd requirements. - Keywords INCLUDE and IGNORE specify included and
excluded declarations - lt!INCLUDE
- lt!ELEMENT name (PCDATA)gt
- gt
- Directs the parser to include the declaration of
element name. - Conditionals may also be used with entities.
40Conditional.dtd
- lt!-- conditional.dtd --gt
- lt!-- DTD for conditional section example --gt
- lt!ENTITY reject "IGNORE"gt
- lt!ENTITY accept "INCLUDE"gt
- lt! accept
- lt!ELEMENT message ( approved, signature )gt
- gt
- lt! reject
- lt!ELEMENT message ( approved, reason,
signature )gt - gt
- lt!ELEMENT approved EMPTYgt
- lt!ATTLIST approved flag ( true false ) "false"gt
- lt!ELEMENT reason ( PCDATA )gt
- lt!ELEMENT signature ( PCDATA )gt
41Conditional.xml
- lt?xml version "1.0" standalone "no"?gt
- lt!-- conditional.xml --gt
- lt!-- Using conditional sections --gt
- lt!DOCTYPE message SYSTEM "conditional.dtd"gt
- ltmessagegt
- ltapproved flag "true"/gt
- ltsignaturegtChairmanlt/signaturegt
- lt/messagegt
42discussion
- Entities accept and reject have values IGNORE
and INCLUDE. - The percent symbol indicates that they are
parameter entities and may only be used inside
the dtd in which they are declared. They may only
appear in the external subset. - Thus the author may create entities specific to
the dtd not xml document.
43conditional.xml
44 conditional.xml
- lt?xml version "1.0" standalone "no"?gt
- lt!-- Fig. 6.13 conditional.xml --gt
- lt!-- Using conditional sections --gt
- lt!DOCTYPE message SYSTEM "conditional.dtd"gt
- ltmessagegt
- ltapproved flag "true"/gt
- ltsignaturegtChairmanlt/signaturegt
- lt/messagegt
45conditional.dtd
- lt!-- DTD for conditional section example --gt
- lt!ENTITY reject "IGNORE"gt
- lt!ENTITY accept "INCLUDE"gt
- lt! accept
- lt!ELEMENT message ( approved, signature )gt
- gt
- lt! reject
- lt!ELEMENT message ( approved, reason,
signature )gt - gt
- lt!ELEMENT approved EMPTYgt
- lt!ATTLIST approved flag ( true false ) "false"gt
- lt!ELEMENT reason ( PCDATA )gt
- lt!ELEMENT signature ( PCDATA )gt
46Whitespace
- Whitespace is preserved or normalized depending
on the context in which it appears. - A text example (whitespace.xml) uses a java
program (Tree.java from chapt 9) to demonstrate
when whitespace is preserved or normalized. - File can be got from classdir\examples\ch09\tree.j
ava
47running Tree.java on whitespace.xml... java src
in notes
- C\Java\j2sdk1.4.1_01\bingtjava Tree yes
whitespace.xml - URL fileC/Java/j2sdk1.4.1_01/bin/whitespace.xml
- document root
- - element whitespace
- - ignorable
- - ignorable
- - ignorable
- - element hasCDATA
- - attribute cdata " simple cdata "
- - ignorable
- - ignorable
- - ignorable
- - element hasID
- - attribute id "i20"
- - ignorable
- - ignorable
- - ignorable
- - element hasNMTOKEN
- - attribute nmtoken "hello"
48Java tree output continued
- - element hasEnumeration
- - attribute enumeration "true"
- - ignorable
- - ignorable
- - ignorable
- - element hasMixed
- - text "
- "
- - text " This is text."
- - text "
- "
- - text " "
- - element hasCDATA
- - attribute cdata " simple cdata"
- - text "
- "
- - text " This is some additional
text." - - text "
- "
49whitespace.xml dtd and content
- lt?xml version "1.0"?gt
- lt!-- whitespace.xml --gt
- lt!-- Demonstrating whitespace parsing --gt
- lt!DOCTYPE whitespace
- lt!ELEMENT whitespace ( hasCDATA,
- hasID, hasNMTOKEN, hasEnumeration, hasMixed
)gt - lt!ELEMENT hasCDATA EMPTYgt
- lt!ATTLIST hasCDATA cdata CDATA REQUIREDgt
- lt!ELEMENT hasID EMPTYgt
- lt!ATTLIST hasID id ID REQUIREDgt
- lt!ELEMENT hasNMTOKEN EMPTYgt
- lt!ATTLIST hasNMTOKEN nmtoken NMTOKEN
REQUIREDgt - lt!ELEMENT hasEnumeration EMPTYgt
50whitespace.xml continued
- ltwhitespacegt
- lthasCDATA cdata " simple cdata "/gt
- lthasID id " i20"/gt
- lthasNMTOKEN nmtoken " hello"/gt
- lthasEnumeration enumeration " true"/gt
- lthasMixedgt
- This is text.
- lthasCDATA cdata " simple cdata"/gt
- This is some additional text.
- lt/hasMixedgt
- lt/whitespacegt
51Tree.java slide 1
- import java.io.
- import org.xml.sax. // for HandlerBase class
- import javax.xml.parsers.SAXParserFactory
- import javax.xml.parsers.ParserConfigurationExcept
ion - import javax.xml.parsers.SAXParser
- public class Tree extends HandlerBase
- private int indent 0 // indentation
counter - // returns the spaces needed for indenting
- private String spacer( int count )
- String temp ""
- for ( int i 0 i lt count i )
- temp " "
- return temp
- // method called before parsing
- // it provides the document location
- public void setDocumentLocator( Locator loc )
- System.out.println( "URL "
loc.getSystemId() )
52Tree.java slide 2
- // method called at the beginning of a document
- public void startDocument() throws
SAXException - System.out.println( " document root "
) - // method called at the end of the document
- public void endDocument() throws SAXException
- System.out.println( " document end "
) - // method called at the start tag of an
element - public void startElement( String name,
- AttributeList attributes ) throws
SAXException - System.out.println( spacer( indent )
- "- element " name
" ") - if ( attributes ! null )
- for ( int i 0 i lt attributes.getLength
() i ) - System.out.println( spacer( indent )
- "- attribute "
attributes.getName( i ) - " \"" attributes.getValue( i )
"\"" ) -
53Tree.java slide 3
- // method called at the end tag of an element
- public void endElement( String name ) throws
SAXException - indent--
- // method called when a processing instruction
is found - public void processingInstruction( String
target, - String value ) throws SAXException
-
- System.out.println( spacer( indent )
- "- proc-inst " target " \""
value "\"" ) - // method called when characters are found
- public void characters( char buffer, int
offset, - int length ) throws SAXException
- if ( length gt 0 )
- String temp new String( buffer,
offset, length ) - System.out.println( spacer( indent )
- "- text \""
temp "\"" ) -
- // method called when ignorable whitespace is
found - public void ignorableWhitespace( char
buffer,
54Tree slide 4
- // method called on a non-fatal (validation)
error - public void error( SAXParseException spe )
- throws SAXParseException
-
- // treat non-fatal errors as fatal errors
- throw spe
-
- // method called on a parsing warning
- public void warning( SAXParseException spe )
- throws SAXParseException
-
- System.err.println( "Warning "
spe.getMessage() ) -
55Tree.java slide 5
- // main method
- public static void main( String args )
- boolean validate false
- if ( args.length ! 2 )
- System.err.println( "Usage java Tree
validate " - "filename\n" )
- System.err.println( "Options" )
- System.err.println( " validate
yesno " - "DTD validation" )
- System.exit( 1 )
- if ( args 0 .equals( "yes" ) )
- validate true
- SAXParserFactory saxFactory
- SAXParserFactory.newInstance()
- saxFactory.setValidating( validate )
- try
- SAXParser saxParser saxFactory.newSAXPa
rser() - saxParser.parse( new File( args 1 ),
new Tree() ) - catch ( SAXParseException spe )
System.err.println( "Parse Error "
spe.getMessage() )
56Day planner example continued
57planner.xml
- lt?xml version "1.0"?gt
- lt!-- planner.xml Day Planner XML document --gt
- lt!DOCTYPE planner SYSTEM "planner.dtd"gt
- ltplannergt
- ltyear value "2000"gt
- ltdate month "7" day "15"gt
- ltnote time "1430"gtDoctor's
appointmentlt/notegt - ltnote time "1620"gtPhysics class at
BH291Clt/notegt - lt/dategt
- ltdate month "7" day "4"gt
- ltnotegtIndependence Daylt/notegt
- lt/dategt
- ltdate month "7" day "20"gt
- ltnote time "0900"gtGeneral Meeting in
room 32-Alt/notegt - lt/dategt
- ltdate month "7" day "20"gt
- ltnote time "1900"gtParty at
Joe'slt/notegt - lt/dategt
- ltdate month "7" day "20"gt
58planner.dtd
- lt!-- DTD for day planner --gt
- lt!ELEMENT planner ( year )gt
- lt!ELEMENT year ( date )gt
- lt!ATTLIST year value CDATA REQUIREDgt
- lt!ELEMENT date ( note )gt
- lt!ATTLIST date month CDATA REQUIREDgt
- lt!ATTLIST date day CDATA REQUIREDgt
- lt!ELEMENT note ( PCDATA )gt
- lt!ATTLIST note time CDATA IMPLIEDgt
59HW this section
- Make a dtd and a conforming xml file. Make your
example non-trivial but feel free to copy and
modify examples given in class or your text.
Check your work in the MS Validator. That means,
elements should have attributes, etc. - You may also need to download the Xerces parser
(youll need it at some point this semester) and
install it as per the documentation that
accompanies it. - Save tree.java to your java directory. Make sure
it compiles and runs. See step 4 below. - For step 3, you will need to download JAXP from
http//java.sun.com/xml/download.html