Title: XML Primer
1XML Primer
2XML
- Extensible Markup Language
- Introduced in 1998
- A way to mark up a document to expose the
structure of the document to machine processing
3What Was Wrong With HTML?
- Intended to expose documents to browsers for
display - Browsers need to display documents that arent
properly marked up - Documents dont have to conform
- Markup tags are predefined and cant be created
by a user
4Whats Good About XML?
- Standard Format
- If a document doesnt conform, it isnt XML
- User-defined tags (Invent your own language)
- XML is evolving
- Many technologies have sprung up around it
- DTDs, Schema, Namespaces, Encryption, Signature,
XPointer, XLink, XPath, XSLT, DOM, SAX, RDF,
SOAP, JAXB, JAXP, JAXM, JAXR, SOAP, WSDL, UDDI,
BPEL,!
5Whats Needed For Web Services?
- The rules for creating XML documents
- XML Schema a way to describe the structure of a
document - XML Namespaces Definitions of mechanisms for
combining documents from different sources - XML Processing How to parse and manipulate a
document from Java
6What Does XML Look Like?
- Optional Prolog
- Root Element
- Elements
- Attributes
7Prolog
- Identifies the document as XML
- Includes comments about the document
- Includes meta-information about the content
8Processing Instruction (PI)
- lt? ?gt
- lt?PITarget ?gt
- PITarget meaningful keyword
- lt?xml version1.0 encodingUTF-8?gt
- UTF-8 Unicode 8 bit Good for English over the
internet. Preserves 8-bit ASCII
9Comments
- lt!-- This is a comment --gt
- Can span multiple lines
- Cant be nested
10Elements
- An element is the pairing of a start tag with an
end tag - ltnamegtDavid Woolbrightlt/namegt
- Every start tag must have a matching end tag
- Everything between the tags is the content of the
element - Tags are user-defined
11Tags
- Begin with a letter
- 0-9, A-Z, a-z, _ ,- ,
- XML is case sensitive
12Content Types of Elements
- Element only
- ltagtltnamegtDavidlt/namegtlt/agt
- Mixed
- ltagtltbgtxxxlt/bgt yyy lt/agt
- Empty
- ltagegtlt/agegt
- or
- ltage/gt
13XML Uses Proper Nesting
- Tags can contain tags, but tags cannot overlap
- ltagt ltbgt lt/bgt lt/agt Yes
- ltagt ltbgt lt/agt lt/bgt No
14Documents Have a Single Root Element
- lt?xml version1.0 encodingUTF-8?gt
- lt! Sample document --gt
- ltnamegt
- ltfirstgtDavidlt/firstgt
- ltmigtElt/migt
- ltlastgtWoolbrightlt/lastgt
- lt/namegt
15XML Rules Produce Hierarchies
name
first mi last
16XML Terminology
- Parent element
- Child element
- Sibling element
- Ancestor
- Descendant
A B C D E
17Attributes
- An attribute is a name-value pair
- Tags can contain 0 or more attributes
- Attribute syntax namevalue
- ltpo id1276 custid83730gt lt/pogt
- Some attributes are reserved
- xmllangen
18Semantics
- XML applications can attach any semantics they
choose to XML markup - Attributes can be used to refer to other parts of
a document in order to prevent duplication of
information
19Element vs Attributes
- ltwork number653.323.3938gt
- Or
- ltworkgt
- ltareagt653lt/areagt
- ltexchangegt323lt/exchangegt
- ltnumbergt3938lt/numbergt
- lt/workgt
20Character Data and Entities
- All character data must comply with the
documents encoding - Other characters must be escaped
- Start with , finish with
- Example x80 128
- lt gt quot apos amp
21CDATA
- Multi-character escape construct
- Syntax
- lt!CDATAany sequence of chars gt
- Example
- ltMYXMLDATAgt
- lt!CDATAltAgtltBgtlt/Bgtlt/Agt gt
- lt/MYXMLDATAgt
22XML Namespaces
- XML documents can be composed to create new
documents - Name conflicts can occur when documents are
combined - Conflicts are resolved by qualification
- Qualified Name Namespace ID Local Name
23URIs U Know?
- XML namespaces use Uniform Resource Identifiers
(URI) for Namespace Identifiers - URIs can be locators, names or both
- URL http//www.colstate.edu
- URN URIs that are globally unique and
persistent - UUID Universally Unique Identifiers 128 bit
ids that are globally unique (Ethernet address
high-precision timestamp increment counter).
Used as unique ids in UDDI
24Namespace Syntax
- URIs are long and may contain characters not
allowed in XML element names - Syntax involves two steps
- Namespace ID is associated with a prefix
- Qualified names are obtained as a combination of
the prefix, a colon character, and the local
element name
25Namespace Example
- ltmsgmessage fromxxx xmlnsmsghttp//www.xcomm
e.com/ns/message xmlnspohttp//www.skatestown.co
m/ns/pogt -
- ltmsgtextgtHi therelt/msgtextgt
- ltpotextgtHello alllt/potextgt
- lt/msgmessagegt
26Default Namespaces
- Namespaces increase document size and reduce
readability - Default namespaces can be specified
- Elements in the default space dont need a prefix
- ltmsgmessage fromxxx
- xmlns http//www.xcomme.com/ns/message
xmlnspohttp//www.skatestown.com/ns/pogt -
- lttextgtHi therelttextgt
- ltpotextgtHello alllt/potextgt
- lt/msgmessagegt
27Namespace-Prefixed Attributes
- Attributes can have namespaces
- ltpoitem sku318-BP xmlnspohttp//www.xxx.yyygt
- ltpo.order popriorityhighgt
28Dereferencing URI
- In many cases the URI is a URL
- What happens if the URL resource is unavailable?
- It doesnt matter
- URI is for identification purposes only
29XML Schemas
- Document Type Definition (DTD) a set of rules
for describing the structure of an XML document - DTDs help attach meaning to a document
- DTD dont address namespace integration, flexible
content models - DTDs arent written in XML
30Well-formed XML
- If a document conforms to the rules of XML syntax
(nested tags, one root tag, ), it is well-formed
. - XML processing software can read well-formed
documents without problems - XML parsers generate immediate non-recoverable
errors when they detect the document isnt
well-formed
31Valid XML
- A document is valid if it conforms to the rules
of a DTD or Schema - The logic for making sure the document is valid
lies inside the parser, relieving the application
of this burden
32Schema Benefits
- Schemas enable the following
- Identification of the elements the document can
contain - Identification of the order and relation of the
elements - Identification of the attributes of every element
and whether they are optional - Identification of the datatype of attribute
content
33A Simple Schema (W3Schools)
- lt?xml version"1.0"?gt
- ltxsschema xmlnsxs"http//www.w3.org/2001/XMLSch
ema" targetNamespace"http//www.w3schools.com"
xmlns"http//www.w3schools.com"
elementFormDefault"qualified"gt - ltxselement name"note"gt
- ltxscomplexTypegt
- ltxssequencegt
- ltxselement name"to" type"xsstring"/gt
- ltxselement name"from" type"xsstring"/gt
- ltxselement name"heading"
type"xsstring"/gt - ltxselement name"body" type"xsstring"/gt
- lt/xssequencegt lt/xscomplexTypegt
- lt/xselementgt
- lt/xsschemagt
34XML Referencing the Schema
- lt?xml version"1.0"?gt
- ltnote xmlns"http//www.w3schools.com"
xmlnsxsi"http//www.w3.org/2001/XMLSchema-instan
ce" xsischemaLocation"http//www.w3schools.com
note.xsd"gt - lttogtTovelt/togt
- ltfromgtJanilt/fromgt
- ltheadinggtReminderlt/headinggt
- ltbodygtDon't forget me this weekend!lt/bodygt
- lt/notegt
35The Schema Root
- lt?xml version1.0 ?gt
- ltxsschemagt
-
-
- ltxsschemagt
36Schemas Define Elements
- A simple element is one that only contains text
- Syntax for defining a simple element
- ltxselement namexxx type yyy/gt
- XML Schema has built in data types
-
37XML Scheme Built-in Data Types
- xsstring
- xsdecimal
- xsinteger
- xsboolean
- xsdate
- xstime
38Some Schema Element Definitions
- ltxselement namefirstname typexsstring/gt
- The document could contain
- ltfirstnamegtDavidlt/firstnamegt
- ltxselement nameage typexsinteger
default0/gt - The document could contain
- ltagegt89lt/agegt
39Attributes
- Simple elements cant have attributes
- Elements with attributes are complex
- Attributes can have a default value or a fixed,
specified value
40Some Schema Attribute Definitions
- ltxsattribute namefirstname typexsstring/gt
- The document could contain
- ltfirstnamegtDavidlt/firstnamegt
- ltxsattribute nameage typexsinteger
default0/gt - The document could contain
- ltagegt89lt/agegt
41Default and Fixed Values
- ltxsattribute namefirstname typexsstring
- defaultJoe/gt
- ltxsattribute namefirstname typexsstring
- fixedJoe/gt
42Optional and Required Attributes
- All attributes are optional by default
- Specify use for required attributes
- ltxsattribute namefirstname typexsstring
- userequired/gt
43Restrictions on Content
- When an attribute has a defined data type, the
content of the XML document must conform to the
type, otherwise the document wont validate - Other restrictions called facets can be added
to elements and attributes
44Restriction Types
- length, minlength,maxlength the exact, minimum,
and maximum character length of the value - pattern a regular expression for the value
- enumeration a list of possible values
- whitespace rules for handling whitespace
- minExclusive,minInclusive,maxExclusive the
range of digits allowed - totalDigits the number of digits in a numeric
value - fractionDigits the number of digits after the
decimal pt
45Facets Restricting Range
- ltxselement nameagegt
- ltxssimpleTypegt
- ltxsrestriction basexsintegergt
- ltxsminInclusive value0/gt
- ltxsmaxInclusive value120/gt
- lt/xsrestrictiongt
- lt/xssimpleTypegt
- lt/xselementgt
46Facets Restricting Values
- ltxselement nameagegt
- ltxssimpleTypegt
- ltxsrestriction basexsstringgt
- ltxsenumeration valueAudi/gt
- ltxsenumeration valueAudi/gt
- ltxsenumeration valueAudi/gt
- lt/xsrestrictiongt
- lt/xssimpleTypegt
- lt/xselementgt
47Reformulated
- ltxselement nameage typecartype/gt
- ltxssimpleType namecartypegt
- ltxsrestriction basexsstringgt
- ltxsenumeration valueAudi/gt
- ltxsenumeration valueAudi/gt
- ltxsenumeration valueAudi/gt
- lt/xsrestrictiongt
- lt/xssimpleTypegt
48Using Patterns
- ltxssimpleType nameskuTypegt
- ltxsrestriction basexsstringgt
- ltxspattern value\d3-A-Z2/gt
- lt/xsrestrictiongt
- lt/xssimpleTypegt
- Three digits, followed by a dash, followed by two
uppercase letters
49Complex Types
- Complex types address elements that can have
nested children, sequencing, multiplicity of
child elements - Syntax
- ltxsdcomplexType nametypeNamegt
- ltxsdsomeTopLevelModelGroupgt
- lt!- Sequencing, multiplicity
constraints, - --gt
- lt/xsdsomeTopLevelModelGroupgt
- lt! Attribute declarations--gt
- lt/xscomplexTypegt
50ComplexType
- ltxsdcomplexType namepoTypegt
- ltxsdsequencegt
- ltxsdelement namebillto
typeaddressType - ltxsdelement nameshipto
typeaddressType - ltxsdelement nameordergt
- ltxsdcomplexTypegt
- ltxsdsequencegt
- ltxsdelement nameitem
- typeitemType
- maxOccursunbounded/gt
- lt/xsdsequencegt
- ltxsdcomplexTypegt
- lt/xsdelementgt
- lt/xsdsequencegt
- ltxsdattribute namesubmitted
userequired - typexsddate/gt
-
- lt/xscomplexTypegt
51Global and Local Elements and Attributes
- An element or attribute defined in a complex type
is local to that definition - An element or attribute defined in the top level
(xsdschema) is global - Global elements can be document roots
- Global attributes can be used on any element in
the document that allows them
52Basic Schema Reusability
- Element References
- Elements have a name and a type
53Processing XML
- Parsing is a process that involves breaking the
text of an XML document into pieces (start tag,
end tag, text, PIs,) - We can call the pieces tokens
- Many parsers alread exist for reading valid XML
54Types of Parsers
- Pull Parser The application asks the parser to
give it the next token. It pulls the token
from the parser - Push Parser The parser sends notifications to
the application about the types of tokens it
encounters during parsing. Simple API for XML
(SAX) defines an event-driven
55Types of Parsers
- One-step Parser The parser reads the whole
document and generates a parse tree. XML DOM
(Document Object Model) describes these types of
trees - Hybrid Parser Combines the other three
techniques to produce a specialized parser. For
example, a one-step approach is combined with
pull parsing
56Parsing in Java
- Java API for XML Processing (JAXP) exists to
instantiate XML parsers using either SAX or DOM - JDOM is the Java communitys attempt to develop
an API that fits Java computational patterns
better than SAX or DOM. JDOM isnt complete at
this point
57Processing Architecture
Character Stream
Application
XML Doc
Standard XML APIs
Serializer
Parser
58Data-Oriented XML Processing
- Parsing or generating XML is syntax-oriented
- Application may want a higher view of the data
using an operation-centric approach
Syntax oriented APIs
Data Abstraction Layer
Application Logic
59Invoice Checker
- Package com.skatestown.invoice
- Import java.io.
- /
- SkatesTown Invoice Checker
- /
- Public interface InvoiceChecker
- Void checkInvoice(InputStream invoiceXML) throws
- Exception
60CheckInvoice()
- Obtain an XML parser
- Parse the XML from the input stream
- Initialize a running total
- Fild all order items, calculate subtotals, add to
running total - Add tax to the total
- Add shipping and handling
- Compare running total to invoice total
- If they are different, throw an exception
- Otherwise, return
61Data-Centric Approach
- Working with XML is reduced to mapping XML to and
from application data - Converting data to XML is called marshalling
- Converting data from XML is called unmarshalling
62Schema Compilers
- Schema compilers are tools that analyze XML
schema and code-generate marshalling and
unmarshalling modules - Binding Customization can help the Schema
compiler bind the XML data to specific data
structures - The Java community has developed tools and an API
for mapping schema to Java data types Java
Architecture for XML Binding (JAXB)
63SAX Parsing Architecture
Parse( )
SAXParser Factory
Content Handler
SAXParser
SAX Reader
Error Handler
DTD Handler
XML
Entity Handler
64SAX Callback Interfaces
- void startDocument( )
- void endDocument( )
- void startElement(String namespaceURI,
- String qName,
- Attributes atts)
- Void characters(char ch, int start, int length)