Title: Schemas for XML
1Schemas for XML
- By Norman Walsh
- From http//www.xml.com/pub/1999/07/schemas/index
.html
2 Introduction
- Schemas will have a broad impact on the future of
XML for two reasons - first because they will define what it means for
an XML document to be valid and - second because they are a radical departure from
Document Type Definitions (DTDs), the existing
schema mechanism inherited from SGML.
3What is a Schema?
- A schema is a model for describing the structure
of information. - A term borrowed from the database world to
describe the structure of data in relational
tables. - In the context of XML, a schema describes a model
for a whole class of documents. - The model describes the possible arrangement of
tags and text in a valid document. - A schema might also be viewed as an agreement on
a common vocabulary for a particular application
that involves exchanging documents.
4What is a Schema?
- Is this a valid postal address?
ltaddressgt ltnamegtNamron H. Slawlt/namegt
ltstreetgt256 Eight Bit Lanelt/streetgt ltcitygtEast
Yahoolt/citygt ltstategtMAlt/stategt
ltzipgt12481-6326lt/zipgt lt/addressgt
- Mentally, you compare the address with a schema
that you have in your head for addresses.
5What is a Schema?
- In schemas, models are described in terms of
constraints. - A constraint defines what can appear in any given
context. - Two kinds of constraints that you can give
- content model constraints describe the order and
sequence of elements and - datatype constraints describe valid units of data.
6What is a Schema?
- For example, a schema might describe a valid
ltaddressgt with the content model constraint that - it consists of a ltnamegt element, followed by
- one or more ltstreetgt elements, followed by
- exactly one ltcitygt, ltstategt, and ltzipgt element.
- The content of a ltzipgt might have a further
datatype constraint that it consist of either a
sequence of exactly five digits or a sequence of
five digits, followed by a hyphen, followed by a
sequence of exactly four digits. No other text is
a valid ZIP code.
7What is a Schema?
- The purpose of a schema is to allow machine
validation of document structure. - The following is not valid according to the
informal schema.
ltaddressgt ltnamegtNamron H. Slawlt/namegt
ltstreetgt256 Eight Bit Lanelt/streetgt ltcitygtEast
Yahoolt/citygt ltstategtMAlt/stategt
ltstategtCTlt/stategt ltzipgtbluelt/zipgt lt/addressgt
8What is a Schema?
- The ability to test the validity of documents is
going to be an important aspect of large web
applications that are receiving and sending
information to and from lots of sources. - If you're receiving XML transactions over the
web, you don't want to process the content into
your database if it's not in the proper schema. - The earlier, and easier it is, to catch this sort
of error, the better off you will be.
9Limitations of DTD
- XML inherited DTDs from SGML.
- DTDs can be used to define content models (the
valid order and nesting of elements) and, to a
limited extent, the datatypes of attributes, but
they have a number of obvious limitations - different (non-XML) syntax
- no support for namespaces
- extremely limited datatyping
- a complex and fragile extension mechanism based
on little more than string substitution
10Limitations of DTD
- The worst thing about the DTD extension mechanism
(parameter entities) is that it does not really
make relationships explicit. - Two elements defined to have the same content
models are not the same thing in any explicit
way. - Likewise, a group of attributes defined as a
parameter entity and reused are not logically a
group, they're just "coincidentally" a group.
11Limitations of DTD
- XML Schema overcome these limitations and are
much more expressive than DTDs. - The additional expressiveness will allow web
applications to exchange XML data much more
robustly without relying on ad hoc validation
tools.
12Limitations of DTD
- In the short term DTDs still have a number of
advantages - Widespread tools support. All SGML tools and many
XML tools can process DTDs. - Widespread deployment. A large number of document
types are already defined using DTDs HTML,
XHTML, DocBook, TEI, J2008, CALS, etc. - Widespread expertise and many years of practical
application.
13Features of Schema
- Richer datatypes
- booleans, numbers, dates and times, URIs,
integers, decimal numbers, real numbers,
intervals of time, etc. - In addition to these simple, predefined types,
there will be facilities for creating other types
and aggregate types. - User defined types
- define your own named datatype.
- For example, you might define a "PostalAddress"
datatype and then define two elements,
"ShippingAddress" and "BillingAddress" to be of
that type.
14Features of Schema
- Attribute grouping.
- It's not uncommon to have several attributes that
"go together". Attribute grouping allows the
schema author to make this relationship explicit.
- In DTDs, the grouping can be achieved with a
parameter entity, simplifying the process of
authoring a DTD, but the information is not
passed on to the processor. - Refinable archetypes, or "inheritance".
- DTD content models are closed.
- Open content models are in the other extreme.
- Refinable content models are in the middle.
15Features of Schema
- Namespace support.
- Since the introduction of Namespaces in XML,
validation has become much more difficult. - In fact, until the XML Schema work is completed,
it just is not practical to validate documents
that use namespaces.
16Validity
- Reasons why need to validate documents
- You're doing electronic commerce and you want to
know that the purchase order you just received is
exactly what you expect. - (B2B) If you receive a record from your partner's
database via XML, you want to be sure that it's
valid before you hand it off to the conversion
tool that will insert it into your database. - The XML document you're constructing is going to
control some overnight batch process and you want
to make sure that the instructions you're sending
are ones the processor is going to understand. - You have got a 1000 XML documents that you want
to publish on a CD-ROM. You want to be confident
that your stylesheet will present each of them
correctly without proofing each and every one by
hand.
17Validity
- Using a schema and a validating parser offers one
standard way to test your documents. - Valid documents can still be semantically wrong
- you can submit a purchase order that asks for a
hundred boxes of staples when you meant to ask
for ten, but checking validity catches a lot of
"obvious" errors.
18Validity
- Every document can be defined in one of four
ways - If it is not well-formed, it is not XML.
- If an XML document does not identify a schema to
which it claims to conform, then it is simply
well-formed. - If a schema is associated with a document, and
the document does not fit within the model
described by that schema, it is well-formed but
not valid. - If a schema is associated with a document, and
the document does not violate any of the
constraints of that schema, it is well-formed and
valid.
19Content Model Validity
- Content model validity tests whether the order
and nesting of tags is correct. - In XML Schema syntax, the content model of an
address could be described like this
ltelementType name"address"gt ltsequencegt
ltelementTypeRef name"name" minOccur"1"
maxOccur"1"/gt ltelementTypeRef name"street"
minOccur"1" maxOccur"2"/gt ltelementTypeRef
name"city" minOccur"1" maxOccur"1"/gt
ltelementTypeRef name"state" minOccur"1"
maxOccur"1"/gt ltelementTypeRef name"zip"
minOccur"1" maxOccur"1"/gt ltelementTypeRef
name"country" minOccur"0" maxOccur"1"/gt
lt/sequencegt lt/elementTypegt
20Datatype Validity
- Datatype validity is the ability to test whether
specific units of information are of the correct
type and fall within the specified legal values. - For example, if I am writing a schema for catalog
order forms, I should be able to express the
constraint that the quantity ordered is greater
than zero. - The ability to express datatype validity in a
schema is one of the really new features of XML
Schema. - Although database schema have always had this
ability, XML DTDs do not. - DTDs have extremely limited datatyping.
21Syntax
- At bottom, a schema describes the content of
elements and attributes. - Example The name Element Type
ltelementType name"name"gt ltmixed/gt lt/elementTypegt
22Syntax
- Example A ZIP Code Datatype and the ZIP Element
Type
ltdatatype name"zipCode"gt ltbasetype
name"string"/gt ltlexicalRepresentationgt
ltlexicalgt99999lt/lexicalgt ltlexicalgt99999-9999lt/le
xicalgt lt/lexicalRepresentationgt lt/datatypegt
5 digits or 5 digits - 4 digits
ltelementType name"zip"gt ltdatatypeRef
name"zipCode"/gt lt/elementTypegt
23Syntax
- Example An Address in Schema Notation
ltelementType name"address"gt ltsequencegt
ltelementTypeRef name"company" minOccur"0"
maxOccur"1"/gt ltelementTypeRef name"name"
minOccur"1" maxOccur"1"/gt ltelementTypeRef
name"street" minOccur"1" maxOccur"2"/gt
ltelementTypeRef name"city" minOccur"1"
maxOccur"1"/gt ltelementTypeRef name"state"
minOccur"1" maxOccur"1"/gt ltelementTypeRef
name"zip" minOccur"1" maxOccur"1"/gt
lt/sequencegt lt/elementTypegt
lt!ELEMENT address (company?, name, street, city,
state, zip)gt
24Syntax
- Example An Address in DTD Notation
lt!ELEMENT address (company?, name, street,
city, state, zip)gt
- Example An Address with Parameter Entities
lt!ENTITY address "company?, name,
street, city, state, zip"gt lt!ELEMENT
billing.address (address)gt lt!ELEMENT
shipping.address (address)gt
25Syntax
- Example An Address Archetype in Schema
ltarchetype name"address" model"refinable"gt
ltsequencegt ltelementTypeRef name"company"
minOccur"0" maxOccur"1"/gt ltelementTypeRef
name"name" minOccur"1" maxOccur"1"/gt
ltelementTypeRef name"street" minOccur"1"
maxOccur"2"/gt ltelementTypeRef name"city"
minOccur"1" maxOccur"1"/gt ltelementTypeRef
name"state" minOccur"1" maxOccur"1"/gt
ltelementTypeRef name"zip" minOccur"1"
maxOccur"1"/gt lt/sequencegt lt/archetypegt
ltelementType name"billing.address"gt
ltarchetypeRef name"address"/gt lt/elementTypegt
ltelementType name"shipping.address"gt
ltarchetypeRef name"address"/gt lt/elementTypegt
26Syntax
- significant advantages of an archetype (from the
previous example) - The archetype is refinable. This means that I can
derive new, related address types from it. I
could create, for example, a return address that
included everything in an address but added an
element to hold the RMA (return merchandise
authorization) number. - The relationship that a billing.address is an
address and a shipping.address is an address is
explicit.
Implicit
lt!ELEMENT billing.address (company?,
name, street, city, state, zip)gt lt!ELEMENT
shipping.address (company?, name,
street, city, state, zip)gt
27Example A Purchase Order
lt!DOCTYPE purchase.order SYSTEM "po.dtd"gt
ltpurchase.ordergt ltdategt16 June 1967lt/dategt
ltbilling.addressgt ltnamegtNamron H. Slawlt/namegt
ltstreetgt256 Eight Bit Lanelt/streetgt
ltcitygtEast Yahoolt/citygt ltstategtMAlt/stategt
ltzipgt12481-6326lt/zipgt lt/billing.addressgt
ltitemsgt ltitemgt ltquantitygt3lt/quantitygt
ltproduct.numbergt248lt/product.numbergt
ltdescriptiongtDecorative Widget, Red,
Largelt/descriptiongt ltunitcostgt19.95lt/unitcostgt
lt/itemgt ltitemgt ltquantitygt1lt/quantity
gt ltproduct.numbergt1632lt/product.numbergt
ltdescriptiongtPacked electron storage container,
AA, 4-packlt/descriptiongt ltunitcostgt4.95lt/unitc
ostgt lt/itemgt lt/itemsgt lt/purchase.ordergt
28lt!DOCTYPE schema SYSTEM "o/reference/w3c/schema/s
tructures.dtd"gt ltschemagt ltarchetype
name"address" model"refinable"gt ltsequencegt
ltelementTypeRef name"company" minOccur"0"
maxOccur"1"/gt ltelementTypeRef name"name"
minOccur"1" maxOccur"1"/gt ltelementTypeRef
name"street" minOccur"1" maxOccur"2"/gt
ltelementTypeRef name"city" minOccur"1"
maxOccur"1"/gt ltelementTypeRef name"state"
minOccur"1" maxOccur"1"/gt ltelementTypeRef
name"zip" minOccur"1" maxOccur"1"/gt
lt/sequencegt lt/archetypegt ltelementType
name"billing.address"gt ltarchetypeRef
name"address"/gt lt/elementTypegt ltelementType
name"shipping.address"gt ltarchetypeRef
name"address"/gt lt/elementTypegt
29 ltelementType name"items"gt ltelementTypeRef
name"item" minOccur"1"/gt lt/elementTypegt
ltelementType name"item"gt ltsequencegt
ltelementTypeRef name"quantity" minOccur"1"
maxOccur"1"/gt ltelementTypeRef
name"product.number" minOccur"1" maxOccur"1"/gt
ltelementTypeRef name"description"
minOccur"1" maxOccur"1"/gt ltelementTypeRef
name"unitcost" minOccur"1" maxOccur"1"/gt
lt/sequencegt lt/elementTypegt ltelementType
name"purchase.order"gt ltsequencegt
ltelementTypeRef name"date" minOccur"1"
maxOccur"1"/gt ltelementTypeRef
name"billing.address" minOccur"1"
maxOccur"1"/gt ltelementTypeRef
name"shipping.address" minOccur"0"
maxOccur"1"/gt ltelementTypeRef name"items"
minOccur"1" maxOccur"1"/gt lt/sequencegt
lt/elementTypegt
30 ltelementType name"company"gt ltmixed/gt
lt/elementTypegt ltelementType name"name"gt
ltmixed/gt lt/elementTypegt ltelementType
name"street"gt ltmixed/gt lt/elementTypegt
ltelementType name"city"gt ltmixed/gt
lt/elementTypegt ltelementType name"state"gt
ltmixed/gt lt/elementTypegt
ltdatatype name"zipCode"gt ltbasetype
name"string"/gt ltlexicalRepresentationgt
ltlexicalgt99999lt/lexicalgt ltlexicalgt99999-9999lt/l
exicalgt lt/lexicalRepresentationgt lt/datatypegt
ltelementType name"zip"gt ltdatatypeRef
name"zipCode"/gt lt/elementTypegt ltelementType
name"product.number"gt ltmixed/gt
lt/elementTypegt ltelementType name"description"gt
ltmixed/gt lt/elementTypegt
31ltdatatype name"quantityType"gt ltbasetype
name"integer"/gt ltminExclusivegt0lt/minExclusivegt lt
/datatypegt ltelementType name"quantity"gt
ltdatatypeRef name"quantityType"/gt
lt/elementTypegt ltdatatype name"currency"gt
ltbasetype name"decimal"/gt ltprecisiongt8lt/precisio
ngt ltscalegt2lt/scalegt lt/datatypegt ltelementType
name"unitcost"gt ltdatatypeRef name"currency"/gt
lt/elementTypegt ltelementType name"date"gt
ltdatatypeRef name"dateTime"/gt lt/elementTypegt lt/
schemagt
32Conclusion
- Schemas greatly improves over DTDs.
- Certain kinds of applications can be made more
interoperable by XML Schema. For example - exchanging information between databases, and
ecommerce - DTDs are well understood and they do offer a good
way to describe the structure of an document for
interchange. - It will take some time before XML Schema are as
well understood.