Title: Extensible Markup Language XML
1Extensible Markup Language (XML)
2XML
- Extensible Markup Language has become the
universal standard for representing data
- XML started out as a standard data exchange
format for the Web
- Yet, it has quickly become the fundamental
instrument in the development of Web-based online
information services and electronic commerce
applications - Almost all recent electronic commerce standards
are based on XML
3XML
- A subset of SGML (Standard Generalized Markup
Language) it is defined by World Wide Web
Consortium (http//www.w3.org)
- It is a fee-free open standard.
- HTML enables a universal method of displaying
data XML provides a universal method of
describing data
- Provides the ability to describe data in an open
text-based format and deliver it using standard
http protocol
4XML
- At present, many applications on the Web use XML
for hosting large amounts of structured and
semi-structured data
- Representation of information in XML documents
has been increasing at an astonishing pace
- According to Meta Group, by 2003, about 65 of
corporate data will be stored in an XML format
5XML The Unifying Technology
XML Messaging
Internet
6Maturity of Web Infrastructure
Technology
Standard
Innovation
Browse the Web
Program the Web
7XML helps address the challenge
- The data is self-describing
- e.g. the meaning of the data is included
identifiers surround every bit of data,
indicating what it means
- Far more flexible method of representing
transmitted information
- e.g. batched orders sent together can have
different fields and format without breaking apps
on each end
- Open, standard technologies for moving,
processing and validating the data
- e.g. the XML parser can automatically parse,
validate, and feed the information to an
application, instead of every application having
to include this functionality
8XML An Example
Data stream in a typical interface
Electronic Commerce, 100, Turban, 25,
Addison-Wesley
Same data stream in XML
Electronic Commerce UANTITY 100 Turban
25 Addison-WesleyPUBLISHER
9Markup (or Tagging)
- XML uses textual markups to define data
- An XML document is comprised of a collection of
tagged elements each containing a start tag
(), an end tag (), and the
content between the two tags - Example
- 1234ABCD
10Tagging Data in XML
- 1234ABCD
- Considering the content only, it is not possible
to understand what 1234ABCD stands for
- The tag name PONumber intuitively tells that the
content is a purchase order number
- Similarly, an XML element might be tagged as
name, gender, birth date, salary, price,
- XML is extensible in the sense that users can
create their own vocabularies, the tag names are
neither predefined nor limited
11Adding Structure to data
- Tagged elements may be nested to any depth to
provide structured data, or may be repeated to
represent a list of values
- A valid XML document contains a single root
element, which constitutes the top-level of
nesting
- In other words, a valid XML document represents a
tree of elements
12Giving Meaning and Structure to Data
Start Tag
Start Tag
An Element
Another Element
An Attribute
Data
End Tag
13Giving Structure to Data
PurchaseOrderRequest
PurchaseOrderDate
LineItem
PONumber
ItemEAN_Identification
QuantityOrdered
UnitPrice
14Well-formed and valid XML documents
- There are two levels of correctness of an XML
document
- Well-formed. A well-formed document conforms to
all of XML's syntax rules. For example, if an
element has an opening tag with no closing tag
and is not self-closing, it is not well-formed. - Valid. A valid document additionally conforms to
some semantic rules. These rules are either
user-defined, or included as an XML schema or
DTD.
15Well-formed documents XML syntax
- The only indispensable syntactical requirement is
that the document has exactly one root element
(alternatively called the document element).
- The root element can be preceded by an optional
XML declaration.
- version of XML
- character encoding and external dependencies.
-
- The specification requires that processors of XML
support the pan-Unicode character encodings UTF-8
and UTF-16
16Well-formed documents XML syntax
- XML comments start with .
-
- The text enclosed by the root tags may contain an
arbitrary number of XML elements. The basic
syntax for one element is
- content
- Here, content is some text which may again
contain XML elements.
17Another example
18Well-formed documents XML syntax
- Attribute values must always be quoted, using
single or double quotes ( OR )
- Each attribute name should appear only once in
any element.
- Proper nesting elements may never overlap
- Normal
emphasized strong
emphasized strong
- Empty element tag, it has three equivalent
forms
- author"John" genre"science-fiction"
date"2009-Jan-01" /
19Entity references
- An entity in XML is a named body of data, usually
text, such as an unusual character.
- An entity reference is a placeholder that
represents that entity
- It consists of the entity's name preceded by an
ampersand ("") and followed by a semicolon
("").
- XML has five predeclared entities
- amp ampersand
- lt
- gt greater than
- apos apostrophe
- quot quotation mark
- More entities are declared in the document's
(DTD). (will see)
20Well-formed documents
- The document complies with its declared character
encoding.
- The encoding may be declared either externally
("Content-Type" header of HTTP) or internally.
- Element names are case-sensitive.
- ...
- Choosing meaningful names implies the semantics
of elements and attributes to a human reader
21Valid documents XML semantics
- By leaving the names, allowable hierarchy, and
meanings of the elements and attributes open and
definable by a customizable schema or DTD, XML
provides a syntactic foundation for the creation
of purpose specific, XML-based markup languages.
- The schema merely supplements the syntax rules
with a set of constraints.
- Schemas typically restrict element and attribute
names and their allowable containment hierarchies
- Such as, an element named 'birthday' contains 3
elements year, 'month' and 'day. Each is only
character data.
22Valid documents XML semantics
- An XML document that complies with a particular
schema/DTD, in addition to being well-formed, is
said to be valid.
- An XML schema expressed in terms of constraints
on the structure and content of documents
- Before SGML and XML, software designers had to
define special file formats and special-purpose
parsers and writers.
- XML's regular structure and strict parsing rules
allow software designers to leave parsing to
standard tools
- Well-tested tools exist to validate an XML
document "against" a schema
23Document Type Definition (DTD)
- The principle purpose of the DTD is to declare
the hierarchy of document elements
- A document type definition defines
- The name of the elements,
- The content model of each element,
- How often and in which order elements may
appear,
- If the end-tags can be shortcut,
- The possible presence of attributes and their
default values,
- The names of the entities
24An Example DTD
-
- PurchaseOrderDate, LineItem)
- QuantityOrdered, UnitPrice)
-
-
-
- other elements are skipped -- ...
25DTDs
- A DTD specifies the structure of an XML element
by specifying the names of its sub-elements and
attributes
- Sub-element structure is specified using the
operators
- set with zero or more elements
- set with one or more elements
- ? optional
- or
- All values are assumed to be string values,
unless the type is ANY in which case the value
can be an arbitrary XML fragment
26DTDs
- There is a special attribute id which can occur
once for each element
- EMPTY- the element has no content
- Empty elements usually have attributes that give
them useful properties
- There is no concept of a root of a document an
XML document conforming to a DTD can be rooted at
any element specified in the DTD
27Element Identity, Ids, and ID References
- To support element sharing, XML reserves an
attribute of type ID, which allows a unique key
to be associated with an element
- An attribute of type IDREF allows an element to
refer to another element with the designated key
and IDREFS may refer to multiple elements
-
-
- John
Smith
-
- ...
-
- ....
- 1995
28Entities
- Entities represent the physical structure of an
XML document
- Two types of entities
- General entities apply within the top level
element and in attribute values
- Parameter entities apply within the internal and
external DTD subsets
-
-
-
- Entity reference in a document
- This contract is between receipent
and contractor and the award is
payment.
- Entity reference expanded
- This contract is between METU and EC
and the award is 1 EURO.
- By changing the entity declarations you can
create any contract.
29General Entities
- General entity declaration
-
- Entity reference in a document
- The xml is derived from ISO 8879, an
International Standard. labelxml/
- Entity reference expanded
- The Extensible Markup Language is derived
from ISO 8879, an International Standard.
Language/
30Parameter Entities
- is for use only in DTDs
- Parameter entities carry information for use in
the markup declaration, often a set of common
attributes shared by several elements or a link
to an outside DTD. - Parameter entities whose references are purely
within DTD are known as internal entities,
whereas references that draw information from
outside files are external entities - Parameter entities use a sign both in their
declaration and in their references to
distinguish themselves from general entities
31Parameter Entities
- Parameter entity declaration
-
- Parameter entity reference in DTD
-
- Parameter entity reference expanded
32DTDs
- The oldest schema format for XML
- Disadvantages
- It has no support for newer features of XML, most
importantly namespaces.
- It lacks expressiveness. Certain formal aspects
of an XML document cannot be captured in a DTD.
- It uses a custom non-XML syntax, inherited from
SGML, to describe the schema.
- Still used in many applications because it is
considered the easiest to read and write.
33Valid documents XML semantics
- Other schema languages
- XML Schema (XSD) (will see)
- RELAX NG (specified by OASIS, now an ISO standard
as part of DSDL)
- ISO DSDL (Document Schema Description Languages)
- Schematron
34XML Namespaces
- Namespaces are a simple and straightforward way
to distinguish names used in XML documents, no
matter where they come from
- The only reason namespaces exist, is to give
elements and attributes programmer-friendly names
that will be unique across the whole Internet
35Example
- xmlnsh"http//www.w3.org/HTML/1998/html4"
- Book Review
-
- XML A
Primer
-
- AuthorPrice
- PagesDate
-
- Simon St. Laurenthtd
- 31.98
- 352
- 1998/01
-
-
36XML Namespaces
- The prefixes are linked to the full names using
the attributes on the top element whose names
begin xmlns
- The prefixes are just shorthand placeholders for
the full names
- Those full names are URIs, i.e. Web addresses
37Extensibility in XML
- Anyone can invent new tags and attach a meaning
to those tags
- But if every user creates its own XML definition
for describing his data, it is not possible to
achieve interoperability
- For example, one may prefer to use the tag name
POR, while another prefers using the tag name
PurchaseOrderReq
- In other words, a tagged document is not very
useful without some kind of agreement on the tags
among inter-operating applications
38Extensibility in XML
- Anyone can invent new tags and attach a meaning
to those tags
- For example
- This device
- This device
- But if every user creates its own XML definition
for describing his data, it is not possible to
achieve interoperability
39Agreement on tags is necessary
- In other words, a tagged document is not very
useful without some kind of agreement on the tags
among inter-operating applications
Mobile Device
Hand Held Device
40Many Efforts for Standardized Tags
- HL7 for healthcare
- RosettaNet for supply chain integration in
Information Technology and Electronic Components
domain
- GS1 again in supply chain
- ebXML for eBusiness
- Common Business Library (CBL) for electronic
catalogs, purchase orders, etc.
41XML Parsers
- A parser takes an XML document and makes its
structure and content available to an application
through an API
- There are two main Application Programming
Interfaces (APIs) for writing parsers
- Document Object Model (DOM) and
- Simple API for XML (SAX)
- Today, many parsers are both DOM and SAX compliant
42XML DOM Parser
A parser validates and makes the data
contained in an XML document available
to the application
43XSLT Processor
- Converts an XML document to another form
- An XSL style sheet is a set of transformation
instructions for converting a source XML document
to a target document
44(No Transcript)
45Why XML?
46XML vs EDI
47XML vs EDI
48XML vs EDI
49Critique of XML Advantages
- It is text-based.
- It supports Unicode, allowing almost any
information in any written human language to be
communicated.
- It can represent the most general computer
science data structures records, lists and
trees.
- Its self-documenting format describes structure
and field names as well as specific values.
- The strict syntax and parsing requirements make
the necessary parsing algorithms extremely
simple, efficient, and consistent.
- XML is heavily used as a format for document
storage and processing, both online and offline.
- It is based on international standards.
50Critique of XML Advantages
- It allows validation using schema languages such
as XSD and Schematron, which makes effective
unit-testing, firewalls, acceptance testing,
contractual specification and software
construction easier. - The hierarchical structure is suitable for most
(but not all) types of documents.
- It manifests as plain text files, which are less
restrictive than other proprietary document
formats.
- It is platform-independent
- Forward and backward compatibility are relatively
easy to maintain
- Its predecessor, SGML, has been in use since
1986, so there is extensive experience and
software available.
- An element fragment of a well-formed XML document
is also a well-formed XML document.
51Critique of XML Disadvantages
- XML syntax is redundant or large relative to
binary representations of similar data.
- The redundancy may affect application efficiency
through higher storage, transmission and
processing costs.
- XML syntax is verbose relative to other
alternative 'text-based' data transmission
formats.
- No intrinsic data type support XML provides no
specific notion of "integer", "string",
"boolean", "date", and so on
52Critique of XML Disadvantages
- The hierarchical model for representation is
limited in comparison to the relational model or
an object oriented graph.
- Expressing overlapping (non-hierarchical) node
relationships requires extra effort.
- XML namespaces are problematic to use and
namespace support can be difficult to correctly
implement in an XML parser.
- XML is commonly depicted as "self-documenting"
but this depiction ignores critical ambiguities.
53Some well-known XML based languages and
applications
- RSS Rich Site Summary
- Ajax
- SOAP Simple Object Access Protocol
- WSDL Web Services Description Language
- SVG Scalable Vector Graphics
- Regarding Office Apps OASIS, OpenOffice,
Microsoft Office
- HL7 Clinical Document Architecture (CDA)
- ...
54HL7 Clinical Document Architecture (CDA)
- A specification for document exchange using
- XML,
- the HL7 Reference Information Model (RIM)
- Version 3 methodology
- and vocabulary (SNOMED, ICD, local,)
- CDA Header
- Metadata required for document discovery,
management, retrieval
- CDA Body
- Clinical report
- Discharge Summary
- Referral
55Clinical Document Architecture
56HL7 CDA
- Level One
- The unconstraint CDA Specification
- Only the header is well structured
- Level Two
- Section Level Templates are applied with coded
terms
- Level Three
- Entry Level Templates are applied
- Machine Processable!
57HL7 CDA Example