Title: XML
1 2Outline (ambitious)
- Background documents (SGML/HTML) and databases
(structured and semistructured data) - XML Basics and Document Type Descriptors
- XML query languages XML-QL and XSL.
- XML additions Xlink, Xpointer, RDF, SOX,
XML-Data - Document Object Model (XML API's)
3Some Useful Articles
- XML, Java, and the future of the web
- http//webreview.com/wr/pub/97/12/19/xml/index.htm
l - XML and the Second-Generation Web
- http//www.sciam.com/1999/0599issue/0599bosak.html
- Articles/standards for XML, XSL, XML-QL
http//www.w3c.org/ - http//www.w3.org/TR/REC-xml
4Part I Background
- Whats the difference between the world of
documents and information retrieval and databases
and query interfaces?
5Documents vs Databases
- Document world
- plenty of small documents
- usually static
- implicit structure
- section, paragraph, toc,
- tagging
- human friendly
- content
- form/layout, annotation
- Paradigms
- Save as, wysiwyg
- meta-data
- author name, date, subject
- Database world
- a few large databases
- usually dynamic
- explicit structure (schema)
- records
- machine friendly
- content
- schema, data, methods
- Paradigms
- Atomicity, Concurrency, Isolation, Durability
- meta-data
- schema description
6What to do with them
- Documents
- editing
- printing
- spell-checking
- counting words
- retrieving (IR)
- searching
- Database
- updating
- cleaning
- querying
- composing/transforming
7HTML
- Publishing hypertext on the World Wide Web
- Designed to describe how a Web browser should
arrange text, images and push-buttons on a page. - Easy to learn, but does not convey structure.
- Fixed tag set.
Text (PCDATA)
Opening tag
Welcome to the XML
course Introduction
Closing tag
Bachelor tag
Attribute name
Attribute value
8The Structure of XML
- XML consists of tags and text
- Tags come in pairs ...
- They must be properly nested
- ... ... --- good
- ... ... --- bad
- (You cant do ... ... ... in
HTML)
9XML text
- XML has only one basic type -- text.
- It is bounded by tags e.g.
- The Big Sleep
- 1935 --- 1935 is still text
- XML text is called PCDATA (for parsed
- character data). It uses a 16-bit encoding,
- e.g. \\x0152 for the Hebrew letter Mem
- Later we shall see how new types are specified by
XML-data
10XML structure
- Nesting tags can be used to express various
structures. E.g. A tuple (record)
Malcolm Atchison
(215) 898 4321
mp_at_dcs.gla.ac.sc
11XML structure (cont.)
- We can represent a list by using the same
- tag repeatedly
...
... ...
...
12Terminology
- The segment of an XML document between an opening
and a corresponding closing tag is called an
element.
Malcolm Atchison
(215) 898 4321
(215) 898 4321 mp_at_dcs.gla.ac.sc
element
element, a sub-element of
not an element
13XML is tree-like
Malcolm Atchison
(215) 898 4321
(215) 898 4321
mp_at_dcs.gla.ac.sc
Semistructured data models typically put the
labels on the edges
14Mixed Content
- An element may contain a mixture of sub-elements
and PCDATA -
- British Airways
-
- Worlds
favorite airline -
-
- Data of this form is not typically generated from
databases. It is needed for consistency with HTML
15A Complete XML Document
-
-
- Malcolm Atchison
- (215) 898 4321
- mp_at_dcs.gla.ac.sc
-
16Two ways of representing a DB
projects
title budget managedBy
employees
name ssn age
17Project and Employee relations in XML
Projects and employees are intermixed
-
-
- Pattern recognition
- 10000
- Joe
-
-
- Joe
- 344556
- 34
-
-
Sandra
2234 35
Auto
guided vehicle 70000
Sandra
18Project and Employee relations in XML (contd)
Employees follow projects
Joe
344556
34
Sandra 2234
35
Pattern recognition
10000
Joe
Auto guided vehicles
70000
Sandra
19Project and Employee relations in XML (contd)
Or without separator tags
Pattern
recognition 10000
Joe
Auto guided vehicles
70000 Sandra
Joe
344556 34
Sandra 2234
35
20Attributes
- An (opening) tag may contain attributes. These
are typically used to describe the content of
an element -
- cheese
- fromage
- branza
- A food made
-
- Order of attributes in an element does not matter
- XML elements are ordered
21Attributes (contd)
- Another common use for attributes is to express
dimension or type -
- 2400
- 96
-
- M05-.C_at_02!G96YE
-
-
- A document that obeys the nested tags rule and
does not repeat an attribute within a tag is said
to be well-formed .
22When to use attributes
- Its not always clear when to use attributes
F. MacNiel
fmacn_at_dcs.barra.ac.sc
...
123 45 6789
F. MacNiel
fmacn_at_dcs.barra.ac.sc
...
23 XML Misc.
- Apart from elements and attributes, XML allows
processing instructions and comments. A
processing instruction is a statement of the
form -
-
- A comment takes the following form enclose
comments between
24Part III Document Type Descriptors
- Imposing structure on XML documents
25Document Type Descriptors
- Document Type Descriptors (DTDs) impose structure
on an XML document. - There is some relationship between a DTD and a
schema, but it is not close -- hence the need for
additional typing systems. - The DTD is a syntactic specification.
26Example The Address Book
-
- MacNiel, John
- Dr. John MacNiel
- 1234 Huron Street
- Rome, OH 98765
- (321) 786 2543
- (321) 786 2543
- (321) 786 2543
- jm_at_abc.com
-
-
Exactly one name
At most one greeting
As many address lines as needed (in order)
Mixed telephones and faxes
As many as needed
27Specifying the structure
- name to specify a name element
- greet? to specify an optional (0 or 1)
greet elements - name,greet? to specify a name followed by an
optional greet
28Specifying the structure (cont)
- addr to specify 0 or more address lines
- tel fax a tel or a fax element
- (tel fax) 0 or more repeats of tel or fax
- email 0 or more email elements
29Specifying the structure (cont)
- So the whole structure of a person entry is
specified by - name, greet?, addr, (tel fax), email
- This is known as a regular expression. Why is it
important?
30Regular Expressions
- Each regular expression determines a
corresponding finite state automaton. Lets start
with a simpler example - name, addr, email
- This suggests a simple parsing program
addr
name
email
31Another example
- name,address,(tel fax),email
address
email
tel
tel
name
email
fax
fax
email
Adding in the optional greet further complicates
things
32 A DTD for the address book
-
-
- (name, greet?, address, (fax tel),
email) -
-
-
-
-
-
-
33Our relational DB revisited
projects
title budget managedBy
employees
name ssn age
34Two DTDs for the relational DB
(projects,employees) (project)
managedBy) age) ...
employee) budget, managedBy) ssn, age) ...
35Some things are hard to specify
- Each employee element is to contain name, age and
ssn elements in some order. - ( (name, age, ssn) (age, ssn, name)
- (ssn, name, age) ...
- )
- Suppose there were many more fields !
36Summary of XML regular expressions
- A The tag A occurs
- e1,e2 The expression e1 followed by e2
- e 0 or more occurrences of e
- e? Optional -- 0 or 1 occurrences
- e 1 or more occurrences
- e1 e2 either e1 or e2
- (e) grouping
37Specifying attributes in the DTD
-
- dimension CDATA REQUIRED
- accuracy CDATA IMPLIED
- The dimension attribute is required the accuracy
attribute is optional. - CDATA is the type of the attribute -- it means
string.
38The DTD Language
- Default modifiers in DTD attributes
39The DTD Language
- Datatypes in DTD attributes
40Consistency of ID and IDREF attribute values
- If an attribute is declared as ID
- the associated values must all be distinct (no
confusion) - Id is a poor cousin of a key in relational
databases. - If an attribute is declared as IDREF
- the associated value must exist as the value of
some ID attribute (no dangling pointers) - IDREF is a poor cousin of foreign key in
relational databases. - Similarly for all the values of an IDREFS
attribute - An attribute of type IDREFS represent a
space-separated list of strings of references to
valid IDs. - ID and IDREF attributes are not typed
41Specifying ID and IDREF attributes
-
-
-
-
- id ID REQUIRED
- mother IDREF IMPLIED
- father IDREF IMPLIED
- children IDREFS IMPLIED
42Some conforming data
-
- father"john"
- Jane Doe
-
-
- John Doe
-
-
- Mary Doe
-
- father"john"
- Jack Doe
-
43An alternative specification
44The revised data
45Types of Attributes
- Enumerated - List of values (You must use one of
the items) -
- ALIGN (LEFT CENTER RIGHT) "LEFT"
- Programming XML in Java ALIGN"CENTER" Programming XML in Java
46Types of Attributes
- NMTOKEN - The characters of an NMTOKEN value must
be a letter, digit, '.', '-', '_', or ''. - It may not include whitespace.
- student_name
- student_no NMTOKEN REQUIRED
- Jo
Smith
47Types of Attributes
- Entities
- XML's way of referring to a data item.
- Text or Binary data.
- General Entity
- Use in the content of XML document
- References start with '' and end with ''
- Parameter Entity
- Use in a DTD
- References start with '' and end with ''
- Internal Entity - Defined in XML Document
- External Entity - Defined in a external source
file, URI.
48Types of Attributes
- Internal General Entities
- Example 1DATE (PCDATA)
-
TODAY - Example 2
-
49Types of Attributes
- External General Entities
- FPI URIExample
-
TODAY
50Types of Attributes
- Predefined General Entity References
- amp  - The character
- apos - The ' character
- gt   - The character
- lt   - The
- quot - The " character
-
- lkhanat_newutdal
las.edu
51Types of Attributes
- Parameter Entities
- Internal
- Externalname PUBLIC FPI URI
- Example
-
-
-
-
-
-
52A useful abbreviation
- When an element has empty content we can use
- for blahblahbla
- For example
-
-
- Jane Doe
-
-
-
- ...
-
53Schema.dtd
54Schema.dtd (contd)
55The DTD Language
- Example Sales Order Document
-
- An order document is comprised of several sales
orders. Each individual order has a number and it
contains the customer information, the date when
the order was received, and the items ordered.
Each customer has a number, a name, street, city,
state, and ZIP code. Each item has an item
number, parts information and a quantity. The
parts information contains a number, a
description of the product and its unit price. - The numbers should be treated as attributes.
56The DTD Language
- Example Sales Order Document DTD
)
57The DTD Language
- Example Sales Order XML Document
-
ABC
Industries 123 Main
St. Chicago
IL 60609
10222000
Turkey wrench
9.95
10
58Connecting the document with its DTD
- In line
-
-
- ...
- Another file
-
- A URL
-
- "http//www.schemaauthority.com/
schema.dtd"
59Well-formed and Valid Documents
- Well-formed applies to any document (with or
without a DTD) proper nesting of tags and unique
attributes - Valid specifies that the document conforms to the
DTD conforms to regular expression grammar,
types of attributes correct, and constraints on
references satisfied
60DTDs v.s Schemas (or Types)
- By database (or programming language) standards
DTDs are rather weak specifications. - Only one base type -- PCDATA
- No useful abstractions e.g., sets
- IDREFs are untyped. You point to something, but
you dont know what! - No constraints e.g., child is inverse of parent
- No methods
- Tag definitions are global
- Some of the XML extensions impose something like
a schema or type on an XML document. Well see
these later
61Lots of possibilities for schemas
- XML Schema (under W3Cs spotlight)
- XDR (Microsofts BizTalk)
- SOX (Schema for Object-Oriented XML)
- Schematron
- DSD (ATT Labs and BRICS)
- and more.
62Some tools
- XML Authority http//www.extensibility.com/tibco/s
olutions/xml_authority/index.htm - XML Spy http//www.xmlspy.com
/download.html
63Summary
- XML is a new data format. Its main virtues are
widespread acceptance and the (important) ability
to handle semistructured data (data without
schema) - DTDs provide some useful syntactic constraints on
documents. As schemas they are weak - How to store large XML documents?
- How to query them?
- How to map between XML and other representations?