Title: Processing of structured documents
1Processing of structured documents
2XSL Formatting model
- An XSL stylesheet processor accepts a document in
XML and an XSL stylesheet and produces the
presentation of that XML source content that was
intended by the designer of that stylesheet - two parts
- tree transformation constructing a result tree
from the XML source - formatting interpreting the result tree to
produce formatted results suitable for
presentation on a display, on paper, in speech,
or onto other media
3XSL formatting process
- XSLT -gt element and attribute tree
- objectify formatting objects tree
- refinement
- area tree
- rendering
4Tree transformation
- Allows the structure of the result tree to be
significantly different from the structure of the
source tree - one could add a table-of-contents ( filtered
selection of an original source document) - one could rearrange source data into a sorted
tabular presentation - in constructing the result tree, the tree
transformation process also adds the information
necessary to format the result tree
5Formatting
- Formatting is enabled by including formatting
semantics in the result tree (cf. CSS semantics) - formatting semantics are expressed in terms of
classes of formatting objects - the nodes of the result tree are formatting
objects - the classes of formatting objects denote
- typographic abstractions such as page,
paragraph, table, - finer control formatting properties
- indenting control, word- and letter-spacing
widow, orphan, and hyphenation control
6In tree transformation
- Tree transformation constructs the result tree
- in XSL, the result tree is called the element and
attribute tree - in the tree, the objects are primarily in the
formatting objects namespace (fo) - a formatting object is represented as an XML
element, with the formatting properties
represented by a set of XML attribute-value pairs - the content of the formatting object is the
content of the XML element - transformations are defined in the XSLT spec
7Formatting objects
- Formatting interprets the result tree in its
formatting object tree to produce the
presentation - Semantically, each formatting object represents a
specification for a part of the pagination,
layout and styling information that will be
applied to the content of that formatting object
as a result of formatting the whole result tree - e.g. block formatting object represents the
breaking of content of a paragraph into lines - formatting of a paragraph also depends on the
layout structure (i.e. aspects not defined in
block fo)
8Formatting objects
- Block-level objects, inline-level objects
- refer to the types of areas that are generated
- areas refer to their default placement method
- Inline areas are collected into lines
- stacking direction inline-progression-direction
(in Western writing systems left-to-right) - Lines are block-level
- stacking direction block-progression-direction
(in Western writing systems top-to-bottom)
9Formatting properties
- The formatting properties associated with an
instance of a formatting object control the
formatting of that object - CSS properties are included in the formatting
properties - some of the properties, e.g., color, directly
specify the formatted result - other properties, e.g., space-before, only
constrain the set of possible formatted results
without specifying any particular formatted
result
10Refinement
- Refinement is a computational process which
finalizes the specification of properties based
on the attribute values in the XML result tree - refinement involves
- propagating the inherited values of properties
- evaluating expressions in property value
specifications into actual values - converting relative numerics to absolute numerics
- constructing some composite properties from more
than one attribute
11Area tree
- Formatting consists of the generation of a tree
of geometric areas, called the area tree - the areas are positioned on a sequence of one or
more pages - each area has
- a position on the page
- a specification of what to display in that area
- may have a background, padding and borders
- areas may be nested
- a character within a line, within a block, within
a page
12Area tree
- As a general rule, the order of the area tree
parallels the order of the formatting object tree - if one formatting object precedes another in the
depth-first traversal of the formatting object
tree, with neither containing the other, then all
the areas generated by the first will precede all
the areas generated by the second (in the
depth-first traversal of the area tree), unless
otherwise specified - typical exceptions side floats, footnotes
13Refinement
- Some of the refinement operations (particularly
evaluating expressions) depend on knowledge of
the area tree - thus refinement is not necessarily a
straightforward, sequential procedure - may involve look-ahead, backtracking, or
control-splicing with other processes in the
formatter - constraints may conflict
- it is implementation-defined which constraints
should be relaxed and in what order to satisfy
the others
14Rendering
- Rendering takes the area tree ( the abstract
model of the presentation in terms of pages and
their collections of areas) and - causes a presentation to appear on the relevant
medium - e.g., a browser window on a computer display
screen or sheets of paper
15Alternatives for XML formatting
- XSLT transformation produces HTML
- CSS stylesheet attached to XML document
- XSLT transformation makes structural changes and
attaches a CSS stylesheet to the result - XSLT transformation produces formatting objects
- e.g. FOP can make a conversion to PDF
- XSmiles editor (HUT)
- See
- https//www.cs.helsinki.fi/I/hahonen/rado/style_ex
.html (link from the course material page)
16XML Namespaces
- How XML namespaces help to modularize and reuse
existing definitions? - Documents (or their structure definitions,
processing applications, etc.) are not always
created from scratch, but more and more existing
definitions are reused and combined - extremely important especially in E-commerce and
other data interchange - agreement of common vocabularies
17Author A writes a document
lt?xml version1.0?gt ltreferencesgt
ltnamegtMacmillanlt/namegt ltlink
hrefhttp//www.mcp.com/gt ltnamegtABC
Newslt/namegt ltlink hrefhttp//www.abcnews.com
/gt lt/referencesgt
18Author B adds some rating.
lt?xml version1.0?gt ltreferencesgt
ltnamegtMacmillanlt/namegt ltlink
hrefhttp//www.mcp.com/gt ltratinggt5
starslt/ratinggt ltnamegtABC Newslt/namegt
ltlink hrefhttp//www.abcnews.com/gt
ltratinggt3 starslt/ratinggt lt/referencesgt
19Also Author C wants to add some rating...
lt?xml version1.0?gt ltreferencesgt
ltnamegtMacmillanlt/namegt ltlink
hrefhttp//www.mcp.com/gt
ltratinggtGlt/ratinggt ltnamegtABC Newslt/namegt
ltlink hrefhttp//www.abcnews.com/gt
ltratinggtPGlt/ratinggt lt/referencesgt
20Author D would like to combine the documents...
lt?xml version1.0?gt ltreferencesgt
ltnamegtMacmillanlt/namegt ltlink
hrefhttp//www.mcp.com/gt ltratinggt5
starslt/ratinggt ltratinggtGlt/ratinggt
ltnamegtABC Newslt/namegt ltlink
hrefhttp//www.abcnews.com/gt ltratinggt3
starslt/ratinggt ltratinggtPGlt/ratinggt lt/reference
sgt
21Which rating? -gt different names
lt?xml version1.0?gt ltreferencesgt
ltnamegtMacmillanlt/namegt ltlink
hrefhttp//www.mcp.com/gt ltqa-ratinggt5
starslt/qa-ratinggt ltpa-ratinggtGlt/pa-ratinggt
ltnamegtABC Newslt/namegt ltlink
hrefhttp//www.abcnews.com/gt ltqa-ratinggt3
starslt/qa-ratinggt ltpa-ratinggtPGlt/pa-ratinggt lt/
referencesgt
22Namespaces give a disciplined method for naming
lt?xml version1.0?gt ltreferences
xmlnsqahttp//joker.com/2000/star-rating
xmlnspahttp//penguin.xmli.com/2
000/review
xmlnshttp//pineapplesoft.com/1999/refgt
ltnamegtMacmillanlt/namegt ltlink
hrefhttp//www.mcp.com/gt ltqaratinggt5
starslt/qaratinggt ltparatinggtGlt/paratinggt
... lt/referencesgt
23Namespaces
- xmlnsqahttp//joker.com/2000/star-rating
- qa prefix
- http//joker.com/2000/star-rating
- the namespace
- a unique name (URI guarantees) no need to
retrieve anything from the address - xmlns http//pineapplesoft.com/1999/refgt
- the default namespace
- elements without prefixes belong to this
namespace - references, name, link
24Namespaces
- qarating
- a qualified name (Qname)
- scoping
- The namespace is valid for the element where it
is declared and all the elements within its
content
25Scoping
lt?xml version1.0?gt ltrefreferences
xmlnsrefhttp//pineapplesoft.com/1999/refgt
ltrefnamegtMacmillanlt/namegt ltreflink
hrefhttp//www.mcp.com/gt ltparating
xmlnspahttp//penguin.xmli.com/2000/reviewgtGlt/
paratinggt ltrefnamegtABC Newslt/namegt
ltreflink hrefhttp//www.abcnews.com/gt
ltqarating xmlnsqahttp//joker.com/2000/star-r
atinggt 3 starslt/qaratinggt lt/refrefer
encesgt
26Namespaces and DTD
- XML 1.0 DTDs are not namespace-aware
- all the elements and attributes that are in some
namespace have to be declared using the
corresponding prefix - for elements with prefix pre
- an attribute xmlnspre has to be declared
27Namespaces and DTD
lt?xml version1.0?gt lt!DOCTYPE refreferences
lt!ELEMENT refreferences
(refname, reflink, (parating
qarating))gt lt!ATTLIST refreferences xmlnsref
CDATA REQUIREDgt lt!ELEMENT refname
(PCDATA)gt lt!ELEMENT reflink EMPTYgt lt!ATTLIST
reflink href CDATA REQUIREDgt lt!ELEMENT
parating (PCDATA)gt lt!ATTLIST parating xmlnspa
CDATA REQUIREDgt lt!ELEMENT qarating
(PCDATA)gt lt!ATTLIST qarating xmlnsqa CDATA
REQUIREDgt gt
28DTD external and internal subsets
- external and internal subset make up the DTD
internal has higher precedence - syntax
- lt!DOCTYPE root-type-name SYSTEM ex.dtd
lt!-- external subset in file ex.dtd --gt
lt!-- internal subset may come here
--gt gt - internal subset may declare new elements (with
attributes) or new attributes for existing
elements - namespaces facilitate the control of name
conflicts
29Namespaces and XML Schema
- An XML Schema document contains declarations of
namespaces that are used in the document - e.g. xmlnsxsdhttp//www.w3.org/2001/XMLSchema
for the elements with special XML Schema
semantics - Target namespace these definitions included in
this schema give definition to this namespace - targetNamespaceurimywork
30Namespaces and XML Schema
- In XML Schema, schema components from different
target namespaces can be used together - -gt enables the schema validation of instance
content defined across multiple namespaces
31Importing schema declarations
- Every top-level schema component is associated
with a target namespace (or, explicitly, with
none, if the target namespace is not defined) - a component may refer to another component that
is in a different namespace, using an import
element
32Import
ltschema xmlnshttp//www.w3.org/2001/XMLSchema
xmlnshtmlhttp//www.w3.org/1999/x
html targetNamespaceurimywork
xmlnsmyurimyworkgt ltimport
namespacehttp//www.w3.org/1999/xhtmlgt ltcompl
exType namemyTypegt ltsequencegt
ltelement refhtmlp minOccurs0/gt
lt/sequencegt lt/complexTypegt ltelement
namemyElt typemymyTypegt lt/schemagt
33Type libraries
- As XML schemas become more widespread, schema
authors will want to create simple and complex
types that can be shared and used as the basic
building blocks for building new schemas - XML Schemas already provide types that play this
role the simple types - other examples currency, units of measurement,
business addresses
34Example currencies
ltschema targetNamespacehttp//www.example.com/Cu
rrency xmlnschttp//www.example
.com/Currency xmlnshttp//www.w3
.org/2000/08/XMLSchemagt ltcomplexType
nameCurrencygt ltsimpleContentgt
ltextension basedecimalgt ltattribute
namenamegt ltsimpleTypegt
ltrestriction basestringgt
ltenumeration valueAED/gt
ltenumeration valueAFA /gt ltenumeration
valueALL /gt
35Extending content models
- Mixed content models
- an element can contain, in addition to
subelements, also arbitrary character data - import
- an element can contain elements whose types are
imported from external namespaces - e.g. this element may contain an HTML p element
here - more flexible way
- any element, any attribute
36Example
ltpurchaseReport xmlnshttp//www.example.com/Rep
ortgt ltregionsgt lt!-- part sales by regions --gt
lt/regionsgt ltpartsgt lt!-- part descriptions --gt
lt/partsgt lthtmlExamplegt lttable
xmlnshttp//www.w3.org/1999/xhtml
border0 width100gt lttrgt ltth
alignleftgtZip Codelt/thgt ltth
alignleftgtPart Number lt/thgt ltth
alignleftgtQuantitylt/thgt lt/trgt
lttrgtlttdgt95819lt/tdgtlttdgt lt/tdgt lttdgt lt/tdgtlt/trgt
lttrgtlttdgt lt/tdgtlttdgt872-AAAlt/tdgtlttdgt1lt/tdgtlt/trgt
...
37Including an HTML table
- To permit the appearance of HTML in the instance
document we modify the report schema by declaring
the content of the element htmlExample by the any
element - in general, an any element specifies that any
well-formed XML is permissible in a types
content model - in the example, we require the XML to belong to
the namespace http//www.w3.org/1999/xhtml - -gt the XML should be XHTML
38Schema declaration with any
ltelement namepurchaseReportgt ltcomplexTypegt
ltsequencegt ltelement nameregions
typerRegionsType/gt ltelement
nameparts typerPartsType/gt ltelement
namehtmlExamplegt ltcomplexTypegt
ltsequencegt ltany
namespacehttp//www.w3.org/1999/xhtml
minOccurs1 maxOccursunbounded
processContentsskip/gt
lt/sequencegt ...
39Schema validation
- The attribute processContents
- skip no validation
- strict an XML processor is obliged to obtain the
schema associated with the required namespace and
validate the HTML appearing within the
HTMLExample element
40anyAttribute
ltelement namehtmlExamplegt ltcomplexTypegt
ltsequencegt ltany namespacehttp//w
ww.w3.org/1999/xhtml
minOccurs1 maxOccursunbounded
processContentsskip/gt
lt/sequencegt ltanyAttribute
namespacehttp//www.w3.org/1999/xhtml/gt
lt/complexTypegt lt/elementgt