Title: Processing of structured documents
1Processing of structured documents
- Spring 2001
- Helena Ahonen-Myka
2Course organization
- 581290-5 laudatur course, 3 cu
- lectures (in Finnish)
- 27.2.-5.4. Tue 12-14, Thu 10-12
- exceptions no lectures 6. and 8.3.
- exercise sessions
- 6.3.-5.4. Tue 10-12 A318 (in English?), Thu
12-14 C454 (in Finnish 22.3. at 8-10) - course assistant Olli Lahti
- not obligatory
3Project work
- an XML application that is constructed during the
course - a framework is given in the first lecture
- in connection with the exercises, more
requirements are given - a report has to be returned by 12.4.
4Requirements
- Exam (Wed 11.4. at 16-20) 45 points
- Project 15 points
- Exercises 5 extra points
- Maximum of points 60
5Outline (preliminary)
- 1. Introduction
- 2. Descriptions of structure
- context-free grammars
- XML DTD, XML Schema
- 3. Programming interfaces
- SAX, DOM
- 4. Querying structured documents
- XML Query
6Outline...
- 5. Transforming structured documents
- XSL (XSLT, formatting objects)
- presentation issues
- 6. Document architectures
- 7. Metadata RDF
- 8. Compressing XML data
- 9. ...
71. Introduction
8Structured documents
- Document?
- A structured representation of (textual)
information on some medium - normally for a human reader
- messages, manuals, memos, books
- also to/from/between applications
- source code, program-generated mail, EDI
(electronic data interchange) - static - dynamic
9Presentation and structure
- Presentation informs the human reader about the
meaning of text and the role of its parts - markup indicating the presentation or the
meaning of different parts of text - originally hand-written annotations for the
typesetter - nowadays primarily codes embedded in digital
documents
10Markup
- Procedural markup
- formatting commands (start boldface, produce an
empty line, indent 5mm) - Descriptive markup
- indicating the logical structure of text using
chosen names
11Structured documents?
- Generally speaking any text is structured
(punctuation, words, sentences) - but especially descriptively marked-up documents
- especially if they adhere to a rigorous
specification of structure.
12Document
ltmemo importancehigh
date19990323gt ltfromgtPaul V. Bironlt/fromgt
lttogtAshok Malhotralt/togt ltsubjectgtLatest
draftlt/subjectgt ltbodygt We need to
discuss the latest draft
ltemphgtimmediatelylt/emphgt. Either
email me at ltemailgt
mailtopaul.v.biron_at_kp.orglt/emailgt or
call ltphonegt555-9876lt/phonegt lt/bodygt lt/memogt
13Data
ltinvoicegt ltorderDategt19990121lt/orderDategt
ltshipDategt19990125lt/shipDategt
ltbillingAddressgt ltnamegtAshok
Malhotralt/namegt ltstreetgt123 IBM
Ave.lt/streetgt ltcitygtHawthornelt/citygt
ltstategtNYlt/stategt ltzipgt10532-0000lt/zipgt
lt/billingAddressgt ltvoicegt555-1234lt/voicegt
ltfaxgt555-4321lt/faxgt lt/invoicegt
14ltbodygt ltpgtltbgtOrder datelt/bgt 19990121lt/pgt
ltpgtltbgtShipping datelt/bgt 19990125lt/pgt
ltpgtltbgtAddresslt/bgtlt/pgt lttablegt
lttrgtltthgtnameltthgtstreetltthgtcityltthgtstateltthgtzip
lttrgtlttdgtAshok Malhotra
lttdgt123 IBM Ave. lttdgtHawthorne
lttdgtNY lttdgt10532-0000
lt/tablegt ltpgtPhone 555-1234lt/pgt ltpgtFax
555-4321lt/pgt lt/bodygt
15Theses of structured documenting
- Separation of structure and presentation
- markup of structure and other (meta) information
should be done - at creation time
- for future needs
- rigor of markup
- automatization of processing
16Advantages of structure
- Better control over documents
- guidance of writing, validation of structure
- higher-precision retrieval (conditions for parts)
- reuse of information
- automated processing
- control of uniform style
17Advantages of structure
- Transport of documents between different
environments and applications - archival of documents
- storing in databases
- multiuse of documents
- different layout styles
- paper, online, CD-ROM, pda
- different versions
18Disadvantages of structure
- Start-up costs
- design of document structures
- conversion of legacy (non-structured) documents
- implementation/adaptation of tools, procedures
and policies - attitudes of authors
- from a producer of a final publication to an
information-feeding clerk?
192. Project work
- The goal everyone builds a (non-trivial) XML
application that can be used during the course to
train different concepts and methods - Example I would need a system to track the work
of my Masters thesis students
20A wish list
- I want to store information about my students,
e.g., name, contact information, scheduled
meetings and deadlines, comments, problems,
deals, links to the drafts and the homepages of
the students, etc. - As a primary interface Id like to have a web
page (with forms)
21A wish list functions
- I want to add information using the HTML form on
the web page (easily!) - I want to have a listing on the web page of 1)
all the students 2) information about one student - I need also other listings (e.g. simple ASCII)
for reporting the state of my students (or just a
list of my current students)
22And now you...
- Design an application that is somehow similar
to mine - set of persons (or other objects) with
information (e.g. your customer contacts) - some parts free text
- several different ways to use the data, e.g.
several listings (both content and presentation)
23Requirements
- More requirements follow later...
- return a report by 12.4.
- The report should include
- (short) requirements analysis
- descriptions of the structure (DTD, Schema)
- other designs, architecture, ...
- Some kind of a working prototype
- not necessarily the whole system
243. Structure descriptions
- Regular expressions, context-free grammars
- XML Document type definitions
- XML Schema
25Regular expressions
- A way to describe set of strings over an alphabet
(of chars, events, elements) - many uses
- text searching (e.g. emacs, grep, perl)
- in grammatical formalisms (e.g. XML DTDs)
- relevant for document structures what kind of
structural content is allowed for different
document components
26Regular expressions
- A regular expression over alphabet ? is either
- ? (an empty set)
- ? epsilon sometimes lambda ?)
- a, where a ? ?
- R S (choice sometimes R ? S)
- R S (catenation) or
- R (Kleene closure)
- where R and S are regular expressions
27Regular expressions
- Regular expression E denotes a language (a set of
strings) L(E) - L(?) ? (empty set)
- L(?) ? (singleton set of empty string)
- L(a) a (singleton set of a ? ?)
- L(RS) L(R) ? L(S) w w ? L(R) or w ? L(S)
- L(RS) L(R)L(S) xy x ? L(R) and y ? L(S)
- L(R) L(R) x1xn xk ? L(R), k1,,n n ? 0
28Example
- top-level structure of a document
- ? title, author, date, sect)
- title followed by an optional list of authors,
followed by an optional date, followed by one or
more sections - title auth (date ?) sect sect
- common abbreviations
- E? (E ?) E E E
- -gt title auth date? sect
29Context-free grammars
- Used widely to syntax specification (programming
languages) - G (V, ?, P, S)
- V the alphabet of the grammar G V ? ? N
- ? the set of terminal symbols
N V- ? the set of nonterminal symbols - P set of productions
- S ? N the start symbol
30Productions and derivations
- Productions A -gt ?, where A ? N, ? ? V
- e.g. A -gt aBa (1)
- Let ?, ? ? V. String ? derives ? directly, ?
gt ?, if - ? ?A?, ? ??? for some ?,? ? V, and A -gt ?
is a production of the grammar - e.g. AA gt AaBa (assuming prod. 1 above)
31Language generated by a context-free grammar
- ? derives ?, ? gt ?, if there is a sequence of
0 or more direct derivations that transforms ? to
? - The language generated by a CFG G
- L(G) w ? ? S gt w
- L(G) is a set of strings to model structural
elements, we consider parse trees
32Parse trees of a CFG
- Aka syntax trees or derivation trees
- nodes labelled by symbols of V (or by ?)
- internal nodes by nonterminals, root by start
symbol - leaves using terminal symbols (or ?)
- parent with label A can have children labeled by
X1,,Xk only if A -gt X1Xk is a production
33CFGs for document structures
- Nonterminals represent document structures
- e.g. Ref -gt AuthorList Title PublData AuthorList
-gt Author AuthorList AuthorList -gt ? - problem
- obscures the relation of elements (the last
Author several hierarchical levels away from Ref)
-gt solution extended CFGs
34Extended CFGs (ECFGs)
- Like CFGs, but right-hand-sides of productions
are regular expressions over V, e.g. Ref -gt
Author Title PublData - Let ?, ? ? V. String ? derives ? directly, ?
gt ?, if - ? ?A?, ? ??? for some ?,? ? V, and A -gt E
is a production such that ? ? L(E) - e.g. Ref gt Author Author Author Title PublData
35Language generated by an ECFG
- Defined similarly to CFGs
- Theorem Languages generated by extended and
ordinary CGFs are the same
36Parse trees of an ECFG
- Similar to parse trees of an ordinary CFG, except
that - parent with label A can have children labeled by
X1,,Xk when A -gt E is a production such that
X1Xk ? L(E) - -gt an internal node may have arbitrarily many
children (e.g. Authors below a Ref node)
37What is XML?
- W3C Recommendation Feb 1998
- metalanguage that can be used to define markup
languages - gives syntax for defining extended context free
grammars - XML documents that adhere to the ECFG are strings
in the language - document types (grammars)- document instances
(strings in the language)
38XML encoding of structure
- XML document essentially a parenthesized linear
encoding of a parse tree - corresponds to a preorder walk
- start of inner node (element) A denoted by a
start tag ltAgt, end denoted by end tag lt/Agt - leaves are strings (or empty elements)
- certain extensions (especially attributes)
39Terminal symbols in practice
- Leaves of parse trees are labeled by single
characters (symbols of ?) - too granular in practice instead terminal
symbols which stand for all values of a type - e.g. PCDATA in XML for variable length content
of data characters - richer data types in proposed XML schema
formalisms
40XML logical structure
- Elements
- correspond to internal nodes of the parse tree
- unique root element -gt document is a single parse
tree - indicated by matching (case-sensitive!) tags
ltElementTypeNamegtlt/ElementTypeNamegt - can contain text and/or subelements
- can be empty
- ltelem-typegtlt/elem-typegt
- ltbr /gt
41Logical structure
- Attributes
- name-value pairs attached to elements
- metadata, usually not treated as content
- e.g. ltdiv classpreface date990126gt
- also
- lt!-- comments --gt
- lt?note this text would be passed to the
application as a processing instruction named
note?gt
42Document type declaration
- Provides a grammar (document type definition,
DTD) for a class of documents - syntax
- lt!DOCTYPE root-type-name SYSTEM ex.dtd lt!--
external subset in file ex.dtd --gt
lt!-- internal subset may come here --gt gt - external and internal subset make up the DTD
internal has higher precedence
43XML declaration
- lt?xml version1.0 encodingUTF-8
standaloneyes ?gt
44Defining the structure DTD
- document type definition (DTD)
- content model for each element
- describes how the elements are formed from the
other elements and text - defines which attributes an element may/must
have default values - content models are regular expressions
45Markup declarations
- Element type declarations (similar to productions
of ECFGs) - attribute-list declarations (for declared element
types) - entity declarations
- notation declarations
46Element type declarations
- The general form is
- lt!ELEMENT elem-type-name (E)gt
- where E is a content model
regular expression over element names
47Regular expression syntax
- 1 or more
- 0 or more
- ? 0 or 1
- choice (one has to be chosen)
- () grouping
- , order
48Examples of definitions
- lt!ELEMENT name (fname, lname)gt
- lt!ELEMENT address (name, street, (city, state,
zipcode) (zipcode, city))gt - lt!ELEMENT contact
(address, phone, email?)gt - lt!ELEMENT contact2 (address
phone email)gt
49DTD for the Invoice example
lt!DOCTYPE invoice lt!ELEMENT invoice
(orderDate, shipDate, billingAddress
voice,
fax?)gt lt!ELEMENT orderDate (PCDATA)gt lt!ELEMENT
shipDate (PCDATA)gt lt!ELEMENT billingAddress
(name, street, city, state, zip)gt lt!ELEMENT voice
(PCDATA)gt lt!ELEMENT fax
(PCDATA)gt lt!ELEMENT name (PCDATA)gt lt!ELEMENT
street (PCDATA)gt lt!ELEMENT city
(PCDATA)gt lt!ELEMENT state (PCDATA)gt lt!ELEMEN
T zip (PCDATA)gtgt
50Attribute-list declarations
- Name, data type and possible default value for
each attribute for a given element type - Example
- lt!ATTLIST FIG
- id ID IMPLIED
- descr CDATA REQUIRED
- class (a b c) agt
- semantics mainly up to the application
51Mixed, empty and arbitrary content
- Mixed content
- lt!ELEMENT P (PCDATA I IMG)gt
- may contain text (PCDATA) and elements
- Empty content
- lt!ELEMENT IMG EMPTYgt
- Arbitrary content
- lt!ELEMENT X ANYgt
- lt!ELEMENT X (PCDATA choice-of-all-declared-e
lement-types)gt
52Entities
- Character entities, e.g. lt
- amp, lt, gt, apos, quot are built-in
- general entities are shorthand notations
lt!ENTITY HY University of Helsinkigt - physical storage units comprising a document
- parsed entities
- lt!ENTITY chapt1 SYSTEM chapter1.xmlgt
- elements in entities must nest properly
53Unparsed entities
- External (binary) files
- declarations
- lt!NOTATION TIFF bin/xvgt
- lt!ENTITY fig123 SYSTEM figs/f123.tif NDATA
TIFFgt - lt!ATTLIST IMG file ENTITY REQUIREDgt
- usage
- ltIMG filefig123gt
54Parameter entities
- A way to parameterize and modularize DTDs
- lt!ENTITY stattr status (draft ready) draftgt
- lt!ATTLIST chap stattrgt
- lt!ATTLIST sect stattrgt
55Note
- elements cannot overlap
- container elements must have end tags
- empty elements ltbr /gt
- all names are case-sensitive
- attribute values must be delimited by quotation
marks
56XML processing model
- A processor (parser)
- reads XML documents
- passes data to an application
- XML Specification tells how to read, what to pass
57XML Information set
- An XML documents information set consists of a
number of information items - an information item is an abstract representation
of some part of an XML document - each information item has a set of association
properties
58XML Information set
- Tree structure provided by the processor (no
special interface is specified) - e.g. entities expanded to their replacement text,
attributes with their default values - properties e.g. for each element its child
elements and attributes
59Namespaces
- An XML document may contain multiple markup
vocabularies - reuse of existing markup, e.g. including HTML
markup in some document type - An XML namespace is a collection of names,
identified by a URI reference, which are used in
XML documents as element types and attribute names
60Namespace prefix declaration and use
- ltx xmlnsedihttp//ecommerce.org/schemagt
- ltediprice unitsEurogt32.18lt/edipricegt
- ltlineItem editaxClassexemptgtBaby
foodlt/lineItemgt - lt/xgt
61XML Schema
- DTDs have drawbacks
- They can only define the element structure and
attributes - They cannot define any database-like constraints
for elements - Value (min, max, etc.)
- Type (integer, string, etc.)
- DTDs are not written in XML and cannot thus be
processed with the same tools as XML documents,
XSL(T), etc. - XML Schema
- Is written in XML
- Avoids most of the DTD drawbacks
62XML Schema
- XML Schema Part 1 Structures
- Element structure definition as with DTD
Elements, attributes, also enhanced ways to
control structures - XML Schema Part 2 Datatypes
- Primitive datatypes (string, boolean, float,
etc.) - Derived datatypes from primitive datatypes (time,
recurringDate) - Constraining facets for each datatype (minLength,
maxLength, pattern, precision, etc.) - Information about Schemas
- http//www.w3c.org/XML/Schema/
63Complex and simple types
- complex types allow elements in their content
and may have attributes - simple types cannot have element content and
cannot have attributes
64Reminder DTD declarations
- lt!ELEMENT name (fname, lname)gt
- lt!ELEMENT address (name, street, (city, state,
zipcode) (zipcode, city))gt - lt!ELEMENT contact
(address, phone, email?)gt - lt!ELEMENT fname (PCDATA)gt
65Example USAddress type
ltxsdcomplexType nameUSAddress gt
ltxsdsequencegt ltxsdelement namename
typexsdstring /gt ltxsdelement
namestreet typexsdstring /gt
ltxsdelement namecity typexsdstring /gt
ltxsdelement namestate typexsdstring
/gt ltxsdelement namezip
typexsddecimal /gt lt/xsdsequencegt
ltxsdattribute namecountry typexsdNMTOKEN
usefixed valueUS
/gt lt/xsdcomplexTypegt
66Example PurchaseOrderType
ltxsdcomplexType namePurchaseOrderTypegt
ltxsdsequencegt ltxsdelement
nameshipTo typeUSAddress /gt
ltxsdelement namebillTo typeUSAddress
/gt ltxsdelement refcomment
minOccurs0 /gt ltxsdelement
nameitems typeItems /gt
lt/xsdsequencegt ltxsdattribute
nameorderDate typexsddate
/gt lt/xsdcomplexTypegt
67Notes
- element declarations for shipTo and billTo
associate different element names with the same
complex type - attribute declarations must reference simple
types - element comment declared elsewhere in the schema
(here reference only)
68 continues
- element is optional, if minOccurs 0
- maximum number of times an element may appear
maxOccurs - attributes may appear once or not at all
- use attribute is used in an attribute declaration
to indicate whether the attribute is required or
optional, and if optional, whether the value is
fixed or whether there is a default
69More examples
ltitemsgt ltitem partNum"872-AA"gt
ltproductNamegtLawnmowerlt/productNamegt
ltquantitygt1lt/quantitygt ltpricegt148.95lt/pricegt
ltcommentgtConfirm this is
electriclt/commentgt lt/itemgt ltitem
partNum"926-AA"gt ltproductNamegtBaby
Monitorlt/productNamegt ltquantitygt1lt/quantitygt
ltpricegt39.98lt/pricegt
ltshipDategt1999-05-21lt/shipDategt lt/itemgt
lt/itemsgt
70ltxsdcomplexType name"Items"gt ltxsdelement
name"item" minOccurs"0
maxOccurs"unbounded"gt ltxsdcomplexTypegt
ltxsdelement name"quantity"gt
ltxsdsimpleType base"xsdpositiveInteger"gt
ltxsdmaxExclusive value"100"/gt
lt/xsdsimpleTypegt lt/xsdelementgt
ltxsdelement name"price" type"xsddecimal"/gt
ltxsdelement ref"comment" minOccurs"0"/gt
ltxsdelement name"shipDate" type"xsddate
minOccurs"0"/gt
ltxsdattribute name"partNum" type"Sku"/gt
lt/xsdcomplexTypegt lt/xsdelementgt lt/xsdcomplexT
ypegt ltxsdsimpleType nameSkugt ltxsdpattern
value"\d3-A-Z2"/gt lt/xsdsimpleTypegt
71Patterns
ltxsdsimpleType nameSkugt ltxsdrestriction
basexsdstringgt ltxsdpattern
value"\d3-A-Z2"/gt ltxsdrestrictiongt lt/xsd
simpleTypegt
- three digits followed by a hyphen followed by
two upper-case ASCII letters
72Building content models
- ltxsdsequencegt fixed order
- ltxsdchoicegt (1) choice of alternatives
- ltxsdgroupgt grouping (also named)
- ltxsdallgt no order specified
73Well-formed XML documents
- documents that adhere to the formal requirements
(syntax) of the XML specification - if a document is not well-formed, it is not an
XML document (and the XML tools do not have to
process it)
74Valid documents
- a document is a valid XML-document, if it is
well-formed and adheres to the structure defined
in the DTD given - XML-processor can be validating or non-validating
- sometimes validity is important, sometimes not