Title: Introduction to XML
1Introduction to XML
2Traditional DB Applications
- Typically business oriented
- Large amount of data
- Data is well-structured, normalized, with
predefined schema - Large number of concurrent users (transactions)
- Simple data, simple queries, and simple updates
- Typically update intensive
- Small transactions
- High performance, high availability, scalability
- Data integrity and security are of major
importance - Good administrative support, nice GUIs
3Document Applications
- Human friendly what-you-see-is-what-you-get
paradigm - Focus on presentation
- Information is divided into multiple small
documents - Mostly static
- Implicit structure section, subsection,
paragraph, etc - Meta-data title, author, date, indexing
keywords, etc - Content structure form/layout,
inter-relationships, references - Tagging eg, ltpgt for new paragraph
- Operations retrieving, editing, spell-checking,
printing, etc - Information retrieval keyword queries
- most successful in web search engines (eg, Google)
4Internet Applications
- Internet applications
- use heterogeneous, complex, hierarchical,
fast-evolving, unstructured/semistructured data - access mostly read-only data
- need 100 availability
- manage millions of users world-wide
- have high-performance requirements
- are concerned with security (encryption)
- like to customize data in a personalized manner
- expect to gain users trust for
business-to-consumer transactions. - Internet users choose speed and availability over
correctness
5Electronic Commerce
- Currently, mostly business-to-business (B2B)
rather than business-to-consumer (B2C)
interactions - Focus on selling and buying
- Order management
- Product catalogs
- Product configuration
- Sales and marketing
- Education and training
- Web services
- Communities
6Other Web Applications
- Web services
- Many standards SOAP, WSDL, UDDI
- Web integration
- Heterogeneous data sources and types
- Thousands of web-accessible data sources
- Dynamic data
- Data warehouses
- Web publishing
- Access different types of content from browsers
(PDF, HTML, XML) - Structured, dynamic, customized/personalized
content - Integration with application
- Accessible via major gateways and search engines
- Application integration
- Transformation between different data formats
(eg, XML, HTML) - Integration of multiple applications
7Current Internet Application Architectures
- Architecture
- Server-Tier relational databases and gateways to
diverse data sources, such as, files, OLE/DB etc.
Use of enterprise servers - Middle-Tier provides data integration
distribution, query, etc. Consists of a web
server and an application server - Client-Tier mostly a web browser, may use CGI
scripts or Java - Characteristics
- Customization is achieved at the server site
(customer data in a database) with some data at
the client site (cookies) - Load balancing is typically hardware based
(multiple servers, DNS routers)
8HTML
- lthtmlgt
- ltheadgtlttitlegtMy Web Pagelt/titlegtlt/headgt
- ltbodygt
- lth1gtIntroductionlt/h1gt
- Look at lta hrefhttp//lambda.uta.edu/index.ht
mlgtthis documentlt/agt - ltimg srcimage.jpg width100 height50gt
- lt/bodygt
- lt/htmlgt
- It is very simple human readable, can be edited
by any editor - It reflects document presentation, not the
semantics or structure of data - Universal portable to any platform
- HTML pages are connected through hypertext links
- HTML pages can be located using web search engines
hypertext link
opening tag
closing tag
attribute name
attribute value
9XML
- XML (eXtensible Markup Language) is a textual
language for representing and exchanging data on
the web - It is designed to improve the functionality of
the Web by providing more flexible and adaptable
information identification - Based on SGML
- It was developed around 1996
- It is called extensible because
- it is not a fixed format like HTML (a single,
predefined markup language) - it is actually a metalanguage (a language for
describing other languages) which lets you design
your own customized markup languages for
limitless different types of documents
10XML (cont.)
- XML can be untyped (semistructured), but there
are standards now for schema conformance - DTD
- XML Schema
- Without schema, an XML document is well-formed if
it satisfies simple syntactic constraints - proper nesting of start and end tags
- With a schema, an XML document is valid if its
structure conforms to a DTD or an XML Schema
11Example
- ltpeoplegt
- ltpersongt
- ltnamegt Leonidas Fegaras lt/namegt
- lttelgt (817) 272-3629 lt/telgt
- ltemailgt fegaras_at_cse.uta.edu lt/emailgt
- lt/persongt
- ltpersongt
- ltnamegt Ramez Elmasri lt/namegt
- lttelgt (817) 272-2348 lt/telgt
- ltemailgt elmasri_at_cse.uta.edu lt/emailgt
- lt/persongt
- lt/peoplegt
12Why XML is so Popular?
- It looks like HTML
- simple, human-readable, easy to learn, universal
- Flexible extensible, since you can represent
any kind of data - unlike HTML
- HTML describes the presentation while XML
describes the content - Precise
- well-formed properly nested XML tags
- valid its structure may conform to a DTD or an
XML Schema - Supported by the W3C
- trusted and adopted by industry
- Many standards around XML schemas, query
languages, etc
13What XML has to do with Databases?
- XML is an important standardization for data
representation and exchange, but still needs - to store and query large repositories of XML
documents - data models and schema representations
- query languages, data indexing, query optimizers
- updates, view maintenance
- concurrency, distribution, security, etc
- Example application
- an XML data repository distributed in a
peer-to-peer network - answer queries, such as
- find all books whose author is Smith and whose
title contains the word Web - much like a web search engine, but for XML, ...
and for more precise querying
14XML Syntax
- XML consists of tags and text
- XML documents conform to the following grammar
- XMLdocument Pi Element Pi
- Element Stag (char Pi Element) Etag
- Stag 'lt' Name Attributes 'gt'
- Etag 'lt/' Name 'gt'
- Pi 'lt?' char '?gt'
- Attributes ( Name '' String )
- String '"' char '"'
- Tags come in pairs ltdategt8/25/2004lt/dategt and
must be properly nested - ltpersongt ltnamegt ... lt/namegt ... lt/persongt ---
valid nesting - ltpersongt ltnamegt ... lt/persongt ... lt/namegt ---
invalid nesting - Text is bounded by tags. PCDATA parsed character
data. eg, - lttitlegt The Big Sleep lt/titlegt
- ltyeargt 1935 lt/ yeargt
15XML Elements
- An element is a segment of an XML document
between an opening and the matching closing tags - ltpersongt
- ltnamegt Ramez Elmasri lt/namegt
- lttelgt (817) 272-2348 lt/telgt
- ltemailgt elmasri_at_cse.uta.edu lt/emailgt
- lt/persongt
- An element may contain a mixture of sub-elements
and PCDATA - lttitlegtAn ltemgtelementlt/emgt is a segmentlt/titlegt
- An abbreviation for an element with empty
content, we can use - lttagname ... /gt
- instead of
- lttagname ...gtlt/tagnamegt
16Representing Data Using XML
- Nesting tags can be used to express various
structures, such as a record - ltpersongt
- ltnamegt Ramez Elmasri lt/namegt
- lttelgt (817) 272-2348 lt/telgt
- ltemailgt elmasri_at_cse.uta.edu lt/emailgt
- lt/persongt
- We can represent a list by using the same tag
repeatedly - ltaddressesgt
- ltpersongt ... lt/persongt
- ltpersongt ... lt/persongt
- ltpersongt ... lt/persongt
- ...
- lt/addressesgt
17XML structure
- XML
- ltpersongt
- ltnamegt Ramez Elmasri lt/namegt
- lttelgt (817) 272-2348 lt/telgt
- ltemailgt elmasri_at_cse.uta.edu lt/emailgt
- lt/persongt
- is Lisp-like
- (person (name Ramez Elmasri)
- (tel (817) 272-2348)
- (email elmasri_at_cse.uta.edu))
- and tree-like
person
name
tel
email
Ramez Elmasri
(817) 272-2348
elmasri_at_cse.uta.edu
18Attributes
- An opening tag may contain attributes
- typically used to describe the content of an
element - ltauthor ssn"2787901"gt
- ltnamegtRamez Elmasrilt/namegt
- ltemailgt elmasri_at_cse.uta.edu lt/emailgt
- lt/authorgt
- It's not always clear when to use attributes
- ltauthorgt
- ltssngt2787901lt/ssngt
- ltnamegtRamez Elmasrilt/namegt
- ltemailgt elmasri_at_cse.uta.edu lt/emailgt
- lt/authorgt
- ID attributes are special must be unique within
the document - An IDref attribute must refer to an existing ID
in the same doc
19Referencing Elements Using IDs/IDrefs
- ltfamilygt
- ltperson id"jane" mother"mary"
father"john"gt - ltnamegt Jane Doe lt/namegt
- lt/persongt
- ltperson id"john" children"jane jack"gt
- ltnamegt John Doe lt/namegt ltmother/gt
- lt/persongt
- ltperson id"mary" children"jane jack"gt
- ltnamegt Mary Doe lt/namegt
- lt/persongt
- ltperson id"jack" mothermary"
father"john"gt - ltnamegt Jack Doe lt/namegt
- lt/persongt
- lt/familygt
20A Complete Example
- lt?xml version"1.0"?gt
- lt!DOCTYPE bib SYSTEM "bib.dtd"gt
- ltbibgt
- ltvendor id"id0_1"gt
- ltnamegtAmazonlt/namegt
- ltemailgtwebmaster_at_amazon.comlt/emailgt
- ltphonegt1-800-555-9999lt/phonegt
- ltbookgt
- lttitlegtUnix Network Programminglt/titlegt
- ltpublishergtAddison Wesleylt/publishergt
- ltyeargt1995lt/yeargt
- ltauthorgt
- ltfirstnamegtRichardlt/firstnamegt
- ltlastnamegtStevenslt/lastnamegt
- lt/authorgt
- ltpricegt38.68lt/pricegt
- lt/bookgt
- ltbookgt
- lttitlegtAn Introduction to
Object-Oriented Designlt/titlegt
21OODB Schema
- class Movie
- ( extent Movies, key title )
-
- attribute string title
- attribute string director
- relationship setltActorgt casts
- inverse Actoracted_In
- attribute int budget
-
class Actor ( extent Actors, key name
) attribute string name relationship
setltMoviegt acted_In inverse
Moviecasts attribute int age
attribute setltstringgt directed
22In XML
ltdbgt ltmovie idm1gt lttitlegtWaking Ned
Divinelt/titlegt ltdirectorgtKirk Jones
IIIlt/directorgt ltcast idrefsa1
a3gtlt/castgt ltbudgetgt100,000lt/budgetgt
lt/moviegt ltmovie idm2gt
lttitlegtDragonheartlt/titlegt ltdirectorgtRob
Cohenlt/directorgt ltcast idrefsa2 a9
a21gtlt/castgt ltbudgetgt110,000lt/budgetgt
lt/moviegt ltmovie idm3gt
lttitlegtMoondancelt/titlegt ltdirectorgtDagmar
Hirtzlt/directorgt ltcast idrefsa1
a8gtlt/castgt ltbudgetgt90,000lt/budgetgt
lt/moviegt
ltactor ida1gt ltnamegtDavid
Kellylt/namegt ltacted_In idrefsm1 m3 m78
gt lt/acted_Ingt lt/actorgt ltactor
ida2gt ltnamegtSean Connerylt/namegt
ltacted_In idrefsm2 m9 m11gt lt/acted_Ingt
ltagegt68lt/agegt lt/actorgt ltactor
ida3gt ltnamegtIan Bannenlt/namegt
ltacted_In idrefsm1 m35gt lt/acted_Ingt
lt/actorgt lt/dbgt
23DTD Document Type Descriptor
- A DTD imposes a structure on an XML document
- Not quite a typing system
- it is purely syntactic
- now replaced by XML Schema
- Uses regular expressions to specify structure
- firstname an element with tag name firstname
- book zero or more books
- year? an optional year
- firstname,lastname a firstname followed by
lastname - book journal either a book or a journal
24Example of XML Data
- ltbibgt
- ltvendor id"id0_1"gt
- ltnamegtAmazonlt/namegt
- ltemailgtwebmaster_at_amazon.comlt/emailgt
- ltphonegt1-800-555-9999lt/phonegt
- ltbookgt
- lttitlegtUnix Network Programminglt/titlegt
- ltpublishergtAddison Wesleylt/publishergt
- ltyeargt1995lt/yeargt
- ltauthorgt
- ltfirstnamegtRichardlt/firstnamegt
- ltlastnamegtStevenslt/lastnamegt
- lt/authorgt
- ltpricegt38.68lt/pricegt
- lt/bookgt
- ...
- lt/vendorgt
- lt/bibgt
25DTD Example
- lt?xml encoding"ISO-8859-1"?gt
- lt!ELEMENT bib (vendor)gt
- lt!ELEMENT vendor (name, email, book)gt
- lt!ATTLIST vendor id ID REQUIREDgt
- lt!ELEMENT book (title, publisher?, year?,
author, price)gt - lt!ELEMENT author (firstname?, lastname)gt
- lt!ELEMENT name (PCDATA)gt
- lt!ELEMENT email (PCDATA)gt
- lt!ELEMENT title (PCDATA)gt
- lt!ELEMENT publisher (PCDATA)gt
- lt!ELEMENT year (PCDATA)gt
- lt!ELEMENT firstname (PCDATA)gt
- lt!ELEMENT lastname (PCDATA)gt
- lt!ELEMENT price (PCDATA)gt
26Summary of the DTD Syntax
- A tagged element in a DTD is defined by
- lt!ELEMENT name egt
- where e is a DTD expression
- If e, e1, e2 are DTD expressions, then so are
- EMPTY empty content
- PCDATA any text
- A an element with tag name A
- e1,e2 e1 followed by e2
- e1 e2 either e1 or e2
- e zero or more occurrences of e
- e one or more occurrences of e
- e? optional e (zero or one occurrences)
- (e)
- Note tagged elements are global
- must be defined once in a DTD
27DTD Syntax (cont.)
- Attribute specification
- lt!ATTLIST name (attribute-name type accuracy?)gt
- type is
- ID must be unique within the document
- IDREF a reference to an existing ID
- IDREFS multiple IDREFs
- CDATA any string
- accuracy is REQUIRED, IMPLIED, FIXED 'value',
value 'v1 ... vn' - ID, IDref, and IDrefs attributes are not typed!
- Example
- lt!ELEMENT person (PCDATA)gt
- lt!ATTLIST person
- id ID REQUIRED
- children IDrefs IMPLIED gt
- the id attribute is required while the children
attribute is optional
28Connecting an XML document to a DTD
- In-line the DTD into the XML file
- lt?xml version1.0?gt
- lt!DOCTYPE db
- lt!ELEMENT person ...gt
- ...
- gt
- ltdbgt
- ltpersongt ... lt/persongt
- ...
- lt/dbgt
- Better put the DTD in a separate file and
reference it by URL - lt!DOCTYPE db SYSTEM http//lambda.uta.edu/person.
dtdgt - Documents are validated against their DTD before
they are used
DTD
XML data
29Recursive DTDs
- We want to capture a person with a mother and a
father - First attempt
- lt!ELEMENT person (name, address, person, person)gt
- where the first person is the mother while the
second is the father - Second attempt
- lt!ELEMENT person (name, address, person?,
person?)gt - Third attempt
- lt!ELEMENT person (name, address)gt
- lt!ATTLIST person
- id ID REQUIRED
- mother IDREF IMPLIED
- father IDREF IMPLIEDgt
30Back to the OODB Schema
- class Movie
- ( extent Movies, key title )
-
- attribute string title
- attribute string director
- relationship setltActorgt casts
- inverse Actoracted_In
- attribute int budget
-
class Actor ( extent Actors, key name
) attribute string name relationship
setltMoviegt acted_In inverse
Moviecasts attribute int age
attribute setltstringgt directed
31DTD
- lt!ELEMENT db (movie, actor)gt
- lt!ELEMENT movie (title, director, cast, budget)gt
- lt!ATTLIST movie id ID REQUIREDgt
- lt!ELEMENT title (PCDATA)gt
- lt!ELEMENT director (PCDATA)gt
- lt!ELEMENT cast EMPTYgt
- lt!ATTLIST cast IDREFS REQUIREDgt
- lt!ELEMENT budget (PCDATA)gt
- lt!ELEMENT actor (name, acted_In, age, directed)gt
- lt!ELEMENT name (PCDATA)gt
- lt!ELEMENT acted_In EMPTYgt
- lt!ATTLIST acted_In IDREFS REQUIREDgt
- lt!ELEMENT age (PCDATA)gt
- lt!ELEMENT directed (PCDATA)gt
32XML Namespaces
- When merging multiple docs together, name
collisions may occur - A namespace is a mechanism for uniquely naming
tagnames and attribute names to avoid name
conflicts - Tag/attribute names are now qualified names
(QNames) - (namespace '')? localname
- example bibauthor
- A document may use multiple namespaces
- A DTD has its own namespace in which all names
are unique - A namespace in an XML doc is defined as an
attribute - xmlnsbibhttp//lambda.uta.edu/biblio.dtd
- where bib is the namespace name and the URL is
the location of the DTD - The default namespace is defined as
- xmlnsURL
- If not defined, it is the global namespace
33Example
- ltitem xmlnshttp//www.acme.com/jpsupplies
- xmlnstoy http//www.acme.com/jptoysgt
- ltnamegtbackpacklt/namegt
- ltfeaturegt
- lttoyitemgt
- lttoynamegtcyberpetlt/toynamegt
- lt/toyitemgt
- lt/featuregt
- lt/itemgt
34Query Languages for XML
- Need a language for XML data for
- extracting fragments (querying)
- restructuring (data transformation)
- integrating (eg, combining multiple XML
documents) - browsing
- presentation (eg, from XML to HTML)
- We will first learn XPath
- used in extracting fragments from a single
document - many XML query languages are based on XPath
- We will briefly discuss XSLT
- for extracting, restructuring, and presentation
over a single document - We will focus later on XQuery
- a full-fledged query language
- much like OQL
35XPath
- Describes a single navigation path in an XML
document - Selects a sequence of nodes reachable by the path
- the order of nodes is the document order
- Main construct axis navigation
- Consists of one or more navigation steps
separated by / - A navigation step is a triplet
- axis node-test list-of-predicates
- Each navigation path is evaluated relative to a
context node - Examples
- /childbib /descendantauthor
- /descendantbook /childauthor Smith
/childtitle - Most people use shorthands
- /bib//author
- //bookauthorSmith/title
36Axis Navigation
- In the beginning, the context node is the
document root - Dot (.) identifies the context node
- Some navigation steps
- / the root node
- // the root node and its descendants
- ./author all the children of the context node
with tagname author the context node of the
next step is each of these children - .// the context node and all its descendants the
context node of the next step is each of the
nodes - _at_mother the attribute value of the attribute name
mother of the context node - ./ all the children of the context node
- .. parent of context node
- text() all the text children of the context node
- Shortcut you can remove ./
37Example
- ltagt
- ltbgt
- ltcgtlt/cgt
- ltbgtlt/bgt
- lt/bgt
- ltdgt
- ltcgtlt/cgt
- lt/dgt
- ltbgt
- ltdgtlt/dgt
- lt/bgt
- lt/agt
a
1
b
d
b
3
2
4
c
d
c
b
7
5
6
8
/./a or /a --gt 1 /./a./b or /a/b --gt
2,4 /a/c --gt /a//c --gt 5,7 //b --gt
2,6,4 //b/c --gt 5 /a//c --gt 5,7
38Predicates
- Many variations
- 10 the tenth child node of the context node
- last() the last child node of the context node
- author true, if the context node has at least
one child tagged author - author/name true, if the XPath ./author/name is
nonempty - authornameSmith true if the author name is
Smith - Examples
- /bib/book_at_price lt 100/title
- /bib/bookauthor/text()
- authorname/firstnameJohn and
name/lastnameSmith/title - /bib/book/authorname/firstnameaddress//zipci
ty/name/lastname