Title: Querying XML
1Querying XML
- Zachary G. Ives
- University of Pennsylvania
- CIS 455 / 555 Internet and Web Systems
- February 3, 2009
2Today
- DTDs
- XPath and XSLT
- Reminders
- Assignment 1 milestone 1 due tonight
- Assignment 1 milestone 2 due Feb 17
3Integrating XML What If We Have Multiple
Sources with the Same Tags?
- Namespaces allow us to specify a context for
different tags - Two parts
- Binding of namespace to URI
- Qualified names
- lttag xmlnsmynshttp//www.fictitious.com/mypath
gt - ltthistaggtis in namespace mynslt/thistaggt
- ltmynsthistaggtis the samelt/mynsthistaggtltotherns
thistaggtis a different taglt/othernsthistaggt - lt/taggt
4XML Isnt Enough on Its Own
- Its too unconstrained for many cases!
- How will we know when were getting garbage?
- How will we query?
- How will we understand what we got?
5Document Type Definitions (DTDs)
- DTD is an EBNF grammar defining XML structure
- XML document specifies an associated DTD, plus
the root element - DTD specifies children of the root (and so on)
- DTD defines special significance for attributes
- IDs special attributes that are analogous to
keys for elements - IDREFs references to IDs
- IDREFS space-delimited list of IDREFs
6An Example DTD
- Example DTD
- lt!ELEMENT dblp((mastersthesis article))gt
- lt!ELEMENT mastersthesis(author,title,year,school,c
ommitteemember)gt - lt!ATTLIST mastersthesis(mdate CDATA REQUIRED ke
y ID REQUIRED - advisor CDATA IMPLIEDgt
- lt!ELEMENT author(PCDATA)gt
-
- Example use of DTD in XML file
- lt?xml version"1.0" encoding"ISO-8859-1" ?gt
- lt!DOCTYPE dblp SYSTEM my.dtd"gt
- ltdblpgt
7Representing Graphs in XML
- lt?xml version"1.0" encoding"ISO-8859-1" ?gt
- lt!DOCTYPE graph SYSTEM special.dtd"gt
- ltgraphgt
- ltauthor idauthor1gt
- ltnamegtJohn Smithlt/namegt
- lt/authorgt
- ltarticlegt
- ltauthor refauthor1 /gt lttitlegtPaper1lt/titlegt
- lt/articlegt
- ltarticlegt
- ltauthor refauthor1 /gt lttitlegtPaper2lt/titlegt
- lt/articlegt
8Graph Data Model
Root
graph
?xml
!DOCTYPE
article
article
author
id
title
title
author
author
name
Paper1
author1
ref
Paper2
ref
John Smith
author1
author1
9Graph Data Model
Root
graph
?xml
!DOCTYPE
article
article
author
id
title
title
author
author
name
Paper1
author1
ref
Paper2
ref
John Smith
10DTDs Are Very Limited
- DTDs capture grammatical structure, but have some
drawbacks - Not themselves in XML inconvenient to build
tools for them - Dont capture types of scalars
- Global ID/reference space is inconvenient
- No way of defining OO-like inheritance
11XML Schema DTDs Rethought
- Features
- XML syntax
- Better way of defining keys using XPaths
- Type subclassing
- And, of course, built-in datatypes
12Basic Constructs of Schema
- Separation of elements (and attributes) from
types - complexType is a structured type
- It can have sequences or choices
- element and attribute have name and type
- Elements may also have minOccurs and maxOccurs
- Subtyping, most commonly using
- ltcomplexContentgt ltextension baseprevTypegt
lt/gt
13Simple Schema Example
- ltxsdschema xmlnsxsd"http//www.w3.org/2001/XMLS
chema"gt - ltxsdelement namemastersthesis"
typeThesisType"/gt - ltxsdcomplexType nameThesisType"gt
- ltxsdattribute namemdate" type"xsddate"/gt
- ltxsdattribute namekey" type"xsdstring"/gt
- ltxsdattribute nameadvisor" type"xsdstring"/gt
- ltxsdsequencegt
- ltxsdelement nameauthor" typexsdstring"/gt
- ltxsdelement nametitle" typexsdstring"/gt
- ltxsdelement nameyear" typexsdinteger"/gt
- ltxsdelement nameschool" typexsdstring/gt
- ltxsdelement namecommitteemember"
typeCommitteeType minOccurs0"/gt - lt/xsdsequencegt
- lt/xsdcomplexTypegt
14Embedding XML Schema
- ltroot xmlnsxsi"http//www.w3.org/2000/10/XMLSche
ma-instance" xsinoNamespaceSchemaLocation"s1.xsd
" gt ltgradegtalt/gradegt lt/rootgt - lts1root xmlnss1"http//www.schemaValid.com/s1ns
" xmlnsxsi"http//www.w3.org/2000/10/XMLSchema-i
nstance" xsischemaLocation"http//www.schemaVali
d.com/s1ns s1ns.xsd" gt lts1gradegtalt/s1gradegt
lt/s1rootgt - But the XML parser is actually free to ignore
this the schema is typically specified from
outside the document
15Designing an XML Schema/DTD
- Often we are given a DTD/Schema if not, we need
to design one - We orient the XML tree around the central
objects in a particular application
16Manipulating XML
- Sometimes
- Need to restructure an XML document
- Or simply need to retrieve certain parts that
satisfy a constraint, e.g. - All books
- All books by author XYZ
17Document Object Model (DOM)vs. Queries
- Build a DOM tree (as we saw earlier) and access
via Java (etc.) DOMNode object - DOM objects have methods like getFirstChild(),
getNextSibling - Common way of traversing the tree
- Can also modify the DOM tree alter the XML
via insertAfter(), etc. - Alternate approach a query language
- Define some sort of a template describing
traversals from the root of the directed graph - In XML, the basis of this template is called an
XPath - Can also declare some constraints on the values
you want - The XPath returns a node set of matches
18XPaths
- In its simplest form, an XPath is like a path in
a file system - /mypath/subpath//morepath
- The XPath returns a node set representing the XML
nodes (and their subtrees) at the end of the path - XPaths can have node tests at the end, returning
only particular node types, e.g., text(),
processing-instruction(), comment(), element(),
attribute() - XPath is fundamentally an ordered language it
can query in order-aware fashion, and it returns
nodes in order
19Sample XML
- lt?xml version"1.0" encoding"ISO-8859-1" ?gt
- ltdblpgt
- ltmastersthesis mdate"2002-01-03"
key"ms/Brown92"gt - Â ltauthorgtKurt P. Brownlt/authorgt
- Â lttitlegtPRPL A Database Workload
Specification Languagelt/titlegt - Â ltyeargt1992lt/yeargt
- Â ltschoolgtUniv. of Wisconsin-Madisonlt/schoolgt
- Â lt/mastersthesisgt
- ltarticle mdate"2002-01-03" key"tr/dec/SRC1997-
018"gt - Â lteditorgtPaul R. McJoneslt/editorgt
- Â lttitlegtThe 1995 SQL Reunionlt/titlegt
- Â ltjournalgtDigital System Research Center
Reportlt/journalgt - Â ltvolumegtSRC1997-018lt/volumegt
- Â ltyeargt1997lt/yeargt
- Â lteegtdb/labs/dec/SRC1997-018.htmllt/eegt
- Â lteegthttp//www.mcjones.org/System_R/SQL_Reunio
n_95/lt/eegt - Â lt/articlegt
20XML Data Model Visualized
attribute
root
p-i
element
Root
text
dblp
?xml
mastersthesis
article
mdate
mdate
key
key
author
title
year
school
2002
editor
title
year
journal
volume
ee
ee
2002
1992
1997
The
ms/Brown92
tr/dec/
PRPL
Digital
db/labs/dec
Univ.
Paul R.
Kurt P.
SRC
http//www.
21Some Example XPath Queries
- /dblp/mastersthesis/title
- /dblp//editor
- //title
- //title/text()
22Context Nodes and Relative Paths
- XPath has a notion of a context node its
analogous to a current directory - . represents this context node
- .. represents the parent node
- We can express relative paths
- subpath/sub-subpath/../.. gets us back to the
context node - By default, the document root is the context node
23Predicates Filtering Operations
- A predicate allows us to filter the node set
based on selection-like conditions over
sub-XPaths - /dblp/articletitle Paper1
- which is equivalent to
- /dblp/article./title/text() Paper1
- because of type coercion. What does this do
- /dblp/article_at_key 123 and ./title/text()
Paper1 and ./author//element()
24Axes More Complex Traversals
- Thus far, weve seen XPath expressions that go
down the tree (and up one step) - But we might want to go up, left, right, etc.
- These are expressed with so-called axes
- selfpath-step
- childpath-step parentpath-step
- descendantpath-step ancestorpath-step
- descendant-or-selfpath-step ancestor-or-selfpa
th-step - preceding-siblingpath-step following-siblingpa
th-step - precedingpath-step followingpath-step
- The previous XPaths we saw were in abbreviated
form
25Users of XPath
- XML Schema uses simple XPaths in defining keys
and uniqueness constraints - XLink and XPointer, hyperlinks for XML
- XSLT useful for converting from XML to other
representations (e.g., HTML, PDF, SVG) - XQuery useful for restructuring an XML document
or combining multiple documents - Might well turn into the glue between Web
Services, etc.
26XSLT Transforming an XML Document
- XSLT XML Stylesheet Language Transformations
- Companion to XSLFO, formatting for XML
- A language for substituting structured fragments
for XML content - Transforms single document ? single document
- Useful for XML ? XML conversions, XML ? HTML
- Runs on server side (Apache Cocoon) or
client-side (modern browsers)
27A Functional Language for XML
- XSLT is based on a series of templates that match
different parts of an XML document - Theres a policy for what rule or template is
applied if more than one matches (its not what
youd think!) - XSLT templates can invoke other templates
- XSLT templates can be nonterminating (beware!)
- XSLT templates are based on XPath matches, and
we can also apply other templates (potentially to
selected XPaths) - Within each template, directly describe what
should be output
28An XSLT Template
- An XML document itself
- XML tags create output OR are XSL operations
- All XSL tags are prefixed with xsl namespace
- All non-XSL tags are part of the XML output
- Common XSL operations
- template with a match XPath
- Recursive call to apply-templates, which may also
select where it should be applied - Attach to XML document with a processing-instructi
on - lt?xml version 1.0 ?gtlt?xml-stylesheet
typetext/xsl hrefhttp//www.com/my.xsl ?gt
29An Example XSLT Stylesheet
- ltxslstylesheet version1.1gt
- ltxsltemplate match/dblpgt
- lthtmlgtltheadgtThis is DBLPlt/headgt
- ltbodygt
- ltxslapply-templates /gt
- lt/bodygt
- lt/htmlgt
- lt/xsltemplategt
- ltxsltemplate matchinproceedingsgt
- lth2gtltxslapply-templates selecttitle /gtlt/h2gt
- ltpgtltxslapply-templates selectauthor/gtlt/pgt
- lt/xsltemplategt
-
- lt/xslstylesheetgt
30XSLT Processing Model
- List of source nodes ? result tree fragment(s)
- Start with root
- Find all template rules with matching patterns
from root - Find best match according to some heuristics
- Set the current node list to be the set of things
it maches - Iterate over each node in the current node list
- Apply the operations of the template
- Append the results of the matching template
rule to the result tree structure - Repeat recursively if specified to by
apply-templates
31What If Theres More than One Match?
- Eliminate rules of lower precedence due to
importing - Break a rule into any branches and consider
separately - Choose rule with highest computed or specified
priority - Simple rules for computing priority based on
precision - QName preceded by XPath child/axis specifier
priority 0 - NCName preceded by child/axis specifier priority
-0.25 - NodeTest preceded by child/axis specifier
pririty -0.5 - else priority 0.5
32Other Common Operations
- Iteration
- ltxslfor-each selectpathgtlt/xslfor-eachgt
- Conditionals
- ltxslif test./text() lt abcgtlt/xslifgt
- Copying current node and children to the result
set - ltxslcopygt ltxslapply-templates /gtlt/xslcopygt
33Creating Output Nodes
- Return text/attribute data (this is a default
rule) - ltxsltemplate matchtext()_at_gt ltxslvalue-of
select./gtlt/xsltemplategt - Create an element from text (attribute is
similar) - ltxslelement nametext()gt ltxslapply-templates
/gtlt/xslelementgt - Copy nodes matching a path
- ltxslcopy-of select/gt
34Embedding Stylesheets
- You can import or include one stylesheet from
another - ltxslimport hrefhttp//www.com/my.xsl/gt
- ltxslinclude hrefhttp//www.com/my.xsl/gt
- Include the rules get same precedence as in
including template - Import the rules are given lower precedence
35XSLT Summary
- A very powerful, template-based transformation
language for XML document ? other structured
document - Commonly used to convert XML ? PDF, SVG, GraphViz
DOT format, HTML, WML, - Primarily useful for presentation of XML or for
very simple conversions - But sometimes we need more complex operations
when converting data from one source to another - Joins combining and correlating information
from multiple sources - Aggregation computing averages, counts, etc.
36Why XSLT Isnt Enough
- XSLT is focused on reformatting documents
- Stylesheets are focused around one XML file
- XML file must reference the stylesheet
- What if we want to
- Manage and combine collections of XML documents?
- Make Web service requests for XML?
- Glue together different Web service requests?
- Query for keywords within documents, with ranked
answers - This is where XQuery plays a role