XML, Schemas, and Queries - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

XML, Schemas, and Queries

Description:

XML, Schemas, and Queries Zachary G. Ives University of Pennsylvania CIS 455 / 555 Internet and Web Systems * – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 52
Provided by: Zac126
Category:

less

Transcript and Presenter's Notes

Title: XML, Schemas, and Queries


1
XML, Schemas, and Queries
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 455 / 555 Internet and Web Systems
  • December 8, 2015

2
Readings Reminders
  • Reminder Homework 1 Milestone 2 due 2/15 _at_
    1159PM
  • XML, DTD, Schema
  • XPath
  • XSLT
  • For next week Altinel Franklin paper on XFilter

3
Kinds of Content
  • Keyword search and inverted indices are great for
    locating text documents
  • But what if we want to index and/or share other
    kinds of content?
  • Spreadsheets
  • Maps
  • Purchase records
  • Objects
  • etc.
  • Lets talk about structured data representation
    and transport, then later indexing and retrieval

4
Sending Data
  • How do we send data within a program?
  • What is the implicit model?
  • How does this change when we need to make the
    data persistent?
  • What happens when we are coupling systems?
  • How do we send data between programs on the same
    machine?
  • Between different machines?

5
Marshalling
  • Converting from an in-memory data structure to
    something that can be sent elsewhere
  • Pointers -gt something else
  • Specific byte orderings
  • Metadata
  • Note that the same logical data gets a different
    physical encoding
  • A specific case of Codds idea of
    logical-physical separation
  • Data model vs. data

6
Communication and Streams
  • When storing data to disk, we have a combination
    of sequential and random access
  • When sending data on the wire, data is only
    sequential
  • Stream-based communication based on packets
  • What are the implications here?
  • Pipelining, incremental evaluation,

7
Why Data Interchange Is Hard
  • Need to be able to understand
  • Data encoding (physical data model)
  • May have syntactic heterogeneity
  • Endian-ness, marshalling issues
  • Impedance mismatches
  • Data representation (logical data model)
  • May have semantic heterogeneity
  • Imprecise and ambiguous values/descriptions

8
Examples
  • MP3 ID3 format record at end of file

offset length description
0 3 "TAG" identifier string.
3 30 Song title string.
33 30 Artist string.
63 30 Album string.
93 4 Year string.
97 28 Comment string.
125 1 Zero byte separator.
126 1 Track byte.
127 1 Genre byte.
9
Examples
  • JPEG JFIF header
  • Start of Image (SOI) marker -- two bytes (FFD8)
  • JFIF marker (FFE0)
  • length -- two bytes
  • identifier -- five bytes 4A, 46, 49, 46, 00
    (the ASCII code equivalent of a zero terminated
    "JFIF" string)
  • version -- two bytes often 01, 02
  • the most significant byte is used for major
    revisions
  • the least significant byte for minor revisions
  • units -- one byte Units for the X and Y
    densities
  • 0 gt no units, X and Y specify the pixel aspect
    ratio
  • 1 gt X and Y are dots per inch
  • 2 gt X and Y are dots per cm
  • Xdensity -- two bytes
  • Ydensity -- two bytes
  • Xthumbnail -- one byte 0 no thumbnail
  • Ythumbnail -- one byte 0 no thumbnail
  • (RGB)n -- 3n bytes packed (24-bit) RGB values
    for the thumbnail pixels, n Xthumbnail
    Ythumbnail

10
Finding File Formats
  • http//www.wikipedia.org/
  • http//www.wotsit.org/
  • etc.

11
The Problem
  • You need to look into a manual to find file
    formats
  • (At best, e.g., MS .DOC file format)
  • The Web is about making data exchange easier
    Maybe we can do better!
  • The mother of all file formats

12
Desiderata for Data Interchange
  • Ability to represent many kinds of information
  • Different data structures
  • Hardware-independent encoding
  • Endian-ness, UTF vs. ASCII vs. EBCDIC
  • Standard tools and interfaces
  • Ability to define shape of expected data
  • With forwards- and backwards-compatibility!
  • Thats XML

13
Consumers of XML
  • A myriad of tools and interfaces, including
  • DOM document object model
  • Standard OO representation of an XML tree
  • SAX simple API for XML
  • An event-driven parser interface for XML
  • startElement, endElement, etc.
  • Ant Java-based make tool with XML makefile
  • XPath, XQuery, XSL, XSLT
  • Web service standards
  • Anything AJAX (mash-ups)

14
XML as a Data Model
  • XML information set includes 7 types of nodes
  • Document (root)
  • Element
  • Attribute
  • Processing instruction
  • Text (content)
  • Namespace
  • Comment
  • XML data model includes this, plus typing info,
    plus order info and a few other things

15
Example XML Document
Processing Instr.
  • lt?xml version"1.0" encoding"ISO-8859-1" ?gt
  • ltdblpgt
  • ltmastersthesis mdate"2002-01-03"
    key"ms/Brown92"gt
  •   ltauthorgtKurt P. Brownlt/authorgt
  •   lttitlegtPRPL A Database Workload
    Specification Languagelt/titlegt
  •   ltyeargt1992lt/yeargt
  •   ltschoolgtUniv. of Wisconsin-Madisonlt/schoolgt
  •   lt/mastersthesisgt
  • ltarticle mdate"2002-01-03" key"tr/dec/SRC1997-
    018"gt
  •   lteditorgtPaul R. McJoneslt/editorgt
  •   lttitlegtThe 1995 SQL Reunionlt/titlegt
  •   ltjournalgtDigital System Research Center
    Reportlt/journalgt
  •   ltvolumegtSRC1997-018lt/volumegt
  •   ltyeargt1997lt/yeargt
  •   lteegtdb/labs/dec/SRC1997-018.htmllt/eegt
  •   lteegthttp//www.mcjones.org/System_R/SQL_Reunio
    n_95/lt/eegt
  •   lt/articlegt

Open-tag
Element
Attribute
Close-tag
16
XML Data Model Visualized( Document Object
Model)
attribute
root
p-i
element
Root
text
dblp
?xml
mastersthesis
article
mdate
mdate
key
key
author
title
year
school
2002
editor
title
year
journal
volume
ee
ee
2002
1992
1997
The
ms/Brown92
tr/dec/
PRPL
Digital
db/labs/dec
Univ.
Paul R.
Kurt P.
SRC
http//www.
17
A Few Common Uses of XML
  • Serves as an extensible HTML
  • Allows custom tags (e.g., used by MS Word,
    openoffice)
  • Supplement it with stylesheets (XSL) to define
    formatting
  • Provides an exchange format for data (still need
    to agree on terminology)
  • Tables, objects, etc.
  • Format for marshalling and unmarshalling data in
    Web Services

18
XML as a Super-HTML(MS Word)
  • lth1 class"Section1"gtlta name"_top /gtCIS 550
    Database and Information Systemslt/h1gt
  • lth2 class"Section1"gtFall 2003lt/h2gt
  • ltp class"MsoNormal"gt
  • ltplacegt311 Townelt/placegt, Tuesday/Thursday
  • lttime Hour"13" Minute"30"gt130PM
    300PMlt/timegt
  • lt/pgt

19
XML Easily Encodes Relations
Student-course-grade
id course grade
1 330-f03 B
23 455-s04 A
  • ltstudent-course-gradegt
  • lttuplegt ltsidgt1lt/sidgtltcoursegt330-f03lt/coursegtltgra
    degtBlt/gradegtlt/tuplegt
  • lttuplegt ltsidgt23lt/sidgtltcoursegt455-s04lt/coursegtltgr
    adegtAlt/gradegtlt/tuplegt
  • lt/student-course-gradegt

20
It Also Encodes Objects (with Pointers
Represented as IDs)
  • ltprojectsgt
  • ltproject classcse455 gt
  • lttypegtProgramminglt/typegtltmemberListgt
  • ltteamMembergtJoanlt/teamMembergt
  • ltteamMembergtJilllt/teamMembergt
  • lt/memberListgtltcodeURLgtwww.lt/codeURLgtltincorpora
    tesProjectFrom classcse330 /gt
  • lt/projectgt

21
XML and Code
  • Web Services (.NET, Java web service toolkits)
    are using XML to pass parameters and make
    function calls marshalling as part of remote
    procedure calls
  • SOAP WSDL
  • Why?
  • Easy to be forwards-compatible
  • Easy to read over and validate (?)
  • Generally firewall-compatible
  • Drawbacks? XML is a verbose and inefficient
    encoding!
  • But if the calls are only sending a few 100s of
    bytes, who cares?

22
XML When Tags Are Used by Different Sources
  • Namespaces allow us to specify a context for
    different tags
  • Two parts
  • Binding of namespace to URI
  • Qualified names
  • lttag xmlnsmynshttp//www.fictitious.com/mypath
    xmlnshttp//www.default/mypathgt
  • ltthistaggtis in default namespacelt/thistaggt
  • ltmynsthistaggtthis a different
    taglt/mynsthistaggtlt/taggt

23
XML Isnt Enough on Its Own
  • Its too unconstrained for many cases!
  • How will we know when were getting garbage?
  • How will we query?
  • How will we understand what we got?

24
Document Type Definitions (DTDs)
  • DTD is an EBNF grammar defining XML structure
  • XML document specifies an associated DTD, plus
    the root element
  • DTD specifies children of the root (and so on)
  • DTD defines special significance for attributes
  • IDs special attributes that are analogous to
    keys for elements
  • IDREFs references to IDs
  • IDREFS space-delimited list of IDREFs

25
An Example DTD
  • Example DTD
  • lt!ELEMENT dblp((mastersthesis article))gt
  • lt!ELEMENT mastersthesis(author,title,year,school,c
    ommitteemember)gt
  • lt!ATTLIST mastersthesis(mdate CDATA REQUIRED ke
    y ID REQUIRED
  • advisor CDATA IMPLIEDgt
  • lt!ELEMENT author(PCDATA)gt
  • Example use of DTD in XML file
  • lt?xml version"1.0" encoding"ISO-8859-1" ?gt
  • lt!DOCTYPE dblp SYSTEM my.dtd"gt
  • ltdblpgt

26
DTDs Are Very Limited
  • DTDs capture grammatical structure, but have some
    drawbacks
  • Only string scalar types
  • Global ID/reference space is inconvenient
  • No way of defining OO-like inheritance

27
XML Schema DTDs Rethought
  • Features
  • XML syntax
  • Better way of defining keys using XPaths
  • Type subclassing
  • And, of course, built-in datatypes

28
Basic Constructs of Schema
  • Separation of elements (and attributes) from
    types
  • complexType is a structured type
  • It can have sequences or choices
  • element and attribute have name and type
  • Elements may also have minOccurs and maxOccurs
  • Subtyping, most commonly using
  • ltcomplexContentgt ltextension baseprevTypegt
    lt/gt

29
Simple Schema Example
  • ltxsdschema xmlnsxsd"http//www.w3.org/2001/XMLS
    chema"gt
  • ltxsdelement namemastersthesis"
    typeThesisType"/gt
  • ltxsdcomplexType nameThesisType"gt
  • ltxsdattribute namemdate" type"xsddate"/gt
  • ltxsdattribute namekey" type"xsdstring"/gt
  • ltxsdattribute nameadvisor" type"xsdstring"/gt
  • ltxsdsequencegt
  • ltxsdelement nameauthor" typexsdstring"/gt
  • ltxsdelement nametitle" typexsdstring"/gt
  • ltxsdelement nameyear" typexsdinteger"/gt
  • ltxsdelement nameschool" typexsdstring/gt
  • ltxsdelement namecommitteemember"
    typeCommitteeType minOccurs0"/gt
  • lt/xsdsequencegt
  • lt/xsdcomplexTypegt

30
Embedding XML Schema
  • ltroot xmlnsxsi"http//www.w3.org/2000/10/XMLSche
    ma-instance" xsinoNamespaceSchemaLocation"s1.xsd
    " gt ltgradegtalt/gradegt lt/rootgt
  • lts1root xmlnss1"http//www.schemaValid.com/s1ns
    " xmlnsxsi"http//www.w3.org/2000/10/XMLSchema-i
    nstance" xsischemaLocation"http//www.schemaVali
    d.com/s1ns s1ns.xsd" gt lts1gradegtalt/s1gradegt
    lt/s1rootgt
  • But the XML parser is actually free to ignore
    this the schema is typically specified from
    outside the document

31
Designing an XML Schema/DTD
  • Often we are given a DTD/Schema if not, we need
    to design one
  • We orient the XML tree around the central
    objects in a particular application

32
Manipulating XML
  • Sometimes
  • Need to restructure an XML document
  • Or simply need to retrieve certain parts that
    satisfy a constraint, e.g.
  • All books
  • All books by author XYZ

33
Document Object Model (DOM)vs. Queries
  • Build a DOM tree (as we saw earlier) and access
    via Java (etc.) DOMNode object
  • DOM objects have methods like getFirstChild(),
    getNextSibling
  • Common way of traversing the tree
  • Can also modify the DOM tree alter the XML
    via insertAfter(), etc.
  • Alternate approach a query language
  • Define some sort of a template describing
    traversals from the root of the directed graph
  • In XML, the basis of this template is called an
    XPath
  • Can also declare some constraints on the values
    you want
  • The XPath returns a node set of matches

34
XPaths
  • In its simplest form, an XPath is like a path in
    a file system
  • /mypath/subpath//morepath
  • The XPath returns a node set representing the XML
    nodes (and their subtrees) at the end of the path
  • XPaths can have node tests at the end, returning
    only particular node types, e.g., text(),
    processing-instruction(), comment(), element(),
    attribute()
  • XPath is fundamentally an ordered language it
    can query in order-aware fashion, and it returns
    nodes in order

35
Sample XML
  • lt?xml version"1.0" encoding"ISO-8859-1" ?gt
  • ltdblpgt
  • ltmastersthesis mdate"2002-01-03"
    key"ms/Brown92"gt
  •   ltauthorgtKurt P. Brownlt/authorgt
  •   lttitlegtPRPL A Database Workload
    Specification Languagelt/titlegt
  •   ltyeargt1992lt/yeargt
  •   ltschoolgtUniv. of Wisconsin-Madisonlt/schoolgt
  •   lt/mastersthesisgt
  • ltarticle mdate"2002-01-03" key"tr/dec/SRC1997-
    018"gt
  •   lteditorgtPaul R. McJoneslt/editorgt
  •   lttitlegtThe 1995 SQL Reunionlt/titlegt
  •   ltjournalgtDigital System Research Center
    Reportlt/journalgt
  •   ltvolumegtSRC1997-018lt/volumegt
  •   ltyeargt1997lt/yeargt
  •   lteegtdb/labs/dec/SRC1997-018.htmllt/eegt
  •   lteegthttp//www.mcjones.org/System_R/SQL_Reunio
    n_95/lt/eegt
  •   lt/articlegt

36
XML Data Model Visualized
attribute
root
p-i
element
Root
text
dblp
?xml
mastersthesis
article
mdate
mdate
key
key
author
title
year
school
2002
editor
title
year
journal
volume
ee
ee
2002
1992
1997
The
ms/Brown92
tr/dec/
PRPL
Digital
db/labs/dec
Univ.
Paul R.
Kurt P.
SRC
http//www.
37
Some Example XPath Queries
  • /dblp/mastersthesis/title
  • /dblp//editor
  • //title
  • //title/text()

38
Context Nodes and Relative Paths
  • XPath has a notion of a context node its
    analogous to a current directory
  • . represents this context node
  • .. represents the parent node
  • We can express relative paths
  • subpath/sub-subpath/../.. gets us back to the
    context node
  • By default, the document root is the context node

39
Predicates Filtering Operations
  • A predicate allows us to filter the node set
    based on selection-like conditions over
    sub-XPaths
  • /dblp/articletitle Paper1
  • which is equivalent to
  • /dblp/article./title/text() Paper1
  • because of type coercion. What does this do
  • /dblp/article_at_key 123 and ./title/text()
    Paper1 and ./author//element()

40
Axes More Complex Traversals
  • Thus far, weve seen XPath expressions that go
    down the tree (and up one step)
  • But we might want to go up, left, right, etc.
  • These are expressed with so-called axes
  • selfpath-step
  • childpath-step parentpath-step
  • descendantpath-step ancestorpath-step
  • descendant-or-selfpath-step ancestor-or-selfpa
    th-step
  • preceding-siblingpath-step following-siblingpa
    th-step
  • precedingpath-step followingpath-step
  • The previous XPaths we saw were in abbreviated
    form

41
Users of XPath
  • XML Schema uses simple XPaths in defining keys
    and uniqueness constraints
  • XLink and XPointer, hyperlinks for XML
  • XSLT useful for converting from XML to other
    representations (e.g., HTML, PDF, SVG)
  • XQuery useful for restructuring an XML document
    or combining multiple documents
  • Might well turn into the glue between Web
    Services, etc.

42
A Functional Language for XML
  • XSLT is based on a series of templates that match
    different parts of an XML document
  • Theres a policy for what rule or template is
    applied if more than one matches (its not what
    youd think!)
  • XSLT templates can invoke other templates
  • XSLT templates can be nonterminating (beware!)
  • XSLT templates are based on XPath matches, and
    we can also apply other templates (potentially to
    selected XPaths)
  • Within each template, directly describe what
    should be output

43
An XSLT Template
  • An XML document itself
  • XML tags create output OR are XSL operations
  • All XSL tags are prefixed with xsl namespace
  • All non-XSL tags are part of the XML output
  • Common XSL operations
  • template with a match XPath
  • Recursive call to apply-templates, which may also
    select where it should be applied
  • Attach to XML document with a processing-instructi
    on
  • lt?xml version 1.0 ?gtlt?xml-stylesheet
    typetext/xsl hrefhttp//www.com/my.xsl ?gt

44
An Example XSLT Stylesheet
  • ltxslstylesheet version1.1gt
  • ltxsltemplate match/dblpgt
  • lthtmlgtltheadgtThis is DBLPlt/headgt
  • ltbodygt
  • ltxslapply-templates /gt
  • lt/bodygt
  • lt/htmlgt
  • lt/xsltemplategt
  • ltxsltemplate matchinproceedingsgt
  • lth2gtltxslapply-templates selecttitle /gtlt/h2gt
  • ltpgtltxslapply-templates selectauthor/gtlt/pgt
  • lt/xsltemplategt
  • lt/xslstylesheetgt

45
XSLT Processing Model
  • List of source nodes ? result tree fragment(s)
  • Start with root
  • Find all template rules with matching patterns
    from root
  • Find best match according to some heuristics
  • Set the current node list to be the set of things
    it maches
  • Iterate over each node in the current node list
  • Apply the operations of the template
  • Append the results of the matching template
    rule to the result tree structure
  • Repeat recursively if specified to by
    apply-templates

46
What If Theres More than One Match?
  • Eliminate rules of lower precedence due to
    importing
  • Break a rule into any branches and consider
    separately
  • Choose rule with highest computed or specified
    priority
  • Simple rules for computing priority based on
    precision
  • QName preceded by XPath child/axis specifier
    priority 0
  • NCName preceded by child/axis specifier priority
    -0.25
  • NodeTest preceded by child/axis specifier
    pririty -0.5
  • else priority 0.5

47
Other Common Operations
  • Iteration
  • ltxslfor-each selectpathgtlt/xslfor-eachgt
  • Conditionals
  • ltxslif test./text() lt abcgtlt/xslifgt
  • Copying current node and children to the result
    set
  • ltxslcopygt ltxslapply-templates /gtlt/xslcopygt

48
Creating Output Nodes
  • Return text/attribute data (this is a default
    rule)
  • ltxsltemplate matchtext()_at_gt ltxslvalue-of
    select./gtlt/xsltemplategt
  • Create an element from text (attribute is
    similar)
  • ltxslelement nametext()gt ltxslapply-templates
    /gtlt/xslelementgt
  • Copy nodes matching a path
  • ltxslcopy-of select/gt

49
Embedding Stylesheets
  • You can import or include one stylesheet from
    another
  • ltxslimport hrefhttp//www.com/my.xsl/gt
  • ltxslinclude hrefhttp//www.com/my.xsl/gt
  • Include the rules get same precedence as in
    including template
  • Import the rules are given lower precedence

50
XSLT Summary
  • A very powerful, template-based transformation
    language for XML document ? other structured
    document
  • Commonly used to convert XML ? PDF, SVG, GraphViz
    DOT format, HTML, WML,
  • Primarily useful for presentation of XML or for
    very simple conversions
  • But sometimes we need more complex operations
    when converting data from one source to another
  • Joins combining and correlating information
    from multiple sources
  • Aggregation computing averages, counts, etc.

51
XSLT and Alternatives
  • XSLT is focused on reformatting documents
  • Stylesheets are focused around one XML file
  • XML file must reference the stylesheet
  • What if we want to
  • Manage and combine collections of XML documents?
  • Make Web service requests for XML?
  • Glue together different Web service requests?
  • Query for keywords within documents, with ranked
    answers
  • This is where XQuery plays a role see CIS 330 /
    550 for details
Write a Comment
User Comments (0)
About PowerShow.com