Introduction to XML - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to XML

Description:

Data is well-structured, normalized, with predefined schema ... Element ::= Stag (char | Pi | Element)* Etag. Stag ::= ' ' Name Attributes ' ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 39
Provided by: lambd
Learn more at: https://lambda.uta.edu
Category:
Tags: xml | introduction | stag

less

Transcript and Presenter's Notes

Title: Introduction to XML


1
Introduction to XML
  • Leonidas Fegaras

2
Traditional DB Applications
  • Typically business oriented
  • Large amount of data
  • Data is well-structured, normalized, with
    predefined schema
  • Large number of concurrent users (transactions)
  • Simple data, simple queries, and simple updates
  • Typically update intensive
  • Small transactions
  • High performance, high availability, scalability
  • Data integrity and security are of major
    importance
  • Good administrative support, nice GUIs

3
Document Applications
  • Human friendly what-you-see-is-what-you-get
    paradigm
  • Focus on presentation
  • Information is divided into multiple small
    documents
  • Mostly static
  • Implicit structure section, subsection,
    paragraph, etc
  • Meta-data title, author, date, indexing
    keywords, etc
  • Content structure form/layout,
    inter-relationships, references
  • Tagging eg, ltpgt for new paragraph
  • Operations retrieving, editing, spell-checking,
    printing, etc
  • Information retrieval keyword queries
  • most successful in web search engines (eg, Google)

4
Internet Applications
  • Internet applications
  • use heterogeneous, complex, hierarchical,
    fast-evolving, unstructured/semistructured data
  • access mostly read-only data
  • need 100 availability
  • manage millions of users world-wide
  • have high-performance requirements
  • are concerned with security (encryption)
  • like to customize data in a personalized manner
  • expect to gain users trust for
    business-to-consumer transactions.
  • Internet users choose speed and availability over
    correctness

5
Electronic Commerce
  • Currently, mostly business-to-business (B2B)
    rather than business-to-consumer (B2C)
    interactions
  • Focus on selling and buying
  • Order management
  • Product catalogs
  • Product configuration
  • Sales and marketing
  • Education and training
  • Web services
  • Communities

6
Other Web Applications
  • Web services
  • Many standards SOAP, WSDL, UDDI
  • Web integration
  • Heterogeneous data sources and types
  • Thousands of web-accessible data sources
  • Dynamic data
  • Data warehouses
  • Web publishing
  • Access different types of content from browsers
    (PDF, HTML, XML)
  • Structured, dynamic, customized/personalized
    content
  • Integration with application
  • Accessible via major gateways and search engines
  • Application integration
  • Transformation between different data formats
    (eg, XML, HTML)
  • Integration of multiple applications

7
Current Internet Application Architectures
  • Architecture
  • Server-Tier relational databases and gateways to
    diverse data sources, such as, files, OLE/DB etc.
    Use of enterprise servers
  • Middle-Tier provides data integration
    distribution, query, etc. Consists of a web
    server and an application server
  • Client-Tier mostly a web browser, may use CGI
    scripts or Java
  • Characteristics
  • Customization is achieved at the server site
    (customer data in a database) with some data at
    the client site (cookies)
  • Load balancing is typically hardware based
    (multiple servers, DNS routers)

8
HTML
  • lthtmlgt
  • ltheadgtlttitlegtMy Web Pagelt/titlegtlt/headgt
  • ltbodygt
  • lth1gtIntroductionlt/h1gt
  • Look at lta hrefhttp//lambda.uta.edu/index.ht
    mlgtthis documentlt/agt
  • ltimg srcimage.jpg width100 height50gt
  • lt/bodygt
  • lt/htmlgt
  • It is very simple human readable, can be edited
    by any editor
  • It reflects document presentation, not the
    semantics or structure of data
  • Universal portable to any platform
  • HTML pages are connected through hypertext links
  • HTML pages can be located using web search engines

hypertext link
opening tag
closing tag
attribute name
attribute value
9
XML
  • XML (eXtensible Markup Language) is a textual
    language for representing and exchanging data on
    the web
  • It is designed to improve the functionality of
    the Web by providing more flexible and adaptable
    information identification
  • Based on SGML
  • It was developed around 1996
  • It is called extensible because
  • it is not a fixed format like HTML (a single,
    predefined markup language)
  • it is actually a metalanguage (a language for
    describing other languages) which lets you design
    your own customized markup languages for
    limitless different types of documents

10
XML (cont.)
  • XML can be untyped (semistructured), but there
    are standards now for schema conformance
  • DTD
  • XML Schema
  • Without schema, an XML document is well-formed if
    it satisfies simple syntactic constraints
  • proper nesting of start and end tags
  • With a schema, an XML document is valid if its
    structure conforms to a DTD or an XML Schema

11
Example
  • ltpeoplegt
  • ltpersongt
  • ltnamegt Leonidas Fegaras lt/namegt
  • lttelgt (817) 272-3629 lt/telgt
  • ltemailgt fegaras_at_cse.uta.edu lt/emailgt
  • lt/persongt
  • ltpersongt
  • ltnamegt Ramez Elmasri lt/namegt
  • lttelgt (817) 272-2348 lt/telgt
  • ltemailgt elmasri_at_cse.uta.edu lt/emailgt
  • lt/persongt
  • lt/peoplegt

12
Why XML is so Popular?
  • It looks like HTML
  • simple, human-readable, easy to learn, universal
  • Flexible extensible, since you can represent
    any kind of data
  • unlike HTML
  • HTML describes the presentation while XML
    describes the content
  • Precise
  • well-formed properly nested XML tags
  • valid its structure may conform to a DTD or an
    XML Schema
  • Supported by the W3C
  • trusted and adopted by industry
  • Many standards around XML schemas, query
    languages, etc

13
What XML has to do with Databases?
  • XML is an important standardization for data
    representation and exchange, but still needs
  • to store and query large repositories of XML
    documents
  • data models and schema representations
  • query languages, data indexing, query optimizers
  • updates, view maintenance
  • concurrency, distribution, security, etc
  • Example application
  • an XML data repository distributed in a
    peer-to-peer network
  • answer queries, such as
  • find all books whose author is Smith and whose
    title contains the word Web
  • much like a web search engine, but for XML, ...
    and for more precise querying

14
XML Syntax
  • XML consists of tags and text
  • XML documents conform to the following grammar
  • XMLdocument Pi Element Pi
  • Element Stag (char Pi Element) Etag
  • Stag 'lt' Name Attributes 'gt'
  • Etag 'lt/' Name 'gt'
  • Pi 'lt?' char '?gt'
  • Attributes ( Name '' String )
  • String '"' char '"'
  • Tags come in pairs ltdategt8/25/2004lt/dategt and
    must be properly nested
  • ltpersongt ltnamegt ... lt/namegt ... lt/persongt ---
    valid nesting
  • ltpersongt ltnamegt ... lt/persongt ... lt/namegt ---
    invalid nesting
  • Text is bounded by tags. PCDATA parsed character
    data. eg,
  • lttitlegt The Big Sleep lt/titlegt
  • ltyeargt 1935 lt/ yeargt

15
XML Elements
  • An element is a segment of an XML document
    between an opening and the matching closing tags
  • ltpersongt
  • ltnamegt Ramez Elmasri lt/namegt
  • lttelgt (817) 272-2348 lt/telgt
  • ltemailgt elmasri_at_cse.uta.edu lt/emailgt
  • lt/persongt
  • An element may contain a mixture of sub-elements
    and PCDATA
  • lttitlegtAn ltemgtelementlt/emgt is a segmentlt/titlegt
  • An abbreviation for an element with empty
    content, we can use
  • lttagname ... /gt
  • instead of
  • lttagname ...gtlt/tagnamegt

16
Representing Data Using XML
  • Nesting tags can be used to express various
    structures, such as a record
  • ltpersongt
  • ltnamegt Ramez Elmasri lt/namegt
  • lttelgt (817) 272-2348 lt/telgt
  • ltemailgt elmasri_at_cse.uta.edu lt/emailgt
  • lt/persongt
  • We can represent a list by using the same tag
    repeatedly
  • ltaddressesgt
  • ltpersongt ... lt/persongt
  • ltpersongt ... lt/persongt
  • ltpersongt ... lt/persongt
  • ...
  • lt/addressesgt

17
XML structure
  • XML
  • ltpersongt
  • ltnamegt Ramez Elmasri lt/namegt
  • lttelgt (817) 272-2348 lt/telgt
  • ltemailgt elmasri_at_cse.uta.edu lt/emailgt
  • lt/persongt
  • is Lisp-like
  • (person (name Ramez Elmasri)
  • (tel (817) 272-2348)
  • (email elmasri_at_cse.uta.edu))
  • and tree-like

person
name
tel
email
Ramez Elmasri
(817) 272-2348
elmasri_at_cse.uta.edu
18
Attributes
  • An opening tag may contain attributes
  • typically used to describe the content of an
    element
  • ltauthor ssn"2787901"gt
  • ltnamegtRamez Elmasrilt/namegt
  • ltemailgt elmasri_at_cse.uta.edu lt/emailgt
  • lt/authorgt
  • It's not always clear when to use attributes
  • ltauthorgt
  • ltssngt2787901lt/ssngt
  • ltnamegtRamez Elmasrilt/namegt
  • ltemailgt elmasri_at_cse.uta.edu lt/emailgt
  • lt/authorgt
  • ID attributes are special must be unique within
    the document
  • An IDref attribute must refer to an existing ID
    in the same doc

19
Referencing Elements Using IDs/IDrefs
  • ltfamilygt
  • ltperson id"jane" mother"mary"
    father"john"gt
  • ltnamegt Jane Doe lt/namegt
  • lt/persongt
  • ltperson id"john" children"jane jack"gt
  • ltnamegt John Doe lt/namegt ltmother/gt
  • lt/persongt
  • ltperson id"mary" children"jane jack"gt
  • ltnamegt Mary Doe lt/namegt
  • lt/persongt
  • ltperson id"jack" mothermary"
    father"john"gt
  • ltnamegt Jack Doe lt/namegt
  • lt/persongt
  • lt/familygt

20
A Complete Example
  • lt?xml version"1.0"?gt
  • lt!DOCTYPE bib SYSTEM "bib.dtd"gt
  • ltbibgt
  • ltvendor id"id0_1"gt
  • ltnamegtAmazonlt/namegt
  • ltemailgtwebmaster_at_amazon.comlt/emailgt
  • ltphonegt1-800-555-9999lt/phonegt
  • ltbookgt
  • lttitlegtUnix Network Programminglt/titlegt
  • ltpublishergtAddison Wesleylt/publishergt
  • ltyeargt1995lt/yeargt
  • ltauthorgt
  • ltfirstnamegtRichardlt/firstnamegt
  • ltlastnamegtStevenslt/lastnamegt
  • lt/authorgt
  • ltpricegt38.68lt/pricegt
  • lt/bookgt
  • ltbookgt
  • lttitlegtAn Introduction to
    Object-Oriented Designlt/titlegt

21
OODB Schema
  • class Movie
  • ( extent Movies, key title )
  • attribute string title
  • attribute string director
  • relationship setltActorgt casts
  • inverse Actoracted_In
  • attribute int budget

class Actor ( extent Actors, key name
) attribute string name relationship
setltMoviegt acted_In inverse
Moviecasts attribute int age
attribute setltstringgt directed
22
In XML
ltdbgt ltmovie idm1gt lttitlegtWaking Ned
Divinelt/titlegt ltdirectorgtKirk Jones
IIIlt/directorgt ltcast idrefsa1
a3gtlt/castgt ltbudgetgt100,000lt/budgetgt
lt/moviegt ltmovie idm2gt
lttitlegtDragonheartlt/titlegt ltdirectorgtRob
Cohenlt/directorgt ltcast idrefsa2 a9
a21gtlt/castgt ltbudgetgt110,000lt/budgetgt
lt/moviegt ltmovie idm3gt
lttitlegtMoondancelt/titlegt ltdirectorgtDagmar
Hirtzlt/directorgt ltcast idrefsa1
a8gtlt/castgt ltbudgetgt90,000lt/budgetgt
lt/moviegt
ltactor ida1gt ltnamegtDavid
Kellylt/namegt ltacted_In idrefsm1 m3 m78
gt lt/acted_Ingt lt/actorgt ltactor
ida2gt ltnamegtSean Connerylt/namegt
ltacted_In idrefsm2 m9 m11gt lt/acted_Ingt
ltagegt68lt/agegt lt/actorgt ltactor
ida3gt ltnamegtIan Bannenlt/namegt
ltacted_In idrefsm1 m35gt lt/acted_Ingt
lt/actorgt lt/dbgt
23
DTD Document Type Descriptor
  • A DTD imposes a structure on an XML document
  • Not quite a typing system
  • it is purely syntactic
  • now replaced by XML Schema
  • Uses regular expressions to specify structure
  • firstname an element with tag name firstname
  • book zero or more books
  • year? an optional year
  • firstname,lastname a firstname followed by
    lastname
  • book journal either a book or a journal

24
Example of XML Data
  • ltbibgt
  • ltvendor id"id0_1"gt
  • ltnamegtAmazonlt/namegt
  • ltemailgtwebmaster_at_amazon.comlt/emailgt
  • ltphonegt1-800-555-9999lt/phonegt
  • ltbookgt
  • lttitlegtUnix Network Programminglt/titlegt
  • ltpublishergtAddison Wesleylt/publishergt
  • ltyeargt1995lt/yeargt
  • ltauthorgt
  • ltfirstnamegtRichardlt/firstnamegt
  • ltlastnamegtStevenslt/lastnamegt
  • lt/authorgt
  • ltpricegt38.68lt/pricegt
  • lt/bookgt
  • ...
  • lt/vendorgt
  • lt/bibgt

25
DTD Example
  • lt?xml encoding"ISO-8859-1"?gt
  • lt!ELEMENT bib (vendor)gt
  • lt!ELEMENT vendor (name, email, book)gt
  • lt!ATTLIST vendor id ID REQUIREDgt
  • lt!ELEMENT book (title, publisher?, year?,
    author, price)gt
  • lt!ELEMENT author (firstname?, lastname)gt
  • lt!ELEMENT name (PCDATA)gt
  • lt!ELEMENT email (PCDATA)gt
  • lt!ELEMENT title (PCDATA)gt
  • lt!ELEMENT publisher (PCDATA)gt
  • lt!ELEMENT year (PCDATA)gt
  • lt!ELEMENT firstname (PCDATA)gt
  • lt!ELEMENT lastname (PCDATA)gt
  • lt!ELEMENT price (PCDATA)gt

26
Summary of the DTD Syntax
  • A tagged element in a DTD is defined by
  • lt!ELEMENT name egt
  • where e is a DTD expression
  • If e, e1, e2 are DTD expressions, then so are
  • EMPTY empty content
  • PCDATA any text
  • A an element with tag name A
  • e1,e2 e1 followed by e2
  • e1 e2 either e1 or e2
  • e zero or more occurrences of e
  • e one or more occurrences of e
  • e? optional e (zero or one occurrences)
  • (e)
  • Note tagged elements are global
  • must be defined once in a DTD

27
DTD Syntax (cont.)
  • Attribute specification
  • lt!ATTLIST name (attribute-name type accuracy?)gt
  • type is
  • ID must be unique within the document
  • IDREF a reference to an existing ID
  • IDREFS multiple IDREFs
  • CDATA any string
  • accuracy is REQUIRED, IMPLIED, FIXED 'value',
    value 'v1 ... vn'
  • ID, IDref, and IDrefs attributes are not typed!
  • Example
  • lt!ELEMENT person (PCDATA)gt
  • lt!ATTLIST person
  • id ID REQUIRED
  • children IDrefs IMPLIED gt
  • the id attribute is required while the children
    attribute is optional

28
Connecting an XML document to a DTD
  • In-line the DTD into the XML file
  • lt?xml version1.0?gt
  • lt!DOCTYPE db
  • lt!ELEMENT person ...gt
  • ...
  • gt
  • ltdbgt
  • ltpersongt ... lt/persongt
  • ...
  • lt/dbgt
  • Better put the DTD in a separate file and
    reference it by URL
  • lt!DOCTYPE db SYSTEM http//lambda.uta.edu/person.
    dtdgt
  • Documents are validated against their DTD before
    they are used

DTD
XML data
29
Recursive DTDs
  • We want to capture a person with a mother and a
    father
  • First attempt
  • lt!ELEMENT person (name, address, person, person)gt
  • where the first person is the mother while the
    second is the father
  • Second attempt
  • lt!ELEMENT person (name, address, person?,
    person?)gt
  • Third attempt
  • lt!ELEMENT person (name, address)gt
  • lt!ATTLIST person
  • id ID REQUIRED
  • mother IDREF IMPLIED
  • father IDREF IMPLIEDgt

30
Back to the OODB Schema
  • class Movie
  • ( extent Movies, key title )
  • attribute string title
  • attribute string director
  • relationship setltActorgt casts
  • inverse Actoracted_In
  • attribute int budget

class Actor ( extent Actors, key name
) attribute string name relationship
setltMoviegt acted_In inverse
Moviecasts attribute int age
attribute setltstringgt directed
31
DTD
  • lt!ELEMENT db (movie, actor)gt
  • lt!ELEMENT movie (title, director, cast, budget)gt
  • lt!ATTLIST movie id ID REQUIREDgt
  • lt!ELEMENT title (PCDATA)gt
  • lt!ELEMENT director (PCDATA)gt
  • lt!ELEMENT cast EMPTYgt
  • lt!ATTLIST cast IDREFS REQUIREDgt
  • lt!ELEMENT budget (PCDATA)gt
  • lt!ELEMENT actor (name, acted_In, age, directed)gt
  • lt!ELEMENT name (PCDATA)gt
  • lt!ELEMENT acted_In EMPTYgt
  • lt!ATTLIST acted_In IDREFS REQUIREDgt
  • lt!ELEMENT age (PCDATA)gt
  • lt!ELEMENT directed (PCDATA)gt

32
XML Namespaces
  • When merging multiple docs together, name
    collisions may occur
  • A namespace is a mechanism for uniquely naming
    tagnames and attribute names to avoid name
    conflicts
  • Tag/attribute names are now qualified names
    (QNames)
  • (namespace '')? localname
  • example bibauthor
  • A document may use multiple namespaces
  • A DTD has its own namespace in which all names
    are unique
  • A namespace in an XML doc is defined as an
    attribute
  • xmlnsbibhttp//lambda.uta.edu/biblio.dtd
  • where bib is the namespace name and the URL is
    the location of the DTD
  • The default namespace is defined as
  • xmlnsURL
  • If not defined, it is the global namespace

33
Example
  • ltitem xmlnshttp//www.acme.com/jpsupplies
  • xmlnstoy http//www.acme.com/jptoysgt
  • ltnamegtbackpacklt/namegt
  • ltfeaturegt
  • lttoyitemgt
  • lttoynamegtcyberpetlt/toynamegt
  • lt/toyitemgt
  • lt/featuregt
  • lt/itemgt

34
Query Languages for XML
  • Need a language for XML data for
  • extracting fragments (querying)
  • restructuring (data transformation)
  • integrating (eg, combining multiple XML
    documents)
  • browsing
  • presentation (eg, from XML to HTML)
  • We will first learn XPath
  • used in extracting fragments from a single
    document
  • many XML query languages are based on XPath
  • We will briefly discuss XSLT
  • for extracting, restructuring, and presentation
    over a single document
  • We will focus later on XQuery
  • a full-fledged query language
  • much like OQL

35
XPath
  • Describes a single navigation path in an XML
    document
  • Selects a sequence of nodes reachable by the path
  • the order of nodes is the document order
  • Main construct axis navigation
  • Consists of one or more navigation steps
    separated by /
  • A navigation step is a triplet
  • axis node-test list-of-predicates
  • Each navigation path is evaluated relative to a
    context node
  • Examples
  • /childbib /descendantauthor
  • /descendantbook /childauthor Smith
    /childtitle
  • Most people use shorthands
  • /bib//author
  • //bookauthorSmith/title

36
Axis Navigation
  • In the beginning, the context node is the
    document root
  • Dot (.) identifies the context node
  • Some navigation steps
  • / the root node
  • // the root node and its descendants
  • ./author all the children of the context node
    with tagname author the context node of the
    next step is each of these children
  • .// the context node and all its descendants the
    context node of the next step is each of the
    nodes
  • _at_mother the attribute value of the attribute name
    mother of the context node
  • ./ all the children of the context node
  • .. parent of context node
  • text() all the text children of the context node
  • Shortcut you can remove ./

37
Example
  • ltagt
  • ltbgt
  • ltcgtlt/cgt
  • ltbgtlt/bgt
  • lt/bgt
  • ltdgt
  • ltcgtlt/cgt
  • lt/dgt
  • ltbgt
  • ltdgtlt/dgt
  • lt/bgt
  • lt/agt

a
1
b
d
b
3
2
4
c
d
c
b
7
5
6
8
/./a or /a --gt 1 /./a./b or /a/b --gt
2,4 /a/c --gt /a//c --gt 5,7 //b --gt
2,6,4 //b/c --gt 5 /a//c --gt 5,7
38
Predicates
  • Many variations
  • 10 the tenth child node of the context node
  • last() the last child node of the context node
  • author true, if the context node has at least
    one child tagged author
  • author/name true, if the XPath ./author/name is
    nonempty
  • authornameSmith true if the author name is
    Smith
  • Examples
  • /bib/book_at_price lt 100/title
  • /bib/bookauthor/text()
  • authorname/firstnameJohn and
    name/lastnameSmith/title
  • /bib/book/authorname/firstnameaddress//zipci
    ty/name/lastname
Write a Comment
User Comments (0)
About PowerShow.com