XML Basics - PowerPoint PPT Presentation

About This Presentation
Title:

XML Basics

Description:

XML Basics Overview – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 98
Provided by: max
Category:

less

Transcript and Presenter's Notes

Title: XML Basics


1
XML Basics
  • Overview

2
What is XML?
  • Extensible Markup Language
  • A syntax for documents
  • A Meta-Markup Language
  • A Structural and Semantic language, not a
    formatting language
  • Not just for Web pages

3
XML is a Meta Markup Language
  • Not like HTML, troff, LaTeX
  • Make up the tags you needs as you need them
  • The tags you create can be documented in a
    Document Type Definition (DTD)
  • A meta syntax for domain-specific markup
    languages like MusicML, MathML, and CML

4
XML describes structure and semantics, not
formatting
  • XML documents form a tree
  • Element and attribute names reflect the kind of
    the element
  • Formatting can be added with a style sheet

5
A Song Description in HTML
  • ltdtgtHot Cop
  • ltddgt by Jacques Morali, Henri Belolo, and Victor
    Willis
  • ltulgt
  • ltligtProducer Jacques Morali
  • ltligtPublisher PolyGram Records
  • ltligtLength 620
  • ltligtWritten 1978
  • ltligtArtist Village People
  • lt/ulgt

6
A Song Description in XML
  • ltSONGgt
  • ltTITLEgtHot Coplt/TITLEgt
  • ltCOMPOSERgtJacques Moralilt/COMPOSERgt
  • ltCOMPOSERgtHenri Belololt/COMPOSERgt
  • ltCOMPOSERgtVictor Willislt/COMPOSERgt
  • ltPRODUCERgtJacques Moralilt/PRODUCERgt
  • ltPUBLISHERgtPolyGram Recordslt/PUBLISHERgt
  • ltLENGTHgt620lt/LENGTHgt
  • ltYEARgt1978lt/YEARgt
  • ltARTISTgtVillage Peoplelt/ARTISTgt
  • lt/SONGgt

7
Style Sheets provide formatting
  • SONG display block
  • TITLE display block font-family Helvetica,
    serif
  • font-size 20pt font-weight bold
  • COMPOSER display block
  • font-family Times, Times New Roman,
    serif
  • font-size 14pt font-style italic
  • ARTIST display block
  • font-family Times, Times New Roman,
    serif
  • font-size 14pt font-weight bold
  • font-style italic
  • PUBLISHER display block font-size 14pt
  • font-family Times, Times New Roman,
    serif
  • LENGTH display block
  • font-family Times, Times New Roman,
    serif
  • font-size 14pt
  • YEAR display block
  • font-family Times, Times New Roman,
    serif
  • font-size 14pt

8
Attaching style sheets to documents
  • Processing Instruction
  • lt?xml-stylesheet type"text/css"
    href"song.css"?gt
  • Converter Program

9
What is XML used for?
  • Domain-Specific Markup Languages
  • Self-Describing Data
  • Interchange of Data Among Applications
  • Structured and Integrated Data

10
Domain-Specific Markup Languages
  • Non proprietary format
  • Dont pay for what you dont use

11
Self-Describing Data
  • Much data is lost due to format problems
  • XML is very simple
  • XML is self-describing
  • XML is well documented

12
  • ltPERSON ID"p1100" SEX"M"gt
  • ltNAMEgt
  • ltGIVENgtJudsonlt/GIVENgt
  • ltSURNAMEgtMcDaniellt/SURNAMEgt
  • lt/NAMEgt
  • ltBIRTHgt
  • ltDATEgt21 Feb 1834lt/DATEgt
  • lt/BIRTHgt
  • ltDEATHgt
  • ltDATEgt9 Dec 1905lt/DATEgt
  • lt/DEATHgt
  • lt/PERSONgt

13
Interchange of Data Among Applications
  • E-commerce
  • Syndication

14
Structured and Integrated Data
  • Can specify relationships between elements
  • Can assemble data from multiple sources

15
XML Applications
  • A specific markup language uses the XML
    meta-syntax is called an XML application
  • Different XML applications have their own more
    constricted syntaxes and vocabularies within the
    broader XML syntax
  • Further syntax can be layered on top of this
    e.g. data typing through DCDs or other schemas

16
Example XML Applications
  • Web Pages
  • Mathematical Equations
  • Music Notation
  • Vector Graphics
  • Metadata
  • and more

17
Mathematical Markup Language
18
Channel Definition Format
lt?xml version"1.0"?gt ltCHANNEL HREF"http//metala
b.unc.edu/xml/index.html"gt ltTITLEgtCafe con
Lechelt/TITLEgt ltITEM HREF"http//metalab.unc.edu
/xml/books.html"gt ltTITLEgtBooks about
XMLlt/TITLEgt lt/ITEMgt ltITEM HREF"http//metalab
.unc.edu/xml/tradeshows.html"gt ltTITLEgtTrade
shows and conferences about XMLlt/TITLEgt
lt/ITEMgt ltITEM HREF"http//metalab.unc.edu/xml/l
ists.htm"gt ltTITLEgtMailing Lists dedicated to
XMLlt/TITLEgt lt/ITEMgtlt/CHANNELgt
19
Classic Literature
  • The Complete Plays of Shakespeare
  • The Bible
  • The Koran
  • The Book of Mormon

20
Vector Graphics
  • Vector Markup Language (VML)
  • Internet Explorer 5.0
  • Microsoft Office 2000
  • Scalable Vector Graphics (SVG)

21
The Resource Description Framework (RDF)
  • Meta-data
  • Dublin Core
  • Better Web searching

22
An Example of RDF
  • ltrdfRDF
  • xmlnsrdf"http//www.w3.org/1999/02/22-rdf-synta
    x-ns"
  • xmlnsdc"http//purl.org/DC/gt
  • ltrdfDescription about"http//metalab.unc.edu/x
    ml/gt
  • ltdcCREATORgtElliotte Rusty Haroldlt/dcCREATORgt
  • ltdcTITLEgtCafe con Lechelt/dcTITLEgt
  • lt/rdfDescriptiongt
  • lt/rdfRDFgt

23
XML for XML
  • XSL The Extensible Stylesheet Language
  • DCD The Document Content Description Schema
    Language
  • XLL The Extensible Linking Language

24
XSL The Extensible Stylesheet Language
  • XSL Transformations
  • XSL Formatting Objects

25
DCD The Document Content Description Schema
Language
  • Data Typing in XML is Weak
  • ltMONTHgt9lt/MONTHgt
  • ltDCDgt
  • ltElementDef Type"MONTH"
  • Model"Data" Datatype"i1"
  • Min"1" Max"12" /gt
  • lt/DCDgt

26
XLL The Extensible Linking Language
  • Any element can be a link
  • Links can be bi-directional
  • Links can be separated from the documents they
    connect

ltfootnote xlinkform"simple" href"footnote7.xml"
gt7lt/footnotegt
27
File Formats, In-house applications, and other
behind the scenes uses
  • Microsoft Office 2000
  • Federal Express Web API
  • Netscape Whats Related

28
Hello XML
lt?xml version"1.0" standalone"yes"?gt ltFOOgt Hello
XML! lt/FOOgt
  • Plain ASCII or UTF-8 text
  • .xml is standard file extension
  • Any standard text editor will work

29
The XML Declaration
lt?xml version"1.0" standalone"yes"?gt
  • version attribute
  • required
  • always has the value 1.0
  • standalone attribute
  • yes
  • no
  • encoding attribute
  • UTF-8
  • 8859_1
  • etc.

30
The FOO element
ltFOOgt Hello XML! lt/FOOgt
  • Start tag ltFOOgt
  • Contents "Hello XML!"
  • End tag lt/FOOgt

31
greeting.xml
lt?xml version"1.0" standalone"yes"?gt ltGREETINGgt
Hello XML! lt/GREETINGgt
32
Style sheets
  • Separate from the XML document
  • Different Languages
  • Cascading Style Sheets Level 1 (CSS1)
  • Internet Explorer 5.0
  • Mozilla 5.0
  • Cascading Style Sheets Level 2 (CSS2)
  • Internet Explorer 5 (partial)
  • Mozilla 5.0 (partial)
  • Extensible Style Language (XSL)
  • Internet Explorer 5.0 (older draft, buggy)
  • LotusXSL, XT, Other non-browser converters
  • Document Style and Semantics Language (DSSSL)
  • Jade

33
xml-stylesheet
  • Style sheets are attached via an xml-stylesheet
    processing instruction in the prolog
  • lt?xml version"1.0" standalone"yes"?gt
  • lt?xml-stylesheet type"text/css"
    href"greeting.css"?gt
  • ltGREETINGgtHello XML!lt/GREETINGgt
  • type attribute has the value text/css or text/xsl
  • href attribute is a URL to the stylesheet,
    possibly relative
  • Can also use non-browser converters like XT,
    LotusXSL, and Jade

34
greeting.css
  • GREETING display block
  • font-size 24pt
  • font-weight bold

35
A larger example Baseball statistics
  • Examine the data
  • Design a vocabulary for the data
  • Write a style sheet

36
Sample statistics
  • http//cbs.sportsline.com/u/baseball/mlb/stats.htm

37
Organizing the Data
  • XML documents are trees.
  • XML elements contain other elements as well as
    text
  • Within these limits there's more than one way to
    organize the data
  • Hierarchically
  • Relationally
  • Objects

38
What is the Root Element
  • The League?
  • The Season?
  • A custom Document element?

39
The Root Element
  • Choose SEASON for the root element
  • Everything else will be a descendant of SEASON
  • This is not the only possible choice

lt?xml version"1.0"?gt ltSEASONgt lt/SEASONgt
40
What are the Immediate Children of The root?
  • Leagues?
  • Teams?
  • Players?
  • Games?

41
Child Elements
  • lt?xml version"1.0"?gtltSEASONgt ltYEARgt 1998
    lt/YEARgtlt/SEASONgt

42
White space in XML is not especially significant
  • lt?xml version"1.0"?gt
  • ltSEASONgtltYEARgt1998lt/YEARgtlt/SEASONgt

43
Leagues
  • Major league baseball is divided into two leagues
  • Each league has
  • a name
  • three divisions

44
Divisions
  • Each division has
  • name
  • 4-6 teams

45
Teams
  • Each team has
  • Name
  • City
  • Players

46
Player Data
  • Each player has
  • First name
  • Last name
  • Position
  • Statistics

47
Player Batting Statistics
  • SB Stolen Bases
  • CS Caught Stealing
  • SH Sacrifice Hits
  • SF Sacrifice Flies
  • Err Errors
  • PB Pitcher Balked
  • BB Base on Balls (Walks)
  • SO Strike Outs
  • HBP Hit By Pitch
  • G Games Played
  • GS Games Started
  • AB At Bats
  • R Runs
  • H Hits
  • 2B Doubles
  • 3B Triples
  • HR Home Runs
  • RBI Runs Batted In

48
What does a player look like
  • Long names vs. short names

49
The Complete 1998 Major League
  • Long version

50
A Style Sheet
  • 1998shortstats.xml
  • baseballstats.css
  • lt?xml-stylesheet type"text/css"
    href"baseballstats.css"?gt
  • styled1998shortstats.xml

51
Cascading Style Sheets
  • Partially supported by Mozilla and IE 5.0
  • Full W3C Recommendation

52
The Default Rule
  • Not every element needs a rule
  • The root element should be at least display
    block
  • SEASON font-size 14pt
  • background-color white
  • color black
  • display block

53
A style rule for the YEAR element
  • Make it look like a title
  • YEAR display block
  • font-size 32pt
  • font-weight bold
  • text-align center

54
Style Rules for Division and League Names
  • LEAGUE_NAME display block
    text-align center font-size
    28pt font-weight bold
  • DIVISION_NAME display block
    text-align center font-size
    24pt font-weight bold

55
Alternate Style Rules for Division and League
Names
  • LEAGUE_NAME, DIVISION_NAME display block
    text-align center font-weight
    boldLEAGUE_NAME font-size 28pt
    DIVISION_NAME font-size 24pt

56
Style Rules for Teams
  • Team name and Team city must be one title
  • Must be inline elements
  • Previous and following must be block elements
  • TEAM_CITY font-size 20pt font-weight bold
    font-style italic
  • TEAM_NAME font-size 20pt font-weight bold
    font-style italic
  • TEAM, PLAYER display block

57
Style Rules for Players
TEAM display table TEAM_CITY display
table-caption TEAM_NAME display
table-caption PLAYER display
table-row SURNAME, GIVEN_NAME, POSITION,
GAMES, GAMES_STARTED, AT_BATS, RUNS, HITS,
DOUBLES, TRIPLES, HOME_RUNS, RBI,
STEALS, CAUGHT_STEALING, SACRIFICE_HITS,
SACRIFICE_FLIES, ERRORS, WALKS, STRUCK_OUT,
HIT_BY_PITCH display table-cell
58
Finished Style Sheet
  • SEASON font-size 14pt background-color white
    color black display block
  • YEAR display block font-size 32pt
  • font-weight bold text-align center
  • LEAGUE_NAME display block text-align center
    font-size 28pt font-weight bold
  • DIVISION_NAME display block text-align
    center font-size 24pt font-weight bold
  • TEAM_CITY font-size 20pt font-weight bold
    font-style italic
  • TEAM_NAME font-size 20pt
  • font-weight bold font-style italic
  • TEAM display block
  • PLAYER display block

59
Possible Extensions
  • There should be captions like "RBI" or "At Bats.
  • Derived numbers like batting averages are not
    included.
  • The titles are short. E.g. "1998" instead of
    "1998 Major League Baseball".
  • The document is so long it's hard to read.
    Something similar to IE5's collapsible outline
    view would be nice.
  • Pitcher stats should be separated from batter
    stats.

60
Possible Solutions
  • CSS Level 2
  • XSL
  • XSL JavaScript

61
Well-formedness Rules
  • Open and close all tags
  • Empty tags end with /gt
  • There is a unique root element
  • Elements may not overlap
  • Attribute values are quoted
  • lt and are only used to start tags and entities
  • Only the five predefined entity references are
    used

62
Open and close all tags
63
Empty tags end with /gt
  • ltBR/gt, ltHR/gt, and ltIMG/gt instead of ltBRgt, ltHRgt,
    and ltIMGgt
  • Web browsers deal inconsistently with these
  • Can use ltBRgtlt/BRgt ltHRgtlt/HRgt ltIMGgtlt/IMGgt instead

64
There is a unique root element
  • One element completely contains all other
    elements of the document
  • This is HTML in HTML files
  • XML Declaration is not an element

lt?xml version"1.0" standalone"yes"?gt ltGREETINGgt
Hello XML! lt/GREETINGgt
65
Elements may not overlap
  • If an element contains a start tag for an
    element, it must also contain the corresponding
    end tag
  • Empty elements may appear anywhere
  • Every non root element has a parent element

66
Attribute values are quoted
  • Good
  • ltA HREF"http//metalab.unc.edu/xml/"gt
  • Bad
  • ltA HREFhttp//metalab.unc.edu/xml/gt

67
lt and are only used to start tags and entities
  • Good ltH1gtO'Reilly amp Associateslt/H1gt
  • Bad ltH1gt O'Reilly Associateslt/H1gt
  • Good
  • ltCODEgtfor (int i 0 i lt args.length i )
    lt/CODEgt
  • Bad
  • ltCODEgtfor (int i 0 i lt args.length i )
    lt/CODEgt

68
Only the five predefined entity references are
used
  • Good
  • amp
  • lt
  • gt
  • quot
  • apos
  • Bad
  • copy
  • reg
  • tm
  • alpha
  • eacute
  • nbsp
  • etc.

69
DTDs and Validity
  • A Document Type Definition describes the elements
    and attributes that may appear in a document
  • Validation compares a particular document against
    a DTD
  • Well-formedness is a prerequisite for validity

70
What is a DTD?
  • a list of the elements, tags, attributes, and
    entities contained in a document, and their
    relationship to each other
  • internal vs. external DTDs

71
The importance of validation
  • Ensures that data is correct before feeding it
    into a program
  • Ensure that a format is followed
  • Establish what must be supported
  • Not all documents need to be valid sometimes
    well-formed is enough

72
A DTD for greeting.xml
  • greeting.xml
  • lt?xml version"1.0"?gt
  • ltGREETINGgt
  • Hello XML!
  • lt/GREETINGgt
  • greeting.dtd
  • lt!ELEMENT GREETING (PCDATA)gt

73
Document Type Declarations
  • lt?xml version"1.0"?gt
  • lt!DOCTYPE GREETING SYSTEM "greeting.dtd"gt
  • ltGREETINGgt
  • Hello XML!
  • lt/GREETINGgt
  • specifies the root element
  • gives a URL for the DTD

74
Invalid Documents
  • Valid
  • ltGREETINGgt
  • various random text but no markup
  • lt/GREETINGgt
  • Invalid anything else including
  • ltGREETINGgt
  • ltsometaggtvarious random textlt/sometaggt
  • ltsomeEmptyTag/gt
  • lt/GREETINGgt
  • or
  • ltGREETINGgt
  • ltGREETINGgtvarious random textlt/GREETINGgt
  • lt/GREETINGgt

75
Validating Tools
  • Command line programs like XJParse
  • Online validators
  • http//www.stg.brown.edu/service/xmlvalid/
  • http//www.cogsci.ed.ac.uk/7Erichard/xml-check.ht
    ml
  • Browsers

76
Element Declarations
  • Each tag must be declared in a lt!ELEMENTgt
    declaration.
  • A lt!ELEMENTgt declaration gives the name and
    content model of the element
  • The content model uses a simple regular
    expression-like grammar to precisely specify what
    is and isn't allowed in an element

77
Content Specifications
  • ANY
  • PCDATA
  • Sequences
  • Choices
  • Mixed Content
  • Modifiers
  • Empty

78
ANY
  • lt!ELEMENT SEASON ANYgt
  • A SEASON can contain any child element and/or raw
    text (parsed character data)

79
PCDATA
  • lt!ELEMENT YEAR (PCDATA)gt
  • Parsed Character Data i.e. raw text, no markup

80
PCDATA
  • Invalid
  • ltYEARgt
  • ltMONTHgtJanuarylt/MONTHgt
  • ltMONTHgtFebruarylt/MONTHgt
  • ltMONTHgtMarchlt/MONTHgt
  • ltMONTHgtAprillt/MONTHgt
  • ltMONTHgtMaylt/MONTHgt
  • ltMONTHgtJunelt/MONTHgt
  • ltMONTHgtJulylt/MONTHgt
  • ltMONTHgtAugustlt/MONTHgt
  • ltMONTHgtSeptemberlt/MONTHgt
  • ltMONTHgtOctoberlt/MONTHgt
  • ltMONTHgtNovemberlt/MONTHgt
  • ltMONTHgtDecemberlt/MONTHgt
  • lt/YEARgt
  • Valid
  • ltYEARgt1999lt/YEARgt
  • ltYEARgt99lt/YEARgt
  • ltYEARgt1999 C.E.lt/YEARgt
  • ltYEARgt
  • The year of our Lord one thousand, nine hundred,
    and ninety-nine
  • lt/YEARgt

81
Child Elements
  • To declare that a LEAGUE element must have a
    LEAGUE_NAME child
  • lt!ELEMENT LEAGUE (LEAGUE_NAME)gt
  • lt!ELEMENT LEAGUE_NAME (PCDATA)gt

82
Sequences
  • Separate multiple required child elements with
    commas e.g.
  • lt!ELEMENT SEASON (YEAR, LEAGUE, LEAGUE)gt
  • lt!ELEMENT LEAGUE (LEAGUE_NAME, DIVISION,
    DIVISION, DIVISION)gt

83
One or More Children
  • lt!ELEMENT DIVISION_NAME (PCDATA)gt
  • lt!ELEMENT DIVISION (DIVISION_NAME, TEAM)gt

84
Zero or More Children
  • lt!ELEMENT TEAM (TEAM_CITY, TEAM_NAME, PLAYER)gt
  • lt!ELEMENT TEAM_CITY (PCDATA)gt
  • lt!ELEMENT TEAM_NAME (PCDATA)gt

85
Zero or One Children ?
  • lt!ELEMENT PLAYER (GIVEN_NAME, SURNAME, POSITION,
    GAMES, GAMES_STARTED, AT_BATS?, RUNS?, HITS?,
    DOUBLES?, TRIPLES?, HOME_RUNS?, RBI?, STEALS?,
    CAUGHT_STEALING?, SACRIFICE_HITS?,
    SACRIFICE_FLIES?, ERRORS?, WALKS?, STRUCK_OUT?,
    HIT_BY_PITCH?, WINS?, LOSSES?, SAVES?,
    COMPLETE_GAMES?, SHUT_OUTS?, ERA?, INNINGS?,
    EARNED_RUNS?, HIT_BATTER?, WILD_PITCHES?,
    BALK?,WALKED_BATTER?, STRUCK_OUT_BATTER?)
  • gt

86
Finished DTD
87
Choices
  • lt!ELEMENT PAYMENT (CASH CREDIT_CARD)gt
  • lt!ELEMENT PAYMENT (CASH CREDIT_CARD CHECK)gt

88
Grouping With Parentheses
  • Parentheses combine several elements into a
    single element.
  • Parenthesized element can be nested inside other
    parentheses in place of a single element.
  • The parenthesized element can be suffixed with a
    plus sign, a comma, or a question mark.
  • lt!ELEMENT dl (dt, dd)gt
  • lt!ELEMENT ARTICLE (TITLE, (P PHOTO GRAPH
    SIDEBAR PULLQUOTE SUBHEAD), BYLINE?)gt

89
Mixed Content
  • Both PCDATA and child elements in a choice
  • lt!ELEMENT TEAM (PCDATA TEAM_CITY TEAM_NAME
    PLAYER)gt
  • PCDATA must come first
  • PCDATA cannot be used in a sequence

90
Empty elements
  • lt!ELEMENT BR EMPTYgt
  • lt!ELEMENT IMG EMPTYgt
  • lt!ELEMENT HR EMPTYgt

91
Internal DTDs
  • lt?xml version"1.0"?gt
  • lt!DOCTYPE GREETING
  • lt!ELEMENT GREETING (PCDATA)gt
  • gt
  • ltGREETINGgt
  • Hello XML!
  • lt/GREETINGgt

92
Internal DTD Subsets
  • lt?xml version"1.0"?gt
  • lt!DOCTYPE GREETING SYSTEM "greeting.dtd"
  • lt!ELEMENT GREETING (PCDATA)gt
  • gt
  • ltGREETINGgt
  • Hello XML!
  • lt/GREETINGgt
  • Internal declarations override external
    declarations

93
Programming with XML
  • Java works best
  • C, Perl, Python etc. can also be used
  • Unicode support is the biggest issue

94
SAX, the Simple API for XML
  • Event based
  • Programs can plug in different parsers

95
The Document Object Model (DOM)
96
To Learn More Books
  • XML Extensible Markup Language
  • IDG Books 1998
  • ISBN 0-76453-199-9
  • The XML Bible
  • IDG Books 1999
  • ISBN 0-76453-236-7

97
Questions?
Write a Comment
User Comments (0)
About PowerShow.com