Title: XML
1CIT 383 Administrative Scripting
2Topics
- What is XML?
- XML Structure
- REXML
3eXtensible Markup Language
- Extensible descriptive markup language framework
- Began as subset of Standard Generalized Markup
Language (SGML). - To ensure that data remains available after
programs that originally created/read it become
obsolete or unusable.
lt?xml version"1.0" encoding"UTF-8"?gt ltinventorygt
ltbook isbn0976694042gt ltauthorgtChris
Pinelt/authorgt lttitlegtLearn to
Programlt/titlegt lt/bookgt lt/inventorygt
4Descriptive vs Presentational
- Presentational describe how documents should look
- ltbgttextlt/bgt turns on boldface for text
- What if you want to change book titles from bold
to italics? - Replace wont work if items other than books are
bold. - Descriptive languages focus on the meaning
- lttitlegtxml and yoult/titlegt
- Stylesheets describe how to present logical
items. - Can just be used for data storage, interchange.
- A/K/A logical or structural markup languages.
5XML-based Languages
- Ant
- Atom
- CML
- MathML
- MML
- MusicXML
- ODF
- OPML
- RDF
- SAML
- SOAP
- SVG
- VoiceXML
- WML
- XHTML
- XUL
6Evolution of XML
- 1986 SGML standard published as ISO 8879
- 1987 Unicode proposal published
- 1991 First volume of Unicode standard
- 1996 XML work started
- 1998 XML 1.0 released as a W3C standard
- 2001 XML Schema language
- 2004 XML 1.1 released (not widely used)
- 2007 Unicode 5.0 published
7XML Tree Structure
- lttodogt
- lttitlegt
- Mondays List
- lt/titlegt
- ltitemgt
- Study for midterm
- lt/itemgt
- ltitemgt
- ltpriority10/gt
- Scripting Class
- lt/itemgt
- ltitemgt
- Bathe cat
- lt/itemgt
- lt/htmlgt
8Elements and Attributes
- An element consists of tags and contents
- lttitlegtLearn to Programlt/titlegt
- Begin and end tags are mandatory.
- ltisbn number0976694042 /gt
- Attributes
- number0976694042
- Elements may have zero or more attributes.
- Attribute values must always be quoted.
9Text
- XML declaration specifies character encoding
- lt?xml version"1.0" encoding"UTF-8"?gt
- Encodings
- Unicode universal character set, UTF-8, UTF-32
- ISO-8859 8-bit encodings, 8859-1 is West Europe
- Entities
- nnnn encodes specified Unicode character
- name are named character entities, such as
- lt is lt
- gt is gt
- amp is
- currency symbols, fractions, Greek letters, math
symbols, etc.
10XML Syntax Rules
- There is one and only one root tag.
- Begin tags must be matched by an end tag.
- XML tags must be properly nested.
- XML tags are case sensitive.
- All attribute values must be quoted.
- Whitespace within tags is part of text.
- Newlines are always stored as LF.
- HTML-style comments lt!-- comment --gt
11Correctness
- Well-formed
- Conforms to XML syntax rules.
- A conforming parser will not parse documents that
are not well-formed. - Valid
- Conforms to XML semantics rules as defined in
- Document Type Definition (DTD)
- XML Schema
- A validating parser will not parse invalid
documents.
12XML Schema Languages
lt?xml version"1.0" encoding"utf-8" ?gt
ltxsschema elementFormDefault"qualified"
xmlnsxs"http//www.w3.org/2001/XMLSchema"gt
ltxselement name"Address"gt
ltxscomplexTypegt ltxssequencegt
ltxselement name"Recipient" type"xsstring" /gt
ltxselement name"House" type"xsstring" /gt
ltxselement name"Street" type"xsstring" /gt
ltxselement name"Town" type"xsstring" /gt
ltxselement minOccurs"0" name"County"
type"xsstring" /gt ltxselement
name"PostCode" type"xsstring" /gt
ltxselement name"Country"gt ltxssimpleTypegt
ltxsrestriction base"xsstring"gt
ltxsenumeration value"FR" /gt ltxsenumeration
value"DE" /gt ltxsenumeration value"ES" /gt
ltxsenumeration value"UK" /gt ltxsenumeration
value"US" /gt lt/xsrestrictiongt
lt/xssimpleTypegt lt/xselementgt
lt/xssequencegt lt/xscomplexTypegt
lt/xselementgt lt/xsschemagt
- Document Type Definitions
- Inherited from SGML.
- No support for all XML.
- XML Schema
- Most commonly used.
- Schemas are XML docs.
- A/K/A WXS, XSD
- RELAX NG
- REgular LAnguage for
- XML Next Generation
- XML and non-XML forms.
13Ruby XML Parsers
- REXML Ruby Electric XML
- Standard with the ruby language.
- Slow on large documents.
- libxml-ruby
- Ruby bindings for Gnome libxml2 XML toolkit.
- Very fast (30X as fast as REXML).
- HPricot
- Parses XML as well as HTML.
- Fast (3-4X as fast as REXML).
- Does not check for well-formedness or validity.
14Types of Parsing
- Tree Parsing (DOM-like)
- Good for small documents.
- Loads entire document into memory.
- Simple API
- Stream Parsing (SAX-like)
- Good for large documents.
- User defines callback methods, passes to API.
- Parser runs callback methods on pattern match.
15Tree Parsing
- Loads entire XML doc into memory.
- require rexml/document
- include REXML
- input File.new(data.xml)
- doc Document.new(input)
- root doc.root
- Search document as a tree using XPath
- doc.elements.each(ch/section) do e
- puts e.attributestitle
- end
16Stream Parsing
- Define listener class.
- class MyListener
- include REXMLStreamListener
- def tag_start(args)
- puts start args.map x
x.inspect.join(, - end
- end
- Invoke parser
- require rexml/document
- require rexml/streamlistener
- include REXML
- listen MyListener.new
- source File.new(data.xml)
- Document.parse_stream(source, listen)
17XPath Searches
- h.search("p")
- Find all paragraph tags in document.
- doc.search("/html/body//p")
- Find all paragraph tags within the body tag.
- doc.search("//a_at_src")
- Find all anchor tags with a src attribute.
- doc.search("//a_at_src'google.com'")
- Find all a tags with a src attribute of
google.com.
18References
- Michael Fitzgerald, Learning Ruby, OReilly,
2008. - David Flanagan and Yukihiro Matsumoto, The Ruby
Programming Language, OReilly, 2008. - Hal Fulton, The Ruby Way, 2nd edition,
Addison-Wesley, 2007. - Robert C. Martin, Clean Code, Prentice Hall,
2008. - Dave Thomas with Chad Fowler and Andy Hunt,
Programming Ruby, 2nd edition, Pragmatic
Programmers, 2005.