Processing XML with Java - PowerPoint PPT Presentation

About This Presentation
Title:

Processing XML with Java

Description:

A language used to describe other languages using 'markup' tags that describe ... There are several different ways to categorise parsers: ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 67
Provided by: csHu
Category:

less

Transcript and Presenter's Notes

Title: Processing XML with Java


1
Processing XML with Java
  • Representation and Management of Data on the
    Internet

2
XML
  • XML is eXtensible Markup Language
  • It is a metalanguage
  • A language used to describe other languages using
    markup tags that describe properties of the
    data
  • Designed to be structured
  • Strict rules about how data can be formatted
  • Designed to be extensible
  • Can define own terms and markup

3
XML Family
  • XML is an official recommendation of the W3C
  • Aims to accomplish what HTML cannot and be
    simpler to use and implement than SGML

HTML
XML
SGML
4
The Essence of XML
  • Syntax The permitted arrangement or structure of
    letters and words in a language as defined by a
    grammar (XML)
  • SemanticsThe meaning of letters or words in a
    language
  • XML uses Syntax to add Semantics to the documents

5
Using XML
  • In XML there is a separation of the content from
    the display
  • XML can be used for
  • Data representation
  • Data exchange

6
Databases and XML
  • Database content can be presented in XML
  • XML processor can access DBMS or file system and
    convert data to XML
  • Web server can serve content as either XML or HTML

7
HTML vs. XML
HTML
XML
8
HTML vs. XML
HTML
XML
9
Some Things in Common
  • Comments are allowed - lt!-- --gt
  • Special characters must be escaped (e.g., gt
    for gt)

10
Processing XML The Idea
11
Sample Document
  • lttransactiongt
  • ltaccountgt89-344lt/accountgt
  • ltbuy shares100gt
  • ltticker exchNASDAQgtWEBMlt/tickergt
  • lt/buygt
  • ltsell shares30gt
  • ltticker exchNYSEgtGElt/tickergt
  • lt/sellgt
  • lt/transactiongt

12
DOM Parser
  • DOM Document Object Model
  • Parser creates a tree object out of the document
  • User accesses data by traversing the tree
  • The API allows for constructing, accessing and
    manipulating the structure and content of XML
    documents

13
Document as Tree
Methods like getRoot getChildren getAttributes et
c.
transaction
account
buy
sell
89-344
shares
shares
ticker
ticker
100
30
exch
exch
NYSE
NASDAQ
WEBM
GE
14
Advantages and Disadvantages
  • Advantages
  • Natural and relatively easy to use
  • Can repeatedly traverse tree
  • Disadvantages
  • High memory requirements the whole document is
    kept in memory
  • Must parse the whole document before use

15
SAX Parser
  • SAX Simple API for XML
  • Parser creates events while traversing tree
  • Parser calls methods (that you write) to deal
    with the events
  • Similar to an IOStream, goes in one direction

16
Document as Events
  • lttransactiongt
  • ltaccountgt89-344lt/accountgt
  • ltbuy shares100gt
  • ltticker exchNASDAQgtWEBMlt/tickergt
  • lt/buygt
  • ltsell shares30gt
  • ltticker exchNYSEgtGElt/tickergt
  • lt/sellgt
  • lt/transactiongt

17
Advantages and Disadvantages
  • Advantages
  • Requires little memory
  • Fast
  • Disadvantages
  • Cannot reread
  • Less natural for object oriented programmers
    (perhaps)

18
Which should we use?DOM vs. SAX
  • If your document is very large and you only need
    a few elements - use SAX
  • If you need to manipulate (i.e., change) the XML
    - use DOM
  • If you need to access the XML many times - use
    DOM

19
XML Parsers
20
XML Parsers
  • There are several different ways to categorise
    parsers
  • Validating versus non-validating parsers
  • DOM parsers versus SAX parsers
  • Parsers written in a particular language (Java,
    C, Perl, etc.)

21
Validating Parsers
  • A validating parser makes sure that the document
    conforms to the specified DTD
  • This is time consuming, so a non-validating
    parser is faster

22
Using an XML Parser
  • Three basic steps
  • Create a parser object
  • Pass the XML document to the parser
  • Process the results
  • Generally, writing out XML is not in the scope of
    parsers (though some may implement proprietary
    mechanisms)

23
SAX Simple API for XML
24
The SAX Parser
  • SAX parser is an event-driven API
  • An XML document is sent to the SAX parser
  • The XML file is read sequentially
  • The parser notifies the class when events happen,
    including errors
  • The events are handled by the implemented API
    methods to handle events that the programmer
    implemented

25
Handles document events start tag, end tag, etc.
Used to create a SAX Parser
Handles Parser Errors
Handles DTDs and Entities
26
Problem
  • The SAX interface is an accepted standard
  • There are many implementations
  • Like to be able to change the implementation used
    without changing any code in the program
  • How is this done?

27
Factory Design Pattern
  • Have a Factory class that creates the actual
    Parsers.
  • The Factory checks the value of a system property
    that states which implementation should be used
  • In order to change the implementation, simply
    change the system property

28
Creating a SAX Parser
  • Import the following packages
  • org.xml.sax.
  • org.xml.sax.helpers.
  • Set the following system property
  • System.setProperty("org.xml.sax.driver",
    "org.apache.xerces.parsers.SAXParser")
  • Create the instance from the Factory
  • XMLReader reader XMLReaderFactory.createXMLReade
    r()

29
Receiving Parsing Information
  • A SAX Parser calls methods such as
    startDocument, startElement, etc., as it runs
  • In order to react to such events we must
  • implement the ContentHandler interface
  • set the parsers content handler with an instance
    of our class

30
ContentHandler
  • // Methods (partial list)
  • public void startDocument()
  • public void endDocument()
  • public void characters(char ch, int start, int
    length)
  • public void startElement(String namespaceURI,
  • String localName, String qName,
  • Attributes atts)
  • public void endElement(String namespaceURI,
  • String localName, String qName)

31
Namespaces and Element Names
  • lt?xml version'1.0' encoding'utf-8'?gt
  • ltforsale date"12/2/03"
  • xmlnsxhtml "urnhttp//www.w3.org/1999/xhtml"gt
  • ltbookgt
  • lttitlegt ltxhtmlemgt DBI lt/xhtmlemgt
  • The Course I Wish I never Took
  • lt/titlegt
  • ltcommentgt My ltxhtmlbgt favorite lt/xhtmlbgt
    book!
  • lt/commentgt
  • lt/bookgt
  • lt/forsalegt

32
Namespaces and Element Names
namespaceURI "" localName book qName book
  • lt?xml version'1.0' encoding'utf-8'?gt
  • ltforsale date"12/2/03"
  • xmlnsxhtml "urnhttp//www.w3.org/1999/xhtml"gt
  • ltbookgt
  • lttitlegt ltxhtmlemgt DBI lt/xhtmlemgt
  • The Course I Wish I never Took
  • lt/titlegt
  • ltcommentgt My ltxhtmlbgt favorite lt/xhtmlbgt
    book!
  • lt/commentgt
  • lt/bookgt
  • lt/forsalegt

namespaceURI urnhttp//www.w3.org/1999/x
html localName em qName xhtmlem
33
Receiving Parsing Information (cont.)
  • An easy way to implement the ContentHandler
    interface is the extend the DefaultHandler, which
    implements this interface (and a few others) in
    an empty fashion
  • To actually parse a document, create an
    InputSource from the document and supply the
    input source to the parse method of the XMLReader

34
import java.io. import org.xml.sax. import
org.xml.sax.helpers. public class InfoWithSax
extends DefaultHandler public static void
main(String args) System.setProperty("org.xm
l.sax.driver", "org.apache.xerces.parsers.S
AXParser") try XMLReader reader
XMLReaderFactory.createXMLReader() reader.
setContentHandler(new InfoWithSax()) reader.par
se(new InputSource(new FileReader(args0)))
catch(Exception e) e.printStackTrace()
35
public static startDocument() throws
SAXException System.out.println(START
DOCUMENT) public static endDocument()
throws SAXException System.out.println(END
DOCUMENT) int depth String indent
private void println(String header, String
value) for (int i 0 i lt depth i)
System.out.print(indent) System.out.println(hea
der " " value)
36
public void characters(char buf, int offset,
int len) throws SAXException String s
(new String(buf, offset, len)).trim() if
(!"".equals(s)) println("CHARACTERS", s)
public void endElement(String namespaceURI,
String localName, String
name) throws SAXException depth-- Stri
ng elementName name if (!"".equals(namespaceU
RI) !"".equals(localName)) elementName
namespaceURI "" localName println("END
ELEMENT", elementName)
37
public static startElement(String namespaceURI,
String localName, String name,
Attributes attrs) throws SAXException
String elementName name if
(!"".equals(namespaceURI) !"".equals(localName)
) elementName namespaceURI ""
localName println("START ELEMENT",
elementName) if (attrs ! null
attrs.getLength() gt 0) for (int i
0 i lt attrs.getLength() i)
println("ATTRIBUTE", attrs.getLocalName(i)
attrs.getValue(i))
depth
38
Bachelor Tags
  • What do you think happens when the parser parses
    a bachelor tag?
  • ltrating stars"five" /gt

39
Attributes Interface
  • Elements may have attributes
  • There is no distinction between attributes that
    are defined explicitly from those that are
    specified in the DTD (with a default value)

40
Attributes Interface (cont.)
  • int getLength()
  • String getQName(int i)
  • String getType(int i)
  • String getValue(int i)
  • String getType(String qname)
  • String getValue(String qname)
  • etc.

41
Attributes Types
  • The following are possible types for attributes
  • "CDATA",
  • "ID",
  • "IDREF", "IDREFS",
  • "NMTOKEN", "NMTOKENS",
  • "ENTITY", "ENTITIES",
  • "NOTATION"

42
Setting Features
  • It is possible to set the features of a parser
    using the setFeature method.
  • Examples
  • reader.setFeature(http//xml.org/sax/features/nam
    espaces, true)
  • reader.setFeature(http//xml.org/sax/features/val
    idation", false)
  • For a full list, see http//www.saxproject.org/?s
    electedget-set

43
ErrorHandler Interface
  • We implement ErrorHandler to receive error events
    (similar to implementing ContentHandler)
  • DefaultHandler implements ErrorHandler in an
    empty fashion, so we can extend it (as before)
  • An ErrorHandler is registered with
  • reader.setErrorHandler(handler)
  • Three methods
  • void error(SAXParseException ex)
  • void fatalError(SAXParserExcpetion ex)
  • void warning(SAXParserException ex)

44
Extending the InfoWithSax Program
public void warning(SAXParseException err)
throws SAXException System.out.println(War
ning in line err.getLineNumber()
and column err.getColumnNumber())
public void error(SAXParseException err)
throws SAXException System.out.println(Oy
vaavoi, an error!) public void
fatalError(SAXParseException err) throws
SAXException System.out.println(OY VAAVOI,
a fatal error!)
Will these methods be called in the case of a
problem?
45
Lexical Events
  • Lexical events have to do with the way that a
    document was written and not with its content
  • Examples
  • A comment is a lexical event (lt!-- comment --gt)
  • The use of an entity is a lexical event (gt)
  • These can be dealt with by implementing the
    LexicalHandler interface, and set on a parser by
  • reader.setProperty("http//xml.org/sax/properties/
    lexical-handler",  mylexicalhandler)     

46
LexicalHandler
  • // Methods (partial list)
  • public void startEntity(String name)
  • public void endEntity(String name)
  • public void comment(char ch, int start,
  • int length)
  • public void startCDATA()
  • public void endCDATA()

47
DOM Document Object Model
48
Creating a DOM Tree
  • How can we create a DOM Tree independently of the
    implementation chosen?
  • Creating a DOM Tree using the Apache Xerces
    package
  • Import org.apache.xerces.parsers.DOMParser
  • Import org.w3c.dom.
  • Use the following lines of code
  • DOMParser dom new DOMParser()
    dom.parse(fileName)
  • Document doc dom.getDocument()

49
Using a DOM Tree
50
Nodes in a DOM Tree
Figure as appears in The XML Companion - Neil
Bradley
DocumentFragment
Document
Text
CDATASection
CharacterData
Comment
Attr
Element
Node
DocumentType
Notation
Entity
EntityReference
ProcessingInstruction
DocumentType
51
DOM Tree
Document
52
Normalizing a Tree
  • Normalizing a DOM Tree has two effects
  • Combine adjacent textual nodes
  • Eliminate empty textual nodes
  • To normalize, apply the normalize() method to the
    document element

53
Node Methods
  • Three categories of methods
  • Node characteristics name, type, value
  • Contextual location and access to relatives
    parents, siblings, children, ancestors,
    descendants
  • Node modification Edit, delete, re-arrange child
    nodes

54
Node Methods (2)
  • short getNodeType()
  • String getNodeName()
  • String getNodeValue() throws DOMException
  • void setNodeValue(String value)
    throws DOMException
  • boolean hasChildNodes()
  • NamedNodeMap getAttributes()
  • Document getOwnerDocument()

55
Node Types - getNodeType()
ELEMENT_NODE 1 ATTRIBUTE_NODE 2 TEXT_NODE
3 CDATA_SECTION_NODE 4 ENTITY_REFERENCE_NODE
5 ENTITY_NODE 6
PROCESSING_INSTRUCTION_NODE 7 COMMENT_NODE
8 DOCUMENT_NODE 9 DOCUMENT_TYPE_NODE
10 DOCUMENT_FRAGMENT_NODE 11 NOTATION_NODE
12
if (myNode.getNodeType() Node.ELEMENT_NODE)
//process node
56
(No Transcript)
57
Node Navigation
  • Every node has a specific location in tree
  • Node interface specifies methods to find
    surrounding nodes
  • Node getFirstChild()
  • Node getLastChild()
  • Node getNextSibling()
  • Node getPreviousSibling()
  • Node getParentNode()
  • NodeList getChildNodes()

58
Node Navigation (2)
Figure as from The XML Companion - Neil Bradley
getPreviousSibling()
getParentNode()
getFirstChild()
getChildNodes()
getLastChild()
getNextSibling()
59
import org.apache.xerces.parsers.DOMParser import
org.w3c.dom. public class InfoWithDom
public static void main(String args)
try DOMParser dom new DOMParser()
dom.parse(args0) Document doc
dom.getDocument() new InfoWithDom().echo(doc
) catch(Exception e) e.printStackTrace()

60
private int depth 0 private final String
indent " " private String NODE_TYPES
"", "ELEMENT", "ATTRIBUTE", "TEXT",
"CDATA", "ENTITY_REF", "ENTITY",
"PROCESSING_INST", "COMMENT", "DOCUMENT",
"DOCUMENT_TYPE", "DOCUMENT_FRAG",
"NOTATION" private void outputIndentation()
for (int i 0 i lt depth i)
System.out.print(indent)
61
private void printlnCommon(Node n)
System.out.print(NODE_TYPESn.getNodeType()
"") System.out.print(" nodeName"
n.getNodeName()) String val if ((val
n.getNamespaceURI()) ! null) System.out.print(
" uri" val) if ((val n.getPrefix()) !
null) System.out.print(" pre" val) if
((val n.getLocalName()) ! null)
System.out.print(" local" val) if ((val
n.getNodeValue()) ! null !val.trim().equals(""
)) System.out.print(" nodeValue"
val) System.out.println()
62
private void echo(Node n) outputIndentation()
printlnCommon(n) if (n.getNodeType()
Node.ELEMENT_NODE) NamedNodeMap
atts n.getAttributes() indent
2 for (int i 0 i lt
atts.getLength() i) echo(atts.item(i))
indent - 2 indent for (Node
child n.getFirstChild() child ! null
child child.getNextSibling()) echo(child)
indent--
Example Input
Example Output
63
Node Manipulation
  • Children of a node in a DOM tree can be
    manipulated - added, edited, deleted, moved,
    copied, etc.

Node removeChild(Node old) throws
DOMException Node insertBefore(Node new, Node
ref) throws DOMException Node appendChild(Node
new) throws DOMException Node replaceChild(Node
new, Node old) throws DOMException Node
cloneNode(boolean deep)
64
Node Manipulation (2)
Figure as appears in The XML Companion - Neil
Bradley
65
Other Interfaces
  • We have discussed methods of the Node interface
  • Each of the "specific types of nodes" have
    additional methods
  • See API for details

66
Note about DOM Objects
  • DOM object ? compiled XML
  • Can save time and effort if send and receive DOM
    objects instead of XML source
  • Saves having to parse XML files into DOM at
    sender and receiver
  • But, DOM object may be larger than XML source
Write a Comment
User Comments (0)
About PowerShow.com