Processing XML with Java - PowerPoint PPT Presentation

About This Presentation

Title:

Processing XML with Java

Description:

A language used to describe other languages using 'markup' tags that describe ... There are several different ways to categorise parsers: ... – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 67

Provided by: csHu

Category:

more less

Transcript and Presenter's Notes

Title: Processing XML with Java

1
Processing XML with Java

Representation and Management of Data on the
Internet

2
XML

XML is eXtensible Markup Language
It is a metalanguage
A language used to describe other languages using
markup tags that describe properties of the
data
Designed to be structured
Strict rules about how data can be formatted
Designed to be extensible
Can define own terms and markup

3
XML Family

XML is an official recommendation of the W3C
Aims to accomplish what HTML cannot and be
simpler to use and implement than SGML

HTML
XML
SGML
4
The Essence of XML

Syntax The permitted arrangement or structure of
letters and words in a language as defined by a
grammar (XML)
SemanticsThe meaning of letters or words in a
language
XML uses Syntax to add Semantics to the documents

5
Using XML

In XML there is a separation of the content from
the display
XML can be used for
Data representation
Data exchange

6
Databases and XML

Database content can be presented in XML
XML processor can access DBMS or file system and
convert data to XML
Web server can serve content as either XML or HTML

7
HTML vs. XML
HTML
XML
8
HTML vs. XML
HTML
XML
9
Some Things in Common

Comments are allowed - lt!-- --gt
Special characters must be escaped (e.g., gt
for gt)

10
Processing XML The Idea
11
Sample Document

lttransactiongt
ltaccountgt89-344lt/accountgt
ltbuy shares100gt
ltticker exchNASDAQgtWEBMlt/tickergt
lt/buygt
ltsell shares30gt
ltticker exchNYSEgtGElt/tickergt
lt/sellgt
lt/transactiongt

12
DOM Parser

DOM Document Object Model
Parser creates a tree object out of the document
User accesses data by traversing the tree
The API allows for constructing, accessing and
manipulating the structure and content of XML
documents

13
Document as Tree
Methods like getRoot getChildren getAttributes et
c.
transaction
account
buy
sell
89-344
shares
shares
ticker
ticker
100
30
exch
exch
NYSE
NASDAQ
WEBM
GE
14
Advantages and Disadvantages

Advantages
Natural and relatively easy to use
Can repeatedly traverse tree
Disadvantages
High memory requirements the whole document is
kept in memory
Must parse the whole document before use

15
SAX Parser

SAX Simple API for XML
Parser creates events while traversing tree
Parser calls methods (that you write) to deal
with the events
Similar to an IOStream, goes in one direction

16
Document as Events

lttransactiongt
ltaccountgt89-344lt/accountgt
ltbuy shares100gt
ltticker exchNASDAQgtWEBMlt/tickergt
lt/buygt
ltsell shares30gt
ltticker exchNYSEgtGElt/tickergt
lt/sellgt
lt/transactiongt

17
Advantages and Disadvantages

Advantages
Requires little memory
Fast
Disadvantages
Cannot reread
Less natural for object oriented programmers
(perhaps)

18
Which should we use?DOM vs. SAX

If your document is very large and you only need
a few elements - use SAX
If you need to manipulate (i.e., change) the XML
- use DOM
If you need to access the XML many times - use
DOM

19
XML Parsers
20
XML Parsers

There are several different ways to categorise
parsers
Validating versus non-validating parsers
DOM parsers versus SAX parsers
Parsers written in a particular language (Java,
C, Perl, etc.)

21
Validating Parsers

A validating parser makes sure that the document
conforms to the specified DTD
This is time consuming, so a non-validating
parser is faster

22
Using an XML Parser

Three basic steps
Create a parser object
Pass the XML document to the parser
Process the results
Generally, writing out XML is not in the scope of
parsers (though some may implement proprietary
mechanisms)

23
SAX Simple API for XML
24
The SAX Parser

SAX parser is an event-driven API
An XML document is sent to the SAX parser
The XML file is read sequentially
The parser notifies the class when events happen,
including errors
The events are handled by the implemented API
methods to handle events that the programmer
implemented

25
Handles document events start tag, end tag, etc.
Used to create a SAX Parser
Handles Parser Errors
Handles DTDs and Entities
26
Problem

The SAX interface is an accepted standard
There are many implementations
Like to be able to change the implementation used
without changing any code in the program
How is this done?

27
Factory Design Pattern

Have a Factory class that creates the actual
Parsers.
The Factory checks the value of a system property
that states which implementation should be used
In order to change the implementation, simply
change the system property

28
Creating a SAX Parser

Import the following packages
org.xml.sax.
org.xml.sax.helpers.
Set the following system property
System.setProperty("org.xml.sax.driver",
"org.apache.xerces.parsers.SAXParser")
Create the instance from the Factory
XMLReader reader XMLReaderFactory.createXMLReade
r()

29
Receiving Parsing Information

A SAX Parser calls methods such as
startDocument, startElement, etc., as it runs
In order to react to such events we must
implement the ContentHandler interface
set the parsers content handler with an instance
of our class

30
ContentHandler

// Methods (partial list)
public void startDocument()
public void endDocument()
public void characters(char ch, int start, int
length)
public void startElement(String namespaceURI,
String localName, String qName,
Attributes atts)
public void endElement(String namespaceURI,
String localName, String qName)

31
Namespaces and Element Names

lt?xml version'1.0' encoding'utf-8'?gt
ltforsale date"12/2/03"
xmlnsxhtml "urnhttp//www.w3.org/1999/xhtml"gt
ltbookgt
lttitlegt ltxhtmlemgt DBI lt/xhtmlemgt
The Course I Wish I never Took
lt/titlegt
ltcommentgt My ltxhtmlbgt favorite lt/xhtmlbgt
book!
lt/commentgt
lt/bookgt
lt/forsalegt

32
Namespaces and Element Names
namespaceURI "" localName book qName book

lt?xml version'1.0' encoding'utf-8'?gt
ltforsale date"12/2/03"
xmlnsxhtml "urnhttp//www.w3.org/1999/xhtml"gt
ltbookgt
lttitlegt ltxhtmlemgt DBI lt/xhtmlemgt
The Course I Wish I never Took
lt/titlegt
ltcommentgt My ltxhtmlbgt favorite lt/xhtmlbgt
book!
lt/commentgt
lt/bookgt
lt/forsalegt

namespaceURI urnhttp//www.w3.org/1999/x
html localName em qName xhtmlem
33
Receiving Parsing Information (cont.)

An easy way to implement the ContentHandler
interface is the extend the DefaultHandler, which
implements this interface (and a few others) in
an empty fashion
To actually parse a document, create an
InputSource from the document and supply the
input source to the parse method of the XMLReader

34
import java.io. import org.xml.sax. import
org.xml.sax.helpers. public class InfoWithSax
extends DefaultHandler public static void
main(String args) System.setProperty("org.xm
l.sax.driver", "org.apache.xerces.parsers.S
AXParser") try XMLReader reader
XMLReaderFactory.createXMLReader() reader.
setContentHandler(new InfoWithSax()) reader.par
se(new InputSource(new FileReader(args0)))
catch(Exception e) e.printStackTrace()
35
public static startDocument() throws
SAXException System.out.println(START
DOCUMENT) public static endDocument()
throws SAXException System.out.println(END
DOCUMENT) int depth String indent
private void println(String header, String
value) for (int i 0 i lt depth i)
System.out.print(indent) System.out.println(hea
der " " value)
36
public void characters(char buf, int offset,
int len) throws SAXException String s
(new String(buf, offset, len)).trim() if
(!"".equals(s)) println("CHARACTERS", s)
public void endElement(String namespaceURI,
String localName, String
name) throws SAXException depth-- Stri
ng elementName name if (!"".equals(namespaceU
RI) !"".equals(localName)) elementName
namespaceURI "" localName println("END
ELEMENT", elementName)
37
public static startElement(String namespaceURI,
String localName, String name,
Attributes attrs) throws SAXException
String elementName name if
(!"".equals(namespaceURI) !"".equals(localName)
) elementName namespaceURI ""
localName println("START ELEMENT",
elementName) if (attrs ! null
attrs.getLength() gt 0) for (int i
0 i lt attrs.getLength() i)
println("ATTRIBUTE", attrs.getLocalName(i)
attrs.getValue(i))
depth
38
Bachelor Tags

What do you think happens when the parser parses
a bachelor tag?
ltrating stars"five" /gt

39
Attributes Interface

Elements may have attributes
There is no distinction between attributes that
are defined explicitly from those that are
specified in the DTD (with a default value)

40
Attributes Interface (cont.)

int getLength()
String getQName(int i)
String getType(int i)
String getValue(int i)
String getType(String qname)
String getValue(String qname)
etc.

41
Attributes Types

The following are possible types for attributes
"CDATA",
"ID",
"IDREF", "IDREFS",
"NMTOKEN", "NMTOKENS",
"ENTITY", "ENTITIES",
"NOTATION"

42
Setting Features

It is possible to set the features of a parser
using the setFeature method.
Examples
reader.setFeature(http//xml.org/sax/features/nam
espaces, true)
reader.setFeature(http//xml.org/sax/features/val
idation", false)
For a full list, see http//www.saxproject.org/?s
electedget-set

43
ErrorHandler Interface

We implement ErrorHandler to receive error events
(similar to implementing ContentHandler)
DefaultHandler implements ErrorHandler in an
empty fashion, so we can extend it (as before)
An ErrorHandler is registered with
reader.setErrorHandler(handler)
Three methods
void error(SAXParseException ex)
void fatalError(SAXParserExcpetion ex)
void warning(SAXParserException ex)

44
Extending the InfoWithSax Program
public void warning(SAXParseException err)
throws SAXException System.out.println(War
ning in line err.getLineNumber()
and column err.getColumnNumber())
public void error(SAXParseException err)
throws SAXException System.out.println(Oy
vaavoi, an error!) public void
fatalError(SAXParseException err) throws
SAXException System.out.println(OY VAAVOI,
a fatal error!)
Will these methods be called in the case of a
problem?
45
Lexical Events

Lexical events have to do with the way that a
document was written and not with its content
Examples
A comment is a lexical event (lt!-- comment --gt)
The use of an entity is a lexical event (gt)
These can be dealt with by implementing the
LexicalHandler interface, and set on a parser by
reader.setProperty("http//xml.org/sax/properties/
lexical-handler", mylexicalhandler)

46
LexicalHandler

// Methods (partial list)
public void startEntity(String name)
public void endEntity(String name)
public void comment(char ch, int start,
int length)
public void startCDATA()
public void endCDATA()

47
DOM Document Object Model
48
Creating a DOM Tree

How can we create a DOM Tree independently of the
implementation chosen?
Creating a DOM Tree using the Apache Xerces
package
Import org.apache.xerces.parsers.DOMParser
Import org.w3c.dom.
Use the following lines of code
DOMParser dom new DOMParser()
dom.parse(fileName)
Document doc dom.getDocument()

49
Using a DOM Tree
50
Nodes in a DOM Tree
Figure as appears in The XML Companion - Neil
Bradley
DocumentFragment
Document
Text
CDATASection
CharacterData
Comment
Attr
Element
Node
DocumentType
Notation
Entity
EntityReference
ProcessingInstruction
DocumentType
51
DOM Tree
Document
52
Normalizing a Tree

Normalizing a DOM Tree has two effects
Combine adjacent textual nodes
Eliminate empty textual nodes
To normalize, apply the normalize() method to the
document element

53
Node Methods

Three categories of methods
Node characteristics name, type, value
Contextual location and access to relatives
parents, siblings, children, ancestors,
descendants
Node modification Edit, delete, re-arrange child
nodes

54
Node Methods (2)

short getNodeType()
String getNodeName()
String getNodeValue() throws DOMException
void setNodeValue(String value)
throws DOMException
boolean hasChildNodes()
NamedNodeMap getAttributes()
Document getOwnerDocument()

55
Node Types - getNodeType()
ELEMENT_NODE 1 ATTRIBUTE_NODE 2 TEXT_NODE
3 CDATA_SECTION_NODE 4 ENTITY_REFERENCE_NODE
5 ENTITY_NODE 6
PROCESSING_INSTRUCTION_NODE 7 COMMENT_NODE
8 DOCUMENT_NODE 9 DOCUMENT_TYPE_NODE
10 DOCUMENT_FRAGMENT_NODE 11 NOTATION_NODE
12
if (myNode.getNodeType() Node.ELEMENT_NODE)
//process node
56
(No Transcript)
57
Node Navigation

Every node has a specific location in tree
Node interface specifies methods to find
surrounding nodes
Node getFirstChild()
Node getLastChild()
Node getNextSibling()
Node getPreviousSibling()
Node getParentNode()
NodeList getChildNodes()

58
Node Navigation (2)
Figure as from The XML Companion - Neil Bradley
getPreviousSibling()
getParentNode()
getFirstChild()
getChildNodes()
getLastChild()
getNextSibling()
59
import org.apache.xerces.parsers.DOMParser import
org.w3c.dom. public class InfoWithDom
public static void main(String args)
try DOMParser dom new DOMParser()
dom.parse(args0) Document doc
dom.getDocument() new InfoWithDom().echo(doc
) catch(Exception e) e.printStackTrace()

60
private int depth 0 private final String
indent " " private String NODE_TYPES
"", "ELEMENT", "ATTRIBUTE", "TEXT",
"CDATA", "ENTITY_REF", "ENTITY",
"PROCESSING_INST", "COMMENT", "DOCUMENT",
"DOCUMENT_TYPE", "DOCUMENT_FRAG",
"NOTATION" private void outputIndentation()
for (int i 0 i lt depth i)
System.out.print(indent)
61
private void printlnCommon(Node n)
System.out.print(NODE_TYPESn.getNodeType()
"") System.out.print(" nodeName"
n.getNodeName()) String val if ((val
n.getNamespaceURI()) ! null) System.out.print(
" uri" val) if ((val n.getPrefix()) !
null) System.out.print(" pre" val) if
((val n.getLocalName()) ! null)
System.out.print(" local" val) if ((val
n.getNodeValue()) ! null !val.trim().equals(""
)) System.out.print(" nodeValue"
val) System.out.println()
62
private void echo(Node n) outputIndentation()
printlnCommon(n) if (n.getNodeType()
Node.ELEMENT_NODE) NamedNodeMap
atts n.getAttributes() indent
2 for (int i 0 i lt
atts.getLength() i) echo(atts.item(i))
indent - 2 indent for (Node
child n.getFirstChild() child ! null
child child.getNextSibling()) echo(child)
indent--
Example Input
Example Output
63
Node Manipulation

Children of a node in a DOM tree can be
manipulated - added, edited, deleted, moved,
copied, etc.

Node removeChild(Node old) throws
DOMException Node insertBefore(Node new, Node
ref) throws DOMException Node appendChild(Node
new) throws DOMException Node replaceChild(Node
new, Node old) throws DOMException Node
cloneNode(boolean deep)
64
Node Manipulation (2)
Figure as appears in The XML Companion - Neil
Bradley
65
Other Interfaces