Title: Processing XML with Java
1Processing XML with Java
- Representation and Management of Data on the
Internet
2XML
- XML is eXtensible Markup Language
- It is a metalanguage
- A language used to describe other languages using
markup tags that describe properties of the
data - Designed to be structured
- Strict rules about how data can be formatted
- Designed to be extensible
- Can define own terms and markup
3XML Family
- XML is an official recommendation of the W3C
- Aims to accomplish what HTML cannot and be
simpler to use and implement than SGML
HTML
XML
SGML
4The Essence of XML
- Syntax The permitted arrangement or structure of
letters and words in a language as defined by a
grammar (XML) - SemanticsThe meaning of letters or words in a
language - XML uses Syntax to add Semantics to the documents
5Using XML
- In XML there is a separation of the content from
the display - XML can be used for
- Data representation
- Data exchange
6Databases and XML
- Database content can be presented in XML
- XML processor can access DBMS or file system and
convert data to XML - Web server can serve content as either XML or HTML
7HTML vs. XML
HTML
XML
8HTML vs. XML
HTML
XML
9Some Things in Common
- Comments are allowed - lt!-- --gt
- Special characters must be escaped (e.g., gt
for gt)
10Processing XML The Idea
11Sample Document
- lttransactiongt
- ltaccountgt89-344lt/accountgt
- ltbuy shares100gt
- ltticker exchNASDAQgtWEBMlt/tickergt
- lt/buygt
- ltsell shares30gt
- ltticker exchNYSEgtGElt/tickergt
- lt/sellgt
- lt/transactiongt
12DOM Parser
- DOM Document Object Model
- Parser creates a tree object out of the document
- User accesses data by traversing the tree
- The API allows for constructing, accessing and
manipulating the structure and content of XML
documents
13Document as Tree
Methods like getRoot getChildren getAttributes et
c.
transaction
account
buy
sell
89-344
shares
shares
ticker
ticker
100
30
exch
exch
NYSE
NASDAQ
WEBM
GE
14Advantages and Disadvantages
- Advantages
- Natural and relatively easy to use
- Can repeatedly traverse tree
- Disadvantages
- High memory requirements the whole document is
kept in memory - Must parse the whole document before use
15SAX Parser
- SAX Simple API for XML
- Parser creates events while traversing tree
- Parser calls methods (that you write) to deal
with the events - Similar to an IOStream, goes in one direction
16Document as Events
- lttransactiongt
- ltaccountgt89-344lt/accountgt
- ltbuy shares100gt
- ltticker exchNASDAQgtWEBMlt/tickergt
- lt/buygt
- ltsell shares30gt
- ltticker exchNYSEgtGElt/tickergt
- lt/sellgt
- lt/transactiongt
17Advantages and Disadvantages
- Advantages
- Requires little memory
- Fast
- Disadvantages
- Cannot reread
- Less natural for object oriented programmers
(perhaps)
18Which should we use?DOM vs. SAX
- If your document is very large and you only need
a few elements - use SAX - If you need to manipulate (i.e., change) the XML
- use DOM - If you need to access the XML many times - use
DOM
19XML Parsers
20XML Parsers
- There are several different ways to categorise
parsers - Validating versus non-validating parsers
- DOM parsers versus SAX parsers
- Parsers written in a particular language (Java,
C, Perl, etc.)
21Validating Parsers
- A validating parser makes sure that the document
conforms to the specified DTD - This is time consuming, so a non-validating
parser is faster
22Using an XML Parser
- Three basic steps
- Create a parser object
- Pass the XML document to the parser
- Process the results
- Generally, writing out XML is not in the scope of
parsers (though some may implement proprietary
mechanisms)
23SAX Simple API for XML
24The SAX Parser
- SAX parser is an event-driven API
- An XML document is sent to the SAX parser
- The XML file is read sequentially
- The parser notifies the class when events happen,
including errors - The events are handled by the implemented API
methods to handle events that the programmer
implemented
25Handles document events start tag, end tag, etc.
Used to create a SAX Parser
Handles Parser Errors
Handles DTDs and Entities
26Problem
- The SAX interface is an accepted standard
- There are many implementations
- Like to be able to change the implementation used
without changing any code in the program - How is this done?
27Factory Design Pattern
- Have a Factory class that creates the actual
Parsers. - The Factory checks the value of a system property
that states which implementation should be used - In order to change the implementation, simply
change the system property
28Creating a SAX Parser
- Import the following packages
- org.xml.sax.
- org.xml.sax.helpers.
- Set the following system property
- System.setProperty("org.xml.sax.driver",
"org.apache.xerces.parsers.SAXParser") - Create the instance from the Factory
- XMLReader reader XMLReaderFactory.createXMLReade
r()
29Receiving Parsing Information
- A SAX Parser calls methods such as
startDocument, startElement, etc., as it runs - In order to react to such events we must
- implement the ContentHandler interface
- set the parsers content handler with an instance
of our class
30ContentHandler
- // Methods (partial list)
- public void startDocument()
- public void endDocument()
- public void characters(char ch, int start, int
length) - public void startElement(String namespaceURI,
- String localName, String qName,
- Attributes atts)
- public void endElement(String namespaceURI,
- String localName, String qName)
31Namespaces and Element Names
- lt?xml version'1.0' encoding'utf-8'?gt
- ltforsale date"12/2/03"
- xmlnsxhtml "urnhttp//www.w3.org/1999/xhtml"gt
- ltbookgt
- lttitlegt ltxhtmlemgt DBI lt/xhtmlemgt
- The Course I Wish I never Took
- lt/titlegt
- ltcommentgt My ltxhtmlbgt favorite lt/xhtmlbgt
book! - lt/commentgt
- lt/bookgt
- lt/forsalegt
32Namespaces and Element Names
namespaceURI "" localName book qName book
- lt?xml version'1.0' encoding'utf-8'?gt
- ltforsale date"12/2/03"
- xmlnsxhtml "urnhttp//www.w3.org/1999/xhtml"gt
- ltbookgt
- lttitlegt ltxhtmlemgt DBI lt/xhtmlemgt
- The Course I Wish I never Took
- lt/titlegt
- ltcommentgt My ltxhtmlbgt favorite lt/xhtmlbgt
book! - lt/commentgt
- lt/bookgt
- lt/forsalegt
namespaceURI urnhttp//www.w3.org/1999/x
html localName em qName xhtmlem
33Receiving Parsing Information (cont.)
- An easy way to implement the ContentHandler
interface is the extend the DefaultHandler, which
implements this interface (and a few others) in
an empty fashion - To actually parse a document, create an
InputSource from the document and supply the
input source to the parse method of the XMLReader
34import java.io. import org.xml.sax. import
org.xml.sax.helpers. public class InfoWithSax
extends DefaultHandler public static void
main(String args) System.setProperty("org.xm
l.sax.driver", "org.apache.xerces.parsers.S
AXParser") try XMLReader reader
XMLReaderFactory.createXMLReader() reader.
setContentHandler(new InfoWithSax()) reader.par
se(new InputSource(new FileReader(args0)))
catch(Exception e) e.printStackTrace()
35 public static startDocument() throws
SAXException System.out.println(START
DOCUMENT) public static endDocument()
throws SAXException System.out.println(END
DOCUMENT) int depth String indent
private void println(String header, String
value) for (int i 0 i lt depth i)
System.out.print(indent) System.out.println(hea
der " " value)
36 public void characters(char buf, int offset,
int len) throws SAXException String s
(new String(buf, offset, len)).trim() if
(!"".equals(s)) println("CHARACTERS", s)
public void endElement(String namespaceURI,
String localName, String
name) throws SAXException depth-- Stri
ng elementName name if (!"".equals(namespaceU
RI) !"".equals(localName)) elementName
namespaceURI "" localName println("END
ELEMENT", elementName)
37 public static startElement(String namespaceURI,
String localName, String name,
Attributes attrs) throws SAXException
String elementName name if
(!"".equals(namespaceURI) !"".equals(localName)
) elementName namespaceURI ""
localName println("START ELEMENT",
elementName) if (attrs ! null
attrs.getLength() gt 0) for (int i
0 i lt attrs.getLength() i)
println("ATTRIBUTE", attrs.getLocalName(i)
attrs.getValue(i))
depth
38Bachelor Tags
- What do you think happens when the parser parses
a bachelor tag? - ltrating stars"five" /gt
39Attributes Interface
- Elements may have attributes
- There is no distinction between attributes that
are defined explicitly from those that are
specified in the DTD (with a default value)
40Attributes Interface (cont.)
- int getLength()
- String getQName(int i)
- String getType(int i)
- String getValue(int i)
- String getType(String qname)
- String getValue(String qname)
- etc.
41Attributes Types
- The following are possible types for attributes
- "CDATA",
- "ID",
- "IDREF", "IDREFS",
- "NMTOKEN", "NMTOKENS",
- "ENTITY", "ENTITIES",
- "NOTATION"
42Setting Features
- It is possible to set the features of a parser
using the setFeature method. - Examples
- reader.setFeature(http//xml.org/sax/features/nam
espaces, true) - reader.setFeature(http//xml.org/sax/features/val
idation", false) - For a full list, see http//www.saxproject.org/?s
electedget-set
43ErrorHandler Interface
- We implement ErrorHandler to receive error events
(similar to implementing ContentHandler) - DefaultHandler implements ErrorHandler in an
empty fashion, so we can extend it (as before) - An ErrorHandler is registered with
- reader.setErrorHandler(handler)
- Three methods
- void error(SAXParseException ex)
- void fatalError(SAXParserExcpetion ex)
- void warning(SAXParserException ex)
44Extending the InfoWithSax Program
public void warning(SAXParseException err)
throws SAXException System.out.println(War
ning in line err.getLineNumber()
and column err.getColumnNumber())
public void error(SAXParseException err)
throws SAXException System.out.println(Oy
vaavoi, an error!) public void
fatalError(SAXParseException err) throws
SAXException System.out.println(OY VAAVOI,
a fatal error!)
Will these methods be called in the case of a
problem?
45Lexical Events
- Lexical events have to do with the way that a
document was written and not with its content - Examples
- A comment is a lexical event (lt!-- comment --gt)
- The use of an entity is a lexical event (gt)
- These can be dealt with by implementing the
LexicalHandler interface, and set on a parser by - reader.setProperty("http//xml.org/sax/properties/
lexical-handler", mylexicalhandler)
46LexicalHandler
- // Methods (partial list)
- public void startEntity(String name)
- public void endEntity(String name)
- public void comment(char ch, int start,
- int length)
- public void startCDATA()
- public void endCDATA()
47DOM Document Object Model
48Creating a DOM Tree
- How can we create a DOM Tree independently of the
implementation chosen? - Creating a DOM Tree using the Apache Xerces
package - Import org.apache.xerces.parsers.DOMParser
- Import org.w3c.dom.
- Use the following lines of code
- DOMParser dom new DOMParser()
dom.parse(fileName) - Document doc dom.getDocument()
49Using a DOM Tree
50Nodes in a DOM Tree
Figure as appears in The XML Companion - Neil
Bradley
DocumentFragment
Document
Text
CDATASection
CharacterData
Comment
Attr
Element
Node
DocumentType
Notation
Entity
EntityReference
ProcessingInstruction
DocumentType
51DOM Tree
Document
52Normalizing a Tree
- Normalizing a DOM Tree has two effects
- Combine adjacent textual nodes
- Eliminate empty textual nodes
- To normalize, apply the normalize() method to the
document element
53Node Methods
- Three categories of methods
- Node characteristics name, type, value
- Contextual location and access to relatives
parents, siblings, children, ancestors,
descendants - Node modification Edit, delete, re-arrange child
nodes
54Node Methods (2)
- short getNodeType()
- String getNodeName()
- String getNodeValue() throws DOMException
- void setNodeValue(String value)
throws DOMException - boolean hasChildNodes()
- NamedNodeMap getAttributes()
- Document getOwnerDocument()
55Node Types - getNodeType()
ELEMENT_NODE 1 ATTRIBUTE_NODE 2 TEXT_NODE
3 CDATA_SECTION_NODE 4 ENTITY_REFERENCE_NODE
5 ENTITY_NODE 6
PROCESSING_INSTRUCTION_NODE 7 COMMENT_NODE
8 DOCUMENT_NODE 9 DOCUMENT_TYPE_NODE
10 DOCUMENT_FRAGMENT_NODE 11 NOTATION_NODE
12
if (myNode.getNodeType() Node.ELEMENT_NODE)
//process node
56(No Transcript)
57Node Navigation
- Every node has a specific location in tree
- Node interface specifies methods to find
surrounding nodes - Node getFirstChild()
- Node getLastChild()
- Node getNextSibling()
- Node getPreviousSibling()
- Node getParentNode()
- NodeList getChildNodes()
58Node Navigation (2)
Figure as from The XML Companion - Neil Bradley
getPreviousSibling()
getParentNode()
getFirstChild()
getChildNodes()
getLastChild()
getNextSibling()
59import org.apache.xerces.parsers.DOMParser import
org.w3c.dom. public class InfoWithDom
public static void main(String args)
try DOMParser dom new DOMParser()
dom.parse(args0) Document doc
dom.getDocument() new InfoWithDom().echo(doc
) catch(Exception e) e.printStackTrace()
60 private int depth 0 private final String
indent " " private String NODE_TYPES
"", "ELEMENT", "ATTRIBUTE", "TEXT",
"CDATA", "ENTITY_REF", "ENTITY",
"PROCESSING_INST", "COMMENT", "DOCUMENT",
"DOCUMENT_TYPE", "DOCUMENT_FRAG",
"NOTATION" private void outputIndentation()
for (int i 0 i lt depth i)
System.out.print(indent)
61 private void printlnCommon(Node n)
System.out.print(NODE_TYPESn.getNodeType()
"") System.out.print(" nodeName"
n.getNodeName()) String val if ((val
n.getNamespaceURI()) ! null) System.out.print(
" uri" val) if ((val n.getPrefix()) !
null) System.out.print(" pre" val) if
((val n.getLocalName()) ! null)
System.out.print(" local" val) if ((val
n.getNodeValue()) ! null !val.trim().equals(""
)) System.out.print(" nodeValue"
val) System.out.println()
62 private void echo(Node n) outputIndentation()
printlnCommon(n) if (n.getNodeType()
Node.ELEMENT_NODE) NamedNodeMap
atts n.getAttributes() indent
2 for (int i 0 i lt
atts.getLength() i) echo(atts.item(i))
indent - 2 indent for (Node
child n.getFirstChild() child ! null
child child.getNextSibling()) echo(child)
indent--
Example Input
Example Output
63Node Manipulation
- Children of a node in a DOM tree can be
manipulated - added, edited, deleted, moved,
copied, etc.
Node removeChild(Node old) throws
DOMException Node insertBefore(Node new, Node
ref) throws DOMException Node appendChild(Node
new) throws DOMException Node replaceChild(Node
new, Node old) throws DOMException Node
cloneNode(boolean deep)
64Node Manipulation (2)
Figure as appears in The XML Companion - Neil
Bradley
65Other Interfaces
- We have discussed methods of the Node interface
- Each of the "specific types of nodes" have
additional methods - See API for details
66Note about DOM Objects
- DOM object ? compiled XML
- Can save time and effort if send and receive DOM
objects instead of XML source - Saves having to parse XML files into DOM at
sender and receiver - But, DOM object may be larger than XML source