Title: Processing%20XML%20with%20Java
1Processing XML with Java
- Representation and Management of Data on the
Internet
2XML eXtensible Markup Language
- XML is a metalanguage
- A language used to describe other languages using
markup tags that describe properties of the
data - Designed to be structured
- Strict rules about how data can be formatted
- Designed to be extensible
- Can define own terms and markup
- When will we use XML?
3XML Family
- XML is an official recommendation of the W3C
- Aims to accomplish what HTML cannot and be
simpler to use and implement than SGML
HTML
XML
SGML
4The Essence of XML
- Syntax The permitted arrangement or structure of
letters and words in a language as defined by a
grammar (XML) - SemanticsThe meaning of letters or words in a
language - XML uses Syntax to add Semantics to the documents
5Using XML
- In XML there is a separation of the content from
the display - XML can be used for
- Data representation
- Data exchange
6HTML vs. XML
HTML
XML
7HTML vs. XML
HTML
XML
HTML was designed to ease the work of authors XML
was design to ease the work of ????
8Parsing XML The Idea
- Two approaches for parsing XML
- the SAX approach and
- the DOM approach
9Parsing XML
- What is a parser?
- Software for analyzing language a computer
program that breaks natural language or
programming language statements or instructions
into smaller more easily interpreted units
understandable to the computer. The parser
determines how a sentence can be constructed from
the grammar of the language, producing a parse
tree about the statement as the output. - How should an XML parser work?
10Sample Document
- lttransactiongt
- ltaccountgt89-344lt/accountgt
- ltbuy shares100gt
- ltticker exchNASDAQgtWEBMlt/tickergt
- lt/buygt
- ltsell shares30gt
- ltticker exchNYSEgtGElt/tickergt
- lt/sellgt
- lt/transactiongt
11DOM Parser
- DOM Document Object Model
- Parser creates a tree object out of the document
- User accesses data by traversing the tree
- The API allows for constructing, accessing and
manipulating the structure and content of XML
documents
12Document as Tree
Methods like getRoot getChildren getAttributes et
c.
transaction
account
buy
sell
89-344
shares
shares
ticker
ticker
100
30
exch
exch
NYSE
NASDAQ
WEBM
GE
13Advantages and Disadvantages
- Advantages
- Natural and relatively easy to use
- Can repeatedly traverse tree
- Disadvantages
- High memory requirements the whole document is
kept in memory - Must parse the whole document and construct many
objects before use
14SAX Parser
- SAX Simple API for XML
- Parser creates events while traversing tree
- Parser calls methods (that you write) to deal
with the events - Similar to an I/O-Stream, goes in one direction
15Document as Events
- lttransactiongt
- ltaccountgt89-344lt/accountgt
- ltbuy shares100gt
- ltticker exchNASDAQgtWEBMlt/tickergt
- lt/buygt
- ltsell shares30gt
- ltticker exchNYSEgtGElt/tickergt
- lt/sellgt
- lt/transactiongt
16Advantages and Disadvantages
- Advantages
- Requires little memory
- Fast
- Disadvantages
- Cannot read backwards
- Does not support transformation of the document
such as cut and paste of fragments - Difficult to program
17Programming using SAX is Difficult
- In some cases, programming with SAX is difficult
- How can we find, using a SAX parser, an element
e1 with ancestor e2? - How can we find, using a SAX parser, elements e1
that have a descendant element e2? - What about cases that are even more complex?
18Which should we use?DOM vs. SAX
- If your document is very large and you only need
a few elements use SAX - If you need to manipulate (i.e., change) the XML
use DOM - If you need to access the XML many times use
DOM (assuming the file is not too large)
19XML Parsers
20XML Parsers
- There are several different ways to categorise
parsers - Validating versus non-validating parsers
- DOM parsers versus SAX parsers
- Parsers written in a particular language (Java,
C, Perl, etc.)
21Validating Parsers
- A validating parser makes sure that the document
conforms to the specified DTD - This is time consuming, so a non-validating
parser is faster
22Using an XML Parser
- Three basic steps
- Create a parser object
- Pass the XML document to the parser
- Process the results
- Generally, writing out XML is not in the scope of
parsers (though some may implement proprietary
mechanisms)
23SAX Simple API for XML
24SAX Parsers
When you see the start of the document do
SAX Parser
When you see the start of an element do
When you see the end of an element do
25The SAX Parser
- SAX parser is an event-driven API
- An XML document is sent to the SAX parser
- The XML file is read sequentially
- The parser notifies the class when events happen,
including errors - The events are handled by the implemented API
methods to handle events that the programmer
implemented
26Handles document events start tag, end tag, etc.
Used to create a SAX Parser
Handles Parser Errors
Handles DTDs and Entities
27Problem
- The SAX interface is an accepted standard
- There are many implementations
- Like to be able to change the implementation used
without changing any code in the program - How is this done?
28Factory Design Pattern
- Have a Factory class that creates the actual
Parsers - The Factory checks the value of a system property
that states which implementation should be used - In order to change the implementation, simply
change the system property
29Creating a SAX Parser
- Import the following packages
- org.xml.sax.
- org.xml.sax.helpers.
- Set the following system property
- System.setProperty("org.xml.sax.driver",
"org.apache.xerces.parsers.SAXParser") - Create the instance from the Factory
- XMLReader reader XMLReaderFactory.createXMLReade
r()
30- import org.xml.sax.
- import org.xml.sax.helpers.
- public static void main(String args)
- try
- XMLReader parser
- XMLReaderFactory.createXMLReader(
"org.apache.xerces.parsers.SAXParser" )
ContentHandler handler new - SomethingThatExtendsDefaultHandler()
parser.setContentHandler(handler)
parser.parse(args0) -
- catch (Exception e) System.err.println(e)
- // end
31Receiving Parsing Information
- A SAX Parser calls methods such as
startDocument, startElement, etc., as it runs - In order to react to such events we must
- implement the ContentHandler interface
- set the parsers content handler with an instance
of our ContentHandler implementation
32ContentHandler
- // Methods (partial list)
- public void startDocument()
- public void endDocument()
- public void characters(char ch, int start, int
length) - public void startElement(String namespaceURI,
- String localName, String qName,
- Attributes atts)
- public void endElement(String namespaceURI,
- String localName, String qName)
What to implement in a ContentHandler
33Namespaces and Element Names
- lt?xml version'1.0' encoding'utf-8'?gt
- ltforsale date"12/2/03"
- xmlnsxhtml "urnhttp//www.w3.org/1999/xhtml"gt
- ltbookgt
- lttitlegt ltxhtmlemgt DBI lt/xhtmlemgt
- The Course I Wish I never Took
- lt/titlegt
- ltcommentgt My ltxhtmlbgt favorite lt/xhtmlbgt
book! - lt/commentgt
- lt/bookgt
- lt/forsalegt
34Namespaces and Element Names
namespaceURI "" localName book qName book
- lt?xml version'1.0' encoding'utf-8'?gt
- ltforsale date"12/2/03"
- xmlnsxhtml "urnhttp//www.w3.org/1999/xhtml"gt
- ltbookgt
- lttitlegt ltxhtmlemgt DBI lt/xhtmlemgt
- The Course I Wish I never Took
- lt/titlegt
- ltcommentgt My ltxhtmlbgt favorite lt/xhtmlbgt
book! - lt/commentgt
- lt/bookgt
- lt/forsalegt
namespaceURI urnhttp//www.w3.org/1999/x
html localName em qName xhtmlem
35Receiving Parsing Information (cont.)
- An easy way to implement the ContentHandler
interface is the extend the DefaultHandler, which
implements this interface (and a few others) in
an empty fashion - To actually parse a document, create an
InputSource from the document and supply the
input source to the parse method of the XMLReader
36import java.io. import org.xml.sax. import
org.xml.sax.helpers. public class InfoWithSax
extends DefaultHandler public static void
main(String args) System.setProperty("org.xm
l.sax.driver", "org.apache.xerces.parsers.S
AXParser") try XMLReader reader
XMLReaderFactory.createXMLReader() reader.
setContentHandler(new InfoWithSax()) reader.par
se(new InputSource(new FileReader(args0)))
catch(Exception e) e.printStackTrace()
37 public static startDocument() throws
SAXException System.out.println(START
DOCUMENT) public static endDocument()
throws SAXException System.out.println(END
DOCUMENT) int depth String indent
private void println(String header, String
value) for (int i 0 i lt depth i)
System.out.print(indent) System.out.println(hea
der " " value)
38 public void characters(char buf, int offset,
int len) throws SAXException String s
(new String(buf, offset, len)).trim() if
(!"".equals(s)) println("CHARACTERS", s)
public void endElement(String namespaceURI,
String localName, String name)
throws SAXException depth-- String
elementName name if (!"".equals(namespaceURI)
!"".equals(localName)) elementName
namespaceURI "" localName println("END
ELEMENT", elementName)
39 public static startElement(String namespaceURI,
String localName, String name,
Attributes attrs) throws SAXException
String elementName name if
(!"".equals(namespaceURI) !"".equals(localName)
) elementName namespaceURI ""
localName println("START ELEMENT",
elementName) if (attrs ! null
attrs.getLength() gt 0) for (int i
0 i lt attrs.getLength() i)
println("ATTRIBUTE", attrs.getLocalName(i)
attrs.getValue(i))
depth
40Bachelor Tags
- What do you think happens when the parser parses
a bachelor tag? - ltrating stars"five" /gt
41Attributes Interface
- Elements may have attributes
- There is no distinction between attributes that
are defined explicitly from those that are
specified in the DTD (with a default value)
42Attributes Interface (cont.)
- int getLength()
- String getQName(int i)
- String getType(int i)
- String getValue(int i)
- String getType(String qname)
- String getValue(String qname)
- etc.
43Attributes Types
- The following are possible types for attributes
- "CDATA",
- "ID",
- "IDREF", "IDREFS",
- "NMTOKEN", "NMTOKENS",
- "ENTITY", "ENTITIES",
- "NOTATION"
44Setting Features
- It is possible to set the features of a parser
using the setFeature method. - Examples
- reader.setFeature(http//xml.org/sax/features/nam
espaces, true) - reader.setFeature(http//xml.org/sax/features/val
idation", false) - For a full list, see http//www.saxproject.org/?s
electedget-set
45ErrorHandler Interface
- We implement ErrorHandler to receive error events
(similar to implementing ContentHandler) - DefaultHandler implements ErrorHandler in an
empty fashion, so we can extend it (as before) - An ErrorHandler is registered with
- reader.setErrorHandler(handler)
- Three methods
- void error(SAXParseException ex)
- void fatalError(SAXParserExcpetion ex)
- void warning(SAXParserException ex)
46Extending the InfoWithSax Program
public void warning(SAXParseException err)
throws SAXException System.out.println(War
ning in line err.getLineNumber()
and column err.getColumnNumber())
public void error(SAXParseException err)
throws SAXException System.out.println(Oy
vaavoi, an error!) public void
fatalError(SAXParseException err) throws
SAXException System.out.println(OY VAAVOI,
a fatal error!)
47Which to Call
- Which callback should be used to report the
violation of a validity constraint? - Which callback should be used to report the
violation of a well-formedness constraint?
warning
fatal error
error
48Lexical Events
- Lexical events have to do with the way that a
document was written and not with its content - Examples
- A comment is a lexical event (lt!-- comment --gt)
- The use of an entity is a lexical event (gt)
- These can be dealt with by implementing the
LexicalHandler interface, and set on a parser by - reader.setProperty("http//xml.org/sax/properties/
lexical-handler", mylexicalhandler)
49LexicalHandler
- // Methods (partial list)
- public void startEntity(String name)
- public void endEntity(String name)
- public void comment(char ch, int start,
- int length)
- public void startCDATA()
- public void endCDATA()
50DOM Document Object Model
51(No Transcript)
52(No Transcript)
53Creating a DOM Tree
- How can we create a DOM Tree independently of the
implementation chosen? - Creating a DOM Tree using the Apache Xerces
package - Import org.apache.xerces.parsers.DOMParser
- Import org.w3c.dom.
- Use the following lines of code
- DOMParser dom new DOMParser()
dom.parse(fileName) - Document doc dom.getDocument()
54Using a DOM Tree
55Nodes in a DOM Tree
Figure as appears in The XML Companion - Neil
Bradley
DocumentFragment
Document
Text
CDATASection
CharacterData
Comment
Attr
Element
Node
DocumentType
Notation
Entity
EntityReference
ProcessingInstruction
DocumentType
56DOM Tree
Document
57Normalizing a Tree
- Normalizing a DOM Tree has two effects
- Combine adjacent textual nodes
- Eliminate empty textual nodes
- To normalize, apply the normalize() method to the
document element
58Node Methods
- Three categories of methods
- Node characteristics name, type, value
- Contextual location and access to relatives
parents, siblings, children, ancestors,
descendants - Node modification Edit, delete, re-arrange child
nodes
59Node Methods (2)
- short getNodeType()
- String getNodeName()
- String getNodeValue() throws DOMException
- void setNodeValue(String value) throws
DOMException - boolean hasChildNodes()
- NamedNodeMap getAttributes()
- Document getOwnerDocument()
60Node Types - getNodeType()
ELEMENT_NODE 1 ATTRIBUTE_NODE 2 TEXT_NODE
3 CDATA_SECTION_NODE 4 ENTITY_REFERENCE_NODE
5 ENTITY_NODE 6
PROCESSING_INSTRUCTION_NODE 7 COMMENT_NODE
8 DOCUMENT_NODE 9 DOCUMENT_TYPE_NODE
10 DOCUMENT_FRAGMENT_NODE 11 NOTATION_NODE
12
if (myNode.getNodeType() Node.ELEMENT_NODE)
//process node
61(No Transcript)
62Node Navigation
- Every node has a specific location in tree
- Node interface specifies methods to find
surrounding nodes - Node getFirstChild()
- Node getLastChild()
- Node getNextSibling()
- Node getPreviousSibling()
- Node getParentNode()
- NodeList getChildNodes()
63Node Navigation (2)
Figure as from The XML Companion - Neil Bradley
getPreviousSibling()
getParentNode()
getFirstChild()
getChildNodes()
getLastChild()
getNextSibling()
64import org.apache.xerces.parsers.DOMParser import
org.w3c.dom. public class InfoWithDom
public static void main(String args)
try DOMParser dom new DOMParser()
dom.parse(args0) Document doc
dom.getDocument() new InfoWithDom().echo(doc
) catch(Exception e) e.printStackTrace()
65 private int depth 0 private final String
indent " " private String NODE_TYPES
"", "ELEMENT", "ATTRIBUTE", "TEXT",
"CDATA", "ENTITY_REF", "ENTITY",
"PROCESSING_INST", "COMMENT", "DOCUMENT",
"DOCUMENT_TYPE", "DOCUMENT_FRAG",
"NOTATION" private void outputIndentation()
for (int i 0 i lt depth i)
System.out.print(indent)
66 private void printlnCommon(Node n)
System.out.print(NODE_TYPESn.getNodeType()
"") System.out.print(" nodeName"
n.getNodeName()) String val if ((val
n.getNamespaceURI()) ! null) System.out.print(
" uri" val) if ((val n.getPrefix()) !
null) System.out.print(" pre" val) if
((val n.getLocalName()) ! null)
System.out.print(" local" val) if ((val
n.getNodeValue()) ! null !val.trim().equals(""
)) System.out.print(" nodeValue"
val) System.out.println()
67 private void echo(Node n) outputIndentation()
printlnCommon(n) if (n.getNodeType()
Node.ELEMENT_NODE) NamedNodeMap
atts n.getAttributes() indent
2 for (int i 0 i lt
atts.getLength() i) echo(atts.item(i))
indent - 2 indent for (Node
child n.getFirstChild() child ! null
child child.getNextSibling()) echo(child)
indent--
68Node Manipulation
- Children of a node in a DOM tree can be
manipulated - added, edited, deleted, moved,
copied, etc.
Node removeChild(Node old) throws
DOMException Node insertBefore(Node new, Node
ref) throws DOMException Node appendChild(Node
new) throws DOMException Node replaceChild(Node
new, Node old) throws DOMException Node
cloneNode(boolean deep)
69Node Manipulation (2)
Figure as appears in The XML Companion - Neil
Bradley
70Other Interfaces
- We have discussed methods of the Node interface
- Each of the "specific types of nodes" have
additional methods - See API for details
71Note about DOM Objects
- DOM object ? compiled XML
- Can save time and effort if send and receive DOM
objects instead of XML source - Saves having to parse XML files into DOM at
sender and receiver - But, DOM object may be larger than XML source
72JAXP
- Java API for XML Parsing
- Includes the DOM and SAX API as part of the Java
API - See the package
- javax.xml.parsers
73Getting a SAX Parser with JAXP
- import javax.xml.parsers.
- import org.xml.sax. import org.xml.sax.helpers.
- DefaultHandler handler new MyDefaultHandlerImpl(
) - SAXParserFactory factory
- SAXParserFactory.newInstance()
- try
- SAXParser saxParser
- factory.newSAXParser()
- saxParser.parse(new File(example.xml))
- catch (Exception e) / do something /
74Getting a DOM Parser with JAXP
- import javax.xml.parsers.
- import org.xml.sax. import org.w3c.dom.
- DocumentBuilderFactory factory
- DocumentBuilderFactory.newInstance()
- try
- DocumentBuilder builder
- factory.newDocumentBuilder()
- Document document builder.parse(new
- File(example.xml))
- catch (SAXException se) / do something /