Title: Parsing XML into programming languages
1Parsing XML into programming languages
- JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB
2Parsing XML
- Goal read XML files into data structures in
programming languages - Possible strategies
- Parse by hand with some reusable libraries
- Parse into generic tree structure
- Parse as sequence of events
- Automagically parse to language-specific objects
3Parsing by-hand
- Advantages
- Complete control
- Good if simple needs build off of regex package
- Disadvantages
- Must write the initial code yourself, even if it
becomes generalized - Pretty tedious and error prone.
- Gets very hard when using schema or DTD to
validate
4Parsing into generic tree structure
- Advantages
- Industry-wide, language neutral standard exists
called DOM (Document Object Model) - Learning DOM for one language makes it easy to
learn for any other - As of JAXP 1.2, support for Schema
- Have to write much less code to get XML to
something you want to manipulate in your program - Disadvantages
- Non-intuitive API, doesnt take full advantage of
Java - Still quite a bit of work
5What is JAXP?
- JAXP Java API for XML Processing
- In the Java language, the definition of these
standard APIs (together with XSLT API) comprise
a set of interfaces known as JAXP - Java also provides standard implementations
together with vendor pluggability layer - Some of these come standard with J2SDK, others
are only availdable with Web Services Developers
Pack - We will study these shortly
6Another alternative
- JDOM Native Java published API for representing
XML as tree - Like DOM but much more Java-specific, object
oriented - However, not supported by other languages
- Also, no support for schema
- Dom4j another alternative
7JAXB
- JAXB Java API for XML Bindings
- Defines an API for automagically representing XML
schema as collections of Java classes. - Most convenient for application programming
- Will cover next class
8DOM
9About DOM
- Stands for Document Object Model
- A World Wide Web Consortium (w3c) standard
- Standard constantly adding new features Level 3
Core just released this month - Well cover most of the basics. Theres always
more, and its always changing.
10DOM abstraction layer in Java -- architecture
Emphasis is on allowing vendors to supply their
own DOM Implementation without requiring change
to source code
Returns specific parser implementation
org.w3d.dom.Document
11Sample Code
A factory instance is the parser
implementation. Can be changed with runtime
System property. Jdk has default. Xerces much
better.
DocumentBuilderFactor factory
DocumentBuilderFactory.newInstance() / set
some factory options here / DocumentBuilder
builder factory.newDocumentBuilde
r() Document doc builder.parse(xmlFile)
From the factory one obtains an instance of the
parser
xmlFile can be an java.io.File, an inputstream,
etc.
javax.xml.parsers.DocumentBuilderFactory javax.xml
.parsers.DocumentBuilder org.w3c.dom.Document
For reference. Notice that the Document class
comes from the w3c-specified bindings.
12Validation
- Note that by default the parser will not validate
against a schema or DTD - As of JAXP1.2, java provides a default parse than
can handle most schema features - See next slide for details on how to setup
13Important Schema validation
String JAXP_SCHEMA_LANGUAGE Â Â Â
 "http//java.sun.com/xml/jaxp/properties/schemaLa
nguage" String W3C_XML_SCHEMA Â Â Â Â
"http//www.w3.org/2001/XMLSchema" Next, you
need to configure DocumentBuilderFactory to
generate a namespace-aware, validating parser
that uses XML Schema DocumentBuilderFactory
factory     DocumentBuilderFactory.newInstance()
 factory.setNamespaceAware(true)
  factory.setValidating(true) try Â
 factory.setAttribute(JAXP_SCHEMA_LANGUAGE,
W3C_XML_SCHEMA) catch (IllegalArgumentExcepti
on x) Â Â // Happens if the parser does not
support JAXP 1.2 Â Â ...
14Associating document with schema
- An xml file can be associated with a schema in
two ways - Directly in xml file in regular way
- Programmatically from java
- Latter is done as
- factory.setAttribute(JAXP_SCHEMA_SOURCE, Â Â Â new
File(schemaSource))
15A few notes
- Factory allows ease of switching parser
implementations - Java provides simple DOM implementation, but much
better to use vendor-supplied when doing serious
work - Xerces, part of apache project, is installed on
cluster as Eclipse plugin. Well use next week. - Note that some properties are not supported by
all parser implementations.
16Document object
- Once a Document object is obtained, rich API to
manipulate. - First call is usually
- Element root doc.getDocumentElement()
- This gets the root element of the Document as an
instance of the Element class - Note that Element subclasses Node and has methods
getType(), getName(), and getValue(), and
getChildNodes()
17Types of Nodes
- Note that there are many types of Nodes (ie
subclasses of Node - Attr, CDATASection, Comment, Document,
DocumentFragment, DocumentType, Element, Entity,
EntityReference, Notation, ProcessingInstruction,
Text - Each of these has a special and non-obvious
associated type, value, and name. - Standards are language-neutral and are specified
on chart on following slide -
- Important keep this chart nearby when using DOM
18Node nodeName() nodeValue() Attributes nodeType()
Attr Attr name Value of attribute null 2
CDATASection cdata-section CDATA cotnent null 4
Comment comment Comment content null 8
Document document Null null 9
DocumentFragment document-fragment null null 11
DocumentType Doc type name null null 10
Element Tag name null NamedNodeMap 1
Entity Entity name null null 6
EntityReference Name entitry referenced null null 5
Notation Notation name null null 1
ProcessingInstruction target Entire string null 7
Text text Actual text null 3
19Transforming XML
20The JAXP Transformation Packages
- JAXP Transformation APIs
- javax.xml.transform
- This package defines the factory class you use to
get a Transformer object. You then configure the
transformer with input (Source) and output
(Result) objects, and invoke its transform()
method to make the transformation happen. The
source and result objects are created using
classes from one of the other three packages. - javax.xml.transform.dom
- Defines the DOMSource and DOMResult classes that
let you use a DOM as an input to or output from a
transformation. - javax.xml.transform.sax
- Defines the SAXSource and SAXResult classes that
let you use a SAX event generator as input to a
transformation, or deliver SAX events as output
to a SAX event processor. - javax.xml.transform.stream
- Defines the StreamSource and StreamResult classes
that let you use an I/O stream as an input to or
output from a transformation.
21Transformer Architecture
22Writing DOM to XML
public class WriteDOM public static void
main(String argv) throws Exception
File f new File(argv0)
DocumentBuilderFactory factory
DocumentBuilderFactory.newInstance()
DocumentBuilder builder factory.newDocumentBuild
er() Document document
builder.parse(f) TransformerFactory
tFactory TransformerFactory.newInsta
nce() Transformer transformer
tFactory.newTransformer() DOMSource
source new DOMSource(document)
StreamResult result new StreamResult(System.out)
transformer.transform(source, result)
23Creating a DOM from scratch
- Sometimes you may want to create a DOM tree
directly in memory. This is done with - DocumentBuilderFactory factory
 DocumentBuilderFactory.newInstance()         - DocumentBuilder builder         factory.newDocum
entBuilder() Â Â Â Â Â Â - Â document builder.newDocument()
24Manipulating Nodes
- Once the root node is obtained, typical tree
methods exist to manipulate other elements - boolean node.hasChildNodes()
- NodeList node.getChildNodes()
- Node node.getNextSibling()
- Node node.getParentNode()
- String node.getValue()
- String node.getName()
- String node.getText()
- void setNodeValue(String nodeValue)
- Node insertBefore(Node new, Node ref)
25SAX
- Simple API for XML Processing
26About SAX
- SAX in Java is hosted on source forge
- SAX is not a w3c standard
- Originated purely in Java
- Other languages have chosen to implement in their
own ways based on this prototype
27SAX vs.
- Please dont compared unrelated things
- SAX is an alternative to DOM, but realize that
DOM is often built on top of SAX - SAX and DOM do not compete with JAXP
- They do both compete with JAXB implementations
28How a SAX parser works
- SAX parser scans an xml stream on the fly and
responds to certain parsing events as it
encounters them. - This is very different than digesting an entire
XML document into memory. - Much faster, requires less memory.
- However, need to reparse if you need to revisit
data.
29Obtaining a SAX parser
- Important classes
- javax.xml.parsers.SAXParserFactory
- javax.xml.parsers.SAXParser
- javax.xml.parsers.ParserConfigurationException
- //get the parser
- SAXParserFactory factory
SAXParserFactory.newInstance() - SAXParser saxParser factory.newSAXParser
() - //parse the document
- saxParser.parse( new File(argv0),
handler)
30DefaultHandler
- Note that an event handler has to be passed to
the SAX parser. - This must implement the interface
- org.xml.sax.ContentHandler
- Easier to extend the adapter
- org.xml.sax.helpers.DefaultHandler
31Overriding Handler methods
- Most important methods to override
- void startDocument()
- Called once when document parsing begins
- void endDocument()
- Called once when parsing ends
- void startElement(...)
- Called each time an element begin tag is
encountered - void endElement(...)
- Called each time an element end tag is
encountered - void characters(...)
- Called randomly between startElement and
endElement calls to accumulated character data
32startElement
- public void startElement(
- String namespaceURI, //if namespace
assoc - String sName,
//nonqualified name - String qName,
//qualified name - Attributes attrs) //list
of attributes -
- Attribute info is obtained by querying Attributes
objects.
33Characters
- public void characters(
- char buf, //buffer of
chars accumulated - int offset, //begin
element of chars - int len) //number of
chars - Note, characters may be called more than once
between begin tag / end tag - Also, mixed-content elements require careful
handling
34Entity references
- Recall that entity references are special
character sequences for referring to characters
that have special meaning in XML syntax - lt is lt
- gt is gt
- In SAX these are automatically converted and
passed to the characters stream unless they are
part of a CDATA section
35Choosing a Parser
- Choosing your Parser Implementation
- If no other factory class is specified, the
default SAXParserFactory class is used. To use a
different manufacturer's parser, you can change
the value of the environment variable that points
to it. You can do that from the command line,
like this - java -Djavax.xml.parsers.SAXParserFactoryyourFact
oryHere ... - The factory name you specify must be a fully
qualified class name (all package prefixes
included). For more information, see the
documentation in the newInstance() method of the
SAXParserFactory class.
36Validating SAX Parsers
String JAXP_SCHEMA_LANGUAGE Â Â Â
 "http//java.sun.com/xml/jaxp/properties/schemaLa
nguage" String W3C_XML_SCHEMA Â Â Â Â
"http//www.w3.org/2001/XMLSchema" Next, you
need to configure DocumentBuilderFactory to
generate a namespace-aware, validating parser
that uses XML Schema SaxParserFactory
factory     SaxParserFactory.newInstance()
 factory.setNamespaceAware(true)
  factory.setValidating(true) try Â
 factory.setAttribute(JAXP_SCHEMA_LANGUAGE,
W3C_XML_SCHEMA) catch (IllegalArgumentExcepti
on x) Â Â // Happens if the parser does not
support JAXP 1.2 Â Â ...
37Transforming arbitrary data structures using SAX
and Transformer
38Goal
- Now that we know SAX and a little about
Transformations, there are some cool things we
can do. - One immediate thing is to create xml files from
plain text files using the help of a faux SAX
parser - Turns out to be more robust than doing by hand
39Transformers
- Recall that transformers easily let us go between
any source and result by arbitrary wirings of - StreamSource / StreamResult
- SAXSource / SAXResult
- DOMSource / DOMResult
- We used this to write a DOM tree to an XML file
- Now we will use a SAXSource together with a
StreamResult to convert our text file
40Strategy
- We construct our own SAXParser ie a class that
implements the XMLReader interface - This class must have a parse method (among
others) - We use parse to read our input file and fire the
appropriate SAX events.
41What?
- What are we really doing here?
- Were having the SAXParser pretend as though it
has encountered certain SAX XML events when it
reads the text file. - Exactly where we pretend these things occur is
where the appropriate XML will get written by the
transformer
42Main snippet
public static void main (String argv )
StudentReader parser new StudentReader()
TransformerFactory tFactory
TransformerFactory.newInstance()
Transformer transformer tFactory.newTransformer(
) FileReader fr new FileReader(student
s.txt) BufferedReader br new
BufferedReader(fr) InputSource
inputSource new InputSource(fr)
SAXSource source new SAXSource(parser,
inputSource) StreamResult result new
StreamResult(System.out)
transformer.transform(source, result)
Create SAX parser
create transformer
Use text File as Transformer source
Use text as result
43XMLReader implementation
- To have a valid SAXSource we need a class that
implements - XMLReader interface
- public void parse(InputSource input)
- public void setContentHandler(ContentHandler
handler) - public ContentHandler getContentHandler()
- .
- .
- .
- Shown are the important methods for a simple app
44Extra Credit?
- Volunteer to present this next class?
45End
46(No Transcript)