Title: A Table-Driven Streaming XML Parsing Methodology for High-Performance Web Services
1A Table-Driven Streaming XML Parsing Methodology
for High-PerformanceWeb Services
- Wei Zhang
- Robert van Engelen
2Outline
- XML Performance
- Related Work Schema-Specific XML Parsing (SSP)
- Table-Driven Streaming XML Parsing (TDX)
- Experiment Results
- Conclusion
3XML Performance
- XML messaging is at the heart of Web Services
- XML is widely seen as underperforming
- Increasingly, XML is being used in processes that
demand high-performance - Validation is even worse
- Often, validation is typically applied during
debugging and testing, and - is often disabled in production systems
4Why are traditional XML parsers slow?
5Traditional parsers performance issues(1)
- Three stages of XML processing
- Well-formedness parsing
- Validation
- Application data handling
application
Validation
Parsing
XML
6Traditional parsers performance issues(2)
- Frequent access to schema
- Comparison done on String (typically inefficient)
- Work duplicated between validator and
deserializer - Repeated data format validation and conversion
(e.g. string/integer) - Data copying
7Related Work Schema-Specific XML Parsing (1)
- Idea
- Constructing a parser that is hard-coded to
process XML by exploiting schema information - Merging well-formedness parsing and validation
application
XML
8Related Work Schema-Specific XML Parsing(2)
- Merging parsing and validation by
- Constructing PDA Chiu 03
- No namespace support
- Converting from NFA to DFA may result in
exponentially growing space requirement - Constructing DFA van Engelen 04
- Cannot process cyclic XML schema
- gSOAP toolkit van Engelen 04
- Based on recursive-descent parsing
- Not suitable for generic XML parsing without
application data (de)serialization
9Table-Driven Streaming XML Parsing Methodology
(TDX)
An integrated Approach to XML Parsing,
validation, deserialization, and even
application-specific events for High Performance
Web Services
XML
10Table-Driven Streaming XML Parsing Methodology (1)
- LL(1) Grammar can be generated from schema
- XML well-formedness parsing can be verified
through grammar productions - XML structure can be verified through grammar
productions - e.g. Occurrence, enumeration simpleType
- CDATA value validation can be accomplished by
semantic actions - Application-specific events can also be encoded
as semantic actions
11An Illustrating Example(1)
Schema (abbreviated syntax)
ltelement nameexample typeexample_type/gt
12An Illustrating Example(1)
Grammar
Schema (abbreviated syntax)
(1) s -gt ltexamplegt t lt/examplegt
ltelement nameexample typeexample_type/gt
(2) t -gt t1 t2 t3
(3) t1 -gt ltidgt CDATA lt/idgt
//isIdType()
(4) t2 -gt ltvaluegt CDATA lt/valuegt
//isValueType()
(5) t3 -gt ltstategt v lt/stategt
(6) v -gt ON EVENT //doStateON()
(7) v -gt OFF
13An Illustrating Example(2)
s
ltexamplegt t lt/examplegt
t1 t2 t3
ltidgt CDATA lt/idgt
ltvaluegt CDATA lt/valuegt
ltstategt v lt/stategt
ON EVENT
invoke isIdType()
invoke isValueType()
invoke doStateOn()
Top-down parsing tree
14 TDX Architecture(1)
15 TDX Architecture(2)
16 TDX Architecture(3)
17 TDX Architecture(4)
18TDX Modularity
- TDX parsing engine is schema-independent
- Hot swap modules for SSP
19TDX Construction Toolkit(1)
- Two Code generators WSDL2TDX and LL2Table
- Given a schema or WSDL specification, the toolkit
automatically generates tables for parsing engine
20TDX Construction Toolkit(2)
- Why two generators?
- Application-specific events can not be generated
automatically - Allows insertion of application specific events
21TDX Scanner/Tokenizer
- TDX scanner is also runtime tokenizer
- Why tokenization?
- Comparison done on tokens (more efficient)
- Defined by component tags
- Element names, attribute names
- Classified as starting tags, ending tags
- Enumeration values
- CDATA, EVENT
- Normarlized namespace binding
- ltnamespace,tag_namegt
22Scanner/Tokenizer example
ltbook xmlns x.org"
xmlnsyy.org"gt lttitlegt XML Bible lt/titlegt
ltauthorgt ltnamegt Bob lt/namegt
ltytitlegt professor lt/ytitlegt
lt/authorgt lt/bookgt
Part of tokens
23Mapping Rules
- Define mapping from XML schema to LL(1) grammars
- Preserves structural constrains
- Many types of validation constraints are
incorporated in resulting grammar productions - e.g., occurrence constraints
- Some type-checking constraints are incorporated
as grammar productions - e.g., enumeration simpleType
24Sample Mapping Rules
25Mapping Example
ltcomplexType nameexamplegt ltsequencegt
ltelement nameid typeid_type
minOccurs0/gt ltelement namevalue
typevalue_type minOccurs0
maxOccursunbounded/gt lt/sequencegt
lt/complexTypegt
26TDX Table Generation Example
Grammar
(1) s -gt bE t eE
(2) t -gt t1 t2 t3
(3) t1 -gt bI CD eI
//isIdType()
(4) t2 -gt bV CD eV
//isValueType()
(5) t3 -gt bS v eS
//doStateOn()
(6) v -gt cON EV
(7) v -gt cOFF
27TDX Table Generation Example(2)
LL(1) Parse Table
28TDX Parsing Engine Exmple
Parsing Table
bE
TDX Parsing Engine
s
stack
29Parsing Example (contd)
Parsing Table
bE
TDX Parsing Engine
bE t eE
stack
30Parsing Example (contd)
Parsing Table
bE bI
TDX Parsing Engine
t eE
stack
31Parsing Example (contd)
Parsing Table
bE bI
t1 t2 t3 eE
TDX Parsing Engine
stack
32Parsing Example (contd)
Parsing Table
bE bI
bI CD eI t2 t3 eE
TDX Parsing Engine
stack
33Parsing Example (contd)
Parsing Table
bE bI CD
TDX Parsing Engine
CD eI t2 t3 eE
invoke isIdType()
stack
34Parsing Example (contd)
Parsing Table
bE bI CD
TDX Parsing Engine
stack
35Experiment Results
- Test environment
- 2.4 GHz P4, 512 MB RAM, Red Hat Linux 3.2.2-5,
GNU Compiler g.3.2.3 with option 02 - Memory-resident XML message
- Measures with elapsed real time using timeofday()
for 100 runs - Compared with
- DFA-based Parser
- gSOAP 2.7
- eXpat 1.2
- Xerces 2.7.0
36Experiment Results(contd)
- XML Schema for echoString (abbreviated syntax)
- ltschemagt
- ltelement name"echoString"gt
- ltcomplexTypegt
- ltsequencegt
- ltelement name"input"
type"xsdstring -
maxOccursunbounded/gt - lt/sequencegt
- lt/complexTypegt
- lt/elementgt
- lt/schemagt
37Parsing Performance (1)
XML document size 1024B
38Parsing Performance (2)
39Conclusions
- TDX is fast
- Integrated approach across layers
- Avoid schema access at runtime
- Comparison done on tokens
- Avoid data copying
- Avoid format conversions
- Minimized function calls
- Optimization based on schema structure
40Conclusions (contd)
- XML can be parsed, validated, and deserialized
efficiently for high-performance Web services
using table-driven methodology - Can be up to several times faster than than
industry-strength high-performance validating XML
parsers. - Table-Driven methodology can offer high-level of
modularity, and - Provides a mechanism integrating
application-specific events, such as SOAP
deserializers
41