Title: Emory University
1XML Evolution Two-phase XML Processing
ModelUsing XML Prefiltering Techniques
Demonstrated at VLDB 2006.
- Emory University
- November 17, 2006
Chia-Hsin Huang, IIS, Academia Sinica Tyng-Ruey
Chuang, IIS, Academia Sinica James J. Lu,
MathCS, Emory Hahn-Ming Lee, CSIE, NTUST
2Agenda
- Issues of Conventional XML Processing Models (DOM
and SAX) - Motivation
- Two-phase XML Processing Models and Prefiltering
Techniques - Experiments and Analysis
- GIS Applications (Optional)
- Conclusion and Future Work
- Related Work (Optional)
3Issues of Conventional XML Processing Model DOM
1/4
XPath expression /html//p/text
child axis
descendant axis
Source http//www.cee.hw.ac.uk/alison/
netapp/dom/sld006.htm
4Issues of Conventional XML Processing Model DOM
2/4
- Pros
- Provide flexible tree-traversal ability
- Suitable for supporting XPath axes
- Random access to the document
- Cons
- Need a lot of resources to build a DOM-tree
- CPU time
- Memory space
- (Size of the XML doc.) (Size of the DOM tree)
1 5
5Issues of Conventional XML Processing Model SAX
3/4
XPath expression //entry_at_ida2
Backward reference XPE //footext()baz/ances
torentry
Source http//www.informatik.hu-berlin.de/obecke
r/Lehre/SS2002/XML/images/sax.t.gif
6Issues of Conventional XML Processing Model SAX
4/4
- Pros (compared with the DOM model)
- Consume much less resources
- A constant and small amount of memory
- Support streaming process
- Cons
- Parse over the document
- No backtrack mechanisms (look forward parsing)
- Lack of interactive mechanisms
7Problems in Standard DOM and SAX Processing Models
- Both DOM and SAX processing models waste a large
amount of computational resources by processing
uninteresting fragments.
8Motivation
- If programs require only small parts of the
document, why do we need to process the entire
document in order to find those fragments!? - Is there any way to prevent DOM- or SAX- parsers
from processing the entire document? and HOW? - Can we improve the standard DOM/SAX processing
models without modifying (or just a little mod.)
them? - What is the benefit?
- What cost will we pay?
9XML Prefiltering Technique
Our Solution
XPath Expression (Issued by users apps.)
Prefiltering Techniques (A tiny search engine)
Candidate-setXML document
XML Parsers (DOM/SAX)
XML document
10Two-phase XML Processing Model Enhanced User
Applications
11Two-phase XML Processing Model Enhanced XPath
Processors
12Two-phase XML Processing Model Enhanced User
Applications
13Two-phase XML Processing Model Enhanced
Stream-based XPath Processors
14A Source Code Fragment of an XPath Processor with
the XML Prefilter
15An Example
- XPath expression //A//E
- Answers sub-trees rooted by (E8,E15) and (E9,E14)
XPath Processors
XML Prefilter
The candidate-set XML Document
The XML Document
Exact Answers
16Prefiltering Technique Requirements and
Limitations
- Requirements
- 100 recall rate (correctness)
- Transparency (easy to use)
- Non-intrusive? (easy to integrate to XML
processors) - Lightweight (XML DBs are expensive)
- Efficient
- Limitations
- Need a user query (We do not take multiple
queries at a time because the candidate-set XML
doc. may still very large) - Need preprocess the XML document (large size and
infrequently updated)
17System Architecture of the Prefiltering
Technique (DOM)
18System Architecture of the Prefiltering
Technique (SAX)
19Prefiltering Technique Indexer
- Position List (start tag position, end tag
position)We use the preorder number to express
the tag offset - Random assess to the document by move file
pointer to tag positions
20Prefiltering Technique Query Simplifier (QS)
- Goal Reduce the cost of query evaluation
- Simplification Rules
- SR1 omitting internal steps (b)
- SR2 omitting branch steps (c1 and c2)
- SR3 omitting wildcard steps (d) and
- SR4 replacing the parent/child axes with the
ancestor/descendant axes (e). - Always applySR4 because our query evaluation
algorithm can determine the ancestor/descendant
relationships more efficiently than the
parent/child relationships
21Prefiltering Technique Query Simplifier (QS)
- SR1 omitting internal steps (b)
- SR2 omitting branch steps (c1 and c2)
- SR3 omitting wildcard steps (d) and
- SR4 replacing the parent/child axes with the
ancestor/descendant axes (e).
22Query Simplifier SR5 Omitting Uninformative
Steps
skipped Intermediate nodes
XPE /A/B/C/E/F
simXPE //C//F (returns the same results.) The
prefilter runs more efficient!!
Intermediate nodes
A, B, and E are Uninformative Steps
Matched nodes
23Prefiltering Technique Fast Lightweight
Steps-Axes Analyzer (FLISA)
- Determine the candidate fragments in the XML
document by evaluating the simplified XPath
expression
The equations of evaluating u/axisv
The answer of //A//E is (E8,E15). Note that the
subtree rooted by (E9, E14) will be removed.
24Prefiltering Technique Fragment Gatherer (FG)
- Generate a candidate-set XML document
- Generate fragments (simple outputs of FLISA)
- Generate path information (only deal with the
descendant axis) - Parse XML document from the root
- When a start-tag is recognized, use its position
to look up the corresponding end-tag position in
the inverted index table - Check whether the parsed node N contains any
candidate fragment as its descendant or N itself
is a candidate fragment - If yes, output N.
- If not, directly move the file pointer to its
end-tag position (skip the frag.) - Note that currently we have no efficient way to
generate the path information if the users XPath
expression contains the preceding, following, or
sibling axes.
25Prefiltering Technique Micro XML Streaming
Parser (MXSP)
- Transforms the candidate fragments into
SAX-events - The procedure is similar to that of Fragment
Gatherer - Provides interactive mechanisms by using the
following additional flow-control operators - close-the-current-fragment (CCF)
- jump-to-the-next-fragment (JNF)
- terminate-the-parsing-process (TPP)
- parse-next-node (PNN)
- reparse-previous-fragment (RPF)
- reparse-current-fragment (RCF)
26Experiments and Analysis
27Experiment and Analysis Testing of
Attributes-Testing Nodes
Path of the query /site/regions/namerica/item_at_id
"item20748"/name
Dataset XMark Benchmark Source
http//www-rocq.inria.fr/gemo/Gemo/Projects/SUMMAR
Y/DTD-xmark.jpg
28Querying Large XML Docs
Query /site/regions/item_at_id"item1"/name
(matching one node)
N/A means that the method runs out of memory and
did not finish.
29Querying Large XML Docs
Query /childsite/childregions/childasia
(matching 4.5 nodes of the source document)
N/A means that the method runs out of memory and
did not finish.
30Chinese Treebank
- Semantically annotated corpus
- Help parse and study Chinese sentences
- Applications
- Machine translation processing
- Building example-based parsers
- Comparing and integrating grammars
- Developing and enlarging Treebank
- ...
- About 20,000 sentences in the CKIP Treebank V1.0
- VP(HeadVK1??goalNP(HeadNdabe???))
(http//godel.iis.sinica.edu.tw/CKIP/trees1000.txt
)
31Experiment and Analysis Sample Queries 1/2
32Experiment and Analysis Sample Queries 2/2
33Experiment and Analysis Treebank Search Engine
Over simplify a query
- StreamPCRI is a stream-based structural pattern
matching algorithm.
Our setup is an Intel Pentium-4 PC running at
2.53GHz, with a 1GB DDR-RAM, All programs were
coded in ActivePerl-5.6.1.629. XML-SAX module
(v0.12) and the XML-SAX-Expat (v0.37), Huang et
al., 2005
34Experiment and Analysis Testing Flow-Control
Operators
- Dataset GML Document (162MB)
- The XPath expression was to find all buildings
within a range of 20,000 square meters, from
(305500, 2767060) to (305600, 2767100).
35Bounded Box and Query
The Bounded Box (BBox) of the Geo-obj.
Query 1 (mismatch)
Query 3 (unmatch)
- Matching Process
- Check BBox
- Check boundary
Query 2 (match)
36Skipping Parsing Uninteresting Fragments using
JNF Flow-Control Operators (in MXSP)
Source XML Document
Candidate Frag. 2 (Matched)
Candidate Frag. 1(Matched)
Unmatched
jump
jump-to-the-next-fragment (JNF)
Candidate Frag. 3
Candidate Frag. n
37Experiment and Analysis Testing of
Flow-Control Operators
- Lower the cost, parse less nodes, and perform
less Disk I/O - However, consume a lot of memory
38GIS Applications (presented at ACM-GIS06)
39Snapshots of the GML-based Web GIS
Query by BBoxes
Query by Layers
Query by ID
Scalable Vector Graphics (SVG) Map Navigator
(powered by www.carto.net)
40A GML Fragment
Geospatial Data (Coordinates)
XML/GML Tags
41System Architecture of the GML-based Web GIS
- GeoXQuery a GML query engine Boucelma and
Colonna, 2004 - Extending the Saxon Java XQuery processor by
calling spatial functions libraries of JTS (Java
Topology Suite). - GeoSAX -- a GML streaming parser
- Extending the Suns SAX parser to support the
spatial functions.
42Problems in the GML Solution
GML
Web Server CGI
WebBrowser (SVG Nav.)
BIG
XQuery Expressions.
Query (BBox, Layers, Obj ID)
SVG Elements
SVG Elements
GeoXQuery or GeoSAX
- If the GML documents are Large
- GeoXQuery may not work (DOM data model consumes a
huge amount of main memory.) - GeoSAX needs a stream-based query algorithm.
43Integrating with an XML Pre-filter
- Using an XML Pre-filter Technique Huang et al.
2006. to cut off uninteresting XML/GML fragments
by approximately executing user query. - However, the prefilter does not support the
functionality of prefiltering Geospatial data. - I.e., cannot handle the BBox query constraint.
44Bounding-Box Indexing Plug-in Module (BIPM) for
the XML Pre-filter
- Bounding-box Indexing Plug-in Module (BIPM) is
developed for the XML pre-filtering technique to
perform geospatial filtering functionality. - BIPM can index the boundary of each geographical
feature in the documents and provides an
intersection operation to query indexed features.
45Indexing Bounded Boxes
Indexing the Bounded Boxes (BBox) for all
Geo-objects.
46Prefiltering with the Bounding-Box Indexing
Plug-in Module
//Rivers//FootPrint
XML Prefilter
Intersection
BIPM
BBox(xx,yy,xx,yy)
Final Pre-filtering Results
47Environment and Datasets
- Two datasets
- 1.1 GB GML document (the Taipei city)
- 152 MB GML document (the Xinyi area)
- Six GML processors
- GeoXQuery
- GPXQuery with BIPM
- GPXQuery without BIPM
- GeoSAX
- GPSAX with BIPM
- GPSAX without BIPM
- Setup
- an Intel Pentium-4 PC running at 2.53 GHz with 1
GB DDR-RAM, - a 120 GB EIDE hard disk,
- the MS Windows 2000 server.
- Java 2 (Standard Edition V.1.4.2).
48Query Constraints
49Datasets
Large datasetTaipei, 1.1 GB
Small datasetXinyi, 152 MB
V2
V4
50Querying by a Feature IDXQuery-based Processors
The query returns a geo-feature.
N/A means that the processor run out of memory
and did not finish
The pre-filtering technique lowers resource
consumption.
51Querying by a Layer and a BBoxXQuery-based
Processors
The query returns the Energy Supply Utility layer
in V4.
The query returns the Energy Supply Utility in V2.
The pre-filtering technique lowers resource
consumption.
52Querying by a BBoxXQuery-based Processors
The query returns geo-features in V4.
The query returns geo-features in V2.
BIPM can efficiently filter out uninteresting
geographic features.
53Querying by a Feature ID SAX-based Processors
The query returns a geo-feature.
The pre-filtering technique lowers the run time
but increases memory consumption.
54Querying by a Layer and a BBox SAX-based
Processors
The query returns the Energy Supply Utility layer
in V4.
The query returns the Energy Supply Utility in V2.
55Querying by a BBox SAX-based Processors
The query returns geo-features in V4.
The Cost of pre-filtering GML docs.
The query returns geo-features in V2.
56Conclusion
- If programs require only small parts of the
document, why do we need to process the entire
document in order to find those fragments!? No,
it is unnecessary. - Is there any possible way to prevent DOM- or SAX-
parsers from processing the entire document? and
HOW? Yes, prefilter XML documents. - Can we improve the standard DOM/SAX processing
models without modifying (or just a little mod.)
them? One instruction is enough (using the
two-phase processing model) - What is the benefit? More efficient XML document
processing. - What cost will we pay? Memory, storage, and cost
of indexing.
57Future Work
- Lowering memory consumption
- Developing index management subsystems
- Investigating more efficient way to prune the XML
doc. and generate path information of the
candidate-set document - Integrating the prefiltering technique into DOM-
and stream-based XPath processors and XQuery
processors (already done, see http//www.iis.sini
ca.edu.tw/jashing/prefiltering/)
58Thank you for your attentionQuestions and
Comments
All the software packages of the XML Prefilter
are available at http//www.iis.sinica.edu.tw/ja
shing/prefiltering/
59XML Processing Enhancements
XML Applications
- Unchangeable?
- or a few modifications!
Requirements?
XML Standards
60Issues in ExistingXML Processing Enhancements
- Consume large amount of disk/memory space and CPU
time (Cost ) - Large-scale (Cost )
- Integrate with relational database (Cost )
- Complicated index/query algorithms (Cost )
- Intrusive (considerable modifications) (Cost
) - Non-transparent (apps. need to be aware of the
mechanics) (Cost )
61Experiment and Analysis - Datasets
Note The CKIP Chinese Treebank corpus and the
GML file are encoded in UTF-8. Although we
initialize the Expat SAX parser and the primitive
SAX parser with the parameter (ProtocolEncoding
gt UTF-8), it still did not work and showed the
warning messages Wide character in print
62Experiment Analysis Treebank Search Engine
Element freq. of the Chinese Treebank
Removing half of steps with consideration to
element frequencies
63Related Work
- Lazy XML processing Noga et al., 2002 and the
Apache Xercess lazy processing The Apache
Xerces2 parser 2.8.1 Release - approaches avoid parsing an entire document into
memory by incrementally building a DOM tree as
different parts of the document are requested by
the user.
64Related Work
- Projecting XML documents Marian and Simeon.
2003. - are that pruning the uninteresting fragments in
the target XML document by considering users
XPath expression when loading the document.
65Related Work
- Type-based XML projection Benzaken et al.,
2006. - prunes an XML document more precisely in the
presence of the document type definition (DTD) or
the schema the document.
Projector
66Related Work
- Accelerating queries by pruning XML documents
Bressan et al., 2005.
67Issues of Manipulating GML Docs.
- GML, providing rich vocabulary and flexible
document structure to express complicated
geospatial data and non-geospatial data. - Although GML is a kind of XML, the existing XML
processors (DOM, SAX, XPath, and XQuery) are not
suitable for processing GML.
DOM, SAX, XPath, XQuery
?
XML
GML
68Solutions
- GIS databases,
- Open source software, PostgreSQL/PostGIS.
- Many people choose this way.
- Extending the existing XML processors
- We now are talking about this way.
DOM, SAX, XPath, XQuery
GeoSAX, GeoXQuery (GeoXPath)
XML
GML
69Contributions
- Proposing two efficient GML-native processors.
- Enabling the GML processors to query large GML
docs. - Building a GML-based Web GIS using the GML
processors.
Bounding-box Indexing Plug-in Module
Indexing
XML Pre-filtering Technique
Spatial Extension
GML Query Engines
XQuery
SAX
XML/GML
Data storage
Streaming
DOM
70XQuery Expression with Geospatial Extension
Libraries
Geospatial extension for XQuery
- 1. declare namespace my"http//www.sinica.edu.tw/
" - 2. declare namespace gml"javaGML.XQGeoExtensions
" - 3. declare namespace svg"javaGML.XQSVGExtensions
" - 4. declare function myget_geo1() as element()
- 5. for var1 in doc("lanyu.xml")//Rivers//FootPr
int_at_id "21001000000-11" - 6. return ltresult1gtvar1lt/result1gt
- 7. declare function myget_geo2() as element()
- 8. for var1 in doc("lanyu.xml")//Roadways//Foot
Print_at_id "4230904000-31" - 9. return ltresult1gtvar1lt/result1gt
- 10. svgGML2SVG(gmlBuffer(
- gmlIntersection (myget_geo1() , myget_geo2()
), 50))
Geospatial Operations
Calculating the buffer of the intersection of a
road and a river
71Query Results
(a) A road.
(b) A river.
(c) The results of buffering the intersection of
the road and the river
(d) Combine and recolor (a), (b), and (c) in a
SVG map.
72GML-native Processors
Bounding-box Indexing Plug-in Module
GPXQuery
GPSAX
XML Pre-filtering Technique
Spatial Extension
GeoSAX
GeoXQuery
XQuery
SAX
XML/GML
Streaming
DOM