Emory University

About This Presentation

Title:

Emory University

Description:

Is there any way to prevent DOM- or SAX- parsers from processing the entire document? ... we improve the standard DOM/SAX processing models without modifying ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 73

Provided by: iisSin

Category:

more less

Transcript and Presenter's Notes

Title: Emory University

1
XML Evolution Two-phase XML Processing
ModelUsing XML Prefiltering Techniques
Demonstrated at VLDB 2006.

Emory University
November 17, 2006

Chia-Hsin Huang, IIS, Academia Sinica Tyng-Ruey
Chuang, IIS, Academia Sinica James J. Lu,
MathCS, Emory Hahn-Ming Lee, CSIE, NTUST
2
Agenda

Issues of Conventional XML Processing Models (DOM
and SAX)
Motivation
Two-phase XML Processing Models and Prefiltering
Techniques
Experiments and Analysis
GIS Applications (Optional)
Conclusion and Future Work
Related Work (Optional)

3
Issues of Conventional XML Processing Model DOM
1/4
XPath expression /html//p/text
child axis
descendant axis
Source http//www.cee.hw.ac.uk/alison/
netapp/dom/sld006.htm
4
Issues of Conventional XML Processing Model DOM
2/4

Pros
Provide flexible tree-traversal ability
Suitable for supporting XPath axes
Random access to the document
Cons
Need a lot of resources to build a DOM-tree
CPU time
Memory space
(Size of the XML doc.) (Size of the DOM tree)
1 5

5
Issues of Conventional XML Processing Model SAX
3/4
XPath expression //entry_at_ida2
Backward reference XPE //footext()baz/ances
torentry
Source http//www.informatik.hu-berlin.de/obecke
r/Lehre/SS2002/XML/images/sax.t.gif
6
Issues of Conventional XML Processing Model SAX
4/4

Pros (compared with the DOM model)
Consume much less resources
A constant and small amount of memory
Support streaming process
Cons
Parse over the document
No backtrack mechanisms (look forward parsing)
Lack of interactive mechanisms

7
Problems in Standard DOM and SAX Processing Models

Both DOM and SAX processing models waste a large
amount of computational resources by processing
uninteresting fragments.

8
Motivation

If programs require only small parts of the
document, why do we need to process the entire
document in order to find those fragments!?
Is there any way to prevent DOM- or SAX- parsers
from processing the entire document? and HOW?
Can we improve the standard DOM/SAX processing
models without modifying (or just a little mod.)
them?
What is the benefit?
What cost will we pay?

9
XML Prefiltering Technique
Our Solution
XPath Expression (Issued by users apps.)
Prefiltering Techniques (A tiny search engine)
Candidate-setXML document
XML Parsers (DOM/SAX)
XML document
10
Two-phase XML Processing Model Enhanced User
Applications
11
Two-phase XML Processing Model Enhanced XPath
Processors
12
Two-phase XML Processing Model Enhanced User
Applications
13
Two-phase XML Processing Model Enhanced
Stream-based XPath Processors
14
A Source Code Fragment of an XPath Processor with
the XML Prefilter

15
An Example

XPath expression //A//E
Answers sub-trees rooted by (E8,E15) and (E9,E14)

XPath Processors
XML Prefilter
The candidate-set XML Document
The XML Document
Exact Answers
16
Prefiltering Technique Requirements and
Limitations

Requirements
100 recall rate (correctness)
Transparency (easy to use)
Non-intrusive? (easy to integrate to XML
processors)
Lightweight (XML DBs are expensive)
Efficient
Limitations
Need a user query (We do not take multiple
queries at a time because the candidate-set XML
doc. may still very large)
Need preprocess the XML document (large size and
infrequently updated)

17
System Architecture of the Prefiltering
Technique (DOM)
18
System Architecture of the Prefiltering
Technique (SAX)
19
Prefiltering Technique Indexer

Position List (start tag position, end tag
position)We use the preorder number to express
the tag offset
Random assess to the document by move file
pointer to tag positions

20
Prefiltering Technique Query Simplifier (QS)

Goal Reduce the cost of query evaluation
Simplification Rules
SR1 omitting internal steps (b)
SR2 omitting branch steps (c1 and c2)
SR3 omitting wildcard steps (d) and
SR4 replacing the parent/child axes with the
ancestor/descendant axes (e).
Always applySR4 because our query evaluation
algorithm can determine the ancestor/descendant
relationships more efficiently than the
parent/child relationships

21
Prefiltering Technique Query Simplifier (QS)

SR1 omitting internal steps (b)
SR2 omitting branch steps (c1 and c2)
SR3 omitting wildcard steps (d) and
SR4 replacing the parent/child axes with the
ancestor/descendant axes (e).

22
Query Simplifier SR5 Omitting Uninformative
Steps
skipped Intermediate nodes
XPE /A/B/C/E/F
simXPE //C//F (returns the same results.) The
prefilter runs more efficient!!
Intermediate nodes
A, B, and E are Uninformative Steps
Matched nodes
23
Prefiltering Technique Fast Lightweight
Steps-Axes Analyzer (FLISA)

Determine the candidate fragments in the XML
document by evaluating the simplified XPath
expression

The equations of evaluating u/axisv
The answer of //A//E is (E8,E15). Note that the
subtree rooted by (E9, E14) will be removed.
24
Prefiltering Technique Fragment Gatherer (FG)

Generate a candidate-set XML document
Generate fragments (simple outputs of FLISA)
Generate path information (only deal with the
descendant axis)
Parse XML document from the root
When a start-tag is recognized, use its position
to look up the corresponding end-tag position in
the inverted index table
Check whether the parsed node N contains any
candidate fragment as its descendant or N itself
is a candidate fragment
If yes, output N.
If not, directly move the file pointer to its
end-tag position (skip the frag.)
Note that currently we have no efficient way to
generate the path information if the users XPath
expression contains the preceding, following, or
sibling axes.

25
Prefiltering Technique Micro XML Streaming
Parser (MXSP)

Transforms the candidate fragments into
SAX-events
The procedure is similar to that of Fragment
Gatherer
Provides interactive mechanisms by using the
following additional flow-control operators
close-the-current-fragment (CCF)
jump-to-the-next-fragment (JNF)
terminate-the-parsing-process (TPP)
parse-next-node (PNN)
reparse-previous-fragment (RPF)
reparse-current-fragment (RCF)

26
Experiments and Analysis
27
Experiment and Analysis Testing of
Attributes-Testing Nodes
Path of the query /site/regions/namerica/item_at_id
"item20748"/name
Dataset XMark Benchmark Source
http//www-rocq.inria.fr/gemo/Gemo/Projects/SUMMAR
Y/DTD-xmark.jpg
28
Querying Large XML Docs
Query /site/regions/item_at_id"item1"/name
(matching one node)
N/A means that the method runs out of memory and
did not finish.
29
Querying Large XML Docs
Query /childsite/childregions/childasia
(matching 4.5 nodes of the source document)
N/A means that the method runs out of memory and
did not finish.
30
Chinese Treebank

Semantically annotated corpus
Help parse and study Chinese sentences
Applications
Machine translation processing
Building example-based parsers
Comparing and integrating grammars
Developing and enlarging Treebank
...
About 20,000 sentences in the CKIP Treebank V1.0
VP(HeadVK1??goalNP(HeadNdabe???))

(http//godel.iis.sinica.edu.tw/CKIP/trees1000.txt
)
31
Experiment and Analysis Sample Queries 1/2
32
Experiment and Analysis Sample Queries 2/2
33
Experiment and Analysis Treebank Search Engine
Over simplify a query

StreamPCRI is a stream-based structural pattern
matching algorithm.

Our setup is an Intel Pentium-4 PC running at
2.53GHz, with a 1GB DDR-RAM, All programs were
coded in ActivePerl-5.6.1.629. XML-SAX module
(v0.12) and the XML-SAX-Expat (v0.37), Huang et
al., 2005
34
Experiment and Analysis Testing Flow-Control
Operators

Dataset GML Document (162MB)
The XPath expression was to find all buildings
within a range of 20,000 square meters, from
(305500, 2767060) to (305600, 2767100).

35
Bounded Box and Query
The Bounded Box (BBox) of the Geo-obj.
Query 1 (mismatch)
Query 3 (unmatch)

Matching Process
Check BBox
Check boundary

Query 2 (match)
36
Skipping Parsing Uninteresting Fragments using
JNF Flow-Control Operators (in MXSP)
Source XML Document
Candidate Frag. 2 (Matched)
Candidate Frag. 1(Matched)
Unmatched
jump
jump-to-the-next-fragment (JNF)
Candidate Frag. 3
Candidate Frag. n
37
Experiment and Analysis Testing of
Flow-Control Operators

Lower the cost, parse less nodes, and perform
less Disk I/O
However, consume a lot of memory

38
GIS Applications (presented at ACM-GIS06)
39
Snapshots of the GML-based Web GIS
Query by BBoxes
Query by Layers
Query by ID
Scalable Vector Graphics (SVG) Map Navigator
(powered by www.carto.net)
40
A GML Fragment
Geospatial Data (Coordinates)
XML/GML Tags
41
System Architecture of the GML-based Web GIS

GeoXQuery a GML query engine Boucelma and
Colonna, 2004
Extending the Saxon Java XQuery processor by
calling spatial functions libraries of JTS (Java
Topology Suite).
GeoSAX -- a GML streaming parser
Extending the Suns SAX parser to support the
spatial functions.

42
Problems in the GML Solution

GML
Web Server CGI
WebBrowser (SVG Nav.)
BIG
XQuery Expressions.
Query (BBox, Layers, Obj ID)
SVG Elements
SVG Elements
GeoXQuery or GeoSAX

If the GML documents are Large
GeoXQuery may not work (DOM data model consumes a
huge amount of main memory.)
GeoSAX needs a stream-based query algorithm.

43
Integrating with an XML Pre-filter

Using an XML Pre-filter Technique Huang et al.
2006. to cut off uninteresting XML/GML fragments
by approximately executing user query.
However, the prefilter does not support the
functionality of prefiltering Geospatial data.
I.e., cannot handle the BBox query constraint.

44
Bounding-Box Indexing Plug-in Module (BIPM) for
the XML Pre-filter

Bounding-box Indexing Plug-in Module (BIPM) is
developed for the XML pre-filtering technique to
perform geospatial filtering functionality.
BIPM can index the boundary of each geographical
feature in the documents and provides an
intersection operation to query indexed features.

45
Indexing Bounded Boxes
Indexing the Bounded Boxes (BBox) for all
Geo-objects.
46
Prefiltering with the Bounding-Box Indexing
Plug-in Module
//Rivers//FootPrint
XML Prefilter
Intersection
BIPM
BBox(xx,yy,xx,yy)
Final Pre-filtering Results
47
Environment and Datasets

Two datasets
1.1 GB GML document (the Taipei city)
152 MB GML document (the Xinyi area)

Six GML processors
GeoXQuery
GPXQuery with BIPM
GPXQuery without BIPM
GeoSAX
GPSAX with BIPM
GPSAX without BIPM

Setup
an Intel Pentium-4 PC running at 2.53 GHz with 1
GB DDR-RAM,
a 120 GB EIDE hard disk,
the MS Windows 2000 server.
Java 2 (Standard Edition V.1.4.2).

48
Query Constraints
49
Datasets
Large datasetTaipei, 1.1 GB
Small datasetXinyi, 152 MB
V2
V4
50
Querying by a Feature IDXQuery-based Processors
The query returns a geo-feature.
N/A means that the processor run out of memory
and did not finish
The pre-filtering technique lowers resource
consumption.
51
Querying by a Layer and a BBoxXQuery-based
Processors
The query returns the Energy Supply Utility layer
in V4.
The query returns the Energy Supply Utility in V2.
The pre-filtering technique lowers resource
consumption.
52
Querying by a BBoxXQuery-based Processors
The query returns geo-features in V4.
The query returns geo-features in V2.
BIPM can efficiently filter out uninteresting
geographic features.
53
Querying by a Feature ID SAX-based Processors
The query returns a geo-feature.
The pre-filtering technique lowers the run time
but increases memory consumption.
54
Querying by a Layer and a BBox SAX-based
Processors
The query returns the Energy Supply Utility layer
in V4.
The query returns the Energy Supply Utility in V2.
55
Querying by a BBox SAX-based Processors
The query returns geo-features in V4.
The Cost of pre-filtering GML docs.
The query returns geo-features in V2.
56
Conclusion

If programs require only small parts of the
document, why do we need to process the entire
document in order to find those fragments!? No,
it is unnecessary.
Is there any possible way to prevent DOM- or SAX-
parsers from processing the entire document? and
HOW? Yes, prefilter XML documents.
Can we improve the standard DOM/SAX processing
models without modifying (or just a little mod.)
them? One instruction is enough (using the
two-phase processing model)
What is the benefit? More efficient XML document
processing.
What cost will we pay? Memory, storage, and cost
of indexing.

57
Future Work

Lowering memory consumption
Developing index management subsystems
Investigating more efficient way to prune the XML
doc. and generate path information of the
candidate-set document
Integrating the prefiltering technique into DOM-
and stream-based XPath processors and XQuery
processors (already done, see http//www.iis.sini
ca.edu.tw/jashing/prefiltering/)

58
Thank you for your attentionQuestions and
Comments
All the software packages of the XML Prefilter
are available at http//www.iis.sinica.edu.tw/ja
shing/prefiltering/
59
XML Processing Enhancements
XML Applications

Unchangeable?
or a few modifications!

Requirements?
XML Standards

Unchangeable!?

60
Issues in ExistingXML Processing Enhancements

Consume large amount of disk/memory space and CPU
time (Cost )
Large-scale (Cost )
Integrate with relational database (Cost )
Complicated index/query algorithms (Cost )
Intrusive (considerable modifications) (Cost
)
Non-transparent (apps. need to be aware of the
mechanics) (Cost )

61
Experiment and Analysis - Datasets
Note The CKIP Chinese Treebank corpus and the
GML file are encoded in UTF-8. Although we
initialize the Expat SAX parser and the primitive
SAX parser with the parameter (ProtocolEncoding
gt UTF-8), it still did not work and showed the
warning messages Wide character in print
62
Experiment Analysis Treebank Search Engine
Element freq. of the Chinese Treebank
Removing half of steps with consideration to
element frequencies
63
Related Work

Lazy XML processing Noga et al., 2002 and the
Apache Xercess lazy processing The Apache
Xerces2 parser 2.8.1 Release
approaches avoid parsing an entire document into
memory by incrementally building a DOM tree as
different parts of the document are requested by
the user.

64
Related Work

Projecting XML documents Marian and Simeon.
2003.
are that pruning the uninteresting fragments in
the target XML document by considering users
XPath expression when loading the document.

65
Related Work

Type-based XML projection Benzaken et al.,
2006.
prunes an XML document more precisely in the
presence of the document type definition (DTD) or
the schema the document.

Projector
66
Related Work

Accelerating queries by pruning XML documents
Bressan et al., 2005.

67
Issues of Manipulating GML Docs.

GML, providing rich vocabulary and flexible
document structure to express complicated
geospatial data and non-geospatial data.
Although GML is a kind of XML, the existing XML
processors (DOM, SAX, XPath, and XQuery) are not
suitable for processing GML.

DOM, SAX, XPath, XQuery
?
XML
GML
68
Solutions

GIS databases,
Open source software, PostgreSQL/PostGIS.
Many people choose this way.
Extending the existing XML processors
We now are talking about this way.

DOM, SAX, XPath, XQuery
GeoSAX, GeoXQuery (GeoXPath)
XML
GML
69
Contributions

Proposing two efficient GML-native processors.
Enabling the GML processors to query large GML
docs.
Building a GML-based Web GIS using the GML
processors.

Bounding-box Indexing Plug-in Module
Indexing
XML Pre-filtering Technique
Spatial Extension
GML Query Engines
XQuery
SAX
XML/GML
Data storage
Streaming
DOM
70
XQuery Expression with Geospatial Extension
Libraries
Geospatial extension for XQuery

1. declare namespace my"http//www.sinica.edu.tw/
"
2. declare namespace gml"javaGML.XQGeoExtensions
"
3. declare namespace svg"javaGML.XQSVGExtensions
"
4. declare function myget_geo1() as element()
5. for var1 in doc("lanyu.xml")//Rivers//FootPr
int_at_id "21001000000-11"
6. return ltresult1gtvar1lt/result1gt
7. declare function myget_geo2() as element()
8. for var1 in doc("lanyu.xml")//Roadways//Foot
Print_at_id "4230904000-31"
9. return ltresult1gtvar1lt/result1gt
10. svgGML2SVG(gmlBuffer(
gmlIntersection (myget_geo1() , myget_geo2()
), 50))

Geospatial Operations
Calculating the buffer of the intersection of a
road and a river
71
Query Results
(a) A road.
(b) A river.
(c) The results of buffering the intersection of
the road and the river
(d) Combine and recolor (a), (b), and (c) in a
SVG map.
72
GML-native Processors
Bounding-box Indexing Plug-in Module
GPXQuery
GPSAX
XML Pre-filtering Technique
Spatial Extension
GeoSAX
GeoXQuery
XQuery
SAX
XML/GML
Streaming
DOM

Write a Comment

User Comments (0)