Filtering XML Documents with XPath presentation

About This Presentation

Title:

Filtering XML Documents with XPath

Description:

The Query Index is used to match documents to individual XPath queries ... Filtering XML documents using a strcuture-oriented path language such as XPath ... –

Number of Views:129

Avg rating:3.0/5.0

Slides: 50

Provided by: nph3

Category:

more less

Transcript and Presenter's Notes

Title: Filtering XML Documents with XPath

1
Filtering XML Documents with XPath
By Nick Phan CS 240B Spring 2008
2
Information Dissemination

The large volume of data available necessitates
the use of selective approaches to disseminate
information in order to not overwhelm end users.
Typical Execution Model
Continuously collect new data items from
underlying data sources
Filter them against user profiles
Deliver relevant data to interested users

3
Current Systems

Current Selective Dissemination of Information
(SDI) applications use simple keyword matching or
bag of words Information Retrieval (IR)
techniques.
These techniques often suffer from a limited
ability to express user interests.

4
XML-based SDI Architecture

XML has emerged as a standard information
exchange mechanism on the Internet.
XML allows structural information to be encoded
into the document
This structural information can be exploited to
create more focused and accurate results

5
XML-based SDI Architecture
6
XFilter

An XML-based document filtering system that
provides efficient matching of XML documents to
large numbers of user profiles
Represents user interests with XPath
Uses a sophisticated index structure and a
modified Finite State Machine (FSM) approach to
quickly locate and examine relevant profiles

7
XPath Basics

A language for addressing parts of an XML
document
It treats an XML document as a tree of nodes
XPath expressions are patterns that can be
matched to nodes in the XML tree
Paths can be specified as absolute paths (from
the root of the document tree) or as relative
paths (from the context node)

8
XPath Basics

The hierarchical relationship between the nodes
are specified in the query using parent-child
operators (/) or ancestor-descendant (//)
operators
Example /catalog/product//msrp
Addresses all msrp elements which are descendents
of all product elements that are direct children
of the catalog element (which is the root)

9
XPath Basics

XPath also has a wildcard operator () which
matches any element name at a location step in a
query
Each location step can also include one or more
filters
A filter is a predicate that is applied to the
element(s) addressed at that location step
Filter expressions are enclosed by and
symbols

10
XFilter XPath

XPath is used to select entire documents rather
than parts of a document
If the XPath expression matches at least one
element of a document then we say that document
satisfies the expression
Although other XML query languages would work,
XPath was chosen because of its simplicity and
its recommendation by W3C

11
How is XFilter Different?

IR-based SDI systems only involve text documents,
where XFilter can work for any application domain
where data is tagged using XML
XFilter takes advantage of embedded schema
information, thus providing more precise
filtering
Most previous IR work has focused on accuracy
rather than efficiency, however XFilter scales
very easily

12
XFilter Implementation
13
XFilter Implementation

Major components
Event-based parser for incoming XML documents
XPath parser for user profiles
Filter engine
Dissemination component which sends the filtered
data to the appropriate users
The heart of the system is the filter engine
which uses an index structure and a modified FSM
approach to quickly locate and check relevant
profiles

14
XFilter Filter Engine

The Filter Engine component of XFilter contains
an inverted index called the Query Index
The Query Index is used to match documents to
individual XPath queries
The Filter Engine allows user profiles to be
expressed as a Boolean combination of XPath
expressions

15
XPath Challenges

Filtering XML documents using a
strcuture-oriented path language such as XPath
introduces several new problems
Checking the order of elements in the profiles
Handling wildcards and descendent operators
Evaluating filters that are applied to element
nodes
To handle these problems, XFilter converts each
XPath query to a Finite State Machine

16
Generating Path Nodes

Each XPath query is decomposed into a set of path
nodes
The path nodes represent the element nodes in
the query and serve as the states for the FSM
Path nodes are not generated for wildcard ()
nodes

17
Path Node Structure

QueryId A unique identifier for the path
expression to which the path node belongs
Position A sequence number that determines the
location of this path node in the order of the
path nodes for the query
RelativePos An integer that describes the
distance in document levels between this path
node and the previous path node
A node that is separated from the previous one by
a descendent operator is flagged with a speical
value of -1

18
Path Node Structure

Level An integer representing the level in the
XML document at which this path node should be
checked
Filters If a node contains one or more filters,
these are stored as expression trees pointed to
by the path node
NextPathNodeSet Each path node also contains
pointer(s) to the next path node(s) of the query

19
Path Node Structure Example
20
Query Index

The Query Index is organized as a hash table
based on the element names that appear in the
XPath expression
Associated with each unique element name are two
lists
Candidate List
Wait List
Since each query can only be in a single state of
its FSM at a time, each query has a single path
node that represents its current state.
Referred to as the current node

21
Query Index

The current node of each query is placed on the
Candidate List of the index entry for its
respective element name
All of the path nodes representing future states
are stored in the Wait Lists of their respective
element names
A state transition is defined by promoting a path
node from the Wait List to the Candidate List

22
Query Index Example
23
XML Parsing Filtering

The XML Parser is based on the SAX interface
which is a standard interface for event-based XML
parsing
The SAX event-based interface reports parsing
events directly to the application through
callbacks and does not build an internal tree
XFilter handles the following events that occur
during the parsing of an XML document
Start Element
End Element
Element Character

24
Start Element Handler

When an element tag is encountered by the parser,
it calls the handler, passing the name, the level
and any XML attributes and values
The handler looks up the element name in the
Query Index and examines all of the nodes in the
Candidate List for that entry
For each node it performs two checks
Level Check
Attribute Filter Check
If both checks succeed and there are no other
filters to be checked, the node passes.
If this is the final path node of the query (i.e.
it is the final state) then the document is
deemed to match the query
Otherwise, if it is not the final state, the
query is moved to the next state
This is done by copying the next node for the
query from its Wait List to its corresponding
Candidate List

25
Other Element Handlers

End Element Handler When and end element tag is
encountered, the corresponding path node is
deleted from the Candidate List in order to
restore that list to its previous state
Element Characters Handler Works similar to the
Start Element Handler except it performs a
content filter check rather than an attribute
filter check

26
Enhanced Filtering Algorithms

List Balancing Attempts to balance the initial
lengths of the Candidate Lists.
When adding a new query to the index the element
node of that query whose entry in the index has
the shortest Candidate List is chosen as the
pivot node for that query. Thus it is the
first node checked for the query.
Prefiltering
When a new document arrives, an occurrence table
is constructed containing an entry of each
element name that appears in the document
Once the table is constructed, the queries
referenced by the table are checked to see if all
of the element names they contain are in the
document

27
Performance Analysis
Uniform Distribution Varying of Profiles (1K
100K) Maximum Depth 5
Skewed Distribution Varying of Profiles (1K
100K) Maximum Depth 5
28
XTrie

Another XML-based document filtering system that
provides efficient matching of XML documents to
large numbers of user profiles
Like XFilter, XTrie uses XPath
XTrie aims to provide improved performance along
with support for more complex XPath expressions

29
XTrie Contributions

XTrie is designed to support effective filtering
based on complex XPath expressions
The XTrie structure and algorithms are designed
to support both ordered and unordered matching of
XML data
By indexing on a sequence of element names (i.e.
substrings) organized in a trie structure and
using sophisticated matching algorithms, XTrie is
able to reduce the number of unnecessary index
probes and redundant matchings

30
XPE-Trees

An XPE-tree is an ordered rooted tree, where each
node is labeled with an element name and the
ordering of the child nodes for each parent node
is based on their order of appearance in the XPE
Relative level is denoted by k, ? if the label
is prefixed with //, otherwise it is define as
k, k

//a
1,?
//b
/f
1,?
1,1
//c
/d
1,1
2,2
//a.//b/c/d/f
31
Unordered v. Ordered Matching

Unordered Matching
Checks to see that the labels of the individual
elements match
Enforces the positional constraints specified in
the XPE
Ordered Matching
Takes into account explicit order matching
defined in XPath expression
For simplicity, only unordered matching is
covered

32
Unordered Matching Example
g
//a
1,?
a
b
b
f
//b
/f
1,?
1,1
b
h
//c
/d
1,1
2,2
e
d
c
c
XML Document Tree
XPE-Tree
33
Substring Decompostions

A substring is defined as a sequence of element
names of nodes along a path such that each node
is prefixed only with /
In other words, a substring is an ordered list of
nodes that are direct descendents of each other
A substring decomposition of an XPE is define as
a set of substrings that cover all of the nodes
in the XPE and all of the possible paths
A minimal substring decomposition is defined as
the substring decomposition where each of the
substrings have a maximal length

34
Substring Decomposition Example
XPE /a/bc/d//eg//e/f////e/f
/a
/a
/b
/b
///e
/g
/c
///e
/g
/c
//e
/d
/f
//e
/d
/f
/f
//e
/f
//e
Substrings abcd, e, abg, ef, ef
Substrings ab, abcd, e, abg, ef, ef
Minimal Substring Decompostion
35
Minimal Substring Decompositions

Two important performance advantages
Since longer substrings have a lower probability
of being matched in the input XML document, the
maximal-length substrings generally result in
fewer index probes
Since there are fewer XPEs associated with a
longer substring, the cost of each index probe is
generally lower

36
Substring Trees
/a
ab
/b
abcd
abg
ef
///e
/g
/c
//e
/d
/f
e
ef
/f
//e
Substrings ab, abcd, e, abg, ef, ef
37
Matching with Substring Trees

A substring matches a node in an XML document if
its last element matches that node
Since XML documents are parsed using a SAX parser
(which performs a pre-order traversal),
substrings should also be pre-ordered
Matching Types
Partial Matching matching for all consecutive
substrings from the first to a given substring
Complete Matching a partial matching for the
final substring
Sub-tree Matching a partial matching found all
all descendants of a given substring
Redundant Matching sub-tree matching found at
some earlier node in the XML document

38
Matching with Substring Trees
XPE //a//b/c/d
b
a
a
b
b
b
b
f
c
bd
c
e
d
c
Substring Tree
XML Tree
39
XTrie Indexing Scheme

The first step to building the XTrie index is to
take a set of XPEs and generate their simple
decompositions
A simple decomposition is a minimal decomposition
with substrings added for each branching node
Consists of two data structures
A substring table where each row represents a
single substring
A Trie where edges are labeled with element names

40
XTrie Substring Table

ParentRow refers to the row number of the tuple
in the substring table corresponding to its
parent (ParentRow 0 if it is root substring)
RelLevel is the relative level of the substring
Rank is the rank of the substring
NumChild is the total number of child substrings
Next is a pointer for a singly linked list that
contains the row numbers of the next tuples in
the substring table

41
XTrie Trie

The trie T is a rooted tree constructed from the
set of distinct substrings S, where each edge in
T is labeled with some element name.
Each node N in T is associated with a label ,
denoted by label(N), which is the string formed
by concatenating the edge labels along the path
from the root node of T.
The construction of T ensures that
For each s?S, there is a unique node N in T such
that label(N) s
For each leaf node N in T, label(N)?S
Basically this ensures that the trie contains all
of the substrings and that they are not duplicated

42
XTrie Trie

Substring pointer, denoted by ?(N), points to a
row in the substring table using the following
rule
If label(N)?S, then ?(N) points to the first row
of the linked list associated with the substring,
otherwise ?(N) 0
Max-suffix pointer, denoted by ?(N), points to
some internal node in T to ensure correctness
?(N) N if label(N) is the longest proper
suffix of label(N) among all internal nodes in T,
otherwise if N does not exist, then ?(N) points
to the root

43
XTrie Index Example
XPE1 //a/a/b/c//a/b
XPE3 /a/bc//d//b/c
XPE2 /a/bc/e//b/c/d
XPE4 //c/b//c/d///d
1
0
1
a
d
b
c
2
3
4
5
1
1
0
0
0
8
1
1
a
b
d
c
b
8
9
10
6
7
2
0
2
3
9
4
10
3
11
5
b
c
d
13
11
12
0
7
7
5
10
8
c
e
14
15
1
12
4
1
Trie T
Substring Table
44
XTrie Matching Algorithm

The Trie is used to detect the occurrence of
matching substrings as the input document is
parsed
For each matching substring s detected, iterate
through all the instances of s in the indexed
XPEs (by traversing the appropriate linked list
of rows in the substring table associated with s)
to check if the matched substring s corresponds
to any non-redundant matching

45
XTrie Matching Algorithm

The matching algorithm maintains to runtime
arrays B and C
B records the rank of the next child subtree of s
that we need to match for this non-redundant
occurrence of s
C is a bit array that is used to ensure that
sibling substrings match along distinct branches
for an ordered matching
An XPE p matches the XML document if Brs, l
m 1 for some level l where
rs is the root substring in the substring-tree
for p
m is the number of child substrings of rs

46
XTrie Optimizations

Lazy XTrie
Aims to reduce the number of index probes by
postponing the probing of the substring table
until the substring appears as a leaf substring
in some XPE
XTrie for Single-Path XPEs
Removes the complexity needed for dealing with
branching XPEs
Although single-path XPEs work in the normal
implementation, a special case is considered
since single-path XPEs are very common in real
world applications

47
XTrie Performance
Comparison between XTrie and XFilter
48
Conclusion

XML-based SDI applications are better than
traditional IR approaches since they make use of
the structural information of XML documents
XFilter provides efficient filtering of XML
documents by encoding user profiles in XPath then
transforming those XPath queries into a FSM based
index
XTrie provides even more efficient filtering of
XML documents by decomposing XPath expressions
into substrings which are then used to build a
trie based index structure

49
References

M. Altinel and M. J. Franklin. Efficient
Filtering of XML Documents for Selective
Dissemination of Information. In Proc. Of VLDB,
2000.
C.-Y. Chan, P. Felber, M. Garofalakis, and R.
Rastogi. Efficient Filtering of XML Documents
with XPath Expressions. In Proc. of ICDE, 2002.

Write a Comment

User Comments (0)

About PowerShow.com