Title: Filtering XML Documents with XPath
1Filtering XML Documents with XPath
By Nick Phan CS 240B Spring 2008
2Information Dissemination
- The large volume of data available necessitates
the use of selective approaches to disseminate
information in order to not overwhelm end users. - Typical Execution Model
- Continuously collect new data items from
underlying data sources - Filter them against user profiles
- Deliver relevant data to interested users
3Current Systems
- Current Selective Dissemination of Information
(SDI) applications use simple keyword matching or
bag of words Information Retrieval (IR)
techniques. - These techniques often suffer from a limited
ability to express user interests.
4XML-based SDI Architecture
- XML has emerged as a standard information
exchange mechanism on the Internet. - XML allows structural information to be encoded
into the document - This structural information can be exploited to
create more focused and accurate results
5XML-based SDI Architecture
6XFilter
- An XML-based document filtering system that
provides efficient matching of XML documents to
large numbers of user profiles - Represents user interests with XPath
- Uses a sophisticated index structure and a
modified Finite State Machine (FSM) approach to
quickly locate and examine relevant profiles
7XPath Basics
- A language for addressing parts of an XML
document - It treats an XML document as a tree of nodes
- XPath expressions are patterns that can be
matched to nodes in the XML tree - Paths can be specified as absolute paths (from
the root of the document tree) or as relative
paths (from the context node)
8XPath Basics
- The hierarchical relationship between the nodes
are specified in the query using parent-child
operators (/) or ancestor-descendant (//)
operators - Example /catalog/product//msrp
- Addresses all msrp elements which are descendents
of all product elements that are direct children
of the catalog element (which is the root)
9XPath Basics
- XPath also has a wildcard operator () which
matches any element name at a location step in a
query - Each location step can also include one or more
filters - A filter is a predicate that is applied to the
element(s) addressed at that location step - Filter expressions are enclosed by and
symbols
10XFilter XPath
- XPath is used to select entire documents rather
than parts of a document - If the XPath expression matches at least one
element of a document then we say that document
satisfies the expression - Although other XML query languages would work,
XPath was chosen because of its simplicity and
its recommendation by W3C
11How is XFilter Different?
- IR-based SDI systems only involve text documents,
where XFilter can work for any application domain
where data is tagged using XML - XFilter takes advantage of embedded schema
information, thus providing more precise
filtering - Most previous IR work has focused on accuracy
rather than efficiency, however XFilter scales
very easily
12XFilter Implementation
13XFilter Implementation
- Major components
- Event-based parser for incoming XML documents
- XPath parser for user profiles
- Filter engine
- Dissemination component which sends the filtered
data to the appropriate users - The heart of the system is the filter engine
which uses an index structure and a modified FSM
approach to quickly locate and check relevant
profiles
14XFilter Filter Engine
- The Filter Engine component of XFilter contains
an inverted index called the Query Index - The Query Index is used to match documents to
individual XPath queries - The Filter Engine allows user profiles to be
expressed as a Boolean combination of XPath
expressions
15XPath Challenges
- Filtering XML documents using a
strcuture-oriented path language such as XPath
introduces several new problems - Checking the order of elements in the profiles
- Handling wildcards and descendent operators
- Evaluating filters that are applied to element
nodes - To handle these problems, XFilter converts each
XPath query to a Finite State Machine
16Generating Path Nodes
- Each XPath query is decomposed into a set of path
nodes - The path nodes represent the element nodes in
the query and serve as the states for the FSM - Path nodes are not generated for wildcard ()
nodes
17Path Node Structure
- QueryId A unique identifier for the path
expression to which the path node belongs - Position A sequence number that determines the
location of this path node in the order of the
path nodes for the query - RelativePos An integer that describes the
distance in document levels between this path
node and the previous path node - A node that is separated from the previous one by
a descendent operator is flagged with a speical
value of -1
18Path Node Structure
- Level An integer representing the level in the
XML document at which this path node should be
checked - Filters If a node contains one or more filters,
these are stored as expression trees pointed to
by the path node - NextPathNodeSet Each path node also contains
pointer(s) to the next path node(s) of the query
19Path Node Structure Example
20Query Index
- The Query Index is organized as a hash table
based on the element names that appear in the
XPath expression - Associated with each unique element name are two
lists - Candidate List
- Wait List
- Since each query can only be in a single state of
its FSM at a time, each query has a single path
node that represents its current state. - Referred to as the current node
21Query Index
- The current node of each query is placed on the
Candidate List of the index entry for its
respective element name - All of the path nodes representing future states
are stored in the Wait Lists of their respective
element names - A state transition is defined by promoting a path
node from the Wait List to the Candidate List
22Query Index Example
23XML Parsing Filtering
- The XML Parser is based on the SAX interface
which is a standard interface for event-based XML
parsing - The SAX event-based interface reports parsing
events directly to the application through
callbacks and does not build an internal tree - XFilter handles the following events that occur
during the parsing of an XML document - Start Element
- End Element
- Element Character
24Start Element Handler
- When an element tag is encountered by the parser,
it calls the handler, passing the name, the level
and any XML attributes and values - The handler looks up the element name in the
Query Index and examines all of the nodes in the
Candidate List for that entry - For each node it performs two checks
- Level Check
- Attribute Filter Check
- If both checks succeed and there are no other
filters to be checked, the node passes. - If this is the final path node of the query (i.e.
it is the final state) then the document is
deemed to match the query - Otherwise, if it is not the final state, the
query is moved to the next state - This is done by copying the next node for the
query from its Wait List to its corresponding
Candidate List
25Other Element Handlers
- End Element Handler When and end element tag is
encountered, the corresponding path node is
deleted from the Candidate List in order to
restore that list to its previous state - Element Characters Handler Works similar to the
Start Element Handler except it performs a
content filter check rather than an attribute
filter check
26Enhanced Filtering Algorithms
- List Balancing Attempts to balance the initial
lengths of the Candidate Lists. - When adding a new query to the index the element
node of that query whose entry in the index has
the shortest Candidate List is chosen as the
pivot node for that query. Thus it is the
first node checked for the query. - Prefiltering
- When a new document arrives, an occurrence table
is constructed containing an entry of each
element name that appears in the document - Once the table is constructed, the queries
referenced by the table are checked to see if all
of the element names they contain are in the
document
27Performance Analysis
Uniform Distribution Varying of Profiles (1K
100K) Maximum Depth 5
Skewed Distribution Varying of Profiles (1K
100K) Maximum Depth 5
28XTrie
- Another XML-based document filtering system that
provides efficient matching of XML documents to
large numbers of user profiles - Like XFilter, XTrie uses XPath
- XTrie aims to provide improved performance along
with support for more complex XPath expressions
29XTrie Contributions
- XTrie is designed to support effective filtering
based on complex XPath expressions - The XTrie structure and algorithms are designed
to support both ordered and unordered matching of
XML data - By indexing on a sequence of element names (i.e.
substrings) organized in a trie structure and
using sophisticated matching algorithms, XTrie is
able to reduce the number of unnecessary index
probes and redundant matchings
30XPE-Trees
- An XPE-tree is an ordered rooted tree, where each
node is labeled with an element name and the
ordering of the child nodes for each parent node
is based on their order of appearance in the XPE - Relative level is denoted by k, ? if the label
is prefixed with //, otherwise it is define as
k, k
//a
1,?
//b
/f
1,?
1,1
//c
/d
1,1
2,2
//a.//b/c/d/f
31Unordered v. Ordered Matching
- Unordered Matching
- Checks to see that the labels of the individual
elements match - Enforces the positional constraints specified in
the XPE - Ordered Matching
- Takes into account explicit order matching
defined in XPath expression - For simplicity, only unordered matching is
covered
32Unordered Matching Example
g
//a
1,?
a
b
b
f
//b
/f
1,?
1,1
b
h
//c
/d
1,1
2,2
e
d
c
c
XML Document Tree
XPE-Tree
33Substring Decompostions
- A substring is defined as a sequence of element
names of nodes along a path such that each node
is prefixed only with / - In other words, a substring is an ordered list of
nodes that are direct descendents of each other - A substring decomposition of an XPE is define as
a set of substrings that cover all of the nodes
in the XPE and all of the possible paths - A minimal substring decomposition is defined as
the substring decomposition where each of the
substrings have a maximal length
34Substring Decomposition Example
XPE /a/bc/d//eg//e/f////e/f
/a
/a
/b
/b
///e
/g
/c
///e
/g
/c
//e
/d
/f
//e
/d
/f
/f
//e
/f
//e
Substrings abcd, e, abg, ef, ef
Substrings ab, abcd, e, abg, ef, ef
Minimal Substring Decompostion
35Minimal Substring Decompositions
- Two important performance advantages
- Since longer substrings have a lower probability
of being matched in the input XML document, the
maximal-length substrings generally result in
fewer index probes - Since there are fewer XPEs associated with a
longer substring, the cost of each index probe is
generally lower
36Substring Trees
/a
ab
/b
abcd
abg
ef
///e
/g
/c
//e
/d
/f
e
ef
/f
//e
Substrings ab, abcd, e, abg, ef, ef
37Matching with Substring Trees
- A substring matches a node in an XML document if
its last element matches that node - Since XML documents are parsed using a SAX parser
(which performs a pre-order traversal),
substrings should also be pre-ordered - Matching Types
- Partial Matching matching for all consecutive
substrings from the first to a given substring - Complete Matching a partial matching for the
final substring - Sub-tree Matching a partial matching found all
all descendants of a given substring - Redundant Matching sub-tree matching found at
some earlier node in the XML document
38Matching with Substring Trees
XPE //a//b/c/d
b
a
a
b
b
b
b
f
c
bd
c
e
d
c
Substring Tree
XML Tree
39XTrie Indexing Scheme
- The first step to building the XTrie index is to
take a set of XPEs and generate their simple
decompositions - A simple decomposition is a minimal decomposition
with substrings added for each branching node - Consists of two data structures
- A substring table where each row represents a
single substring - A Trie where edges are labeled with element names
40XTrie Substring Table
- ParentRow refers to the row number of the tuple
in the substring table corresponding to its
parent (ParentRow 0 if it is root substring) - RelLevel is the relative level of the substring
- Rank is the rank of the substring
- NumChild is the total number of child substrings
- Next is a pointer for a singly linked list that
contains the row numbers of the next tuples in
the substring table
41XTrie Trie
- The trie T is a rooted tree constructed from the
set of distinct substrings S, where each edge in
T is labeled with some element name. - Each node N in T is associated with a label ,
denoted by label(N), which is the string formed
by concatenating the edge labels along the path
from the root node of T. - The construction of T ensures that
- For each s?S, there is a unique node N in T such
that label(N) s - For each leaf node N in T, label(N)?S
- Basically this ensures that the trie contains all
of the substrings and that they are not duplicated
42XTrie Trie
- Substring pointer, denoted by ?(N), points to a
row in the substring table using the following
rule - If label(N)?S, then ?(N) points to the first row
of the linked list associated with the substring,
otherwise ?(N) 0 - Max-suffix pointer, denoted by ?(N), points to
some internal node in T to ensure correctness - ?(N) N if label(N) is the longest proper
suffix of label(N) among all internal nodes in T,
otherwise if N does not exist, then ?(N) points
to the root
43XTrie Index Example
XPE1 //a/a/b/c//a/b
XPE3 /a/bc//d//b/c
XPE2 /a/bc/e//b/c/d
XPE4 //c/b//c/d///d
1
0
1
a
d
b
c
2
3
4
5
1
1
0
0
0
8
1
1
a
b
d
c
b
8
9
10
6
7
2
0
2
3
9
4
10
3
11
5
b
c
d
13
11
12
0
7
7
5
10
8
c
e
14
15
1
12
4
1
Trie T
Substring Table
44XTrie Matching Algorithm
- The Trie is used to detect the occurrence of
matching substrings as the input document is
parsed - For each matching substring s detected, iterate
through all the instances of s in the indexed
XPEs (by traversing the appropriate linked list
of rows in the substring table associated with s)
to check if the matched substring s corresponds
to any non-redundant matching
45XTrie Matching Algorithm
- The matching algorithm maintains to runtime
arrays B and C - B records the rank of the next child subtree of s
that we need to match for this non-redundant
occurrence of s - C is a bit array that is used to ensure that
sibling substrings match along distinct branches
for an ordered matching - An XPE p matches the XML document if Brs, l
m 1 for some level l where - rs is the root substring in the substring-tree
for p - m is the number of child substrings of rs
46XTrie Optimizations
- Lazy XTrie
- Aims to reduce the number of index probes by
postponing the probing of the substring table
until the substring appears as a leaf substring
in some XPE - XTrie for Single-Path XPEs
- Removes the complexity needed for dealing with
branching XPEs - Although single-path XPEs work in the normal
implementation, a special case is considered
since single-path XPEs are very common in real
world applications
47XTrie Performance
Comparison between XTrie and XFilter
48Conclusion
- XML-based SDI applications are better than
traditional IR approaches since they make use of
the structural information of XML documents - XFilter provides efficient filtering of XML
documents by encoding user profiles in XPath then
transforming those XPath queries into a FSM based
index - XTrie provides even more efficient filtering of
XML documents by decomposing XPath expressions
into substrings which are then used to build a
trie based index structure
49References
- M. Altinel and M. J. Franklin. Efficient
Filtering of XML Documents for Selective
Dissemination of Information. In Proc. Of VLDB,
2000. - C.-Y. Chan, P. Felber, M. Garofalakis, and R.
Rastogi. Efficient Filtering of XML Documents
with XPath Expressions. In Proc. of ICDE, 2002.