Filtering XML Documents with XPath

1 / 49
About This Presentation
Title:

Filtering XML Documents with XPath

Description:

The Query Index is used to match documents to individual XPath queries ... Filtering XML documents using a strcuture-oriented path language such as XPath ... –

Number of Views:129
Avg rating:3.0/5.0
Slides: 50
Provided by: nph3
Category:

less

Transcript and Presenter's Notes

Title: Filtering XML Documents with XPath


1
Filtering XML Documents with XPath
By Nick Phan CS 240B Spring 2008
2
Information Dissemination
  • The large volume of data available necessitates
    the use of selective approaches to disseminate
    information in order to not overwhelm end users.
  • Typical Execution Model
  • Continuously collect new data items from
    underlying data sources
  • Filter them against user profiles
  • Deliver relevant data to interested users

3
Current Systems
  • Current Selective Dissemination of Information
    (SDI) applications use simple keyword matching or
    bag of words Information Retrieval (IR)
    techniques.
  • These techniques often suffer from a limited
    ability to express user interests.

4
XML-based SDI Architecture
  • XML has emerged as a standard information
    exchange mechanism on the Internet.
  • XML allows structural information to be encoded
    into the document
  • This structural information can be exploited to
    create more focused and accurate results

5
XML-based SDI Architecture
6
XFilter
  • An XML-based document filtering system that
    provides efficient matching of XML documents to
    large numbers of user profiles
  • Represents user interests with XPath
  • Uses a sophisticated index structure and a
    modified Finite State Machine (FSM) approach to
    quickly locate and examine relevant profiles

7
XPath Basics
  • A language for addressing parts of an XML
    document
  • It treats an XML document as a tree of nodes
  • XPath expressions are patterns that can be
    matched to nodes in the XML tree
  • Paths can be specified as absolute paths (from
    the root of the document tree) or as relative
    paths (from the context node)

8
XPath Basics
  • The hierarchical relationship between the nodes
    are specified in the query using parent-child
    operators (/) or ancestor-descendant (//)
    operators
  • Example /catalog/product//msrp
  • Addresses all msrp elements which are descendents
    of all product elements that are direct children
    of the catalog element (which is the root)

9
XPath Basics
  • XPath also has a wildcard operator () which
    matches any element name at a location step in a
    query
  • Each location step can also include one or more
    filters
  • A filter is a predicate that is applied to the
    element(s) addressed at that location step
  • Filter expressions are enclosed by and
    symbols

10
XFilter XPath
  • XPath is used to select entire documents rather
    than parts of a document
  • If the XPath expression matches at least one
    element of a document then we say that document
    satisfies the expression
  • Although other XML query languages would work,
    XPath was chosen because of its simplicity and
    its recommendation by W3C

11
How is XFilter Different?
  • IR-based SDI systems only involve text documents,
    where XFilter can work for any application domain
    where data is tagged using XML
  • XFilter takes advantage of embedded schema
    information, thus providing more precise
    filtering
  • Most previous IR work has focused on accuracy
    rather than efficiency, however XFilter scales
    very easily

12
XFilter Implementation
13
XFilter Implementation
  • Major components
  • Event-based parser for incoming XML documents
  • XPath parser for user profiles
  • Filter engine
  • Dissemination component which sends the filtered
    data to the appropriate users
  • The heart of the system is the filter engine
    which uses an index structure and a modified FSM
    approach to quickly locate and check relevant
    profiles

14
XFilter Filter Engine
  • The Filter Engine component of XFilter contains
    an inverted index called the Query Index
  • The Query Index is used to match documents to
    individual XPath queries
  • The Filter Engine allows user profiles to be
    expressed as a Boolean combination of XPath
    expressions

15
XPath Challenges
  • Filtering XML documents using a
    strcuture-oriented path language such as XPath
    introduces several new problems
  • Checking the order of elements in the profiles
  • Handling wildcards and descendent operators
  • Evaluating filters that are applied to element
    nodes
  • To handle these problems, XFilter converts each
    XPath query to a Finite State Machine

16
Generating Path Nodes
  • Each XPath query is decomposed into a set of path
    nodes
  • The path nodes represent the element nodes in
    the query and serve as the states for the FSM
  • Path nodes are not generated for wildcard ()
    nodes

17
Path Node Structure
  • QueryId A unique identifier for the path
    expression to which the path node belongs
  • Position A sequence number that determines the
    location of this path node in the order of the
    path nodes for the query
  • RelativePos An integer that describes the
    distance in document levels between this path
    node and the previous path node
  • A node that is separated from the previous one by
    a descendent operator is flagged with a speical
    value of -1

18
Path Node Structure
  • Level An integer representing the level in the
    XML document at which this path node should be
    checked
  • Filters If a node contains one or more filters,
    these are stored as expression trees pointed to
    by the path node
  • NextPathNodeSet Each path node also contains
    pointer(s) to the next path node(s) of the query

19
Path Node Structure Example
20
Query Index
  • The Query Index is organized as a hash table
    based on the element names that appear in the
    XPath expression
  • Associated with each unique element name are two
    lists
  • Candidate List
  • Wait List
  • Since each query can only be in a single state of
    its FSM at a time, each query has a single path
    node that represents its current state.
  • Referred to as the current node

21
Query Index
  • The current node of each query is placed on the
    Candidate List of the index entry for its
    respective element name
  • All of the path nodes representing future states
    are stored in the Wait Lists of their respective
    element names
  • A state transition is defined by promoting a path
    node from the Wait List to the Candidate List

22
Query Index Example
23
XML Parsing Filtering
  • The XML Parser is based on the SAX interface
    which is a standard interface for event-based XML
    parsing
  • The SAX event-based interface reports parsing
    events directly to the application through
    callbacks and does not build an internal tree
  • XFilter handles the following events that occur
    during the parsing of an XML document
  • Start Element
  • End Element
  • Element Character

24
Start Element Handler
  • When an element tag is encountered by the parser,
    it calls the handler, passing the name, the level
    and any XML attributes and values
  • The handler looks up the element name in the
    Query Index and examines all of the nodes in the
    Candidate List for that entry
  • For each node it performs two checks
  • Level Check
  • Attribute Filter Check
  • If both checks succeed and there are no other
    filters to be checked, the node passes.
  • If this is the final path node of the query (i.e.
    it is the final state) then the document is
    deemed to match the query
  • Otherwise, if it is not the final state, the
    query is moved to the next state
  • This is done by copying the next node for the
    query from its Wait List to its corresponding
    Candidate List

25
Other Element Handlers
  • End Element Handler When and end element tag is
    encountered, the corresponding path node is
    deleted from the Candidate List in order to
    restore that list to its previous state
  • Element Characters Handler Works similar to the
    Start Element Handler except it performs a
    content filter check rather than an attribute
    filter check

26
Enhanced Filtering Algorithms
  • List Balancing Attempts to balance the initial
    lengths of the Candidate Lists.
  • When adding a new query to the index the element
    node of that query whose entry in the index has
    the shortest Candidate List is chosen as the
    pivot node for that query. Thus it is the
    first node checked for the query.
  • Prefiltering
  • When a new document arrives, an occurrence table
    is constructed containing an entry of each
    element name that appears in the document
  • Once the table is constructed, the queries
    referenced by the table are checked to see if all
    of the element names they contain are in the
    document

27
Performance Analysis
Uniform Distribution Varying of Profiles (1K
100K) Maximum Depth 5
Skewed Distribution Varying of Profiles (1K
100K) Maximum Depth 5
28
XTrie
  • Another XML-based document filtering system that
    provides efficient matching of XML documents to
    large numbers of user profiles
  • Like XFilter, XTrie uses XPath
  • XTrie aims to provide improved performance along
    with support for more complex XPath expressions

29
XTrie Contributions
  • XTrie is designed to support effective filtering
    based on complex XPath expressions
  • The XTrie structure and algorithms are designed
    to support both ordered and unordered matching of
    XML data
  • By indexing on a sequence of element names (i.e.
    substrings) organized in a trie structure and
    using sophisticated matching algorithms, XTrie is
    able to reduce the number of unnecessary index
    probes and redundant matchings

30
XPE-Trees
  • An XPE-tree is an ordered rooted tree, where each
    node is labeled with an element name and the
    ordering of the child nodes for each parent node
    is based on their order of appearance in the XPE
  • Relative level is denoted by k, ? if the label
    is prefixed with //, otherwise it is define as
    k, k

//a
1,?
//b
/f
1,?
1,1
//c
/d
1,1
2,2
//a.//b/c/d/f
31
Unordered v. Ordered Matching
  • Unordered Matching
  • Checks to see that the labels of the individual
    elements match
  • Enforces the positional constraints specified in
    the XPE
  • Ordered Matching
  • Takes into account explicit order matching
    defined in XPath expression
  • For simplicity, only unordered matching is
    covered

32
Unordered Matching Example
g
//a
1,?
a
b
b
f
//b
/f
1,?
1,1
b
h
//c
/d
1,1
2,2
e
d
c
c
XML Document Tree
XPE-Tree
33
Substring Decompostions
  • A substring is defined as a sequence of element
    names of nodes along a path such that each node
    is prefixed only with /
  • In other words, a substring is an ordered list of
    nodes that are direct descendents of each other
  • A substring decomposition of an XPE is define as
    a set of substrings that cover all of the nodes
    in the XPE and all of the possible paths
  • A minimal substring decomposition is defined as
    the substring decomposition where each of the
    substrings have a maximal length

34
Substring Decomposition Example
XPE /a/bc/d//eg//e/f////e/f
/a
/a
/b
/b
///e
/g
/c
///e
/g
/c
//e
/d
/f
//e
/d
/f
/f
//e
/f
//e
Substrings abcd, e, abg, ef, ef
Substrings ab, abcd, e, abg, ef, ef
Minimal Substring Decompostion
35
Minimal Substring Decompositions
  • Two important performance advantages
  • Since longer substrings have a lower probability
    of being matched in the input XML document, the
    maximal-length substrings generally result in
    fewer index probes
  • Since there are fewer XPEs associated with a
    longer substring, the cost of each index probe is
    generally lower

36
Substring Trees
/a
ab
/b
abcd
abg
ef
///e
/g
/c
//e
/d
/f
e
ef
/f
//e
Substrings ab, abcd, e, abg, ef, ef
37
Matching with Substring Trees
  • A substring matches a node in an XML document if
    its last element matches that node
  • Since XML documents are parsed using a SAX parser
    (which performs a pre-order traversal),
    substrings should also be pre-ordered
  • Matching Types
  • Partial Matching matching for all consecutive
    substrings from the first to a given substring
  • Complete Matching a partial matching for the
    final substring
  • Sub-tree Matching a partial matching found all
    all descendants of a given substring
  • Redundant Matching sub-tree matching found at
    some earlier node in the XML document

38
Matching with Substring Trees
XPE //a//b/c/d
b
a
a
b
b
b
b
f
c
bd
c
e
d
c
Substring Tree
XML Tree
39
XTrie Indexing Scheme
  • The first step to building the XTrie index is to
    take a set of XPEs and generate their simple
    decompositions
  • A simple decomposition is a minimal decomposition
    with substrings added for each branching node
  • Consists of two data structures
  • A substring table where each row represents a
    single substring
  • A Trie where edges are labeled with element names

40
XTrie Substring Table
  • ParentRow refers to the row number of the tuple
    in the substring table corresponding to its
    parent (ParentRow 0 if it is root substring)
  • RelLevel is the relative level of the substring
  • Rank is the rank of the substring
  • NumChild is the total number of child substrings
  • Next is a pointer for a singly linked list that
    contains the row numbers of the next tuples in
    the substring table

41
XTrie Trie
  • The trie T is a rooted tree constructed from the
    set of distinct substrings S, where each edge in
    T is labeled with some element name.
  • Each node N in T is associated with a label ,
    denoted by label(N), which is the string formed
    by concatenating the edge labels along the path
    from the root node of T.
  • The construction of T ensures that
  • For each s?S, there is a unique node N in T such
    that label(N) s
  • For each leaf node N in T, label(N)?S
  • Basically this ensures that the trie contains all
    of the substrings and that they are not duplicated

42
XTrie Trie
  • Substring pointer, denoted by ?(N), points to a
    row in the substring table using the following
    rule
  • If label(N)?S, then ?(N) points to the first row
    of the linked list associated with the substring,
    otherwise ?(N) 0
  • Max-suffix pointer, denoted by ?(N), points to
    some internal node in T to ensure correctness
  • ?(N) N if label(N) is the longest proper
    suffix of label(N) among all internal nodes in T,
    otherwise if N does not exist, then ?(N) points
    to the root

43
XTrie Index Example
XPE1 //a/a/b/c//a/b
XPE3 /a/bc//d//b/c
XPE2 /a/bc/e//b/c/d
XPE4 //c/b//c/d///d
1
0
1
a
d
b
c
2
3
4
5
1
1
0
0
0
8
1
1
a
b
d
c
b
8
9
10
6
7
2
0
2
3
9
4
10
3
11
5
b
c
d
13
11
12
0
7
7
5
10
8
c
e
14
15
1
12
4
1
Trie T
Substring Table
44
XTrie Matching Algorithm
  • The Trie is used to detect the occurrence of
    matching substrings as the input document is
    parsed
  • For each matching substring s detected, iterate
    through all the instances of s in the indexed
    XPEs (by traversing the appropriate linked list
    of rows in the substring table associated with s)
    to check if the matched substring s corresponds
    to any non-redundant matching

45
XTrie Matching Algorithm
  • The matching algorithm maintains to runtime
    arrays B and C
  • B records the rank of the next child subtree of s
    that we need to match for this non-redundant
    occurrence of s
  • C is a bit array that is used to ensure that
    sibling substrings match along distinct branches
    for an ordered matching
  • An XPE p matches the XML document if Brs, l
    m 1 for some level l where
  • rs is the root substring in the substring-tree
    for p
  • m is the number of child substrings of rs

46
XTrie Optimizations
  • Lazy XTrie
  • Aims to reduce the number of index probes by
    postponing the probing of the substring table
    until the substring appears as a leaf substring
    in some XPE
  • XTrie for Single-Path XPEs
  • Removes the complexity needed for dealing with
    branching XPEs
  • Although single-path XPEs work in the normal
    implementation, a special case is considered
    since single-path XPEs are very common in real
    world applications

47
XTrie Performance
Comparison between XTrie and XFilter
48
Conclusion
  • XML-based SDI applications are better than
    traditional IR approaches since they make use of
    the structural information of XML documents
  • XFilter provides efficient filtering of XML
    documents by encoding user profiles in XPath then
    transforming those XPath queries into a FSM based
    index
  • XTrie provides even more efficient filtering of
    XML documents by decomposing XPath expressions
    into substrings which are then used to build a
    trie based index structure

49
References
  • M. Altinel and M. J. Franklin. Efficient
    Filtering of XML Documents for Selective
    Dissemination of Information. In Proc. Of VLDB,
    2000.
  • C.-Y. Chan, P. Felber, M. Garofalakis, and R.
    Rastogi. Efficient Filtering of XML Documents
    with XPath Expressions. In Proc. of ICDE, 2002.
Write a Comment
User Comments (0)
About PowerShow.com