Title: XML Indexing Techniques
1XML Indexing Techniques
- Requirements
- Dataguide and Variation
- Index Fabric
- Adaptative Path Index
- Node Numbering scheme
- Compact Structural Summary
- Conclusion
2Requirements
- XML Queries involve navigating data using regular
path expressions.(e.g., XPath) - /Livre//Auteur_at_specialite"informatique")
- Accessing all elements with same name string.
- Ancestor-descendant relationship between
elements. - Content based access on values included in text.
3Index Types
- Structural index
- Accessing all elements of given name
- Ancestor-descendant and parent-child relationship
between elements - Content index
- Accessing elements containing given keywords
- Supporting most text search functionalities
4Classical Content Index
- Classically based on inverted lists
- For each term, gives the doc.ID localization
- Several variations allows different search types
- Offset, Relative, Proximity
- Generally stored in a B-Tree to optimize search
for a given word - Size is an important issue
- Memory and Disk
- (word, localization)
- Fixed entry (word repeated)
- (word, Frequency, (localization))
- Variable length entry
5Problem with XML
- Support of element addressing
- Doc.ID should include NodeId (Xpath) Offset
- Index size becomes very large
- XPath are long
- Support of typed data
- Integer, float, simple types of XML schema
- Requires classical indexes for certain elements
- Query processing
- Structural joins
- Text search
- Exact search
- Support of updates
- Incremental updates would be a plus
6Evaluation Criteria
- Identifiers
- Per node or per document
- Descendant/Ancestor Search
- By join algo.
- By graph traversal
- By OID comparison
- Keyword Search
- By element scan
- By B-tree traversal
- Update
- Incremental
- Index size
- Entry number
- Entry size
72-Dataguide and Variation
- Goldman Widom VLDB97
- Dynamic schemas
- helps in query formulation
- Concise and accurate structural summaries
- Every path in the database has one and only one
corresponding path in the DataGuide with the same
sequence of labels
- A legal label path
- Restaurant/Name
- Target set
- for eRestaurant/Entree is Ts(e) 6,10,11.
- DocId can be added to identifiers
8Dataguide Principle
- To achieve conciseness
- a DataGuide describes every unique label path of
a source exactly once. - To ensure accuracy
- a DataGuide encodes no label path that does not
appear in the source. - And for convenience
- a DataGuide itself be an object (OEM or XML).
9Dataguide Evaluation
- Identifier
- One per node
- Descendant/Ancestor Search
- By graph traversal
- Keyword Search
- By element scan
- Update
- Insertion is incremental
- Deletion is complex
- Index size
- Entry number Linear for tree can be
exponential in number of DB nodes - Entry size number of elements for a path
10T-Index
- Milo Suciu, LNCS 1997
- T-index stands for Template-index
- A path template t has the form
- T1 x1 T2 x2 Tn xn
- where each Ti is either a regular path expression
or one of the following two place holders P (any
Path) and F (any Formula) - //restaurant/ x P y /Address/City z F u
- A query path q is obtained from t by
instantiating - P by any path F by any formula
11Principle
- T-index indexes all sequences of objects
connected by a sequence of path expressions
defined by a template. - Particular cases
- 1-index indexes template any path P
- Indexes all objects reachable through an
arbitrary path expression P from a root - two nodes are equivalent (same entry) if the set
of paths into them from the root is the same. - 1-index is a non-deterministic version of the
strong data guide - 2-index indexes template P x P
- all pairs of objects connected by an arbitrary
path expression P
12Building a T-index
- Group objects into equivalence classes containing
objects that are indistinguishable w.r.t to a
class of paths defined by a path template - Finer equivallence classes are more efficient to
construct using bi-simulation - Construct a non deterministic automaton
- states represent the equivalence classes
- transitions correspond to edges between objects
in those classes. - T-index can be used to answer queries of more
general forms than the template
133-Adaptative Path Index (APEX)
- Adaptative Path Index for XML Chung et.al.
SIGMOD 2002 - Summarize paths that appear frequently in query
workload - Maintain all paths of length 1
- Efficient for partial match paths
- Incremental update of index
14APEX details
- Each node has an identifier (nid)
- Required paths for indexing (labelsome
composed paths) - APEX Graph (structural summary) hash tree
(incoming required paths to nodes of Graph) - Hash tree is used to find nodes of graph for
given label path, also for incremental update - Determine frequently used path from query
workload using sequential pattern mining
15APEX Example
XML data structure
APEX Hash tree and Graph
16APEX Evaluation
- Identifiers
- One per node
- Descendant/Ancestor Search
- Hash tree access if required or graph traversal
or join - Keyword Search
- Not supported
- Update
- Insertion is incremental
- Index size (two structures)
- Entry number Linear in number of nodes
- Entry size number of elements for a path
174-Index Fabric
- Cooper et al. .A Fast Index for Semistructured
Data.. VLDB, 2001 - Extension of dataguide for text search
- Keeps all label paths starting from the root
- Encode each label path with data value as a
string - Use efficient index for strings to store it
(Patricia trie) - Perform queries on keywords for elements as
string search - Does not keep information on non-terminal nodes
18Patricia Trié
- A Patricia trie is a simple form of compressed
trie which merges single child nodes with their
parents - More efficient for long keys (non-common postfix
in one node)
Trie A tree for storing strings in which there
is one node for every common prefix. The strings
are stored in extra leaf nodes.
19Exemple
- Doc 1ltinvoicegt
- ltbuyergt
- ltnamegtABC Corplt/namegt
- ltaddressgt1 Industrial Waylt/addressgt
- lt/buyergt
- ltsellergt
- ltnamegtAcme Inclt/namegt
- ltaddressgt2 Acme Rd.lt/addressgt
- lt/sellergt
- ltitem count3gtsawlt/itemgt
- ltitem count2gtdrilllt/itemgt
- lt/invoicegt
- Doc 2 ltinvoicegt
- ltbuyergt
- ltnamegtOracle Inclt/namegt
- ltphonegt555-1212lt/phonegt
- lt/buyergt
- ltsellergt
- ltnamegtIBM Corplt/namegt
- lt/sellergt
- ltitemgt
- ltcountgt4lt/countgt
- ltnamegtnaillt/namegt
- lt/itemgt
- lt/invoicegt
20Patricia Trie
21Search on Paths
- Example of queries
- /invoice/buyer/name/ABC Corp
- /invoice/buyer//ABC Corp
- A key lookup operator search for the path key
corresponding to the path expression. - If path expands to infinite number of tags
- start by using a prefix key lookup operator,
- then navigate through children to check the rest
22Fabric Evaluation
- Identifiers
- One per document
- Descendant/Ancestor Search
- As string search do not keep order of elements
- Keyword Search
- By Patricia trie leaves if expanded value index
otherwise - Update
- Insertion is incremental
- Deletion is complex
- Index size (index stored with document)
- Entry number Linear for tree
- Entry size number of elements for a path
235-Node Numbering Scheme
- Used for indexing elements
- Node Identifier (NID) ? element
- The NID aims at replacing structural joins by
simple function computation - check parent ancestor relationships
- is_parent(NID1,NID2), is_ancestor(NID1,NID2)
- determine parent children
- get_parent(NID1), get_children(NID1)
24Virtual nodes (1)
- Lee Yoo Digital Libraries 99
- Document structure mapped on a k-ary tree
- Node identifier assigned according to the
level-order tree traversal - parent(i) (i-2)/k 1
- child(i,j) k(i-1) j 1
25Virtual nodes (2)
- NID can be used to address elements in index of
elements - Only certain nodes (e.g., leaves) have to be
indexed as parent nodes can be determined by
computation - Problems
- arity of tree may be variable and large
- determination of real existence of parent/child
- update when arity increases ?
26XML trees node pre/post numbering
- Dietz82
- Identification of nodes
- Identifier preorder rankpostorder rank
- X ancestor of Y ltgt
- pre(X) lt pre(Y) and
- post(X) gt post(Y)
- Example
- 1lt5 and 7gt3 gt (1,7) ancestor (5,3)
(1,7)
(6,6)
(2,4)
(7,5)
(3,1)
(5,3)
(4,2)
27Interval encoding
- LiMoon VLDB 2001
- Identify each node by a pair of numbers ltorder,
sizegt as follows - For a tree node y of parent x
- order(x) lt order(y)
- order(y)size(y) lt order(x) size(x)
- For two sibling nodes x and y, if x is the
predecessor of y in preorder traversal then - order(x) size(x) lt order(y)
(1,100)
(41,10)
(10,30)
(45,5)
(25,5)
(11,5)
(17,5)
Size keeps space for updates
28Relative Region Coordinates (1)
- Kha Yoshikawa IEEE Data Engin. 2001
- A RRC of a node n of an XML tree is a pair
sp-sn,sp-en of addresses in the region of
parent, i.e., relative to parent start
Parent
Child
s
e
29Relative Region Coordinates (2)
- Absolute region coordinate (ARC)
- Relative to root begin (from byte Nth to Mth)
- Allow to extract the XML data
- Can be derived from RRCs of parents and self
- Begin ?(parents?self)s (k-1)
- End ?(parents)s e(self)(k-1)
- Advantages
- Updates are kept local to a region
- To access parent-child efficiently
- A B-tree like structure is maintained (Ã la
Natix).
30Xyleme
- Generate a form of dataguide per cluster
- Generalized DTD
- Manage a label and value index (full index)
- Keep document ID and element ID
- Two forms of element ID
- Bit structured scheme structure position
- Prefix-postfix scheme left-deep traversal
- Stores XML DOM trees in pages
- NATIX (Mannheim Univ.) technology
31Xyleme
326-Compact Structural Summary
- Bremer Gertz Tech Report 2003
- Compact addressing of words in XML doc.
- Encode XPath as reference to a path in a document
guide (path set, DTD or schema)
33Managing a Compact Index
- Naïve XML Indexing
- (Word,docId,(XPath))
- Example
- book/chapter2/resume/section3
- article/author/name
- Difficulties
- Index size !
- Processing time !
- Intersection of lists
- Problem
- How to memorize the location of a word inside an
element ? - Solution Bremer Gertz 02
- Encode the XPath as a reference to a path in a
document guide (path sequence or schema)
34XPath Encoding
- XPath encoded as a path ID (PID) of structure
(N,(p1,p2, ...) - N being a node identifier in the guide
- (p1, p2, ...) being indices for repetitive
ancestors from root to N
PID (V, (1, 3))
/db/article1/text/sect3
35PID Ordering and Encoding
- PID order
- IV,(1))lt(V,(1,2)) lt(V,(1,3)).
- Pre-order relationship
- X Parent Y
- ? PID(X) lt PID(Y)
- Compact PID encoding
- Path number
- Integer (short)
- Repetitive node
- log2(n) bits
- Compact PID Encoding (V, (1, 3))
/db/article1/text/sect3
2 children 1 bit
1 child 0 bit
3 children 2 bits
Total 3 bits
36Index Implementation
ltlivregt lttitregtLes Misérables, Tome 1
Fantinelt/titregt ltauteurgtVictor
Hugolt/auteurgt lthistoiregt 1815. Alors que tous
les aubergistes de la ville l'ont chassé, le
bagnard Jean Valjean est hébergé par Mgr Myriel (
que les pauvres ont baptisé, d'après l'un de ses
prénoms, Mgr Bienvenu). L'évêque de la ville de
Digne, l'accueille avec bienveillance, le fait
manger à sa table et lui offre un bon
lit. . lt/histoiregt lt/livregt
- Entry
- Word (stem) Address
- Address is
- PID (offset in element)
- Example
- City (V(1,3) (9, 36))
Word PID offset
Valjean (PID 15)
Ville (PID 9, 36)
37XQuery Text Evaluator
- Normalize the query through thesaurus
- Translation
- Synonyms
- Conceptualization
- Access to the text index
- Intersection, union, difference of PIDs
- Access to the relevant elements from PIDs
- Verification of relevance
387-Conclusion
- Various indexing techniques for XML
- Main dimensions of variations
- Structural summary
- Dataguide, Schema guide, Generalized DTD
- Identification of nodes (NID)
- Should keep parent-child relationship
- Should be stable to updates
- Index of keywords
- Should be compact
- Should give NID and offset of instances
39Classification
XML Indexing Methods
Numbering Scheme
Text Search
Graph Traversal
RRC
Hierarchy
T-Index
Pre/Post Order
Fabric
Dataguide
APEX
Interval Encoding
40Index for XQuery Text
- Facilitate the retrieval of
- Non stop words
- Suffixes, prefixes
- Location of words in elements
- Relevant nodes for a search
- Entries should focus on elements
- Word (docId, NID)
41Trreguide patterns
Book
Book
Author
Category
Author
Category
_at_speciality
Address
Company
_at_speciality
Company
Address
City
City
(b)
(a)