Title: Cooperative XML CoXML Query Answering
1Cooperative XML (CoXML) Query Answering
2Motivation
- XML has become the standard format for
information representation and data exchange - An explosive increase in the amount of XML data
available on the web, e.g., - Bills at the Library of Congress
- IEEE Computer Societys publication
- SwissProt protein sequence databases
- XMark online auction data
- .
- Effective XML search methods are needed!
3Challenges
- XML schema is usually very complex
- E.g., the schema for the IEEE Computer Society
publication dataset contains about 170 distinct
tags and more than 1000 distinct paths - It is often unrealistic for users to fully
understand a schema before asking queries - Exact query answering is inadequate and
approximate query answering is more appropriate!
4Approach CoXML
5Roadmap
- Introduction
- Background
- CoXML
- Related Work
- Conclusion
6XML Data Model
- XML data is often modeled as an ordered labeled
tree - Tree nodes elements
- Tree edges element-nesting relationships
7XML Query Model
- XML queries are often modeled as trees
- Structure conditions a set of query nodes
connected by - Parent-to-child (/) directly connected
- Ancestor-to-descendant (// ) connected (either
directly or indirectly) - Content conditions
- Either value predicates or keyword constraints on
query nodes - Example
8XML Query Answer
- An answer for a query is a set of nodes in a data
tree that satisfies both structure and content
conditions - Example
9XML Query Relaxation Types
- Value relaxation enlarging a value conditions
search scope - Node relabel changing the label a node to a
similar or a more general label by domain
knowledge
1 Tree Pattern Relaxation (S. Amer-Yahia, et
al., 2000)
10XML Query Relaxation Types
- Edge generalization relaxing a / edge to a
// edge - Node deletion dropping a node from a query tree
11XML Relaxation Properties
- Definition
- Relaxation operation an application of a
relaxation type to a specific query node or edge - Lemma
- Given a query tree with n applicable relaxation
operations, there are potentially up to 2n
relaxed trees - Possible combinations
12Roadmap
- Introduction
- Background
- CoXML
- Related Work
- Conclusion
13Challenges
- Query relaxation is often user-specific
- Different users may have different approximate
matching specifications for a given query tree - How to provide user-specific approximate query
answering? - A query with n relaxation operations has
potentially up to 2n relaxed queries - How to systematically relax a query?
- Query relaxation generates a set of approximate
answers - How to effectively rank the returned approximate
answers?
14CoXML System Overview
15Roadmap
- Introduction
- Background
- CoXML
- Relaxation Language
- Relaxation Indexes
- Ranking
- Evaluation
- Testbed
- Related Work
- Conclusion
16Relaxation Language
- Motivation
- Enabling users to specify approximate conditions
in queries and to control the approximate
matching process - RLXQuery - relaxation-enabled XQuery
- Extends the standard XML query language (XQuery)
with relaxation constructs controls, such as - approximate conditions
- ! non-relaxable conditions
- REJECT unacceptable relaxations
- AT-LEAST minimum of answers to be returned
- RELAX-ORDER relaxation orders among multiple
conditions - USE allowable relaxation types
17RLXQuery Example
FOR a in doc (bib.xml)//article WHERE a/year
2003 V-COND-LABEL t1 and
(aabout(./!title, search engine)/body/sectio
n)about(., spam detection) S-COND-LABEL
t2 RETURN a RELAX-ORDER (t1, t2) USE (edge
generalization, node deletion) AT-LEAST 20
article
!
title
body
year
search engine
section
2003
spam detection
18Roadmap
- Introduction
- Background
- CoXML
- Relaxation Language
- Relaxation Indexes
- Ranking
- Evaluation
- Testbed
- Related Work
- Conclusion
19Relaxation Index
- Naïve approach
- Generate all possible relaxed queries
iteratively select the best relaxed query to
derive approximate answers - Exhaustive, but not scalable
- Observation
- Many queries share the same (or similar) tree
structures - Our approach relaxation index
- Consider the structure of a query tree T as a
template - Build indexes on the relaxed trees of T
- Use the index to guide the relaxations of any
query with the same (or similar) tree structure
as that of T
20Relaxation Index - XTAH
- XTAH
- A hierarchical multi-level labeled cluster of
relaxed trees - Building an XTAH
- Given a query structure template T, generate all
possible relaxed trees - Each relaxed trees uses an unique set of
relaxation operations - Cluster relaxed trees into groups based on
relaxation operations and distances similar to
suffix-tree clustering
21XTAH Example
A sample XTAH for the template structure T
gen(eu, v) relaxing the edge between u and
v del(u) deleting the node u
22XTAH Properties
- Each group consists of a set of relaxed trees
obtained by using similar relaxation operations - Efficient location of relaxed trees based on
relaxation operations - The higher level a group, the less relaxed the
trees in the group - Relaxing queries at different granularities by
traversing up and down the XTAH
23XTAH-Guided Query Relaxation
- Problem
- Given a query with relaxation specifications
(constructs and controls), how to search an XTAH
for relaxed queries that satisfy the
specification? - Approach
- First, prune XTAH groups containing trees that
use unacceptable relaxations as specified in the
query - This step can be efficiently achieved by
utilizing internal node labels - Then, iteratively search the XTAH for the best
relaxed query
24Query Relaxation Process Example
A sample XTAH for the template structure T
25XTAH-Guided Query Relaxation
- Problem
- Given a query and an XTAH, how to efficiently
locate the best relaxation candidate at the leaf
level? - Approach M-tree
- Assign representatives to internal groups
- Representatives summarize distance properties of
the trees within groups - Use representatives to guide the search path to
the best relaxation candidate
relaxed tree j
2 M-tree An efficient access method for
similarity search in metric space (P. Ciaccia et.
al., VLDB 97)
26Roadmap
- Introduction
- Background
- CoXML
- Relaxation Language
- Relaxation Indexes
- Ranking
- Evaluation
- Testbed
- Related Work
- Conclusion
27Ranking
- Ranking criteria
- Based on both content and structure similarities
between a query and an answer, i.e., a set of
data nodes - Approach
- Content similarity extended vector space model
- Structure similarity tree editing distance with
a model for assigning operation cost - Overall relevancy a ranking model combing both
content and structure similarities
28Content Similarity
content similarity between a query and a document
Vector Space Model
Traditional IR ranking
Term Frequency
Inverse Document Frequency
Weighted Term Frequency
Inverse Element Frequency
content similarity between a query and an answer
(i.e., a set of data nodes)
Extended Vector Space Model
XML content ranking
29Weighted Term Frequency
- Terms under different paths of a node weight
differently - Example
- The weighted term frequency for a term t in a
node v is
Query
Data
30Inverse Element Frequency
- The more number of XML elements containing a
term, the less disambiguating power the term has - E.g., the term spam is less disambiguating than
the term detection - The inverse element frequency for a query term t
is
31Extended Vector Space Model
- The content similarity between an answer A and a
query Q is
n of nodes in Q u1, , un the set of
query nodes in Q v1, , vn the set of data
nodes in A, where vi matches ui (1 i
n) ui.cont the number of terms in the content
conditions on the node ui tij a term in the
content condition on the query ui
32Structure Distance Function
- Both XML data and queries are modeled as trees
- Similarities between trees are often computed by
editing distances, - i.e., the cost of the cheapest sequence of
editing operations that transform one tree into
the other tree - The structure distance between an answer A and a
query Q can be measured as the total cost of
relaxation operations used to derive A
33Relaxation Operation Cost
- Naïve approach
- Assign uniform cost to all relaxation operations
- Simple but ineffective
- Our approach
- Assign an operation cost based on the similarity
between the two nodes being approximated by the
operation - The closer the two nodes, the less the operation
costs
ri a relaxation operation u, v the two nodes
that are being approximated by ri
34Nodes Approximated By Relaxation Operations
T1
T2
T3
T4
35structure distance
content similarity
overall relevancy
36Overall Relevancy Function
- The overall relevancy of an answer A to a query
Q, sim(A, Q), is a function of cont_sim(A, Q) and
struct_dist(A, Q) - Properties
- sim(A, Q) cont_sim(A, Q) if struct_dist(A, Q)
0 - sim(A, Q) ? as cont_sim(A, Q ) ?
- sim(A, Q) ? as struct_dist(A, Q ) ?
- Implementation
? is a small constant between 0 and 1
37Roadmap
- Introduction
- Background
- CoXML
- Relaxation Indexes
- Relaxation Language
- Ranking
- Evaluation
- Testbed
- Related Work
- Conclusion
38Evaluation Studies
- INEX (Initiative for the evaluation of XML)
- Similar to TREC for text retrieval
- Document collections
- Scientific articles from IEEE Computer Society
1995 2002 - About 500MByte
- Each article consists of 1500 XML nodes on
average - Queries
- Strict content and structure (SCAS)
- Vague content and structure (VCAS)
- Golden standard
- Relevance assessment provided by INEX
39Evaluation of Content Similarity
- Datasets INEX 03 test collection
- Query sets 30 SCAS queries
- Comparisons 38 submissions in INEX 03
40Evaluation of the Cost Model
- Dataset INEX 05 test collection
- Query set 22 simple VCAS queries
- Evaluation metric normalized extended cumulative
gain (nxCG) - the official evaluation metric used in INEX 05
- Given a number i (i?1), nxCG_at_i, similar to
precision_at_i, measures the relative gain users
accumulated up to the rank i - E.g., nxCG_at_10, nxCG_at_25, nxCG_at_50,
- Cost Models
- UCost uniform cost for each relaxation operation
(Baseline) - SCost our proposed cost model
41Retrieval performance improvements with semantic
cost model
- Query set all content-and-structure queries in
INEX 05
nxCG_at_10 (?, cost model)
Assigning relaxation operation with different
cost based on the similarities of the nodes being
operated improves retrieval performance! nxCG_at_25
and nxCG_at_50 yield similar results
42Evaluation of the Cost Model
Each cell nxCG_at_10 for a given pair (?, cost
model) ( of improvement over the baseline)
Utilizing node similarities to distinguish costs
of different operations improves retrieval
performance! Similar results are observed using
nxCG_at_25 and nxCG_at_50
43Expressiveness of the Relaxation Language
- INEX 05 Topic 267
- Expressing Topic 267 using RLXQuery
ltinex_topic topic_id"267" query_type"CAS" gt
ltcastitlegt //article//fm//atlabout(., "digital
libraries") lt/castitlegt ltdescriptiongt
Articles containing "digital libraries" in their
title. lt/descriptiongt ltnarrativegt I'm
interested in articles discussing Digital
Libraries as their main subject. Therefore I
require that the title of any relevant article
mentions "digital library" explicitly. Documents
that mention digital libraries only under the
bibliography are not relevant, as well as
documents that do not have the phrase "digital
library" in their title. lt/narrativegt lt/inex_topic
gt
FOR a in doc(inex.xml)//article LET b
a//fm//!atl REJECT(fm, bb) WHERE babout(.,
digital libraries) RETURN b
44Effectiveness of the Relaxation Control
- Expressing Topic 267 with RLXQuery
- Results
FOR a in doc(inex.xml)//article LET b
a//fm//!atl REJECT(fm, bb) WHERE babout(.,
digital libraries) RETURN b
Relaxation control enables the system to provide
answers with greater relevancy!
45Evaluation of the Ranking Function
- Dataset INEX 05 test collection
- Query set 4 official VCAS queries with available
relevance assessments - Comparison top-1 submission in INEX 05
- Results
The systematic relaxation approach enables our
system to derive more approximate answers! Our
ranking function, based on both content and
structure relevancy, outperforms other ranking
functions using content similarities only!
46Roadmap
- Introduction
- Background
- CoXML
- Relaxation Indexes XTAH
- Relaxation Language RLXQuery
- Ranking
- Evaluation
- Testbed
- Related Work
- Conclusion
47CoXML Testbed
RLXQuery
Relaxation Controller
Approximate Answers
RLXQuery Preprocessor
RLXQuery Parser
Relaxation Manager
Database Manager
Ranking Module
Relaxation Index Builder
XML Database Engine
XML Documents
Team Members Prof. Chu, S. Liu, T. Lee, E. Sung,
C. Cardenas, A. Putnam, J. Chen, R. Shahinian
48Relaxation Examples using the Testbed
49Relaxation Examples using the Testbed
50Roadmap
- Introduction
- Background
- CoXML
- Related Work
- Conclusion
51Related Work Query Relaxation
- Relaxation based on schema conversions (LC01,
LMC01, LMC03) - No structure relaxation
- Native XML relaxation
- Propose structure relaxation types e.g., KS01,
ACS02 - We use the relaxation types introduced in ACS02
- Investigate efficient algorithms for deriving
top-K answers based on relaxation types supported
e.g, Sch02, ACS02, ALP04, AKM05 - No relaxation control
52Related Work XML Ranking
- Content ranking
- Most extend ranking models for text retrieval to
the XML scenario, e.g., HyRex, XXL, JuruXML,
XSearch - We utilize structure to distinguish terms of
different weights occurring in different parts of
a document - Structure ranking
- Based on tree editing distance algorithms w/o
considering operation cost NJ02 - Based on the occurrence frequency of the query
trees, paths, or predicates in data MAK05,
AKM05 - Our structure ranking is similar to editing
distance, but we consider operation cost
53Conclusion
- Cooperative XML (CoXML) query answering
- RLXQuery enables users to effectively express
approximate query conditions and to control the
approximate matching process - XTAH provides systematic query relaxation
guidance - Both content and structure similarity metrics for
evaluating the relevancy of approximate answers - Evaluation studies with the INEX test collections
demonstrate the effectiveness of our methodology