Cooperative XML CoXML Query Answering - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

Cooperative XML CoXML Query Answering

Description:

year. 2003. 8. XML Query Answer ... 2000-2005. section. spam detection. article. title. year. search engine. 2003. section. spam detection ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 54

Provided by: csU5

Learn more at: https://www.cs.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Cooperative XML CoXML Query Answering

1
Cooperative XML (CoXML) Query Answering
2
Motivation

XML has become the standard format for
information representation and data exchange
An explosive increase in the amount of XML data
available on the web, e.g.,
Bills at the Library of Congress
IEEE Computer Societys publication
SwissProt protein sequence databases
XMark online auction data
.
Effective XML search methods are needed!

3
Challenges

XML schema is usually very complex
E.g., the schema for the IEEE Computer Society
publication dataset contains about 170 distinct
tags and more than 1000 distinct paths
It is often unrealistic for users to fully
understand a schema before asking queries
Exact query answering is inadequate and
approximate query answering is more appropriate!

4
Approach CoXML
5
Roadmap

Introduction
Background
CoXML
Related Work
Conclusion

6
XML Data Model

XML data is often modeled as an ordered labeled
tree
Tree nodes elements
Tree edges element-nesting relationships

7
XML Query Model

XML queries are often modeled as trees
Structure conditions a set of query nodes
connected by
Parent-to-child (/) directly connected
Ancestor-to-descendant (// ) connected (either
directly or indirectly)
Content conditions
Either value predicates or keyword constraints on
query nodes
Example

8
XML Query Answer

An answer for a query is a set of nodes in a data
tree that satisfies both structure and content
conditions
Example

9
XML Query Relaxation Types

Value relaxation enlarging a value conditions
search scope
Node relabel changing the label a node to a
similar or a more general label by domain
knowledge

1 Tree Pattern Relaxation (S. Amer-Yahia, et
al., 2000)
10
XML Query Relaxation Types

Edge generalization relaxing a / edge to a
// edge
Node deletion dropping a node from a query tree

11
XML Relaxation Properties

Definition
Relaxation operation an application of a
relaxation type to a specific query node or edge
Lemma
Given a query tree with n applicable relaxation
operations, there are potentially up to 2n
relaxed trees
Possible combinations

12
Roadmap

Introduction
Background
CoXML
Related Work
Conclusion

13
Challenges

Query relaxation is often user-specific
Different users may have different approximate
matching specifications for a given query tree
How to provide user-specific approximate query
answering?
A query with n relaxation operations has
potentially up to 2n relaxed queries
How to systematically relax a query?
Query relaxation generates a set of approximate
answers
How to effectively rank the returned approximate
answers?

14
CoXML System Overview
15
Roadmap

Introduction
Background
CoXML
Relaxation Language
Relaxation Indexes
Ranking
Evaluation
Testbed
Related Work
Conclusion

16
Relaxation Language

Motivation
Enabling users to specify approximate conditions
in queries and to control the approximate
matching process
RLXQuery - relaxation-enabled XQuery
Extends the standard XML query language (XQuery)
with relaxation constructs controls, such as
approximate conditions
! non-relaxable conditions
REJECT unacceptable relaxations
AT-LEAST minimum of answers to be returned
RELAX-ORDER relaxation orders among multiple
conditions
USE allowable relaxation types

17
RLXQuery Example
FOR a in doc (bib.xml)//article WHERE a/year
2003 V-COND-LABEL t1 and
(aabout(./!title, search engine)/body/sectio
n)about(., spam detection) S-COND-LABEL
t2 RETURN a RELAX-ORDER (t1, t2) USE (edge
generalization, node deletion) AT-LEAST 20
article
!
title
body
year
search engine
section
2003
spam detection
18
Roadmap

Introduction
Background
CoXML
Relaxation Language
Relaxation Indexes
Ranking
Evaluation
Testbed
Related Work
Conclusion

19
Relaxation Index

Naïve approach
Generate all possible relaxed queries
iteratively select the best relaxed query to
derive approximate answers
Exhaustive, but not scalable
Observation
Many queries share the same (or similar) tree
structures
Our approach relaxation index
Consider the structure of a query tree T as a
template
Build indexes on the relaxed trees of T
Use the index to guide the relaxations of any
query with the same (or similar) tree structure
as that of T

20
Relaxation Index - XTAH

XTAH
A hierarchical multi-level labeled cluster of
relaxed trees
Building an XTAH
Given a query structure template T, generate all
possible relaxed trees
Each relaxed trees uses an unique set of
relaxation operations
Cluster relaxed trees into groups based on
relaxation operations and distances similar to
suffix-tree clustering

21
XTAH Example
A sample XTAH for the template structure T
gen(eu, v) relaxing the edge between u and
v del(u) deleting the node u
22
XTAH Properties

Each group consists of a set of relaxed trees
obtained by using similar relaxation operations
Efficient location of relaxed trees based on
relaxation operations
The higher level a group, the less relaxed the
trees in the group
Relaxing queries at different granularities by
traversing up and down the XTAH

23
XTAH-Guided Query Relaxation

Problem
Given a query with relaxation specifications
(constructs and controls), how to search an XTAH
for relaxed queries that satisfy the
specification?
Approach
First, prune XTAH groups containing trees that
use unacceptable relaxations as specified in the
query
This step can be efficiently achieved by
utilizing internal node labels
Then, iteratively search the XTAH for the best
relaxed query

24
Query Relaxation Process Example
A sample XTAH for the template structure T
25
XTAH-Guided Query Relaxation

Problem
Given a query and an XTAH, how to efficiently
locate the best relaxation candidate at the leaf
level?
Approach M-tree
Assign representatives to internal groups
Representatives summarize distance properties of
the trees within groups
Use representatives to guide the search path to
the best relaxation candidate

relaxed tree j
2 M-tree An efficient access method for
similarity search in metric space (P. Ciaccia et.
al., VLDB 97)
26
Roadmap

Introduction
Background
CoXML
Relaxation Language
Relaxation Indexes
Ranking
Evaluation
Testbed
Related Work
Conclusion

27
Ranking

Ranking criteria
Based on both content and structure similarities
between a query and an answer, i.e., a set of
data nodes
Approach
Content similarity extended vector space model
Structure similarity tree editing distance with
a model for assigning operation cost
Overall relevancy a ranking model combing both
content and structure similarities

28
Content Similarity
content similarity between a query and a document
Vector Space Model
Traditional IR ranking
Term Frequency
Inverse Document Frequency
Weighted Term Frequency
Inverse Element Frequency
content similarity between a query and an answer
(i.e., a set of data nodes)
Extended Vector Space Model
XML content ranking
29
Weighted Term Frequency

Terms under different paths of a node weight
differently
Example
The weighted term frequency for a term t in a
node v is

Query
Data
30
Inverse Element Frequency

The more number of XML elements containing a
term, the less disambiguating power the term has
E.g., the term spam is less disambiguating than
the term detection
The inverse element frequency for a query term t
is

31
Extended Vector Space Model

The content similarity between an answer A and a
query Q is

n of nodes in Q u1, , un the set of
query nodes in Q v1, , vn the set of data
nodes in A, where vi matches ui (1 i
n) ui.cont the number of terms in the content
conditions on the node ui tij a term in the
content condition on the query ui
32
Structure Distance Function

Both XML data and queries are modeled as trees
Similarities between trees are often computed by
editing distances,
i.e., the cost of the cheapest sequence of
editing operations that transform one tree into
the other tree
The structure distance between an answer A and a
query Q can be measured as the total cost of
relaxation operations used to derive A

33
Relaxation Operation Cost

Naïve approach
Assign uniform cost to all relaxation operations
Simple but ineffective
Our approach
Assign an operation cost based on the similarity
between the two nodes being approximated by the
operation
The closer the two nodes, the less the operation
costs

ri a relaxation operation u, v the two nodes
that are being approximated by ri
34
Nodes Approximated By Relaxation Operations
T1
T2
T3
T4
35
structure distance
content similarity
overall relevancy
36
Overall Relevancy Function

The overall relevancy of an answer A to a query
Q, sim(A, Q), is a function of cont_sim(A, Q) and
struct_dist(A, Q)
Properties
sim(A, Q) cont_sim(A, Q) if struct_dist(A, Q)
0
sim(A, Q) ? as cont_sim(A, Q ) ?
sim(A, Q) ? as struct_dist(A, Q ) ?
Implementation

? is a small constant between 0 and 1
37
Roadmap

Introduction
Background
CoXML
Relaxation Indexes
Relaxation Language
Ranking
Evaluation
Testbed
Related Work
Conclusion

38
Evaluation Studies

INEX (Initiative for the evaluation of XML)
Similar to TREC for text retrieval
Document collections
Scientific articles from IEEE Computer Society
1995 2002
About 500MByte
Each article consists of 1500 XML nodes on
average
Queries
Strict content and structure (SCAS)
Vague content and structure (VCAS)
Golden standard
Relevance assessment provided by INEX

39
Evaluation of Content Similarity

Datasets INEX 03 test collection
Query sets 30 SCAS queries
Comparisons 38 submissions in INEX 03

40
Evaluation of the Cost Model

Dataset INEX 05 test collection
Query set 22 simple VCAS queries
Evaluation metric normalized extended cumulative
gain (nxCG)
the official evaluation metric used in INEX 05
Given a number i (i?1), nxCG_at_i, similar to
precision_at_i, measures the relative gain users
accumulated up to the rank i
E.g., nxCG_at_10, nxCG_at_25, nxCG_at_50,
Cost Models
UCost uniform cost for each relaxation operation
(Baseline)
SCost our proposed cost model

41
Retrieval performance improvements with semantic
cost model

Query set all content-and-structure queries in
INEX 05

nxCG_at_10 (?, cost model)

Assigning relaxation operation with different
cost based on the similarities of the nodes being
operated improves retrieval performance! nxCG_at_25
and nxCG_at_50 yield similar results
42
Evaluation of the Cost Model

Result

Each cell nxCG_at_10 for a given pair (?, cost
model) ( of improvement over the baseline)
Utilizing node similarities to distinguish costs
of different operations improves retrieval
performance! Similar results are observed using
nxCG_at_25 and nxCG_at_50
43
Expressiveness of the Relaxation Language

INEX 05 Topic 267
Expressing Topic 267 using RLXQuery

ltinex_topic topic_id"267" query_type"CAS" gt
ltcastitlegt //article//fm//atlabout(., "digital
libraries") lt/castitlegt ltdescriptiongt
Articles containing "digital libraries" in their
title. lt/descriptiongt ltnarrativegt I'm
interested in articles discussing Digital
Libraries as their main subject. Therefore I
require that the title of any relevant article
mentions "digital library" explicitly. Documents
that mention digital libraries only under the
bibliography are not relevant, as well as
documents that do not have the phrase "digital
library" in their title. lt/narrativegt lt/inex_topic
gt
FOR a in doc(inex.xml)//article LET b
a//fm//!atl REJECT(fm, bb) WHERE babout(.,
digital libraries) RETURN b
44
Effectiveness of the Relaxation Control

Expressing Topic 267 with RLXQuery
Results

FOR a in doc(inex.xml)//article LET b
a//fm//!atl REJECT(fm, bb) WHERE babout(.,
digital libraries) RETURN b
Relaxation control enables the system to provide
answers with greater relevancy!
45
Evaluation of the Ranking Function

Dataset INEX 05 test collection
Query set 4 official VCAS queries with available
relevance assessments
Comparison top-1 submission in INEX 05
Results

The systematic relaxation approach enables our
system to derive more approximate answers! Our
ranking function, based on both content and
structure relevancy, outperforms other ranking
functions using content similarities only!
46
Roadmap

Introduction
Background
CoXML
Relaxation Indexes XTAH
Relaxation Language RLXQuery
Ranking
Evaluation
Testbed
Related Work
Conclusion

47
CoXML Testbed
RLXQuery
Relaxation Controller
Approximate Answers
RLXQuery Preprocessor
RLXQuery Parser
Relaxation Manager
Database Manager
Ranking Module
Relaxation Index Builder
XML Database Engine
XML Documents
Team Members Prof. Chu, S. Liu, T. Lee, E. Sung,
C. Cardenas, A. Putnam, J. Chen, R. Shahinian
48
Relaxation Examples using the Testbed
49
Relaxation Examples using the Testbed
50
Roadmap

Introduction
Background
CoXML
Related Work
Conclusion

51
Related Work Query Relaxation

Relaxation based on schema conversions (LC01,
LMC01, LMC03)
No structure relaxation
Native XML relaxation
Propose structure relaxation types e.g., KS01,
ACS02
We use the relaxation types introduced in ACS02
Investigate efficient algorithms for deriving
top-K answers based on relaxation types supported
e.g, Sch02, ACS02, ALP04, AKM05
No relaxation control

52
Related Work XML Ranking

Content ranking
Most extend ranking models for text retrieval to
the XML scenario, e.g., HyRex, XXL, JuruXML,
XSearch
We utilize structure to distinguish terms of
different weights occurring in different parts of
a document
Structure ranking
Based on tree editing distance algorithms w/o
considering operation cost NJ02
Based on the occurrence frequency of the query
trees, paths, or predicates in data MAK05,
AKM05
Our structure ranking is similar to editing
distance, but we consider operation cost

53
Conclusion

Cooperative XML (CoXML) query answering
RLXQuery enables users to effectively express
approximate query conditions and to control the
approximate matching process
XTAH provides systematic query relaxation
guidance
Both content and structure similarity metrics for
evaluating the relevancy of approximate answers
Evaluation studies with the INEX test collections
demonstrate the effectiveness of our methodology