Cooperative XML CoXML Query Answering - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Cooperative XML CoXML Query Answering

Description:

year. 2003. 8. XML Query Answer ... 2000-2005. section. spam detection. article. title. year. search engine. 2003. section. spam detection ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 54
Provided by: csU5
Learn more at: https://www.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Cooperative XML CoXML Query Answering


1
Cooperative XML (CoXML) Query Answering
2
Motivation
  • XML has become the standard format for
    information representation and data exchange
  • An explosive increase in the amount of XML data
    available on the web, e.g.,
  • Bills at the Library of Congress
  • IEEE Computer Societys publication
  • SwissProt protein sequence databases
  • XMark online auction data
  • .
  • Effective XML search methods are needed!

3
Challenges
  • XML schema is usually very complex
  • E.g., the schema for the IEEE Computer Society
    publication dataset contains about 170 distinct
    tags and more than 1000 distinct paths
  • It is often unrealistic for users to fully
    understand a schema before asking queries
  • Exact query answering is inadequate and
    approximate query answering is more appropriate!

4
Approach CoXML
5
Roadmap
  • Introduction
  • Background
  • CoXML
  • Related Work
  • Conclusion

6
XML Data Model
  • XML data is often modeled as an ordered labeled
    tree
  • Tree nodes elements
  • Tree edges element-nesting relationships

7
XML Query Model
  • XML queries are often modeled as trees
  • Structure conditions a set of query nodes
    connected by
  • Parent-to-child (/) directly connected
  • Ancestor-to-descendant (// ) connected (either
    directly or indirectly)
  • Content conditions
  • Either value predicates or keyword constraints on
    query nodes
  • Example

8
XML Query Answer
  • An answer for a query is a set of nodes in a data
    tree that satisfies both structure and content
    conditions
  • Example

9
XML Query Relaxation Types
  • Value relaxation enlarging a value conditions
    search scope
  • Node relabel changing the label a node to a
    similar or a more general label by domain
    knowledge

1 Tree Pattern Relaxation (S. Amer-Yahia, et
al., 2000)
10
XML Query Relaxation Types
  • Edge generalization relaxing a / edge to a
    // edge
  • Node deletion dropping a node from a query tree

11
XML Relaxation Properties
  • Definition
  • Relaxation operation an application of a
    relaxation type to a specific query node or edge
  • Lemma
  • Given a query tree with n applicable relaxation
    operations, there are potentially up to 2n
    relaxed trees
  • Possible combinations

12
Roadmap
  • Introduction
  • Background
  • CoXML
  • Related Work
  • Conclusion

13
Challenges
  • Query relaxation is often user-specific
  • Different users may have different approximate
    matching specifications for a given query tree
  • How to provide user-specific approximate query
    answering?
  • A query with n relaxation operations has
    potentially up to 2n relaxed queries
  • How to systematically relax a query?
  • Query relaxation generates a set of approximate
    answers
  • How to effectively rank the returned approximate
    answers?

14
CoXML System Overview
15
Roadmap
  • Introduction
  • Background
  • CoXML
  • Relaxation Language
  • Relaxation Indexes
  • Ranking
  • Evaluation
  • Testbed
  • Related Work
  • Conclusion

16
Relaxation Language
  • Motivation
  • Enabling users to specify approximate conditions
    in queries and to control the approximate
    matching process
  • RLXQuery - relaxation-enabled XQuery
  • Extends the standard XML query language (XQuery)
    with relaxation constructs controls, such as
  • approximate conditions
  • ! non-relaxable conditions
  • REJECT unacceptable relaxations
  • AT-LEAST minimum of answers to be returned
  • RELAX-ORDER relaxation orders among multiple
    conditions
  • USE allowable relaxation types

17
RLXQuery Example
FOR a in doc (bib.xml)//article WHERE a/year
2003 V-COND-LABEL t1 and
(aabout(./!title, search engine)/body/sectio
n)about(., spam detection) S-COND-LABEL
t2 RETURN a RELAX-ORDER (t1, t2) USE (edge
generalization, node deletion) AT-LEAST 20
article
!
title
body
year
search engine
section
2003
spam detection
18
Roadmap
  • Introduction
  • Background
  • CoXML
  • Relaxation Language
  • Relaxation Indexes
  • Ranking
  • Evaluation
  • Testbed
  • Related Work
  • Conclusion

19
Relaxation Index
  • Naïve approach
  • Generate all possible relaxed queries
    iteratively select the best relaxed query to
    derive approximate answers
  • Exhaustive, but not scalable
  • Observation
  • Many queries share the same (or similar) tree
    structures
  • Our approach relaxation index
  • Consider the structure of a query tree T as a
    template
  • Build indexes on the relaxed trees of T
  • Use the index to guide the relaxations of any
    query with the same (or similar) tree structure
    as that of T

20
Relaxation Index - XTAH
  • XTAH
  • A hierarchical multi-level labeled cluster of
    relaxed trees
  • Building an XTAH
  • Given a query structure template T, generate all
    possible relaxed trees
  • Each relaxed trees uses an unique set of
    relaxation operations
  • Cluster relaxed trees into groups based on
    relaxation operations and distances similar to
    suffix-tree clustering

21
XTAH Example
A sample XTAH for the template structure T
gen(eu, v) relaxing the edge between u and
v del(u) deleting the node u
22
XTAH Properties
  • Each group consists of a set of relaxed trees
    obtained by using similar relaxation operations
  • Efficient location of relaxed trees based on
    relaxation operations
  • The higher level a group, the less relaxed the
    trees in the group
  • Relaxing queries at different granularities by
    traversing up and down the XTAH

23
XTAH-Guided Query Relaxation
  • Problem
  • Given a query with relaxation specifications
    (constructs and controls), how to search an XTAH
    for relaxed queries that satisfy the
    specification?
  • Approach
  • First, prune XTAH groups containing trees that
    use unacceptable relaxations as specified in the
    query
  • This step can be efficiently achieved by
    utilizing internal node labels
  • Then, iteratively search the XTAH for the best
    relaxed query

24
Query Relaxation Process Example
A sample XTAH for the template structure T
25
XTAH-Guided Query Relaxation
  • Problem
  • Given a query and an XTAH, how to efficiently
    locate the best relaxation candidate at the leaf
    level?
  • Approach M-tree
  • Assign representatives to internal groups
  • Representatives summarize distance properties of
    the trees within groups
  • Use representatives to guide the search path to
    the best relaxation candidate

relaxed tree j
2 M-tree An efficient access method for
similarity search in metric space (P. Ciaccia et.
al., VLDB 97)
26
Roadmap
  • Introduction
  • Background
  • CoXML
  • Relaxation Language
  • Relaxation Indexes
  • Ranking
  • Evaluation
  • Testbed
  • Related Work
  • Conclusion

27
Ranking
  • Ranking criteria
  • Based on both content and structure similarities
    between a query and an answer, i.e., a set of
    data nodes
  • Approach
  • Content similarity extended vector space model
  • Structure similarity tree editing distance with
    a model for assigning operation cost
  • Overall relevancy a ranking model combing both
    content and structure similarities

28
Content Similarity
content similarity between a query and a document
Vector Space Model
Traditional IR ranking
Term Frequency
Inverse Document Frequency
Weighted Term Frequency
Inverse Element Frequency
content similarity between a query and an answer
(i.e., a set of data nodes)
Extended Vector Space Model
XML content ranking
29
Weighted Term Frequency
  • Terms under different paths of a node weight
    differently
  • Example
  • The weighted term frequency for a term t in a
    node v is

Query
Data
30
Inverse Element Frequency
  • The more number of XML elements containing a
    term, the less disambiguating power the term has
  • E.g., the term spam is less disambiguating than
    the term detection
  • The inverse element frequency for a query term t
    is

31
Extended Vector Space Model
  • The content similarity between an answer A and a
    query Q is

n of nodes in Q u1, , un the set of
query nodes in Q v1, , vn the set of data
nodes in A, where vi matches ui (1 i
n) ui.cont the number of terms in the content
conditions on the node ui tij a term in the
content condition on the query ui
32
Structure Distance Function
  • Both XML data and queries are modeled as trees
  • Similarities between trees are often computed by
    editing distances,
  • i.e., the cost of the cheapest sequence of
    editing operations that transform one tree into
    the other tree
  • The structure distance between an answer A and a
    query Q can be measured as the total cost of
    relaxation operations used to derive A

33
Relaxation Operation Cost
  • Naïve approach
  • Assign uniform cost to all relaxation operations
  • Simple but ineffective
  • Our approach
  • Assign an operation cost based on the similarity
    between the two nodes being approximated by the
    operation
  • The closer the two nodes, the less the operation
    costs

ri a relaxation operation u, v the two nodes
that are being approximated by ri
34
Nodes Approximated By Relaxation Operations
T1
T2
T3
T4
35
structure distance
content similarity
overall relevancy
36
Overall Relevancy Function
  • The overall relevancy of an answer A to a query
    Q, sim(A, Q), is a function of cont_sim(A, Q) and
    struct_dist(A, Q)
  • Properties
  • sim(A, Q) cont_sim(A, Q) if struct_dist(A, Q)
    0
  • sim(A, Q) ? as cont_sim(A, Q ) ?
  • sim(A, Q) ? as struct_dist(A, Q ) ?
  • Implementation

? is a small constant between 0 and 1
37
Roadmap
  • Introduction
  • Background
  • CoXML
  • Relaxation Indexes
  • Relaxation Language
  • Ranking
  • Evaluation
  • Testbed
  • Related Work
  • Conclusion

38
Evaluation Studies
  • INEX (Initiative for the evaluation of XML)
  • Similar to TREC for text retrieval
  • Document collections
  • Scientific articles from IEEE Computer Society
    1995 2002
  • About 500MByte
  • Each article consists of 1500 XML nodes on
    average
  • Queries
  • Strict content and structure (SCAS)
  • Vague content and structure (VCAS)
  • Golden standard
  • Relevance assessment provided by INEX

39
Evaluation of Content Similarity
  • Datasets INEX 03 test collection
  • Query sets 30 SCAS queries
  • Comparisons 38 submissions in INEX 03

40
Evaluation of the Cost Model
  • Dataset INEX 05 test collection
  • Query set 22 simple VCAS queries
  • Evaluation metric normalized extended cumulative
    gain (nxCG)
  • the official evaluation metric used in INEX 05
  • Given a number i (i?1), nxCG_at_i, similar to
    precision_at_i, measures the relative gain users
    accumulated up to the rank i
  • E.g., nxCG_at_10, nxCG_at_25, nxCG_at_50,
  • Cost Models
  • UCost uniform cost for each relaxation operation
    (Baseline)
  • SCost our proposed cost model

41
Retrieval performance improvements with semantic
cost model
  • Query set all content-and-structure queries in
    INEX 05

nxCG_at_10 (?, cost model)

Assigning relaxation operation with different
cost based on the similarities of the nodes being
operated improves retrieval performance! nxCG_at_25
and nxCG_at_50 yield similar results
42
Evaluation of the Cost Model
  • Result

Each cell nxCG_at_10 for a given pair (?, cost
model) ( of improvement over the baseline)
Utilizing node similarities to distinguish costs
of different operations improves retrieval
performance! Similar results are observed using
nxCG_at_25 and nxCG_at_50
43
Expressiveness of the Relaxation Language
  • INEX 05 Topic 267
  • Expressing Topic 267 using RLXQuery

ltinex_topic topic_id"267" query_type"CAS" gt
ltcastitlegt //article//fm//atlabout(., "digital
libraries") lt/castitlegt ltdescriptiongt
Articles containing "digital libraries" in their
title. lt/descriptiongt ltnarrativegt I'm
interested in articles discussing Digital
Libraries as their main subject. Therefore I
require that the title of any relevant article
mentions "digital library" explicitly. Documents
that mention digital libraries only under the
bibliography are not relevant, as well as
documents that do not have the phrase "digital
library" in their title. lt/narrativegt lt/inex_topic
gt
FOR a in doc(inex.xml)//article LET b
a//fm//!atl REJECT(fm, bb) WHERE babout(.,
digital libraries) RETURN b
44
Effectiveness of the Relaxation Control
  • Expressing Topic 267 with RLXQuery
  • Results

FOR a in doc(inex.xml)//article LET b
a//fm//!atl REJECT(fm, bb) WHERE babout(.,
digital libraries) RETURN b
Relaxation control enables the system to provide
answers with greater relevancy!
45
Evaluation of the Ranking Function
  • Dataset INEX 05 test collection
  • Query set 4 official VCAS queries with available
    relevance assessments
  • Comparison top-1 submission in INEX 05
  • Results

The systematic relaxation approach enables our
system to derive more approximate answers! Our
ranking function, based on both content and
structure relevancy, outperforms other ranking
functions using content similarities only!
46
Roadmap
  • Introduction
  • Background
  • CoXML
  • Relaxation Indexes XTAH
  • Relaxation Language RLXQuery
  • Ranking
  • Evaluation
  • Testbed
  • Related Work
  • Conclusion

47
CoXML Testbed
RLXQuery
Relaxation Controller
Approximate Answers
RLXQuery Preprocessor
RLXQuery Parser
Relaxation Manager
Database Manager
Ranking Module
Relaxation Index Builder
XML Database Engine
XML Documents
Team Members Prof. Chu, S. Liu, T. Lee, E. Sung,
C. Cardenas, A. Putnam, J. Chen, R. Shahinian
48
Relaxation Examples using the Testbed
49
Relaxation Examples using the Testbed
50
Roadmap
  • Introduction
  • Background
  • CoXML
  • Related Work
  • Conclusion

51
Related Work Query Relaxation
  • Relaxation based on schema conversions (LC01,
    LMC01, LMC03)
  • No structure relaxation
  • Native XML relaxation
  • Propose structure relaxation types e.g., KS01,
    ACS02
  • We use the relaxation types introduced in ACS02
  • Investigate efficient algorithms for deriving
    top-K answers based on relaxation types supported
    e.g, Sch02, ACS02, ALP04, AKM05
  • No relaxation control

52
Related Work XML Ranking
  • Content ranking
  • Most extend ranking models for text retrieval to
    the XML scenario, e.g., HyRex, XXL, JuruXML,
    XSearch
  • We utilize structure to distinguish terms of
    different weights occurring in different parts of
    a document
  • Structure ranking
  • Based on tree editing distance algorithms w/o
    considering operation cost NJ02
  • Based on the occurrence frequency of the query
    trees, paths, or predicates in data MAK05,
    AKM05
  • Our structure ranking is similar to editing
    distance, but we consider operation cost

53
Conclusion
  • Cooperative XML (CoXML) query answering
  • RLXQuery enables users to effectively express
    approximate query conditions and to control the
    approximate matching process
  • XTAH provides systematic query relaxation
    guidance
  • Both content and structure similarity metrics for
    evaluating the relevancy of approximate answers
  • Evaluation studies with the INEX test collections
    demonstrate the effectiveness of our methodology
Write a Comment
User Comments (0)
About PowerShow.com