XSEarch: A Semantic Search Engine for XML - PowerPoint PPT Presentation

About This Presentation

Title:

XSEarch: A Semantic Search Engine for XML

Description:

How can we find such papers? Attempt 1: Standard Search Engine ... Attempt 2: XML Query Language. FOR $i IN document('bib.xml')//inproceedings ... – PowerPoint PPT presentation

Number of Views:91

Avg rating:3.0/5.0

Slides: 55

Provided by: winx58

Category:

more less

Transcript and Presenter's Notes

Title: XSEarch: A Semantic Search Engine for XML

1
XSEarch A Semantic Search Engine for XML

Sara Cohen
Jonathan Mamou
Yaron Kanza
Yehoshua Sagiv
Presented at VLDB 2003, Germany

2
XSEarch an XML Search Engine

Our Goal
Find the relevant XML fragments,
given tag names and keywords

3
Excerpt from the XML Version of DBLP

ltproceedingsgt
ltinproceedingsgt
ltauthorgtMoshe Y. Vardilt/authorgt
lttitlegtQuerying Logical Databaseslt/titlegt
lt/inproceedingsgt
ltinproceedingsgt
ltauthorgtVictor Vianult/authorgt
lttitlegtA Web Odyssey From Codd to
XMLlt/titlegt
lt/inproceedingsgt
lt/proceedingsgt

4
A Search Example

Find papers by Vianu on the topic of
logical databases

How can we find such papers?
5
Attempt 1 Standard Search Engine
Each document in the corpus is treated as an
integral unit. A document containing some of the
three query terms is considered as a result.
6
The document is not relevant to the query. This
does not work!!!

ltproceedingsgt
ltinproceedingsgt
ltauthorgtMoshe Y. Vardilt/authorgt
lttitlegtQuerying Logical Databaseslt/titlegt
lt/inproceedingsgt
ltinproceedingsgt
ltauthorgtVictor Vianult/authorgt
lttitlegtA Web Odyssey From Codd to
XMLlt/titlegt
lt/inproceedingsgt
lt/proceedingsgt

7
Attempt 2 XML Query Language

FOR i IN document(bib.xml)//inproceedings
WHERE i/author contains Vianu
AND i/title contains Logical
AND i/title contains Databases
RETURN ltresultgt
ltauthorgt i/author lt/authorgt
lttitlegt i/title lt/titlegt
lt/resultgt

This does work, BUT

Complicated syntax
Extensive knowledge of the document structure
required to write the query
No mechanism for ranking results

8
Our Requirements from the Search Tool

A simple syntax that can be used by naive users
Search results should include XML fragments and
not necessarily full documents
The XML fragments in an answer, should be
semantically related
For example, a paper and an author should be in
an answer only if the paper was written by this
author
Search results should be ranked
Search results should be returned in reasonable
time

9
Overall Architecture
User Interface
Query Processor
Ranker
1
L1
L2
2
L3
3
L4
4
Index Repository
XML Files
Indices
Indexer
10
Query Syntax and Semantics
User Interface
Query Processor
Ranker
1
L1
L2
2
L3
3
L4
4
Index Repository
XML Files
Indices
Indexer
11
XSEarch Query Syntax

A query is a list of query terms
A query term can be a
Keyword, e.g., database
Tag, e.g., inproceedings
Tag-keyword combination, e.g., authorVianu
Optionally preceded by a

12
Example

Find papers by Vianu on the topic of logical
databases

logical database inproceedings authorVianu
Note that the different document fragments
matching these query terms must be semantically
related
13
Semantic Relation The Intuition
14
XSEarch
authorVianu title

ltproceedingsgt
ltinproceedingsgt
ltauthorgtMoshe Y. Vardilt/authorgt
lttitlegtQuerying Logical Databaseslt/titlegt
lt/inproceedingsgt
ltinproceedingsgt
ltauthorgtVictor Vianult/authorgt
lttitlegtA Web Odyssey From Codd to
XMLlt/titlegt
lt/inproceedingsgt
lt/proceedingsgt

ltauthorgtVictor Vianult/authorgt
lttitlegtA Web Odyssey From Codd to XMLlt/titlegt
Good Result! title and author elements ARE
semantically related
15
XSEarch
authorVianu title

ltproceedingsgt
ltinproceedingsgt
ltauthorgtMoshe Y. Vardilt/authorgt
lttitlegtQuerying Logical Databaseslt/titlegt
lt/inproceedingsgt
ltinproceedingsgt
ltauthorgtVictor Vianult/authorgt
lttitlegtA Web Odyssey From Codd to
XMLlt/titlegt
lt/inproceedingsgt
lt/proceedingsgt

lttitlegtQuerying Logical Databaseslt/titlegt
ltauthorgtVictor Vianult/authorgt
Bad Result! title and author elements ARE NOT
semantically related
16
Semantic Relation Formalization
17
Data Model Document Tree
Tags are colored in green
proceedings
Data is colored in red
inproceedings
inproceedings
title
author
title
author
Moshe Y. Vardi
Victor Vianu
A Web Odyssey From Codd to XML
Querying Logical Databases
GOAL Find pairs of semantically related titles
and authors.
18
Relationship Trees
Lowest common ancestor of n1, n2, , nk

nk
n1
n2
19
Our Semantic Relation Interconnection

n1,..., nk are strongly interconnected if the
relationship tree of n1,..., nk does not contain
two nodes with the same label
n1,..., nk are interconnected if either
they are strongly interconnected, or
the only nodes with the same label in the
relationship tree of n1,..., nk, are among
n1,..., nk

20
Example (1)
Lowest common ancestor of circled nodes
Relationship tree
Circled nodes belong to different inproceedings
entities. They ARE NOT strongly interconnected
nor interconnected!
21
Example (2)
Lowest common ancestor of circled nodes
Relationship tree
Circled nodes belong to the same inproceedings
entity. They ARE strongly interconnected, thus,
interconnected!
22
Example (3)
Lowest common ancestor of circled nodes
proceedings
Relationship tree
inproceedings
inproceedings
title
author
title
author
author
Moshe Y. Vardi
Victor Vianu
Serge Abiteboul
Queries and Computation on the Web
Querying Logical Databases
We can see the advantage of using interconnection
rather than strong interconnection. These two
author nodes ARE semantically related.
Circled nodes belong to the same inproceedings
entity, but are labeled with the same tag. They
ARE interconnected, BUT NOT strongly
interconnected!
23
Interconnection

Based on theoretical results of Generating
relations from XML documents, S. Cohen, Y.Kanza,
Y. Sagiv, ICDT 2003.
Three types of interconnection
We have implemented two types of interconnection
XSEarch can easily accommodate different types of
interconnection, or other semantic relations
between nodes

24
Checking Whether Two Nodes Are Interconnected

During query processing, we need to check
whether pairs of nodes are interconnected

Given a document T, it is possible to check
whether nodes n and n are interconnected in
O(T) time
Too expensive to do it during query processing!

25
Interconnection Index

Is built offline
Allows for checking interconnection between two
nodes, during query processing, in O(1) time
We have two implementations
as a hash table
as a symmetric matrix
The Indexer is responsible for building the
Interconnection Index

26
Indexer
27
Building the Interconnection Index Naïve Approach

For each pair of nodes, check whether this pair
is interconnected
There are O(T2) pairs
Checking interconnection is in O(T) time
As a result, checking for interconnection of all
pairs of nodes in T is in O(T3) time
?Too expensive also if it is done offline!!!

28
Building the Interconnection Index Dynamic
Programming Approach

Idea Checking whether two nodes are
interconnected can be done by checking
interconnection between their parents/children
There are two characterizations of nodes
interconnection
For child-ancestor nodes
For non child-ancestor nodes

29
Interconnection Characterization n is an
ancestor of n

n and n are interconnected
if and only if
the parent of n is strongly
interconnected with n
the child of n on the path to n
is strongly interconnected with n

n
child of n

parent of n
n
30
Interconnection Characterization n is not an
ancestor of n

n and n are interconnected
if and only if
the parent of n is strongly
interconnected with n
the parent of n is strongly
interconnected with n

parent of n
parent of n
31
Building the Interconnection Index Using Dynamic
Programming

Theorem Let T be a document. Then it is possible
to determine interconnection of all pairs of
nodes in T in O(T2) time
Proof hint
Derive nodes numbers in T by a depth-first
traversal of T
Compute the index using dynamic programming,
based on the characterizations

32
Query Processing

Document fragments are extracted using the
interconnection index and other indices
Extracted fragments are returned ranked by the
estimated relevance

33
Ranker
34
Ranking Factors

Several factors increase the rank of a result
Similarity between query and result
Weight of labels appearing in the result
Characteristics of result tree

35
Query and Result Similarity

TFILF
Extension of TFIDF, classical in IR
Term Frequency number of occurrences of a query
term in a fragment
Inverse Leaf Frequency number of leaves
containing a query term divided by number of
leaves in the corpus

36
TFILF

Term frequency of keyword k in a leaf node nl
Inverse leaf frequency

TFILF is the product between tf and ilf
37
Weight of Labels

Some labels are considered more important than
others
Text under an element labeled with title is more
important than text under element labeled with
section
Label weights can be
system generated
user defined

38
Relationship between Nodes

Size of the relationship tree small fragment
indicates that its nodes are closer, and thus,
probably, more related

article titleXML
39
Relationship between Nodes

Ancestor-descendant relationships between a pair
of nodes in a fragment, indicates strong
relation between these nodes

section titleXML
40
Experimental Results
41
Hardware and Software Used

Language Java
Processor 1.6 GHZ Pentium 4
RAM 2 GB (limited to 1.46 GB by JVM)
OS Windows XP

42
Choosing the Implementation for the
Interconnection Index

We have experimented the two implementations of
the interconnection index
1. IIH the index is an hash table
2. IIM the index is a symmetric matrix
We compare the two implementations
Cost of building the index
Cost of query processing, i.e., using the index

43
Time For Building Indices
IIH time (ms) IIM time (ms) Number of nodes Size (KB) XML corpus
36 29 3,360 146 Dream
185 114 6,635 281 Hamlet
1,729 1,552 21,246 704 Sigmod
7,837 6,231 49,422 1,198 Mondial

Both implementations are reasonable

IIM is better than IIH, because of the additional
overhead of hashing

44
On the Fly Indexing (OFI)

Fully building the indices as a preprocess of
querying is expensive in memory for huge
corpuses!
Also expensive in time because of the additional
overhead of using virtual memory
Instead, compute interconnection index
incrementally on-the-fly during query processing
for each pair that must be checked
By how much will query processing be slowed down?

45
Time For Building Indices Comparing IIH, IIM, OFI
IIM time (ms) IIH time (ms) OFI time (ms) Number of nodes Size (KB) XML corpus
29 36 0.6 3,360 146 Dream
114 185 1.1 6,635 281 Hamlet
1,552 1,729 2.2 21,246 704 Sigmod
6,231 7,837 10.0 49,422 1,198 Mondial
For these corpuses, OFI time is less than 10 ms.
Actually it is the time to build all the indices
other than the interconnection index.
46
Query Execution Time

We generated 1000 random queries for the Sigmod
Record corpus
Each query had
At most 3 optional search terms
At most 3 required search terms
We checked time with IIH, IIM and OFI

47
IIH/IIM Query Processing Time

Note Logarithmic scale
Both approaches lead to similar results
Average run time for queries 35 ms

48
OFI Query Processing Time

After processing the 1000 queries, 0.75 of all
pairs of nodes were checked for interconnection.
Space saved in main memory

Slowdown in response time not too large! Locality
property queries tend to be similar in the parts
of the document that they may access

More than 50 of the queries processed in under
10 ms

49
How Good are the Results?

We measured recall precision for the query
Find papers written by Buneman that contain the
keyword database in the title
We tried two different queries that reflect
different amounts of user knowledge
Kw Buneman database (classical search engine
query)
Tag-kw authorBuneman titledatabase
Corpus Sigmod, DBLP

50
Precision and Recall

We computed the "correct answers" using XQuery
Recall
?Perfect recall, i.e., XSEarch returns all the
correct answers
Precision at n

51
Precision at 5, 10 and 20
Sigmod Perfect precision DBLP 0.8/0.9 for query
containing only keywords
Combining tags and keywords leads to perfect
precision
52
Conclusions

Paradigm for querying XML combining IR and
database techniques
Returns semantically related fragments, ranked by
estimated relevance
Combining tags and keywords in the query leads to
good results

53
Conclusions

Efficient index structures
IIM/IIH for small documents
OFI for big documents
Efficient evaluation algorithms
Dynamic algorithm for computing interconnection
Extensible implementation
The system can easily accommodate different types
of semantic relations between nodes, other than
interconnection