Title: XSEarch: A Semantic Search Engine for XML
1XSEarch A Semantic Search Engine for XML
- Sara Cohen
- Jonathan Mamou
- Yaron Kanza
- Yehoshua Sagiv
- Presented at VLDB 2003, Germany
2XSEarch an XML Search Engine
- Our Goal
- Find the relevant XML fragments,
- given tag names and keywords
3Excerpt from the XML Version of DBLP
- ltproceedingsgt
- ltinproceedingsgt
- ltauthorgtMoshe Y. Vardilt/authorgt
- lttitlegtQuerying Logical Databaseslt/titlegt
- lt/inproceedingsgt
- ltinproceedingsgt
- ltauthorgtVictor Vianult/authorgt
- lttitlegtA Web Odyssey From Codd to
XMLlt/titlegt - lt/inproceedingsgt
- lt/proceedingsgt
4A Search Example
- Find papers by Vianu on the topic of
- logical databases
How can we find such papers?
5Attempt 1 Standard Search Engine
Each document in the corpus is treated as an
integral unit. A document containing some of the
three query terms is considered as a result.
6The document is not relevant to the query. This
does not work!!!
- ltproceedingsgt
- ltinproceedingsgt
- ltauthorgtMoshe Y. Vardilt/authorgt
- lttitlegtQuerying Logical Databaseslt/titlegt
- lt/inproceedingsgt
- ltinproceedingsgt
- ltauthorgtVictor Vianult/authorgt
- lttitlegtA Web Odyssey From Codd to
- XMLlt/titlegt
- lt/inproceedingsgt
- lt/proceedingsgt
7Attempt 2 XML Query Language
- FOR i IN document(bib.xml)//inproceedings
- WHERE i/author contains Vianu
- AND i/title contains Logical
- AND i/title contains Databases
- RETURN ltresultgt
- ltauthorgt i/author lt/authorgt
- lttitlegt i/title lt/titlegt
- lt/resultgt
This does work, BUT
- Complicated syntax
- Extensive knowledge of the document structure
required to write the query - No mechanism for ranking results
8Our Requirements from the Search Tool
- A simple syntax that can be used by naive users
- Search results should include XML fragments and
not necessarily full documents - The XML fragments in an answer, should be
semantically related - For example, a paper and an author should be in
an answer only if the paper was written by this
author - Search results should be ranked
- Search results should be returned in reasonable
time
9Overall Architecture
User Interface
Query Processor
Ranker
1
L1
L2
2
L3
3
L4
4
Index Repository
XML Files
Indices
Indexer
10Query Syntax and Semantics
User Interface
Query Processor
Ranker
1
L1
L2
2
L3
3
L4
4
Index Repository
XML Files
Indices
Indexer
11XSEarch Query Syntax
- A query is a list of query terms
- A query term can be a
- Keyword, e.g., database
- Tag, e.g., inproceedings
- Tag-keyword combination, e.g., authorVianu
- Optionally preceded by a
12Example
- Find papers by Vianu on the topic of logical
databases
logical database inproceedings authorVianu
Note that the different document fragments
matching these query terms must be semantically
related
13Semantic Relation The Intuition
14XSEarch
authorVianu title
- ltproceedingsgt
- ltinproceedingsgt
- ltauthorgtMoshe Y. Vardilt/authorgt
- lttitlegtQuerying Logical Databaseslt/titlegt
- lt/inproceedingsgt
- ltinproceedingsgt
- ltauthorgtVictor Vianult/authorgt
- lttitlegtA Web Odyssey From Codd to
XMLlt/titlegt - lt/inproceedingsgt
- lt/proceedingsgt
ltauthorgtVictor Vianult/authorgt
lttitlegtA Web Odyssey From Codd to XMLlt/titlegt
Good Result! title and author elements ARE
semantically related
15XSEarch
authorVianu title
- ltproceedingsgt
- ltinproceedingsgt
- ltauthorgtMoshe Y. Vardilt/authorgt
- lttitlegtQuerying Logical Databaseslt/titlegt
- lt/inproceedingsgt
- ltinproceedingsgt
- ltauthorgtVictor Vianult/authorgt
- lttitlegtA Web Odyssey From Codd to
XMLlt/titlegt - lt/inproceedingsgt
- lt/proceedingsgt
lttitlegtQuerying Logical Databaseslt/titlegt
ltauthorgtVictor Vianult/authorgt
Bad Result! title and author elements ARE NOT
semantically related
16Semantic Relation Formalization
17Data Model Document Tree
Tags are colored in green
proceedings
Data is colored in red
inproceedings
inproceedings
title
author
title
author
Moshe Y. Vardi
Victor Vianu
A Web Odyssey From Codd to XML
Querying Logical Databases
GOAL Find pairs of semantically related titles
and authors.
18Relationship Trees
Lowest common ancestor of n1, n2, , nk
nk
n1
n2
19Our Semantic Relation Interconnection
- n1,..., nk are strongly interconnected if the
relationship tree of n1,..., nk does not contain
two nodes with the same label - n1,..., nk are interconnected if either
- they are strongly interconnected, or
- the only nodes with the same label in the
relationship tree of n1,..., nk, are among
n1,..., nk
20Example (1)
Lowest common ancestor of circled nodes
Relationship tree
Circled nodes belong to different inproceedings
entities. They ARE NOT strongly interconnected
nor interconnected!
21Example (2)
Lowest common ancestor of circled nodes
Relationship tree
Circled nodes belong to the same inproceedings
entity. They ARE strongly interconnected, thus,
interconnected!
22Example (3)
Lowest common ancestor of circled nodes
proceedings
Relationship tree
inproceedings
inproceedings
title
author
title
author
author
Moshe Y. Vardi
Victor Vianu
Serge Abiteboul
Queries and Computation on the Web
Querying Logical Databases
We can see the advantage of using interconnection
rather than strong interconnection. These two
author nodes ARE semantically related.
Circled nodes belong to the same inproceedings
entity, but are labeled with the same tag. They
ARE interconnected, BUT NOT strongly
interconnected!
23Interconnection
- Based on theoretical results of Generating
relations from XML documents, S. Cohen, Y.Kanza,
Y. Sagiv, ICDT 2003. - Three types of interconnection
- We have implemented two types of interconnection
- XSEarch can easily accommodate different types of
interconnection, or other semantic relations
between nodes
24Checking Whether Two Nodes Are Interconnected
- During query processing, we need to check
whether pairs of nodes are interconnected
- Given a document T, it is possible to check
whether nodes n and n are interconnected in
O(T) time - Too expensive to do it during query processing!
25Interconnection Index
- Is built offline
- Allows for checking interconnection between two
nodes, during query processing, in O(1) time - We have two implementations
- as a hash table
- as a symmetric matrix
- The Indexer is responsible for building the
Interconnection Index
26Indexer
27Building the Interconnection Index Naïve Approach
- For each pair of nodes, check whether this pair
is interconnected - There are O(T2) pairs
- Checking interconnection is in O(T) time
- As a result, checking for interconnection of all
pairs of nodes in T is in O(T3) time - ?Too expensive also if it is done offline!!!
28Building the Interconnection Index Dynamic
Programming Approach
- Idea Checking whether two nodes are
interconnected can be done by checking
interconnection between their parents/children - There are two characterizations of nodes
interconnection - For child-ancestor nodes
- For non child-ancestor nodes
29Interconnection Characterization n is an
ancestor of n
- n and n are interconnected
- if and only if
- the parent of n is strongly
- interconnected with n
- the child of n on the path to n
- is strongly interconnected with n
n
child of n
parent of n
n
30Interconnection Characterization n is not an
ancestor of n
- n and n are interconnected
- if and only if
- the parent of n is strongly
- interconnected with n
- the parent of n is strongly
- interconnected with n
parent of n
parent of n
31Building the Interconnection Index Using Dynamic
Programming
- Theorem Let T be a document. Then it is possible
to determine interconnection of all pairs of
nodes in T in O(T2) time - Proof hint
- Derive nodes numbers in T by a depth-first
traversal of T - Compute the index using dynamic programming,
based on the characterizations
32Query Processing
- Document fragments are extracted using the
interconnection index and other indices - Extracted fragments are returned ranked by the
estimated relevance
33Ranker
34Ranking Factors
- Several factors increase the rank of a result
- Similarity between query and result
- Weight of labels appearing in the result
- Characteristics of result tree
35Query and Result Similarity
- TFILF
- Extension of TFIDF, classical in IR
- Term Frequency number of occurrences of a query
term in a fragment - Inverse Leaf Frequency number of leaves
containing a query term divided by number of
leaves in the corpus
36TFILF
- Term frequency of keyword k in a leaf node nl
- Inverse leaf frequency
TFILF is the product between tf and ilf
37Weight of Labels
- Some labels are considered more important than
others - Text under an element labeled with title is more
important than text under element labeled with
section - Label weights can be
- system generated
- user defined
38Relationship between Nodes
- Size of the relationship tree small fragment
indicates that its nodes are closer, and thus,
probably, more related
article titleXML
39Relationship between Nodes
- Ancestor-descendant relationships between a pair
of nodes in a fragment, indicates strong
relation between these nodes
section titleXML
40Experimental Results
41Hardware and Software Used
- Language Java
- Processor 1.6 GHZ Pentium 4
- RAM 2 GB (limited to 1.46 GB by JVM)
- OS Windows XP
42Choosing the Implementation for the
Interconnection Index
- We have experimented the two implementations of
the interconnection index - 1. IIH the index is an hash table
- 2. IIM the index is a symmetric matrix
- We compare the two implementations
- Cost of building the index
- Cost of query processing, i.e., using the index
43Time For Building Indices
IIH time (ms) IIM time (ms) Number of nodes Size (KB) XML corpus
36 29 3,360 146 Dream
185 114 6,635 281 Hamlet
1,729 1,552 21,246 704 Sigmod
7,837 6,231 49,422 1,198 Mondial
- Both implementations are reasonable
- IIM is better than IIH, because of the additional
overhead of hashing
44On the Fly Indexing (OFI)
- Fully building the indices as a preprocess of
querying is expensive in memory for huge
corpuses! - Also expensive in time because of the additional
overhead of using virtual memory - Instead, compute interconnection index
incrementally on-the-fly during query processing
for each pair that must be checked - By how much will query processing be slowed down?
45Time For Building Indices Comparing IIH, IIM, OFI
IIM time (ms) IIH time (ms) OFI time (ms) Number of nodes Size (KB) XML corpus
29 36 0.6 3,360 146 Dream
114 185 1.1 6,635 281 Hamlet
1,552 1,729 2.2 21,246 704 Sigmod
6,231 7,837 10.0 49,422 1,198 Mondial
For these corpuses, OFI time is less than 10 ms.
Actually it is the time to build all the indices
other than the interconnection index.
46Query Execution Time
- We generated 1000 random queries for the Sigmod
Record corpus - Each query had
- At most 3 optional search terms
- At most 3 required search terms
- We checked time with IIH, IIM and OFI
47IIH/IIM Query Processing Time
- Note Logarithmic scale
- Both approaches lead to similar results
- Average run time for queries 35 ms
48OFI Query Processing Time
- After processing the 1000 queries, 0.75 of all
pairs of nodes were checked for interconnection. - Space saved in main memory
Slowdown in response time not too large! Locality
property queries tend to be similar in the parts
of the document that they may access
- More than 50 of the queries processed in under
10 ms
49How Good are the Results?
- We measured recall precision for the query
- Find papers written by Buneman that contain the
keyword database in the title - We tried two different queries that reflect
different amounts of user knowledge - Kw Buneman database (classical search engine
query) - Tag-kw authorBuneman titledatabase
- Corpus Sigmod, DBLP
50Precision and Recall
- We computed the "correct answers" using XQuery
- Recall
- ?Perfect recall, i.e., XSEarch returns all the
correct answers - Precision at n
51Precision at 5, 10 and 20
Sigmod Perfect precision DBLP 0.8/0.9 for query
containing only keywords
Combining tags and keywords leads to perfect
precision
52Conclusions
- Paradigm for querying XML combining IR and
database techniques - Returns semantically related fragments, ranked by
estimated relevance - Combining tags and keywords in the query leads to
good results
53Conclusions
- Efficient index structures
- IIM/IIH for small documents
- OFI for big documents
- Efficient evaluation algorithms
- Dynamic algorithm for computing interconnection
- Extensible implementation
- The system can easily accommodate different types
of semantic relations between nodes, other than
interconnection
54