Adaptive XML Search - PowerPoint PPT Presentation

About This Presentation

Title:

Adaptive XML Search

Description:

Ranking Support vector machine in voting SpyNB Framework ... LINE King Malchus of Arabia; King of Pont; /LINE LINE Herod of Jewry; Mithridates, king /LINE ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 41

Provided by: tqz

Category:

more less

Transcript and Presenter's Notes

Title: Adaptive XML Search

1
Adaptive XML Search

Dr Wilfred Ng
Department of Computer Science
The Hong Kong University of Science and
Technology

2
Outline

Motivation
Key-Tag Search
Multi-Ranker Model
Ranking Support vector machine in voting SpyNB
Framework (RSSF)
Experiments
Conclusions and Ongoing Work

3
Motivation
4
Why we need XML Search Engine?

Different nature of HTML and XML data
HTML data
Hyperlink-intensive
Declarative languages
Tags have no semantic meaning
XML data
Self-describing tags
Extra structural information
XML search engines retrieve more accurate
fragments

5
Why we need XML Search Engine?

Web searching
Document paradigm
Matching keywords Vs documents
Return links to whole document (web page)
XML searching
Query Keywords maybe tags or data values
Structure of XML document is diverse, e.g. DBLP
and Shakespeare
Not return whole document 100Mb or larger
Return fragments

6
DBLP

ltdblpgt
ltincollection mdate"2002-01-03"
key"books/acm/kim95/AnnevelinkACFHK95"gt
ltauthorgtJurgen Annevelinklt/authorgt
lttitlegtObject SQL - A Language for the Design
and Implementation of Object Databases.lt/titlegt
ltpagesgt42-68lt/pagesgt
ltyeargt1995lt/yeargt
ltbooktitlegtModern Database Systemslt/booktitlegt
lturlgtdb/books/collections/kim95.htmllt/urlgt
lt/incollectiongt
.

7
Shakespeare

ltSPEECHgt
ltSPEAKERgtOCTAVIUS CAESARlt/SPEAKERgt
ltLINEgtNo, my most wronged sister
Cleopatralt/LINEgt
ltLINEgtHath nodded him to her. He hath given his
empirelt/LINEgt
ltLINEgtUp to a whore who now are levyinglt/LINEgt
ltLINEgtThe kings o' the earth for war he hath
assembledlt/LINEgt
ltLINEgtBocchus, the king of Libya
Archelaus,lt/LINEgt
ltLINEgtOf Cappadocia Philadelphos, kinglt/LINEgt
ltLINEgtOf Paphlagonia the Thracian king,
Adallaslt/LINEgt
ltLINEgtKing Malchus of Arabia King of
Pontlt/LINEgt
ltLINEgtHerod of Jewry Mithridates, kinglt/LINEgt
ltLINEgtOf Comagene Polemon and Amyntas,lt/LINEgt
ltLINEgtThe kings of Mede and Lycaonia,lt/LINEgt
ltLINEgtWith a more larger list of sceptres.lt/LINEgt
lt/SPEECHgt

8
Research Ideas

In Information Retrieval community, many ranking
techniques are developed
Weighted keywords
Vector space
Searching and ranking XML as plain text using IR
techniques is possible but
Too simple
Do not use the advantage of XML data
Can achieve better accuracy using features of XML
data
Structures
Tag semantics

9
Outline

Motivation
Key-Tag Search
Multi-Ranker Model
Ranking Support vector machine in voting SpyNB
Framework (RSSF)
Experiments
Conclusions and Ongoing Work

10
Key-Tag Search
11
Key-Tag Query vs. XQuery

Keywords in Web search engine vs. SQL
The goals of key-tag query and XQuery are
different
Key-Tag Query
Simple
Easy to understand
Flexible

Too complicate for ordinary users!!
XQuery for x in doc(some.xml") where
x/author(.ftcontains(Mary) return x/title
Key-Tag Query ltauthorgtMarylt/authorgt
Will users input such complex XQuery in search
engines?
12
Key-Tag Search Query
Key
Tag

ltauthorgtMarylt/authorgt
For example,

13
Key-Tag Query Semantics

A fragment is considered as a result candidate if
at least one key-tag is found in it.
If F1 and F2 both contain the same instance of
key-tag and F1 is a subtree of F2, F1is chosen to
be the only answer.
For example, a query ltbgtblt/bgt

14
Outline

Motivation
Key-Tag Search
Multi-Ranker Model
Ranking Support vector machine in voting SpyNB
Framework (RSSF)
Experiments
Conclusions and Ongoing Work

15
Multi-Ranker Model
16
Introduction to MRM

Handle diversified XML documents and user
preferences

17
Multi-Ranker Model
User Profiles
Adaptive Ranking Level (AR)
RSSF
w11
w12
w13
w14
Standard Ranking Level (XR)
NEW
Feature Ranking Level
Keyword Access Path Element Order Category
Sibling Children Distance Distance- Tag Attribute
18
Adaptive Ranking Level (AR)

AR maintains a feature vector,?, which adapts to
the four XRs, the vector is weighted and trained
by RSSF
? (?STR, ?DAT, ?DFT, ?CUS, ?STR, ?DAT, ?DFT,
?CUS)
The adaptive ranking of fragments is calculated
by
W ?,
where W is generated by RSSF, we will introduce
it later.

19
Standard Ranking Level (XR)

Four XRs
Structure ranker (STR) focus on ranking XML
fragments based on their structure
Data ranker (DAT) ignore the structure and rank
the XML fragments with their textual data
System default ranker (DFT) a balance of
structure and data ranker
Customized ranker (CUS) system administrator
selects low-level feature for tuning, in our
experiment, the low-level features are randomly
pick

20
Feature Ranking Level
For example, Q ltauthorgtMarylt/authorgt,
lttitlegtXMLlt/titlegt

Similarity Features
Keyword
Access
Path
Element
Order
Category

Order in Q author gt title
Sibling order in F authorgttitle, authorgtyear,
titlegtyear, firstgtlast
Ancestor order similarity 0 Sibling order
similarity 1/4
21
Feature Ranking Level
For example, Q ltauthorgtMarylt/authorgt,
lttitlegtXMLlt/titlegt

Granularity Features
Sibling
Children
Distance
Distance-
Tag
Attribute
Involves statistical data in the database

Number of fragments whose roots are dblp
Number of tags whose parent are dblp
The length of the path from root to farthest
leaf dblp/article/author/first length 4
The length of the path from root to nearest
leaf dblp/article/title length 3
Number of tag in F 7
Number of attributes in F 0
22
Highlights of MRM

Highly Flexible
Add or remove of new features or new XR is
straightforward
Only require to update the feature vector, ?
Ranking Level Independence
Analogous to data independence in relational model

23
Outline

Motivation
Key-Tag Search
Multi-Ranker Model
Ranking Support vector machine in voting SpyNB
Framework (RSSF)
Experiments
Conclusions and Ongoing Work

24
Features of RSSF

Input set of labeled fragment
Output a trained ranker
Naïve Bayes is a successful algorithm for
learning to classify text documents
Require small amount of training data, both
positive and negative samples
In our setting, we only have labeled and
unlabeled data, we extend the Naïve Bayes with
spying technique to obtain the negative training
samples

25
The RSSF
26
Ranking SVM Techniques

Find a vector that makes the inequality holds F1
lt F2 ltF3

27
Voting Spy Naïve Bayes
28
Voting Spy Naïve Bayes
P1
P2
P3
Estimated Negative
Training Completed
Training Naïve Bayes
Positive
Unclassified
Negative
29
Voting Spy Naïve Bayes
The Final Estimated Negative is
F11
F11
F11
F12
F12
F13
F14
Positive
Unclassified
Negative
30
Outline

Motivation
Key-Tag Search
Multi-Ranker Model
Ranking Support vector machine in voting SpyNB
Framework (RSSF)
Experiments
Conclusions and Ongoing Work

31
Effect of Varying Voting Threshold
X voting threshold Y Relative average rank of
labeled fragments new average rank /
original average rank
32
Effectiveness of Low-Level Features on XR

In this experiment, we remove individual
low-level feature from STR and DAT rankers and
measure the new precision
The queries we use can be found in the appendix
of the proposal

33
Processing Time
34
Comparison with TopX

TopX is a searching engine for XML data available
online
State-of-the-art XML search engine
We measure the MAP and precison_at_k
MAP mean average precision
precison_at_k top k precision

Average precision over 100 recall points for
each query. Then, take the average.
Number of top k relevant results k
35
Outline

Motivation
Key-Tag Search
Multi-Ranker Model
Ranking Support vector machine in voting SpyNB
Framework (RSSF)
Experiments
Conclusions and Ongoing Work

36
Further remarks

Searching and ranking XML data are important,
since existing Web search engines cannot handle
them well
We present effective approach to perform adaptive
XML searching and ranking by extending
traditional IR techniques by considering
different features of XML data

37
Ongoing Work INEX 2007

The Initiative for Evaluation of XML retrieval
(INEX)
A community which aims to provide large test data
and scoring method for researchers to evaluate
their retrieval systems
It is getting attention recently
We participate INEX in 2006 and 2007
INEX 2007 Collection is a Wikipedia XML Corpus
with a set of 659388 XML documents
We are running experiments using their data and
queries

38
Ongoing Work INEX 2007
39
Ongoing Work Merging

Displaying a list of fragments one by one to the
user may not be adequate in XML setting.
Fragments may be scattered on the list
Duplicated fragments in different structures
Refine a search query to obtain more and better
results
Ideas Make use of the schema information (DTD)
and consider the fragments as entities and merge
them in a concise way

40
My Publications

Ho-Lam LAU and Wilfred NG. A Multi-Ranker Model
for Adaptive XML Searching. Accepted and to
appear VLDB Journal. (2007).
Ho-Lam LAU and Wilfred NG. Towards an Adaptive
Information Merging Using Selected XML Fragments.
International Conference of Database Systems for
Advanced Applications. DASFAA 2007, Lecture Notes
in Computer Science Vol. 4443, Bangkok, Thailand,
pp. 1013-1019, (2007).
James CHENG and Wilfred NG. A Development of
Hash-Lookup Trees to Support Querying Streaming
XML. International Conference of Database Systems
for Advanced Applications. DASFAA 2007, Lecture
Notes in Computer Science Vol. 4443, Bangkok,
Thailand, pp. 768-780, (2007).
Wilfred NG and James CHENG. An Efficient Index
Lattice for XML Query Evaluation. International
Conference of Database Systems for Advanced
Applications. DASFAA 2007, Lecture Notes in
Computer Science Vol. 4443, Bangkok, Thailand,
pp. 753-767, (2007).
Wilfred NG and Ho-Lam LAU. A Co-Training
Framework for Searching XML Documents.
Information Systems, 32(3), pp. 477-503, (2007).
Yin YANG, Wilfred NG, Ho-Lam LAU and James CHENG
. An Efficient Approach to Support Querying
Secure Outsourced XML Information. Conference on
Advanced Information Systems Engineering. CAiSE
2006, Lecture Notes in Computer Science Vol.
4007, Luxembourg, pp. 157-171, (2006).
Wilfred NG and Ho-Lam LAU. Effective Approaches
for Watermarking XML Data. 10th International
Conference on Database Systems for Advanced
Applications DASFAA 2005, Lecture Notes of
Computer Science Vol.3453, Beijing, China, page
68-80, (2005).
Ho-Lam LAU and Wilfred NG. A Unifying Framework
for Merging and Evaluating XML Information. 10th
International Conference on Database Systems for
Advanced Applications DASFAA 2005, Lecture Notes
of Computer Science Vol.3453, Beijing, China,
page 81-94, (2005).