Title: Adaptive XML Search
1Adaptive XML Search
- Dr Wilfred Ng
- Department of Computer Science
- The Hong Kong University of Science and
Technology
2Outline
- Motivation
- Key-Tag Search
- Multi-Ranker Model
- Ranking Support vector machine in voting SpyNB
Framework (RSSF) - Experiments
- Conclusions and Ongoing Work
3Motivation
4Why we need XML Search Engine?
- Different nature of HTML and XML data
- HTML data
- Hyperlink-intensive
- Declarative languages
- Tags have no semantic meaning
- XML data
- Self-describing tags
- Extra structural information
- XML search engines retrieve more accurate
fragments
5Why we need XML Search Engine?
- Web searching
- Document paradigm
- Matching keywords Vs documents
- Return links to whole document (web page)
- XML searching
- Query Keywords maybe tags or data values
- Structure of XML document is diverse, e.g. DBLP
and Shakespeare - Not return whole document 100Mb or larger
- Return fragments
6DBLP
- ltdblpgt
- ltincollection mdate"2002-01-03"
key"books/acm/kim95/AnnevelinkACFHK95"gt - ltauthorgtJurgen Annevelinklt/authorgt
- lttitlegtObject SQL - A Language for the Design
and Implementation of Object Databases.lt/titlegt - ltpagesgt42-68lt/pagesgt
- ltyeargt1995lt/yeargt
- ltbooktitlegtModern Database Systemslt/booktitlegt
- lturlgtdb/books/collections/kim95.htmllt/urlgt
- lt/incollectiongt
- .
7Shakespeare
- ltSPEECHgt
- ltSPEAKERgtOCTAVIUS CAESARlt/SPEAKERgt
- ltLINEgtNo, my most wronged sister
Cleopatralt/LINEgt - ltLINEgtHath nodded him to her. He hath given his
empirelt/LINEgt - ltLINEgtUp to a whore who now are levyinglt/LINEgt
- ltLINEgtThe kings o' the earth for war he hath
assembledlt/LINEgt - ltLINEgtBocchus, the king of Libya
Archelaus,lt/LINEgt - ltLINEgtOf Cappadocia Philadelphos, kinglt/LINEgt
- ltLINEgtOf Paphlagonia the Thracian king,
Adallaslt/LINEgt - ltLINEgtKing Malchus of Arabia King of
Pontlt/LINEgt - ltLINEgtHerod of Jewry Mithridates, kinglt/LINEgt
- ltLINEgtOf Comagene Polemon and Amyntas,lt/LINEgt
- ltLINEgtThe kings of Mede and Lycaonia,lt/LINEgt
- ltLINEgtWith a more larger list of sceptres.lt/LINEgt
- lt/SPEECHgt
8Research Ideas
- In Information Retrieval community, many ranking
techniques are developed - Weighted keywords
- Vector space
- Searching and ranking XML as plain text using IR
techniques is possible but - Too simple
- Do not use the advantage of XML data
- Can achieve better accuracy using features of XML
data - Structures
- Tag semantics
9Outline
- Motivation
- Key-Tag Search
- Multi-Ranker Model
- Ranking Support vector machine in voting SpyNB
Framework (RSSF) - Experiments
- Conclusions and Ongoing Work
10Key-Tag Search
11Key-Tag Query vs. XQuery
- Keywords in Web search engine vs. SQL
- The goals of key-tag query and XQuery are
different - Key-Tag Query
- Simple
- Easy to understand
- Flexible
Too complicate for ordinary users!!
XQuery for x in doc(some.xml") where
x/author(.ftcontains(Mary) return x/title
Key-Tag Query ltauthorgtMarylt/authorgt
Will users input such complex XQuery in search
engines?
12Key-Tag Search Query
Key
Tag
- ltauthorgtMarylt/authorgt
- For example,
13Key-Tag Query Semantics
- A fragment is considered as a result candidate if
at least one key-tag is found in it. - If F1 and F2 both contain the same instance of
key-tag and F1 is a subtree of F2, F1is chosen to
be the only answer. - For example, a query ltbgtblt/bgt
14Outline
- Motivation
- Key-Tag Search
- Multi-Ranker Model
- Ranking Support vector machine in voting SpyNB
Framework (RSSF) - Experiments
- Conclusions and Ongoing Work
15Multi-Ranker Model
16Introduction to MRM
- Handle diversified XML documents and user
preferences
17Multi-Ranker Model
User Profiles
Adaptive Ranking Level (AR)
RSSF
w11
w12
w13
w14
Standard Ranking Level (XR)
NEW
Feature Ranking Level
Keyword Access Path Element Order Category
Sibling Children Distance Distance- Tag Attribute
18Adaptive Ranking Level (AR)
- AR maintains a feature vector,?, which adapts to
the four XRs, the vector is weighted and trained
by RSSF - ? (?STR, ?DAT, ?DFT, ?CUS, ?STR, ?DAT, ?DFT,
?CUS) - The adaptive ranking of fragments is calculated
by - W ?,
- where W is generated by RSSF, we will introduce
it later.
19Standard Ranking Level (XR)
- Four XRs
- Structure ranker (STR) focus on ranking XML
fragments based on their structure - Data ranker (DAT) ignore the structure and rank
the XML fragments with their textual data - System default ranker (DFT) a balance of
structure and data ranker - Customized ranker (CUS) system administrator
selects low-level feature for tuning, in our
experiment, the low-level features are randomly
pick
20Feature Ranking Level
For example, Q ltauthorgtMarylt/authorgt,
lttitlegtXMLlt/titlegt
- Similarity Features
- Keyword
- Access
- Path
- Element
- Order
- Category
Order in Q author gt title
Sibling order in F authorgttitle, authorgtyear,
titlegtyear, firstgtlast
Ancestor order similarity 0 Sibling order
similarity 1/4
21Feature Ranking Level
For example, Q ltauthorgtMarylt/authorgt,
lttitlegtXMLlt/titlegt
- Granularity Features
- Sibling
- Children
- Distance
- Distance-
- Tag
- Attribute
- Involves statistical data in the database
Number of fragments whose roots are dblp
Number of tags whose parent are dblp
The length of the path from root to farthest
leaf dblp/article/author/first length 4
The length of the path from root to nearest
leaf dblp/article/title length 3
Number of tag in F 7
Number of attributes in F 0
22Highlights of MRM
- Highly Flexible
- Add or remove of new features or new XR is
straightforward - Only require to update the feature vector, ?
- Ranking Level Independence
- Analogous to data independence in relational model
23Outline
- Motivation
- Key-Tag Search
- Multi-Ranker Model
- Ranking Support vector machine in voting SpyNB
Framework (RSSF) - Experiments
- Conclusions and Ongoing Work
24Features of RSSF
- Input set of labeled fragment
- Output a trained ranker
- Naïve Bayes is a successful algorithm for
learning to classify text documents - Require small amount of training data, both
positive and negative samples - In our setting, we only have labeled and
unlabeled data, we extend the Naïve Bayes with
spying technique to obtain the negative training
samples
25The RSSF
26Ranking SVM Techniques
- Find a vector that makes the inequality holds F1
lt F2 ltF3
27Voting Spy Naïve Bayes
28Voting Spy Naïve Bayes
P1
P2
P3
Estimated Negative
Training Completed
Training Naïve Bayes
Positive
Unclassified
Negative
29Voting Spy Naïve Bayes
The Final Estimated Negative is
F11
F11
F11
F12
F12
F13
F14
Positive
Unclassified
Negative
30Outline
- Motivation
- Key-Tag Search
- Multi-Ranker Model
- Ranking Support vector machine in voting SpyNB
Framework (RSSF) - Experiments
- Conclusions and Ongoing Work
31Effect of Varying Voting Threshold
X voting threshold Y Relative average rank of
labeled fragments new average rank /
original average rank
32Effectiveness of Low-Level Features on XR
- In this experiment, we remove individual
low-level feature from STR and DAT rankers and
measure the new precision - The queries we use can be found in the appendix
of the proposal
33Processing Time
34Comparison with TopX
- TopX is a searching engine for XML data available
online - State-of-the-art XML search engine
- We measure the MAP and precison_at_k
- MAP mean average precision
- precison_at_k top k precision
Average precision over 100 recall points for
each query. Then, take the average.
Number of top k relevant results k
35Outline
- Motivation
- Key-Tag Search
- Multi-Ranker Model
- Ranking Support vector machine in voting SpyNB
Framework (RSSF) - Experiments
- Conclusions and Ongoing Work
36Further remarks
- Searching and ranking XML data are important,
since existing Web search engines cannot handle
them well - We present effective approach to perform adaptive
XML searching and ranking by extending
traditional IR techniques by considering
different features of XML data
37Ongoing Work INEX 2007
- The Initiative for Evaluation of XML retrieval
(INEX) - A community which aims to provide large test data
and scoring method for researchers to evaluate
their retrieval systems - It is getting attention recently
- We participate INEX in 2006 and 2007
- INEX 2007 Collection is a Wikipedia XML Corpus
with a set of 659388 XML documents - We are running experiments using their data and
queries
38Ongoing Work INEX 2007
39Ongoing Work Merging
- Displaying a list of fragments one by one to the
user may not be adequate in XML setting. - Fragments may be scattered on the list
- Duplicated fragments in different structures
- Refine a search query to obtain more and better
results - Ideas Make use of the schema information (DTD)
and consider the fragments as entities and merge
them in a concise way
40My Publications
- Ho-Lam LAU and Wilfred NG. A Multi-Ranker Model
for Adaptive XML Searching. Accepted and to
appear VLDB Journal. (2007). - Ho-Lam LAU and Wilfred NG. Towards an Adaptive
Information Merging Using Selected XML Fragments.
International Conference of Database Systems for
Advanced Applications. DASFAA 2007, Lecture Notes
in Computer Science Vol. 4443, Bangkok, Thailand,
pp. 1013-1019, (2007). - James CHENG and Wilfred NG. A Development of
Hash-Lookup Trees to Support Querying Streaming
XML. International Conference of Database Systems
for Advanced Applications. DASFAA 2007, Lecture
Notes in Computer Science Vol. 4443, Bangkok,
Thailand, pp. 768-780, (2007). - Wilfred NG and James CHENG. An Efficient Index
Lattice for XML Query Evaluation. International
Conference of Database Systems for Advanced
Applications. DASFAA 2007, Lecture Notes in
Computer Science Vol. 4443, Bangkok, Thailand,
pp. 753-767, (2007). - Wilfred NG and Ho-Lam LAU. A Co-Training
Framework for Searching XML Documents.
Information Systems, 32(3), pp. 477-503, (2007). - Yin YANG, Wilfred NG, Ho-Lam LAU and James CHENG
. An Efficient Approach to Support Querying
Secure Outsourced XML Information. Conference on
Advanced Information Systems Engineering. CAiSE
2006, Lecture Notes in Computer Science Vol.
4007, Luxembourg, pp. 157-171, (2006). - Wilfred NG and Ho-Lam LAU. Effective Approaches
for Watermarking XML Data. 10th International
Conference on Database Systems for Advanced
Applications DASFAA 2005, Lecture Notes of
Computer Science Vol.3453, Beijing, China, page
68-80, (2005). - Ho-Lam LAU and Wilfred NG. A Unifying Framework
for Merging and Evaluating XML Information. 10th
International Conference on Database Systems for
Advanced Applications DASFAA 2005, Lecture Notes
of Computer Science Vol.3453, Beijing, China,
page 81-94, (2005).