Title: Searching Web Better
1Searching Web Better
- Dr Wilfred Ng
- Department of Computer Science
- The Hong Kong University of Science and
Technology
2Outline
- Introduction
- Main Techniques (RSCF)
- Clickthrough Data
- Ranking Support Vector Machine Algorithm
- Ranking SVM in Co-training Framework
- The RSCF-based Metasearch Engine
- Search Engine Components
- Feature Extraction
- Experiments
- Current Development
3Search Engine Adaptation
Social Science
Computer Science
Finance
Product
CS terms
News
Google, MSNsearch, Wisenut, Overture,
Adapt the search engine by learning from implicit
feedback ---- Clickthrough data
4Clickthrough Data
- Clickthrough data data that indicates which
links in the returned ranking results have been
clicked by the users - Formally, a triplet (q, r, c)
- q the input query
- r the ranking result presented to the user
- c the set of links the user clicked on
- Benefits
- Can be obtained timely
- No intervention to the search activity
5An Example of Clickthrough Data
Users input query
l
l
l
Clicked by the user
l
l
l
l
l
6Target Ranking (Preference Pairs Set )
7An Example of Clickthrough Data
Users input query
l
Labelled data set
l
l
Clicked by the user
l
l
l
l
Unlabelled data set
l
8Target Ranking (Preference Pairs Set )
- Labelled data set l1, l2,, l10
- Unlabelled data set l11, l12,
9The Ranking SVM Algorithm
Three links, each described by a
feature vector Target ranking l1 Weight vector -- Ranker Distance between
two closest projected links
l2
l2
l1
l2
l1
l1
l3
l3
l3
Cons It needs a large set of labelled data
10The Ranking SVM in Co-training Framework
- Divide the feature vector into two subvectors
- Two rankers are built over these two feature
subvectors - Each ranker chooses several unlabelled
preference pairs and add them to the labelled
data set - Rebuild each ranker from the augmented labelled
data set
Labelled Preference Feedback Pairs P_l
Training
Ranker a_B
Ranker a_A
Augmented pairs
Augmented pairs
Selecting confident pairs
Unlabelled Preference Pairs P_u
11Some Issues
- Guideline for partitioning the feature vector
- After the partition each subvector must be
sufficient for the later ranking - Number of rankers
- Depend on the number of features
- When to terminate the procedure?
- Prediction difference indicates the ranking
difference between the two rankers - After termination, get a final ranker on the
augmented labelled data set
12Metasearch Engine
User
query
- Receives query from user
- Sends query to multiple search engines
- Combines the retrieved results from the
underlying search engines - Presents a unified ranking result to user
Metasearch Engine
Search Engine 1
Search Engine 2
Search Engine n
Retrieved Results 1
Retrieved Results 2
Retrieved Results n
Unified Ranking Result
13Search Engine Components
-
- Powered by Inktomi, relatively mature
-
- One of the most powerful search engines nowadays
-
- A new but growing search engine
-
- Ranks links based on the prices paid by the
sponsors on the links
14Feature Extraction
- Ranking Features (12 binary features)
- Rank(E,T) where E? M,W,O T? 1,3,5,10
- (M MSNsearch, W Wisenut, O Overture)
- Indicate the ranking of the links in each
underlying search engine - Similarity Features(4 features)
- Sim_U(q,l), Sim_T(q,t), Sim_C(q,a), Sim_G(q,a)
- URL,Title, Abstract Cover, Abstract Group
- Indicate the similarity between the query and the
link
15Experiments
- Experiment data within the same domain
Computer science - Objectives
- Offline experiments compared with RSVM
- Online experiments compared with Google
16Prediction Error
- Prediction Error difference between the rankers
ranking and the target ranking - Target ranking
- l1
- Projected ranking
- l2
- Prediction error 33
l2
l2
l1
l1
l3
l3
17Offline Experiment (Compared with RSVM)
10 queries 30 queries
60 queries
The ranker trained by the RSVM algorithm on the
whole feature vector The ranker trained by the
RSCF algorithm on one feature subvector The
ranker trained by the RSCF algorithm on another
feature subvector
Prediction error rise up again! The number of
iterations in RSCF algorithm is about four to
five!
18Offline Experiment (Compare with RSVM)
Overall comparison
The ranker trained by the RSVM algorithm The
final ranker trained by the RSCF algorithm
19Online Experiment (Compare with Google)
- Experiment data CS terms
- e.g. radix sort, TREC collection,
- Experiment Setup
- Combine the results returned by RSCF and those by
Google into one shuffled list - Present to the users in a unified way
- Record the users clicks
20Experimental Analysis
21Experimental Analysis
22Experimental Analysis
23Conclusion on RSCF
- Search engine adaptation
- The RSCF algorithm
- Train on clickthrough data
- Apply RSVM in the co-training framework
- The RSCF-based metasearch engine
- Offline experiments better than RSVM
- Online experiments better than Google
24Current Development
- Features extraction and division
- Apply in different domains
- Search engine personalization
- SpyNoby Project Personalized search engine with
clickthrough analysis
25Modified Target Ranking for Metasearch Engines
- If l1 and l7 are from the same underlying search
engine, the preference pairs set arising from l1
should be - l1
- Advantages
- Alleviate the penalty on high-ranked links
- Give more credit to the ranking ability of the
underlying search engines
26Modified Target Ranking
- Labeled data set l1, l2,, l10
- Unlabelled data set l11, l12,
27RSCF-based Metasearch Engine - MEA
User
query
q
MEA
q
q
q
Unified Ranking Result
28RSCF-based Metasearch Engine - MEB
User
query
q
MEB
q
q
q
q
Unified Ranking Result
29Generating Clickthrough Data
- Probability of being clicked on
- k the ranking of the link in the metasearch
engine - n the number of all the links in the metasearch
engine - the skewness parameter in Zipfs law
- Harmonic number
- Judge the links relevance manually
- If the link is irrelevant ? not be clicked on
- If the link is relevant ? has the probability of
Pr(k) to be clicked on
30Feature Extraction
- Ranking Features (binary features)
- Rank(E,T) whether the link is ranked within ST
in E - where E? G,M,W,O T? 1,3,5,10,15,20,25,30
- S11, S32,3, S54,5, S106,7,8,9,10
- (G Google, M MSNsearch, W Wisenut, O
Overture) - Indicate the ranking of the links in each
underlying search engine - Similarity Features(4 features)
- Sim_U(q,l), Sim_T(q,t), Sim_C(q,a), Sim_G(q,a)
- Measure the similarity between the query and the
link
31Experiments
- Experiment data three different domains
- CS terms
- News
- E-shopping
- Objectives
- Prediction Error better than RSVM
- Top-k Precision adaptation ability
32Top-k Precision
- Advantages
- Precision is more easier to obtained than recall
- Users care only top-k links (k10)
- Evaluation data 30 queries in each domain
33Comparison of Top-k precision
News
CS terms
E-shopping
34Statistical Analysis
- Hypothesis Testing
- (two-sample hypothesis testing about means)
- used to analyze whether there is a statistically
significant difference between two means of two
samples
35Comparison Results
- MEA can produce better search quality than Google
- Google does not excel in every query category
- MEA and MEB is able to adapt to bring out the
strengths of each underlying search engine - MEA and MEB are better than, or comparable to all
their underlying search engine components in
every query category - The RSCF-based metasearch engine
- Comparison of prediction error better than RSVM
- Comparison of top-k precision adaptation
ability
36Spy Naïve Bayes Motivation
- The problem of Joachims method
- Strong assumptions
- Excessively penalize high-ranked links
- l1, l2, l3 are apt to appear on the right,
while l7, l10 on the left - New interpretation of clickthrough data
- Clicked positive (P)
- Unclicked unlabeled (U), containing both
positive and negative samples. - Goal identify Reliable Negatives (RN) from U
lp
37Spy Naïve Bayes Ideas
- Standard naïve Bayes classify positive and
negative samples - One-step spy naïve Bayes Spying out RN from U
- Put a small number of positive samples into U to
act as spies, (to scout the behavior of real
positive samples in U) - Take U as negative samples to train a naïve Bayes
classifier - Samples with lower probabilities to be positive
will be assigned into RN - Voting procedure make Spying more robust
- Run one-step SpyNB for n times and get n sets of
RNi - A sample appear in at least m (mwill appear in the final RN
38http//dleecpu1.cs.ust.hk8080/SpyNoby/
39My publications
- Wilfred NG. Book Review An Introduction to
Search Engines and Web Navigation. An
International Journal of Information Processing
Management, pp. 290-292, 43(1) (2007). - Wilfred NG, Lin DENG and Dik-Lun LEE. Spying Out
Real User Preferences in Web Searching. Accepted
and to appear ACM Transactions on Internet
Technology, (2006). - Yiping KE, Lin DENG, Wilfred NG and Dik-Lun LEE.
Web Dynamics and their Ramifications for the
Development of Web Search Engines. Accepted and
to appear Computer Networks Journal - Special
Issue on Web Dynamics, (2005). - Qingzhao TAN, Yiping KE and Wilfred NG. WUML A
Web Usage Manipulation Language For Querying Web
Log Data. International Conference on Conceptual
Modeling ER 2004, Lecture Notes of Computer
Science Vol.3288, Shanghai, China, page 567-581,
(2004). - Lin DENG, Xiaoyong CHAI, Qingzhao TAN, Wilfred
NG, Dik-Lun LEE. Spying Out Real User Preferences
for Metasearch Engine Personalization. ACM
Proceedings of WEBKDD Workshop on Web Mining and
Web Usage Analysis 2004, Seattle, USA, (2004). - Qingzhao TAN, Xiaoyong CHAI, Wilfred NG and
Dik-Lun LEE. Applying Co-training to Clickthrough
Data for Search Engine Adaptation. 9th
International Conference on Database Systems for
Advanced Applications DASFAA 2004, Lecture Notes
of Computer Science Vol. 2973, Jeju Island,
Korea, page 519-532, (2004). - Haofeng ZHOU, Yubo LOU, Qingqing YUAN, Wilfred
NG, Wei WANG and Baile SHI. Refining Web
Authoritative Resource by Frequent Structures.
IEEE Proceedings of the International Database
Engineering and Applications Symposium IDEAS
2003, Hong Kong, pages 236-241, (2003). - Wilfred NG. Capturing the Semantics of Web Log
Data by Navigation Matrices. A Book Chapter in
"Semantic Issues in E-Commerce Systems", Edited
by R. Meersman, K. Aberer and T. Dillon, Kluwer
Academic Publishers, pages 155-170, (2003).