Searching Web Better - PowerPoint PPT Presentation

About This Presentation

Title:

Searching Web Better

Description:

... RSCF-based Metasearch Engine. Search Engine Components. Feature ... Metasearch Engine. Receives query from user. Sends query to multiple search engines ... – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 40

Provided by: tqz

Category:

more less

Transcript and Presenter's Notes

Title: Searching Web Better

1
Searching Web Better

Dr Wilfred Ng
Department of Computer Science
The Hong Kong University of Science and
Technology

2
Outline

Introduction
Main Techniques (RSCF)
Clickthrough Data
Ranking Support Vector Machine Algorithm
Ranking SVM in Co-training Framework
The RSCF-based Metasearch Engine
Search Engine Components
Feature Extraction
Experiments
Current Development

3
Search Engine Adaptation
Social Science
Computer Science
Finance
Product
CS terms
News
Google, MSNsearch, Wisenut, Overture,
Adapt the search engine by learning from implicit
feedback ---- Clickthrough data
4
Clickthrough Data

Clickthrough data data that indicates which
links in the returned ranking results have been
clicked by the users
Formally, a triplet (q, r, c)
q the input query
r the ranking result presented to the user
c the set of links the user clicked on
Benefits
Can be obtained timely
No intervention to the search activity

5
An Example of Clickthrough Data
Users input query
l
l
l
Clicked by the user
l
l
l
l
l
6
Target Ranking (Preference Pairs Set )
7
An Example of Clickthrough Data
Users input query
l
Labelled data set
l
l
Clicked by the user
l
l
l
l
Unlabelled data set
l
8
Target Ranking (Preference Pairs Set )

Labelled data set l1, l2,, l10
Unlabelled data set l11, l12,

9
The Ranking SVM Algorithm
Three links, each described by a
feature vector Target ranking l1 Weight vector -- Ranker Distance between
two closest projected links
l2
l2
l1
l2
l1
l1
l3
l3
l3
Cons It needs a large set of labelled data
10
The Ranking SVM in Co-training Framework

Divide the feature vector into two subvectors
Two rankers are built over these two feature
subvectors
Each ranker chooses several unlabelled
preference pairs and add them to the labelled
data set
Rebuild each ranker from the augmented labelled
data set

Labelled Preference Feedback Pairs P_l
Training
Ranker a_B
Ranker a_A
Augmented pairs
Augmented pairs
Selecting confident pairs
Unlabelled Preference Pairs P_u
11
Some Issues

Guideline for partitioning the feature vector
After the partition each subvector must be
sufficient for the later ranking
Number of rankers
Depend on the number of features
When to terminate the procedure?
Prediction difference indicates the ranking
difference between the two rankers
After termination, get a final ranker on the
augmented labelled data set

12
Metasearch Engine
User
query

Receives query from user
Sends query to multiple search engines
Combines the retrieved results from the
underlying search engines
Presents a unified ranking result to user

Metasearch Engine
Search Engine 1
Search Engine 2
Search Engine n
Retrieved Results 1
Retrieved Results 2
Retrieved Results n
Unified Ranking Result
13
Search Engine Components

Powered by Inktomi, relatively mature
One of the most powerful search engines nowadays
A new but growing search engine
Ranks links based on the prices paid by the
sponsors on the links

14
Feature Extraction

Ranking Features (12 binary features)
Rank(E,T) where E? M,W,O T? 1,3,5,10
(M MSNsearch, W Wisenut, O Overture)
Indicate the ranking of the links in each
underlying search engine
Similarity Features(4 features)
Sim_U(q,l), Sim_T(q,t), Sim_C(q,a), Sim_G(q,a)
URL,Title, Abstract Cover, Abstract Group
Indicate the similarity between the query and the
link

15
Experiments

Experiment data within the same domain
Computer science
Objectives
Offline experiments compared with RSVM
Online experiments compared with Google

16
Prediction Error

Prediction Error difference between the rankers
ranking and the target ranking
Target ranking
l1
Projected ranking
l2
Prediction error 33

l2
l2
l1
l1
l3
l3
17
Offline Experiment (Compared with RSVM)
10 queries 30 queries
60 queries
The ranker trained by the RSVM algorithm on the
whole feature vector The ranker trained by the
RSCF algorithm on one feature subvector The
ranker trained by the RSCF algorithm on another
feature subvector
Prediction error rise up again! The number of
iterations in RSCF algorithm is about four to
five!
18
Offline Experiment (Compare with RSVM)
Overall comparison
The ranker trained by the RSVM algorithm The
final ranker trained by the RSCF algorithm
19
Online Experiment (Compare with Google)

Experiment data CS terms
e.g. radix sort, TREC collection,
Experiment Setup
Combine the results returned by RSCF and those by
Google into one shuffled list
Present to the users in a unified way
Record the users clicks

20
Experimental Analysis
21
Experimental Analysis
22
Experimental Analysis
23
Conclusion on RSCF

Search engine adaptation
The RSCF algorithm
Train on clickthrough data
Apply RSVM in the co-training framework
The RSCF-based metasearch engine
Offline experiments better than RSVM
Online experiments better than Google

24
Current Development

Features extraction and division
Apply in different domains
Search engine personalization
SpyNoby Project Personalized search engine with
clickthrough analysis

25
Modified Target Ranking for Metasearch Engines

If l1 and l7 are from the same underlying search
engine, the preference pairs set arising from l1
should be
l1
Advantages
Alleviate the penalty on high-ranked links
Give more credit to the ranking ability of the
underlying search engines

26
Modified Target Ranking

Labeled data set l1, l2,, l10
Unlabelled data set l11, l12,

27
RSCF-based Metasearch Engine - MEA
User
query
q
MEA
q
q
q

30. ......

Unified Ranking Result
28
RSCF-based Metasearch Engine - MEB
User
query
q
MEB
q
q
q
q

Unified Ranking Result
29
Generating Clickthrough Data

Probability of being clicked on
k the ranking of the link in the metasearch
engine
n the number of all the links in the metasearch
engine
the skewness parameter in Zipfs law
Harmonic number
Judge the links relevance manually
If the link is irrelevant ? not be clicked on
If the link is relevant ? has the probability of
Pr(k) to be clicked on

30
Feature Extraction

Ranking Features (binary features)
Rank(E,T) whether the link is ranked within ST
in E
where E? G,M,W,O T? 1,3,5,10,15,20,25,30
S11, S32,3, S54,5, S106,7,8,9,10
(G Google, M MSNsearch, W Wisenut, O
Overture)
Indicate the ranking of the links in each
underlying search engine
Similarity Features(4 features)
Sim_U(q,l), Sim_T(q,t), Sim_C(q,a), Sim_G(q,a)
Measure the similarity between the query and the
link

31
Experiments

Experiment data three different domains
CS terms
News
E-shopping
Objectives
Prediction Error better than RSVM
Top-k Precision adaptation ability

32
Top-k Precision

Advantages
Precision is more easier to obtained than recall
Users care only top-k links (k10)
Evaluation data 30 queries in each domain

33
Comparison of Top-k precision
News
CS terms
E-shopping
34
Statistical Analysis

Hypothesis Testing
(two-sample hypothesis testing about means)
used to analyze whether there is a statistically
significant difference between two means of two
samples

35
Comparison Results

MEA can produce better search quality than Google
Google does not excel in every query category
MEA and MEB is able to adapt to bring out the
strengths of each underlying search engine
MEA and MEB are better than, or comparable to all
their underlying search engine components in
every query category
The RSCF-based metasearch engine
Comparison of prediction error better than RSVM
Comparison of top-k precision adaptation
ability

36
Spy Naïve Bayes Motivation

The problem of Joachims method
Strong assumptions
Excessively penalize high-ranked links
l1, l2, l3 are apt to appear on the right,
while l7, l10 on the left
New interpretation of clickthrough data
Clicked positive (P)
Unclicked unlabeled (U), containing both
positive and negative samples.
Goal identify Reliable Negatives (RN) from U

lp
37
Spy Naïve Bayes Ideas

Standard naïve Bayes classify positive and
negative samples
One-step spy naïve Bayes Spying out RN from U
Put a small number of positive samples into U to
act as spies, (to scout the behavior of real
positive samples in U)
Take U as negative samples to train a naïve Bayes
classifier
Samples with lower probabilities to be positive
will be assigned into RN
Voting procedure make Spying more robust
Run one-step SpyNB for n times and get n sets of
RNi
A sample appear in at least m (mwill appear in the final RN

38
http//dleecpu1.cs.ust.hk8080/SpyNoby/
39
My publications

Wilfred NG. Book Review An Introduction to
Search Engines and Web Navigation. An
International Journal of Information Processing
Management, pp. 290-292, 43(1) (2007).
Wilfred NG, Lin DENG and Dik-Lun LEE. Spying Out
Real User Preferences in Web Searching. Accepted
and to appear ACM Transactions on Internet
Technology, (2006).
Yiping KE, Lin DENG, Wilfred NG and Dik-Lun LEE.
Web Dynamics and their Ramifications for the
Development of Web Search Engines. Accepted and
to appear Computer Networks Journal - Special
Issue on Web Dynamics, (2005).
Qingzhao TAN, Yiping KE and Wilfred NG. WUML A
Web Usage Manipulation Language For Querying Web
Log Data. International Conference on Conceptual
Modeling ER 2004, Lecture Notes of Computer
Science Vol.3288, Shanghai, China, page 567-581,
(2004).
Lin DENG, Xiaoyong CHAI, Qingzhao TAN, Wilfred
NG, Dik-Lun LEE. Spying Out Real User Preferences
for Metasearch Engine Personalization. ACM
Proceedings of WEBKDD Workshop on Web Mining and
Web Usage Analysis 2004, Seattle, USA, (2004).
Qingzhao TAN, Xiaoyong CHAI, Wilfred NG and
Dik-Lun LEE. Applying Co-training to Clickthrough
Data for Search Engine Adaptation. 9th
International Conference on Database Systems for
Advanced Applications DASFAA 2004, Lecture Notes
of Computer Science Vol. 2973, Jeju Island,
Korea, page 519-532, (2004).
Haofeng ZHOU, Yubo LOU, Qingqing YUAN, Wilfred
NG, Wei WANG and Baile SHI. Refining Web
Authoritative Resource by Frequent Structures.
IEEE Proceedings of the International Database
Engineering and Applications Symposium IDEAS
2003, Hong Kong, pages 236-241, (2003).
Wilfred NG. Capturing the Semantics of Web Log
Data by Navigation Matrices. A Book Chapter in
"Semantic Issues in E-Commerce Systems", Edited
by R. Meersman, K. Aberer and T. Dillon, Kluwer
Academic Publishers, pages 155-170, (2003).