Title: MetaSearch Engine
1Lecture 9 Rank Aggregation in MetaSearch
- MetaSearch Engine
- Social Choice Rules
- Rank Aggregation
2Choices of Search Engines
- Many search engines exist to compete for users
- The results are not necessarily the same
- Different users prefer different search engines
- Search results may, in the future, be biased
towards paid advertisements.
3MetaSearch Engine
- Metasearch Engines are designed to increase the
coverage of web by forwarding users queries to
multiple search engines - Users requests are sent to multiple search
engines such as AlltheWeb, Google, MSN. - Then the results from the individual search
engine are combined into a single result set to
present to users.
4Different Forms of MetaSearch
- Submit different representations of the same
query to the same search engine, then combine the
results. - Submit the same query to several search engine
adopting different information retrieval models,
then combine the results.
5Issues
- How to combine the results retrieved by different
source search engines is crucial for the success
of a metasearch engine. - And this is the problem that social choice theory
has been trying to answer.
6Search Engine Watch
- Interesting meta search engines are listed at
- http//www.searchenginewatch.com/links/article.php
/2156241
7Social Choice Theory
- Studies on protocols that help a group of people
make collective decisions, such as vote.
8A Fundamental problem
- Given a collection of agents (voters)
- with preferences over different alternatives
(allocations, outcomes), - how should society evaluate these alternatives
and make a decision for all - that may be for the will of some voters but
against that of others.
9Applications
- Voters elect president from several candidates.
- National polls for economic or political policy
of the government - The procedure or rule of election
- The rank of metasearch engine obtained from those
of search engines
10Group Descisions
- How do we make decisions
- Flip a coin?
- Dictatorship?
- Democracy (Majority rule)?
-
11Group Decision Rules
- Majority rule ,
- Condorcet paradox (voting cycle)
- Borda rule
12Mathematical model
- A set of voters Vv1,v2,v3,,Vn
- A set of alternatives or outcomes
Ss1,s2,s3,Sm, with Sm and - A set of preference relation PR1,R2,R3Rn,
called a preference profile, - the preference relation Ri for each voter i is a
permutation (order) of elements in S.
13Example 1 Majority Rule
- 3 rational people have rational preferences over
2 alternatives x,y - Person
- 1 2 3
- 1st X Y X
1 XgtY - Pref. i.e.Person 2
YgtX - 2nd Y X Y
3 XgtY - How to Aggregate their preferences? How to choose?
14- Using majority rule.
- Since more than ½ people (two out of three)
prefer x to y. - Then the group prefers x to y
15Example 2 Condorcet Paradox
- 3 rational people have rational preferences over
3 alternatives x,y,z - Person
- 1 2 3
- 1st X Y Z
1 XgtYgtZ - Pref. 2nd Y Z X i.e. Person 2
YgtZgtX - 3rd Z X Y
3 ZgtXgtY
16Binary/paired Comparison With Majority rule
- Person
- 1 2 3
- 1st X Y Z 1
XgtY - Pref. 2nd Y Z X for (x,y) 2
YgtX? XgtY - 3rd Z X Y 3
XgtY - Similarly, for (Y,Z) we can get YgtZ for (Z,X) we
can get ZgtX. - Then XgtYgtZgtX (cycling) , Intransitive ? Not
rational
17- It was noted by Condorcet in the 18 century that
no alternative can win a majority against all
other alternatives. - Pairwise majority is not satisfactory in all
cases.
18Example 3 Borda Rule
- For each voter,
- associate the number 1 with the most preferred
alternative, - 2 with the second and so on,
- Assign to each alternative the number equal to
- the sum of the numbers the individual voters
assigned to the alternative.
19- Person
- 1 2 3
- 1st X(1) Y(1) X(1) X(4)
X - Pref. 2nd Y(2) X(2) W(2) ? Y(7) ?
Y - 3rd Z(3) W(3) Z(3)
Z(10) W - 4th W(4) Z(4) Y(4)
W(9) Z - Then We get choice XgtYgtWgtZ
20- For above example, if we use binary/paired
comparison With majority rule . We can get - XgtY in 2 out of 3, YgtW in 2 out of 3,
- WgtZ in 2 out of 3, XgtW in 3 out of 3,
- XgtZ in 3 out of 3, YgtZ in 2 out of 3
- Then we can achieve same choice
- XgtYgtWgtZ
21- For the previous example we had trouble with
majority rule via binary/paired comparison, we
get a tie between all three alternatives with the
Bordas rule - All three alternatives get a sum of 6.
-
22- Some variations
- 1 with relevant scores available
- allotting each input system a point p to be
distributed according to relevance scores of the
documents. - 2 Weighted Borda-rule
- Each voter may not have equal effectiveness to
the final result. We may set more weight to good
quality input systems. -
23- Condorcet winner algorithm
-
- It also comes from social choice theory. The
Condorcet algorithm says that any candidate that
can beat all other candidates in a head-to-head
contest (pair-wise comparison) should win the
election.
24- Step 1, Construct Condorcet Graph.
- For each candidate pair (x,y), there exists an
edge from x to y if x would receive at least as
many votes as y in a head-to-head contest. - In Condorcet graph, there is at least one
directed edge between every pair of candidates. (
we call the graph is semi-complete) - It may contains cycles in the graph. This is
due to voting paradox of the condorcet voting.
25- Step 2, We form a new acyclic graph from an old
cyclic one by contracting all of the nodes in a
cycle into one. It is a strongly connected
component graph (SCCG). - A directed graph is strongly connected if for
any two nodes ua nd v, there are paths from u to
v and from v to u. - Definition of Strongly connected component(SCC)
- A strongly connected subgraph, S, of a
directed graph, D, such that no vertex or subset
of vertices of D can be added to S such that the
new subgraph is still strongly connected.
26- The graph is totally orderable at the level of
the SCCs and each SCC is a pocket of cycles,
within which each candidate is tied. (Why?) - Step 3, The condorcet-consistent Hamiltonian path
is any Hamiltonian path through Condorcet graph. - Definition Hamiltonian path A path between two
vertices of a graph that visits each vertex
exactly once.
27- Theorem 1. Suppose x and y are nodes in a graph
g, and that X and Y are nodes of the associated
SCCG G such that x X and y Y. If there
exists a path from X to Y in G, then every
Condorcet path of g has x before y. -
- Refer to Javed A. Aslam, Mark Montague 2001
for proof.
28Rank Aggregation in MetaSearch
- Here we discussed two cases which using
algorithm rooted at social choice theory for
MetaSearch rank aggregation. - Data fusion track in TREC
- Javed A. Aslam, Mark Montague 2001 Models for
Metasearch - in SIGIR2001
- Rank aggregation for web search engine
- Cynthia Dwork, Ravi Kumar, Moni Naor,
D.Sivakumar 2001 - Rank Aggregation Methods for the Web in WWW10
29Data fusion track in TREC
- TREC (Text Retrieval Conference ,see
http//trec.nist.gov/) maintains about 6Gb of
SGML tagged text, queries and respective answers
for evaluation purposes. - The TREC organizers distribute data sets in
advance and 50 new queries each year. - The competing teams then submit ranked lists of
documents that their system gave in response to
each query. And these retrieval systems will be
evaluated.
30- These ranked lists are available for metasearch
researchers to download and use. - For each query, every retrieval system will
return top 1000 documents and relevant score is
available. - Then given these results retrieved by many
different retrieval systems, how to aggregate
them for better performance?
31Previous algorithms
- Min, Max and Average Models
- Fox and Shaw,1995
- Linear Combination Model
- Bartell 1995
- Logistic Regression Model
-
-
32Example
- Min, Max and Average model
- The final score of each document d is based on
the scores given to d by each input systems
(voters). - Algorithm Final
score - CombMin minimum of individual
relevance scores - CombMed median of individual relevance
scores - CombMax maximum of individual relevance
scores - CombSum sum of individual relevance
scores - CombANZ CombSum / num non-zero relevance
scores - CombMNZ CombSum num non-zero relevance
scores
33- Linear Combination Model (LC model)
- The final score of document d is a simply
linearly (each weighted differently) combining
the normalized relevance scores given to each
document. - aiweight
- si(d)relevance score
34Experiment result on TREC Model
- The performance of rank aggregation is evaluated
by average precision over the queries - Score-based borda-fuse (LC model) is usually the
best method among several borda variant
algorithms. - It is better than best input system over most of
data collection. Such as TREC3, TREC5
35Experiment result II
- The performance of rank aggregation is evaluated
by average precision over the queries. - Condorcet-fusion is the only algorithm that ,
without training data, ever matches the
performance of the best input system over TREC 9. - Condorcet-fusion seems particularly sensitive to
the dependence of input systems. If the input
systems (voters) are too similar, the performance
will decrease.
36Rank aggregation methods for web
- New Challenges Different from the case in TREC
data fusion, - The coverage of various search engine is
different - Thus some highly relevant web pages may not be
ranked by some search engines. - Therefore, each voter ranks a partial candidate
list
37Preliminaries
- Given a universe U, an ordered list with
respect to U is an ordering of a subset S U,
i.e., ,with each
and is some ordering
relation on S. - If contains
- all the elements in U, then it is said to be a
full list, - otherwise it is called partial list.
38- Distance measures between two full lists with
respect to a set S - The Kendall tau distance
- It counts the number of pairwise disagreements
between two lists. - The distance is given by
- Normalize it by dividing the maximum possible
value
39- Spearman footrule distance
- Given two full lists and , the distance
is given by -
- Normalize it by dividing the maximum value
40- Distance measures for more than 2 list
- Given several full lists ,
for instance, the normalized Footrule distance of
to is given by - If are partial lists, let U
denote the union of elements in
and let be a full list with respect to U.
Considering the distance between and the
projection of with respect to , we have
the induced footrule distance -
41Optimal rank aggregation
- The question is
- Given (full or partial) lists ,
find a such that is a full list with
respect to the union of the elements of - minimizes
- The aggregation obtained by optimizing Kendall
distance is called Kemeny optimal aggregation.
42- When kgt4,computing the Kemeny optimal
aggregation is NP-hard. -
- (please refer to Cynthia Dwork, Ravi Kumar,
Moni Naor, D.Sivakumar 2001 for detailed proof ) - We can use Spearman footrule distance to
approximate the Kendall distance.
43LCS approach (My own method)
- Given n lists
- l1,1, l1, 2, , l 1, n1
- l2,1 , l 2, 2, , l 2, n2
- l3,1,l3,2, , l3, n3
- ..
- l m,1, l m,2, , l m,nm,
- Find a longest common subsequence for these
lists.
44LCS approach (My own method)
- LCS is NP-hard for m sequences if some elements
appear twice in a sequence. - For the lists obtained by search engines, each
document appears at most once. - There exists efficient algorithm to solve the
problem for the special case. - Assume ninj for i, j1, 2, .
45Efficient algorithm for LCS of m sequences
- Fixed the order of the first sequence as
- 1, 2, , n1.
- Define d(i) to be the length of LCS for
- the elements 1, 2, , i that contains i in
the LCS. -
46Computation of d(i,1) and d(i,2)
- d(i)max k d(k)1 such that k is always
before i in all the m lists. (if k does not
exist, d(i)1.) - The length of the LCS is max d(i) for i1, 2, ,
n1. - A backtracking process can give the LCS.
47An Example
- l11,2,3,4,5,6,7,8,9,10.
- l22,1,3,4,5,6,7,9,8,10
- l32,3,5,4,1,6,7,8,9,10
- l42,3,5,7,4,6,1,7,8,9,10
- d(1)1, d(2)1. d(3)d(2)12.
- d(4)d(3)13. d(5)d(3)13.
- d(6)d(5)14. d(7)d(6)15.
- d(8)d(7)16.
- d(9)d(7)16.
- d(10)d(9)17.
- The final length is 7. the LCS is 2,3,4
,6,7,8,10 - 2,3,4, 6, 7, 9, 10 is a LCS, too.
48When nis are different
- We delete those elements that are absent
- in some sequence.
- Examlple, l1 1, 2, 3, 4, 5, 6
- l22, 1, 5, 4, 6
- l32, 3, 4, 5, 6,
- l41,4, 3, 5, 6,
- since 1 is not in l3, 2 is not in l4 and 3 is
not in l2, - we can compute the LCS for
- l1 4, 5, 6
- l2 5, 4, 6
- l3 4, 5, 6,
- l4 4, 5, 6. The final
result is 4, 6.