MetaSearch Engine - PowerPoint PPT Presentation

About This Presentation
Title:

MetaSearch Engine

Description:

Title: String Matching Allowing Errors Author: Dept of Computer Science Last modified by: cswangl Created Date: 3/17/2004 2:48:04 AM Document presentation format – PowerPoint PPT presentation

Number of Views:178
Avg rating:3.0/5.0
Slides: 49
Provided by: DeptofCom6
Category:

less

Transcript and Presenter's Notes

Title: MetaSearch Engine


1
Lecture 9 Rank Aggregation in MetaSearch
  • MetaSearch Engine
  • Social Choice Rules
  • Rank Aggregation

2
Choices of Search Engines
  • Many search engines exist to compete for users
  • The results are not necessarily the same
  • Different users prefer different search engines
  • Search results may, in the future, be biased
    towards paid advertisements.

3
MetaSearch Engine
  • Metasearch Engines are designed to increase the
    coverage of web by forwarding users queries to
    multiple search engines
  • Users requests are sent to multiple search
    engines such as AlltheWeb, Google, MSN.
  • Then the results from the individual search
    engine are combined into a single result set to
    present to users.

4
Different Forms of MetaSearch
  • Submit different representations of the same
    query to the same search engine, then combine the
    results.
  • Submit the same query to several search engine
    adopting different information retrieval models,
    then combine the results.

5
Issues
  • How to combine the results retrieved by different
    source search engines is crucial for the success
    of a metasearch engine.
  • And this is the problem that social choice theory
    has been trying to answer.

6
Search Engine Watch
  • Interesting meta search engines are listed at
  • http//www.searchenginewatch.com/links/article.php
    /2156241

7
Social Choice Theory
  • Studies on protocols that help a group of people
    make collective decisions, such as vote.

8
A Fundamental problem
  • Given a collection of agents (voters)
  • with preferences over different alternatives
    (allocations, outcomes),
  • how should society evaluate these alternatives
    and make a decision for all
  • that may be for the will of some voters but
    against that of others.

9
Applications
  • Voters elect president from several candidates.
  • National polls for economic or political policy
    of the government
  • The procedure or rule of election
  • The rank of metasearch engine obtained from those
    of search engines

10
Group Descisions
  • How do we make decisions
  • Flip a coin?
  • Dictatorship?
  • Democracy (Majority rule)?

11
Group Decision Rules
  • Majority rule ,
  • Condorcet paradox (voting cycle)
  • Borda rule

12
Mathematical model
  • A set of voters Vv1,v2,v3,,Vn
  • A set of alternatives or outcomes
    Ss1,s2,s3,Sm, with Sm and
  • A set of preference relation PR1,R2,R3Rn,
    called a preference profile,
  • the preference relation Ri for each voter i is a
    permutation (order) of elements in S.

13
Example 1 Majority Rule
  • 3 rational people have rational preferences over
    2 alternatives x,y
  • Person
  • 1 2 3
  • 1st X Y X
    1 XgtY
  • Pref. i.e.Person 2
    YgtX
  • 2nd Y X Y
    3 XgtY
  • How to Aggregate their preferences? How to choose?

14
  • Using majority rule.
  • Since more than ½ people (two out of three)
    prefer x to y.
  • Then the group prefers x to y

15
Example 2 Condorcet Paradox
  • 3 rational people have rational preferences over
    3 alternatives x,y,z
  • Person
  • 1 2 3
  • 1st X Y Z
    1 XgtYgtZ
  • Pref. 2nd Y Z X i.e. Person 2
    YgtZgtX
  • 3rd Z X Y
    3 ZgtXgtY

16
Binary/paired Comparison With Majority rule
  • Person
  • 1 2 3
  • 1st X Y Z 1
    XgtY
  • Pref. 2nd Y Z X for (x,y) 2
    YgtX? XgtY
  • 3rd Z X Y 3
    XgtY
  • Similarly, for (Y,Z) we can get YgtZ for (Z,X) we
    can get ZgtX.
  • Then XgtYgtZgtX (cycling) , Intransitive ? Not
    rational

17
  • It was noted by Condorcet in the 18 century that
    no alternative can win a majority against all
    other alternatives.
  • Pairwise majority is not satisfactory in all
    cases.

18
Example 3 Borda Rule
  • For each voter,
  • associate the number 1 with the most preferred
    alternative,
  • 2 with the second and so on,
  • Assign to each alternative the number equal to
  • the sum of the numbers the individual voters
    assigned to the alternative.

19
  • Person
  • 1 2 3
  • 1st X(1) Y(1) X(1) X(4)
    X
  • Pref. 2nd Y(2) X(2) W(2) ? Y(7) ?
    Y
  • 3rd Z(3) W(3) Z(3)
    Z(10) W
  • 4th W(4) Z(4) Y(4)
    W(9) Z
  • Then We get choice XgtYgtWgtZ

20
  • For above example, if we use binary/paired
    comparison With majority rule . We can get
  • XgtY in 2 out of 3, YgtW in 2 out of 3,
  • WgtZ in 2 out of 3, XgtW in 3 out of 3,
  • XgtZ in 3 out of 3, YgtZ in 2 out of 3
  • Then we can achieve same choice
  • XgtYgtWgtZ

21
  • For the previous example we had trouble with
    majority rule via binary/paired comparison, we
    get a tie between all three alternatives with the
    Bordas rule
  • All three alternatives get a sum of 6.

22
  • Some variations
  • 1 with relevant scores available
  • allotting each input system a point p to be
    distributed according to relevance scores of the
    documents.
  • 2 Weighted Borda-rule
  • Each voter may not have equal effectiveness to
    the final result. We may set more weight to good
    quality input systems.

23
  • Condorcet winner algorithm
  • It also comes from social choice theory. The
    Condorcet algorithm says that any candidate that
    can beat all other candidates in a head-to-head
    contest (pair-wise comparison) should win the
    election.

24
  • Step 1, Construct Condorcet Graph.
  • For each candidate pair (x,y), there exists an
    edge from x to y if x would receive at least as
    many votes as y in a head-to-head contest.
  • In Condorcet graph, there is at least one
    directed edge between every pair of candidates. (
    we call the graph is semi-complete)
  • It may contains cycles in the graph. This is
    due to voting paradox of the condorcet voting.

25
  • Step 2, We form a new acyclic graph from an old
    cyclic one by contracting all of the nodes in a
    cycle into one. It is a strongly connected
    component graph (SCCG).
  • A directed graph is strongly connected if for
    any two nodes ua nd v, there are paths from u to
    v and from v to u.
  • Definition of Strongly connected component(SCC)
  • A strongly connected subgraph, S, of a
    directed graph, D, such that no vertex or subset
    of vertices of D can be added to S such that the
    new subgraph is still strongly connected.

26
  • The graph is totally orderable at the level of
    the SCCs and each SCC is a pocket of cycles,
    within which each candidate is tied. (Why?)
  • Step 3, The condorcet-consistent Hamiltonian path
    is any Hamiltonian path through Condorcet graph.
  • Definition Hamiltonian path A path between two
    vertices of a graph that visits each vertex
    exactly once.

27
  • Theorem 1. Suppose x and y are nodes in a graph
    g, and that X and Y are nodes of the associated
    SCCG G such that x X and y Y. If there
    exists a path from X to Y in G, then every
    Condorcet path of g has x before y.
  • Refer to Javed A. Aslam, Mark Montague 2001
    for proof.

28
Rank Aggregation in MetaSearch
  • Here we discussed two cases which using
    algorithm rooted at social choice theory for
    MetaSearch rank aggregation.
  • Data fusion track in TREC
  • Javed A. Aslam, Mark Montague 2001 Models for
    Metasearch
  • in SIGIR2001
  • Rank aggregation for web search engine
  • Cynthia Dwork, Ravi Kumar, Moni Naor,
    D.Sivakumar 2001
  • Rank Aggregation Methods for the Web in WWW10

29
Data fusion track in TREC
  • TREC (Text Retrieval Conference ,see
    http//trec.nist.gov/) maintains about 6Gb of
    SGML tagged text, queries and respective answers
    for evaluation purposes.
  • The TREC organizers distribute data sets in
    advance and 50 new queries each year.
  • The competing teams then submit ranked lists of
    documents that their system gave in response to
    each query. And these retrieval systems will be
    evaluated.

30
  • These ranked lists are available for metasearch
    researchers to download and use.
  • For each query, every retrieval system will
    return top 1000 documents and relevant score is
    available.
  • Then given these results retrieved by many
    different retrieval systems, how to aggregate
    them for better performance?

31
Previous algorithms
  • Min, Max and Average Models
  • Fox and Shaw,1995
  • Linear Combination Model
  • Bartell 1995
  • Logistic Regression Model

32
Example
  • Min, Max and Average model
  • The final score of each document d is based on
    the scores given to d by each input systems
    (voters).
  • Algorithm Final
    score
  • CombMin minimum of individual
    relevance scores
  • CombMed median of individual relevance
    scores
  • CombMax maximum of individual relevance
    scores
  • CombSum sum of individual relevance
    scores
  • CombANZ CombSum / num non-zero relevance
    scores
  • CombMNZ CombSum num non-zero relevance
    scores

33
  • Linear Combination Model (LC model)
  • The final score of document d is a simply
    linearly (each weighted differently) combining
    the normalized relevance scores given to each
    document.
  • aiweight
  • si(d)relevance score

34
Experiment result on TREC Model
  • The performance of rank aggregation is evaluated
    by average precision over the queries
  • Score-based borda-fuse (LC model) is usually the
    best method among several borda variant
    algorithms.
  • It is better than best input system over most of
    data collection. Such as TREC3, TREC5

35
Experiment result II
  • The performance of rank aggregation is evaluated
    by average precision over the queries.
  • Condorcet-fusion is the only algorithm that ,
    without training data, ever matches the
    performance of the best input system over TREC 9.
  • Condorcet-fusion seems particularly sensitive to
    the dependence of input systems. If the input
    systems (voters) are too similar, the performance
    will decrease.

36
Rank aggregation methods for web
  • New Challenges Different from the case in TREC
    data fusion,
  • The coverage of various search engine is
    different
  • Thus some highly relevant web pages may not be
    ranked by some search engines.
  • Therefore, each voter ranks a partial candidate
    list

37
Preliminaries
  • Given a universe U, an ordered list with
    respect to U is an ordering of a subset S U,
    i.e., ,with each
    and is some ordering
    relation on S.
  • If contains
  • all the elements in U, then it is said to be a
    full list,
  • otherwise it is called partial list.

38
  • Distance measures between two full lists with
    respect to a set S
  • The Kendall tau distance
  • It counts the number of pairwise disagreements
    between two lists.
  • The distance is given by
  • Normalize it by dividing the maximum possible
    value

39
  • Spearman footrule distance
  • Given two full lists and , the distance
    is given by
  • Normalize it by dividing the maximum value

40
  • Distance measures for more than 2 list
  • Given several full lists ,
    for instance, the normalized Footrule distance of
    to is given by
  • If are partial lists, let U
    denote the union of elements in
    and let be a full list with respect to U.
    Considering the distance between and the
    projection of with respect to , we have
    the induced footrule distance

41
Optimal rank aggregation
  • The question is
  • Given (full or partial) lists ,
    find a such that is a full list with
    respect to the union of the elements of
  • minimizes
  • The aggregation obtained by optimizing Kendall
    distance is called Kemeny optimal aggregation.

42
  • When kgt4,computing the Kemeny optimal
    aggregation is NP-hard.
  • (please refer to Cynthia Dwork, Ravi Kumar,
    Moni Naor, D.Sivakumar 2001 for detailed proof )
  • We can use Spearman footrule distance to
    approximate the Kendall distance.

43
LCS approach (My own method)
  • Given n lists
  • l1,1, l1, 2, , l 1, n1
  • l2,1 , l 2, 2, , l 2, n2
  • l3,1,l3,2, , l3, n3
  • ..
  • l m,1, l m,2, , l m,nm,
  • Find a longest common subsequence for these
    lists.

44
LCS approach (My own method)
  • LCS is NP-hard for m sequences if some elements
    appear twice in a sequence.
  • For the lists obtained by search engines, each
    document appears at most once.
  • There exists efficient algorithm to solve the
    problem for the special case.
  • Assume ninj for i, j1, 2, .

45
Efficient algorithm for LCS of m sequences
  • Fixed the order of the first sequence as
  • 1, 2, , n1.
  • Define d(i) to be the length of LCS for
  • the elements 1, 2, , i that contains i in
    the LCS.

46
Computation of d(i,1) and d(i,2)
  • d(i)max k d(k)1 such that k is always
    before i in all the m lists. (if k does not
    exist, d(i)1.)
  • The length of the LCS is max d(i) for i1, 2, ,
    n1.
  • A backtracking process can give the LCS.

47
An Example
  • l11,2,3,4,5,6,7,8,9,10.
  • l22,1,3,4,5,6,7,9,8,10
  • l32,3,5,4,1,6,7,8,9,10
  • l42,3,5,7,4,6,1,7,8,9,10
  • d(1)1, d(2)1. d(3)d(2)12.
  • d(4)d(3)13. d(5)d(3)13.
  • d(6)d(5)14. d(7)d(6)15.
  • d(8)d(7)16.
  • d(9)d(7)16.
  • d(10)d(9)17.
  • The final length is 7. the LCS is 2,3,4
    ,6,7,8,10
  • 2,3,4, 6, 7, 9, 10 is a LCS, too.

48
When nis are different
  • We delete those elements that are absent
  • in some sequence.
  • Examlple, l1 1, 2, 3, 4, 5, 6
  • l22, 1, 5, 4, 6
  • l32, 3, 4, 5, 6,
  • l41,4, 3, 5, 6,
  • since 1 is not in l3, 2 is not in l4 and 3 is
    not in l2,
  • we can compute the LCS for
  • l1 4, 5, 6
  • l2 5, 4, 6
  • l3 4, 5, 6,
  • l4 4, 5, 6. The final
    result is 4, 6.
Write a Comment
User Comments (0)
About PowerShow.com