User-Centric Web Crawling* - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

User-Centric Web Crawling*

Description:

Search engines show entrenched (already-popular) pages at the top ... Give each page an equal chance to become popular. Incentive for search engines to be fair? ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 32
Provided by: imageforma
Category:
Tags: centric | crawling | user | web

less

Transcript and Presenter's Notes

Title: User-Centric Web Crawling*


1
Shuffling a Stacked DeckThe Case for Partially
Randomized Ranking of Search Engine Results
Sandeep Pandey1, Sourashis Roy2, Christopher
Olston1, Junghoo Cho2, Soumen Chakrabarti3
1 Carnegie Mellon 2 UCLA 3 IIT Bombay
2
Popularity as a Surrogate for Quality
  • Search engines want to measure the quality of
    pages
  • Quality is hard to define and measure
  • Various popularity measures are used in ranking
  • e.g., in-links, PageRank, user traffic

3
Relationship Between Popularity and Quality
aware of page p
Users
like page p
  • Popularity depends on the number of users who
    like a page
  • relies on both quality and awareness of the page
  • Popularity is different from quality
  • But strongly correlated when awareness is large

4
Problem
  • Popularity/quality correlation weak for young
    pages
  • Even if of high quality, may not (yet) be popular
    due to lack of user awareness
  • Plus, process of gaining popularity inhibited by
    entrenchment effect
  • Cho et. al. WWW04, Chakrabarti et. al.
    SODA05
  • Mowshowitz et. al. Communication02
  • and many others

5
Entrenchment Effect
  • Search engines show entrenched (already-popular)
    pages at the top
  • Users discover pages via search engines tend to
    focus on top results

new unpopular pages
6
Outline
  • Problem introduction
  • Key idea Mitigate entrenchment by introducing
    randomness into ranking
  • Randomized Rank Promotion Scheme
  • Model of ranking and popularity evolution
  • Evaluation
  • Summary

7
Alternative Approaches to Counter-act
Entrenchment Effect
  • Weight links to young pages more
  • Baeza-Yates et. al SPIRE 02
  • Proposed an age-based variant of PageRank
  • Extrapolate quality based on increase in
    popularity
  • Cho et. al SIGMOD 05
  • Proposed an estimate of quality based on the
    derivative of popularity

8
Our Approach Randomized Rank Promotion
  • Select random (young) pages to promote to good
    rank positions
  • Rank position to promote to is chosen at random

9
Our Approach Randomized Rank Promotion
  • Consequence Users visit promoted pages improves
    ability to estimate quality via popularity
  • Compared with previous approaches
  • Does not rely on temporal measurements ()
  • Sub-optimal (-)

10
Exploration/Exploitation Tradeoff
  • Exploration/Exploitation tradeoff
  • exploit known high-quality pages by assigning
    good rank positions
  • explore quality of new pages by promoting them in
    rank
  • Existing search engines only exploit (to our
    knowledge)

11
Possible Objectives for Rank Promotion
  • Fairness
  • Give each page an equal chance to become popular
  • Incentive for search engines to be fair?
  • Quality
  • Maximize quality of search results seen by users
    (in aggregate)
  • Quality page p extent to which users like p
  • Q(p) 0,1

our choice
12
Quality-Per-Click Metric (QPC)
  • V(p,t) number of visits made to page p at time
    t through search engine
  • QPC average quality of pages viewed by users,
    amortized over time

13
Outline
  • Problem introduction
  • Key idea Mitigate entrenchment by introducing
    randomness into ranking
  • Randomized Rank Promotion Scheme
  • Model of ranking and popularity evolution
  • Evaluation
  • Summary

14
Desiderata for Randomized Rank Promotion
  • Want ability to
  • Control exploration/exploitation tradeoff
  • Select certain pages as candidates for
    promotion
  • Protect certain pages from demotion

15
Randomized Rank Promotion Scheme
Promotion pool
Wm
random ordering
Remainder
W-Wm
Lm
order by popularity
Ld
16
Randomized Rank Promotion Scheme
Promotion list
Remainder
1
2
4
1
2
3
Ld
Lm
1
2
3
4
5
6
k 3 r 0.5
17
Parameters
  • Promotion pool (Wm)
  • Uniform rank promotion give an equal chance to
    each page
  • Selective rank promotion exclusively target
    zero awareness pages
  • Start rank (k)
  • rank to start randomization from
  • Degree of randomization (r)
  • controls the tradeoff between exploration and
    exploitation

18
Tuning the Parameters
  • Objective maximize quality-per-click (QPC)
  • Two ways to tune
  • Real-world experiment
  • Analytical modeling

19
Outline
  • Problem introduction
  • Key idea Mitigate entrenchment by introducing
    randomness into ranking
  • Randomized Rank Promotion Scheme
  • Model of ranking and popularity evolution
  • Evaluation
  • Summary

20
Popularity Evolution Cycle
Popularity P(p,t)
Awareness A(p,t)
Rank R(p,t)
Visit rate V(p,t)
21
Popularity Evolution Cycle
FPR(P(p,t))
FAP(A(p,t))
Popularity P(p,t)
Awareness A(p,t)
Rank R(p,t)
Visit rate V(p,t)
FRV(R(p,t))
FVA(V(p,t))
22
Deriving Popularity Evolution Curve
  • Next step derive formula for popularity
    evolution curve
  • Assumptions
  • Number of pages constant
  • Pages are created and retired according to a
    Poisson process with rate parameter
  • Quality distribution of pages is stationary

23
Deriving Popularity Evolution Curve
DETAIL
Doing the steady state analysis, we get
24
Use Popularity Evolution Model to Tune Parameters
  • Model of popularity evolution process (see paper)
  • Complex dynamic process
  • To study, we combine approximate analysis with
    simulation
  • Next step use model to tune rank promotion
    scheme
  • Parameters k, r and Wm
  • Objective maximize QPC

25
Tuning Promotion Pool (Wm )
  • -no promotion
  • - uniform promotion
  • selective promotion

k1 and r0.2
26
Tuning k and r
k start rank r degree of randomization
27
Tuning k and r
  • Maximize QPC
  • (Quality-per-click)
  • Avoid excessive
  • junk
  • Preserve 1 result
  • for navigational
  • searches

28
Model of the Web
  • Web collection of multiple disjoint
    topic-specific communities (e.g., Linux,
    Squash etc.)
  • A community is made up of a set of pages,
    interested users and related queries

29
Robustness Across Different Web Communities
30
Summary
  • Entrenchment effect hurts search result quality
  • Solution Randomized rank promotion
  • Model of Web evolution and QPC metric
  • Used to tune evaluate randomized rank promotion
  • Results
  • New high-quality pages become popular much faster
  • Aggregate search result quality significantly
    improved

31
THE END
  • Paper available at
  • www.cs.cmu.edu/spandey
Write a Comment
User Comments (0)
About PowerShow.com