Title: User-Centric Web Crawling*
1Shuffling a Stacked DeckThe Case for Partially
Randomized Ranking of Search Engine Results
Sandeep Pandey1, Sourashis Roy2, Christopher
Olston1, Junghoo Cho2, Soumen Chakrabarti3
1 Carnegie Mellon 2 UCLA 3 IIT Bombay
2Popularity as a Surrogate for Quality
- Search engines want to measure the quality of
pages - Quality is hard to define and measure
- Various popularity measures are used in ranking
- e.g., in-links, PageRank, user traffic
3Relationship Between Popularity and Quality
aware of page p
Users
like page p
- Popularity depends on the number of users who
like a page - relies on both quality and awareness of the page
- Popularity is different from quality
- But strongly correlated when awareness is large
4Problem
- Popularity/quality correlation weak for young
pages - Even if of high quality, may not (yet) be popular
due to lack of user awareness - Plus, process of gaining popularity inhibited by
entrenchment effect - Cho et. al. WWW04, Chakrabarti et. al.
SODA05 - Mowshowitz et. al. Communication02
- and many others
5Entrenchment Effect
- Search engines show entrenched (already-popular)
pages at the top - Users discover pages via search engines tend to
focus on top results
new unpopular pages
6Outline
- Problem introduction
- Key idea Mitigate entrenchment by introducing
randomness into ranking - Randomized Rank Promotion Scheme
- Model of ranking and popularity evolution
- Evaluation
- Summary
7Alternative Approaches to Counter-act
Entrenchment Effect
- Weight links to young pages more
- Baeza-Yates et. al SPIRE 02
- Proposed an age-based variant of PageRank
- Extrapolate quality based on increase in
popularity - Cho et. al SIGMOD 05
- Proposed an estimate of quality based on the
derivative of popularity
8Our Approach Randomized Rank Promotion
- Select random (young) pages to promote to good
rank positions
- Rank position to promote to is chosen at random
9Our Approach Randomized Rank Promotion
- Consequence Users visit promoted pages improves
ability to estimate quality via popularity - Compared with previous approaches
- Does not rely on temporal measurements ()
- Sub-optimal (-)
10Exploration/Exploitation Tradeoff
- Exploration/Exploitation tradeoff
- exploit known high-quality pages by assigning
good rank positions - explore quality of new pages by promoting them in
rank - Existing search engines only exploit (to our
knowledge)
11Possible Objectives for Rank Promotion
- Fairness
- Give each page an equal chance to become popular
- Incentive for search engines to be fair?
- Quality
- Maximize quality of search results seen by users
(in aggregate) - Quality page p extent to which users like p
- Q(p) 0,1
our choice
12Quality-Per-Click Metric (QPC)
- V(p,t) number of visits made to page p at time
t through search engine - QPC average quality of pages viewed by users,
amortized over time
13Outline
- Problem introduction
- Key idea Mitigate entrenchment by introducing
randomness into ranking - Randomized Rank Promotion Scheme
- Model of ranking and popularity evolution
- Evaluation
- Summary
14Desiderata for Randomized Rank Promotion
- Want ability to
- Control exploration/exploitation tradeoff
- Select certain pages as candidates for
promotion -
- Protect certain pages from demotion
15Randomized Rank Promotion Scheme
Promotion pool
Wm
random ordering
Remainder
W-Wm
Lm
order by popularity
Ld
16Randomized Rank Promotion Scheme
Promotion list
Remainder
1
2
4
1
2
3
Ld
Lm
1
2
3
4
5
6
k 3 r 0.5
17Parameters
- Promotion pool (Wm)
- Uniform rank promotion give an equal chance to
each page - Selective rank promotion exclusively target
zero awareness pages - Start rank (k)
- rank to start randomization from
- Degree of randomization (r)
- controls the tradeoff between exploration and
exploitation
18Tuning the Parameters
- Objective maximize quality-per-click (QPC)
- Two ways to tune
- Real-world experiment
- Analytical modeling
19Outline
- Problem introduction
- Key idea Mitigate entrenchment by introducing
randomness into ranking - Randomized Rank Promotion Scheme
- Model of ranking and popularity evolution
- Evaluation
- Summary
20Popularity Evolution Cycle
Popularity P(p,t)
Awareness A(p,t)
Rank R(p,t)
Visit rate V(p,t)
21Popularity Evolution Cycle
FPR(P(p,t))
FAP(A(p,t))
Popularity P(p,t)
Awareness A(p,t)
Rank R(p,t)
Visit rate V(p,t)
FRV(R(p,t))
FVA(V(p,t))
22Deriving Popularity Evolution Curve
- Next step derive formula for popularity
evolution curve
- Assumptions
- Number of pages constant
- Pages are created and retired according to a
Poisson process with rate parameter - Quality distribution of pages is stationary
23Deriving Popularity Evolution Curve
DETAIL
Doing the steady state analysis, we get
24Use Popularity Evolution Model to Tune Parameters
- Model of popularity evolution process (see paper)
- Complex dynamic process
- To study, we combine approximate analysis with
simulation - Next step use model to tune rank promotion
scheme - Parameters k, r and Wm
- Objective maximize QPC
25Tuning Promotion Pool (Wm )
- -no promotion
- - uniform promotion
- selective promotion
k1 and r0.2
26Tuning k and r
k start rank r degree of randomization
27Tuning k and r
- Maximize QPC
- (Quality-per-click)
-
- Avoid excessive
- junk
- Preserve 1 result
- for navigational
- searches
28Model of the Web
- Web collection of multiple disjoint
topic-specific communities (e.g., Linux,
Squash etc.) - A community is made up of a set of pages,
interested users and related queries
29Robustness Across Different Web Communities
30Summary
- Entrenchment effect hurts search result quality
- Solution Randomized rank promotion
- Model of Web evolution and QPC metric
- Used to tune evaluate randomized rank promotion
- Results
- New high-quality pages become popular much faster
- Aggregate search result quality significantly
improved
31THE END
- Paper available at
- www.cs.cmu.edu/spandey