User-Centric Web Crawling - PowerPoint PPT Presentation

About This Presentation
Title:

User-Centric Web Crawling

Description:

User-Centric Web Crawling. Sandeep Pandey & Christopher Olston. Carnegie Mellon University ... Web Crawling Optimization Problem ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 29
Provided by: csC76
Learn more at: http://www.cs.cmu.edu
Category:
Tags: centric | crawling | user | web

less

Transcript and Presenter's Notes

Title: User-Centric Web Crawling


1
User-Centric Web Crawling
  • Sandeep Pandey Christopher Olston
  • Carnegie Mellon University

2
Web Crawling
  • One important application (our focus) search
  • Topic-specific search engines
  • General-purpose ones

search queries
index
repository
user
crawler
3
Out-of-date Repository
  • Web is always changing Arasu et.al., TOIT01
  • 23 of Web pages change daily
  • 40 commercial Web pages change daily
  • Many problems may arise due to an out-of-date
    repository
  • Hurt both precision and recall

4
Web Crawling Optimization Problem
  • Not enough resources to (re)download every web
    document every day/hour
  • Must pick and choose ? optimization problem
  • Others objective function avg. freshness, age
  • Our goal focus directly on impact on users

search queries
index
repository
user
crawler
5
Web Search User Interface
  1. User enters keywords
  2. Search engine returns ranked list of results
  3. User visits subset of results

documents
6
Objective Maximize Repository Quality
(as perceived by users)
  • Suppose a user issues search query q
  • Qualityq Sdocuments D (likelihood of viewing D)
    x (relevance of D to q)
  • Given a workload W of user queries
  • Average quality 1/K x Squeries q ? W (freqq
    x Qualityq)

7
Viewing Likelihood
  • Depends primarily on rank in list Joachims
    KDD02
  • From AltaVista data Lempel et al. WWW03

1
.
2
1
0
.
8
view probability
0
.
6
Probability of Viewing
ViewProbability(r) ? r 1.5
0
.
4
0
.
2
0
0
5
0
1
0
0
1
5
0
rank
R
a
n
k
8
Relevance Scoring Function
  • Search engines internal notion of how well a
    document matches a query
  • Each D/Q pair ? numerical score ? 0,1
  • Combination of many factors, including
  • Vector-space similarity (e.g., TF.IDF cosine
    metric)
  • Link-based factors (e.g., PageRank)
  • Anchortext of referring pages

9
(Caveat)
  • Using scoring function for absolute relevance
  • Normally only used for relative ranking
  • Need to craft scoring function carefully

10
Measuring Quality
  • Avg. Quality
  • Sq (freqq x SD (likelihood of viewing D) x
    (relevance of D to q))

11
Lessons from Quality Metric
  • Avg. Quality
  • Sq (freqq x SD (ViewProb( Rank(D, q) ) x
    (relevance of D to q))
  • ViewProb(r) monotonically nonincreasing
  • Quality maximized when ranking function orders
    documents in descending order of relevance
  • Out-of-date repository
    scrambles ranking ?
    lowers quality
  • Let ?QD loss in quality due to inaccurate
    information about D
  • Alternatively, improvement in quality if we
    (re)download D

12
?QD Improvement in Quality
REDOWNLOAD
Web Copy of D (fresh)
Repository Copy of D (stale)
Repository Quality ?QD
13
Download Prioritization
Idea Given ?QD for each doc., prioritize
(re)downloading accordingly
Q How to measure ?QD?
  • Two difficulties
  • Live copy unavailable
  • Given both the live and repository copies of D,
    measuring ?QD may require computing ranks of all
    documents for all queries

Approach (1) Estimate ?QD for past
versions, (2) Forecast current ?QD
14
Overhead of Estimating ?QD
Estimate while updating inverted index
15
Forecast Future ?QD
  • Avg. weekly ?QD

Top 50
Data 48 weekly snapshots of 15 web sites sampled
from OpenDirectory topics Queries AltaVista
query log
Top 80
second 24 weeks
Top 90
first 24 weeks
16
Summary
  • Estimate ?QD at index time
  • Forecast future ?QD
  • Prioritize downloading according to forecasted ?QD

17
Overall Effectiveness
  • Staleness fraction of out-of-date documents
    Cho et al. 2000
  • Embarrassment probability that user visits
    irrelevant result Wolf et al. 2002
  • Used shingling to filter out trivial
    changes
  • Scoring function PageRank
    (similar results for TF.IDF)

Min. Staleness Min. Embarrassment User-Centric
resource requirement
Quality (fraction of ideal)
18
Reasons for Improvement
  • Does not rely on size of text change to estimate
    importance

Tagged as important by shingling measure,
although did not match many queries in workload
(boston.com)
19
Reasons for Improvement
  • Accounts for false negatives
  • Does not always ignore frequently-updated pages

User-centric crawling repeatedly re-downloads
this page
(washingtonpost.com)
20
Related Work (1/2)
  • General-purpose Web crawling
  • Min. Staleness Cho, Garcia-Molina, SIGMOD00
  • Maximize average freshness or age for fixed set
    of docs.
  • Min. Embarrassment Wolf et al., WWW02
  • Maximize weighted avg. freshness for fixed set of
    docs.
  • Document weights determined by prob. of
    embarrassment
  • Edwards et al., WWW01
  • Maximize average freshness for a growing set of
    docs.
  • How to balance new downloads vs. redownloading
    old docs.

21
Related Work (2/2)
  • Focused/topic-specific crawling
  • Chakrabarti, many others
  • Select subset of pages that match user interests
  • Our work given a set of pages, decide when to
    (re)download each based on predicted content
    shifts user interests

22
Summary
  • Crawling an optimization problem
  • Objective maximize quality as perceived by users
  • Approach
  • Measure ?QD using query workload and usage logs
  • Prioritize downloading based on forecasted ?QD
  • Various reasons for improvement
  • Accounts for false positives and negatives
  • Does not rely on size of text change to estimate
    importance
  • Does not always ignore frequently updated pages

23
THE END
  • Paper available at
  • www.cs.cmu.edu/olston

24
Most Closely Related Work
  • Wolf et al., WWW02
  • Maximize weighted avg. freshness for fixed set of
    docs.
  • Document weights determined by prob. of
    embarrassment
  • User-Centric Crawling
  • Which queries affected by a change, and by how
    much?
  • Change A significantly alters relevance to
    several common queries
  • Change B only affects relevance to infrequent
    queries, and not by much
  • Metric penalizes false negatives
  • Doc. ranked 1000 for a popular query should be
    ranked 2
  • Small embarrassment but big loss in quality

25
Inverted Index
Word
Posting list DocID (freq)
Doc1
Seminar Cancer Symptoms
Cancer
Doc7 (2)
Doc9 (1)
Doc1 (1)
Doc5 (1)
Doc1 (1)
Doc6 (1)
Seminar
Doc1 (1)
Doc8 (2)
Doc4 (3)
Symptoms
26
Updating Inverted Index
Stale Doc1
Live Doc1
Cancer management how to detect breast cancer
Seminar Cancer Symptoms
Cancer
Doc7 (2)
Doc9 (1)
Doc1 (1)
Doc1 (2)
27
Measure ?QD While Updating Index
  • Compute previous and new scores of the downloaded
    document while updating postings
  • Maintain an approximate mapping between score and
    rank for each query term (20 bytes per mapping in
    our exps.)
  • Compute previous and new ranks (approximately)
    using the computed scores and score-to-rank
    mapping
  • Measure ?QD using previous and new ranks
  • (by applying an approximate function derived
    in the paper)

28
Out-of-date Repository
Web Copy of D (fresh)
Repository Copy of D (stale)
Write a Comment
User Comments (0)
About PowerShow.com