User-Centric Web Crawling - PowerPoint PPT Presentation

About This Presentation

Title:

User-Centric Web Crawling

Description:

User-Centric Web Crawling. Sandeep Pandey & Christopher Olston. Carnegie Mellon University ... Web Crawling Optimization Problem ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 29

Provided by: csC76

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: User-Centric Web Crawling

1
User-Centric Web Crawling

Sandeep Pandey Christopher Olston
Carnegie Mellon University

2
Web Crawling

One important application (our focus) search
Topic-specific search engines
General-purpose ones

search queries
index
repository
user
crawler
3
Out-of-date Repository

Web is always changing Arasu et.al., TOIT01
23 of Web pages change daily
40 commercial Web pages change daily
Many problems may arise due to an out-of-date
repository
Hurt both precision and recall

4
Web Crawling Optimization Problem

Not enough resources to (re)download every web
document every day/hour
Must pick and choose ? optimization problem
Others objective function avg. freshness, age
Our goal focus directly on impact on users

search queries
index
repository
user
crawler
5
Web Search User Interface

User enters keywords
Search engine returns ranked list of results
User visits subset of results

documents
6
Objective Maximize Repository Quality
(as perceived by users)

Suppose a user issues search query q
Qualityq Sdocuments D (likelihood of viewing D)
x (relevance of D to q)

Given a workload W of user queries
Average quality 1/K x Squeries q ? W (freqq
x Qualityq)

7
Viewing Likelihood

Depends primarily on rank in list Joachims
KDD02
From AltaVista data Lempel et al. WWW03

1
.
2
1
0
.
8
view probability
0
.
6
Probability of Viewing
ViewProbability(r) ? r 1.5
0
.
4
0
.
2
0
0
5
0
1
0
0
1
5
0
rank
R
a
n
k
8
Relevance Scoring Function

Search engines internal notion of how well a
document matches a query
Each D/Q pair ? numerical score ? 0,1
Combination of many factors, including
Vector-space similarity (e.g., TF.IDF cosine
metric)
Link-based factors (e.g., PageRank)
Anchortext of referring pages

9
(Caveat)

Using scoring function for absolute relevance
Normally only used for relative ranking
Need to craft scoring function carefully

10
Measuring Quality

Avg. Quality
Sq (freqq x SD (likelihood of viewing D) x
(relevance of D to q))

11
Lessons from Quality Metric

Avg. Quality
Sq (freqq x SD (ViewProb( Rank(D, q) ) x
(relevance of D to q))

ViewProb(r) monotonically nonincreasing
Quality maximized when ranking function orders
documents in descending order of relevance
Out-of-date repository
scrambles ranking ?
lowers quality

Let ?QD loss in quality due to inaccurate
information about D
Alternatively, improvement in quality if we
(re)download D

12
?QD Improvement in Quality
REDOWNLOAD
Web Copy of D (fresh)
Repository Copy of D (stale)
Repository Quality ?QD
13
Download Prioritization
Idea Given ?QD for each doc., prioritize
(re)downloading accordingly
Q How to measure ?QD?

Two difficulties
Live copy unavailable
Given both the live and repository copies of D,
measuring ?QD may require computing ranks of all
documents for all queries

Approach (1) Estimate ?QD for past
versions, (2) Forecast current ?QD
14
Overhead of Estimating ?QD
Estimate while updating inverted index
15
Forecast Future ?QD

Avg. weekly ?QD

Top 50
Data 48 weekly snapshots of 15 web sites sampled
from OpenDirectory topics Queries AltaVista
query log
Top 80
second 24 weeks
Top 90
first 24 weeks
16
Summary

Estimate ?QD at index time
Forecast future ?QD
Prioritize downloading according to forecasted ?QD

17
Overall Effectiveness

Staleness fraction of out-of-date documents
Cho et al. 2000
Embarrassment probability that user visits
irrelevant result Wolf et al. 2002
Used shingling to filter out trivial
changes
Scoring function PageRank
(similar results for TF.IDF)

Min. Staleness Min. Embarrassment User-Centric
resource requirement
Quality (fraction of ideal)
18
Reasons for Improvement

Does not rely on size of text change to estimate
importance

Tagged as important by shingling measure,
although did not match many queries in workload
(boston.com)
19
Reasons for Improvement

Accounts for false negatives
Does not always ignore frequently-updated pages

User-centric crawling repeatedly re-downloads
this page
(washingtonpost.com)
20
Related Work (1/2)

General-purpose Web crawling
Min. Staleness Cho, Garcia-Molina, SIGMOD00
Maximize average freshness or age for fixed set
of docs.
Min. Embarrassment Wolf et al., WWW02
Maximize weighted avg. freshness for fixed set of
docs.
Document weights determined by prob. of
embarrassment
Edwards et al., WWW01
Maximize average freshness for a growing set of
docs.
How to balance new downloads vs. redownloading
old docs.

21
Related Work (2/2)

Focused/topic-specific crawling
Chakrabarti, many others
Select subset of pages that match user interests
Our work given a set of pages, decide when to
(re)download each based on predicted content
shifts user interests

22
Summary

Crawling an optimization problem
Objective maximize quality as perceived by users
Approach
Measure ?QD using query workload and usage logs
Prioritize downloading based on forecasted ?QD
Various reasons for improvement
Accounts for false positives and negatives
Does not rely on size of text change to estimate
importance
Does not always ignore frequently updated pages

23
THE END

Paper available at
www.cs.cmu.edu/olston

24
Most Closely Related Work

Wolf et al., WWW02
Maximize weighted avg. freshness for fixed set of
docs.
Document weights determined by prob. of
embarrassment
User-Centric Crawling
Which queries affected by a change, and by how
much?
Change A significantly alters relevance to
several common queries
Change B only affects relevance to infrequent
queries, and not by much
Metric penalizes false negatives
Doc. ranked 1000 for a popular query should be
ranked 2
Small embarrassment but big loss in quality

25
Inverted Index
Word
Posting list DocID (freq)
Doc1
Seminar Cancer Symptoms
Cancer
Doc7 (2)
Doc9 (1)
Doc1 (1)
Doc5 (1)
Doc1 (1)
Doc6 (1)
Seminar
Doc1 (1)
Doc8 (2)
Doc4 (3)
Symptoms
26
Updating Inverted Index
Stale Doc1
Live Doc1
Cancer management how to detect breast cancer
Seminar Cancer Symptoms
Cancer
Doc7 (2)
Doc9 (1)
Doc1 (1)
Doc1 (2)
27
Measure ?QD While Updating Index

Compute previous and new scores of the downloaded
document while updating postings
Maintain an approximate mapping between score and
rank for each query term (20 bytes per mapping in
our exps.)
Compute previous and new ranks (approximately)
using the computed scores and score-to-rank
mapping
Measure ?QD using previous and new ranks
(by applying an approximate function derived
in the paper)

28
Out-of-date Repository
Web Copy of D (fresh)
Repository Copy of D (stale)

Write a Comment

User Comments (0)