Title: User-Centric Web Crawling
1User-Centric Web Crawling
- Sandeep Pandey Christopher Olston
- Carnegie Mellon University
2Web Crawling
- One important application (our focus) search
- Topic-specific search engines
- General-purpose ones
search queries
index
repository
user
crawler
3Out-of-date Repository
- Web is always changing Arasu et.al., TOIT01
- 23 of Web pages change daily
- 40 commercial Web pages change daily
- Many problems may arise due to an out-of-date
repository - Hurt both precision and recall
4Web Crawling Optimization Problem
- Not enough resources to (re)download every web
document every day/hour - Must pick and choose ? optimization problem
- Others objective function avg. freshness, age
- Our goal focus directly on impact on users
search queries
index
repository
user
crawler
5Web Search User Interface
- User enters keywords
- Search engine returns ranked list of results
- User visits subset of results
documents
6Objective Maximize Repository Quality
(as perceived by users)
- Suppose a user issues search query q
- Qualityq Sdocuments D (likelihood of viewing D)
x (relevance of D to q)
- Given a workload W of user queries
- Average quality 1/K x Squeries q ? W (freqq
x Qualityq)
7Viewing Likelihood
- Depends primarily on rank in list Joachims
KDD02 - From AltaVista data Lempel et al. WWW03
1
.
2
1
0
.
8
view probability
0
.
6
Probability of Viewing
ViewProbability(r) ? r 1.5
0
.
4
0
.
2
0
0
5
0
1
0
0
1
5
0
rank
R
a
n
k
8Relevance Scoring Function
- Search engines internal notion of how well a
document matches a query - Each D/Q pair ? numerical score ? 0,1
- Combination of many factors, including
- Vector-space similarity (e.g., TF.IDF cosine
metric) - Link-based factors (e.g., PageRank)
- Anchortext of referring pages
9(Caveat)
- Using scoring function for absolute relevance
- Normally only used for relative ranking
- Need to craft scoring function carefully
10Measuring Quality
- Avg. Quality
- Sq (freqq x SD (likelihood of viewing D) x
(relevance of D to q))
11Lessons from Quality Metric
- Avg. Quality
- Sq (freqq x SD (ViewProb( Rank(D, q) ) x
(relevance of D to q))
- ViewProb(r) monotonically nonincreasing
- Quality maximized when ranking function orders
documents in descending order of relevance - Out-of-date repository
scrambles ranking ?
lowers quality
- Let ?QD loss in quality due to inaccurate
information about D - Alternatively, improvement in quality if we
(re)download D
12?QD Improvement in Quality
REDOWNLOAD
Web Copy of D (fresh)
Repository Copy of D (stale)
Repository Quality ?QD
13Download Prioritization
Idea Given ?QD for each doc., prioritize
(re)downloading accordingly
Q How to measure ?QD?
- Two difficulties
- Live copy unavailable
- Given both the live and repository copies of D,
measuring ?QD may require computing ranks of all
documents for all queries
Approach (1) Estimate ?QD for past
versions, (2) Forecast current ?QD
14Overhead of Estimating ?QD
Estimate while updating inverted index
15Forecast Future ?QD
Top 50
Data 48 weekly snapshots of 15 web sites sampled
from OpenDirectory topics Queries AltaVista
query log
Top 80
second 24 weeks
Top 90
first 24 weeks
16Summary
- Estimate ?QD at index time
- Forecast future ?QD
- Prioritize downloading according to forecasted ?QD
17Overall Effectiveness
- Staleness fraction of out-of-date documents
Cho et al. 2000 - Embarrassment probability that user visits
irrelevant result Wolf et al. 2002 - Used shingling to filter out trivial
changes - Scoring function PageRank
(similar results for TF.IDF)
Min. Staleness Min. Embarrassment User-Centric
resource requirement
Quality (fraction of ideal)
18Reasons for Improvement
- Does not rely on size of text change to estimate
importance
Tagged as important by shingling measure,
although did not match many queries in workload
(boston.com)
19Reasons for Improvement
- Accounts for false negatives
- Does not always ignore frequently-updated pages
User-centric crawling repeatedly re-downloads
this page
(washingtonpost.com)
20Related Work (1/2)
- General-purpose Web crawling
- Min. Staleness Cho, Garcia-Molina, SIGMOD00
- Maximize average freshness or age for fixed set
of docs. - Min. Embarrassment Wolf et al., WWW02
- Maximize weighted avg. freshness for fixed set of
docs. - Document weights determined by prob. of
embarrassment - Edwards et al., WWW01
- Maximize average freshness for a growing set of
docs. - How to balance new downloads vs. redownloading
old docs.
21Related Work (2/2)
- Focused/topic-specific crawling
- Chakrabarti, many others
- Select subset of pages that match user interests
- Our work given a set of pages, decide when to
(re)download each based on predicted content
shifts user interests
22Summary
- Crawling an optimization problem
- Objective maximize quality as perceived by users
- Approach
- Measure ?QD using query workload and usage logs
- Prioritize downloading based on forecasted ?QD
- Various reasons for improvement
- Accounts for false positives and negatives
- Does not rely on size of text change to estimate
importance - Does not always ignore frequently updated pages
23THE END
- Paper available at
- www.cs.cmu.edu/olston
24Most Closely Related Work
- Wolf et al., WWW02
- Maximize weighted avg. freshness for fixed set of
docs. - Document weights determined by prob. of
embarrassment - User-Centric Crawling
- Which queries affected by a change, and by how
much? - Change A significantly alters relevance to
several common queries - Change B only affects relevance to infrequent
queries, and not by much - Metric penalizes false negatives
- Doc. ranked 1000 for a popular query should be
ranked 2 - Small embarrassment but big loss in quality
25Inverted Index
Word
Posting list DocID (freq)
Doc1
Seminar Cancer Symptoms
Cancer
Doc7 (2)
Doc9 (1)
Doc1 (1)
Doc5 (1)
Doc1 (1)
Doc6 (1)
Seminar
Doc1 (1)
Doc8 (2)
Doc4 (3)
Symptoms
26Updating Inverted Index
Stale Doc1
Live Doc1
Cancer management how to detect breast cancer
Seminar Cancer Symptoms
Cancer
Doc7 (2)
Doc9 (1)
Doc1 (1)
Doc1 (2)
27Measure ?QD While Updating Index
- Compute previous and new scores of the downloaded
document while updating postings - Maintain an approximate mapping between score and
rank for each query term (20 bytes per mapping in
our exps.) - Compute previous and new ranks (approximately)
using the computed scores and score-to-rank
mapping - Measure ?QD using previous and new ranks
- (by applying an approximate function derived
in the paper)
28Out-of-date Repository
Web Copy of D (fresh)
Repository Copy of D (stale)