Title: Internet and Data Management
1Internet and Data Management
Junghoo John Cho UCLA Computer Science
2Information Galore
Biblio sever
Legacy database
Plain text files
3Challenges Too much information?
- Discovery
- Management
- Overload
- Access
4Approaches
- Central caching and Indexing
- Google, Excite, AltaVista
- Dynamic integration
- MySimon, BizRate
5Central Caching and Indexing
Central Index
6Challenges
- Page selection and download
- What page to download?
- Page and index update
- How to update pages?
- Page ranking
- What page is important or relevant?
- Scalability
7Dynamic Integration
Mediator
Wrapper
Wrapper
Wrapper
Source 1
Source 2
Source n
8Challenges
- Heterogeneous sources
- Different data models relational,
object-oriented - Different schemas and representations
- Keanu Reeves or Reeves, K. etc.
- Limited query capabilities
- Mediator caching
9Outline of This Talk
- How can we maintain pages fresh?
- How does the Web change?
- What do we mean by fresh pages?
- How should we refresh pages?
10Web Evolution Experiment
- How often does a Web page change?
- How long does a page stay on the Web?
- How long does it take for 50 of the Web to
change? - How do we model Web changes?
11Experimental Setup
- February 17 to June 24, 1999
- 270 sites visited (with permission)
- identified 400 sites with highest PageRank
- contacted administrators
- 720,000 pages collected
- 3,000 pages from each site daily
- start at root, visit breadth first (get new old
pages) - ran only 9pm - 6am, 10 seconds between site
requests
12Average Change Interval
fraction of pages
¾
¾
average change interval
13Change Interval By Domain
fraction of pages
¾
¾
average change interval
14Modeling Web Evolution
- Poisson process with rate l
- T is time to next event
- fT (t) l e-lt (t gt 0)
15Change Interval of Pages
for pages that change every 10 days on average
fraction of changes with given interval
Poisson model
interval in days
16Change Metrics
- Freshness
- Freshness of element ei at time t is F (
ei t ) 1 if ei is up-to-date at time t
0 otherwise
17Change Metrics
- Age
- Age of element ei at time t is A( ei t
) 0 if ei is up-to-date at time t
t - (modification ei time)
otherwise
18Change Metrics
F(ei)
1
0
time
A(ei)
0
time
refresh
update
19Trick Question
- Two page database
- e1 changes daily
- e2 changes once a week
- Can visit one page per week
- How should we visit pages?
- e1 e2 e1 e2 e1 e2 e1 e2... uniform
- e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 proportional
- e1 e1 e1 e1 e1 e1 ...
- e2 e2 e2 e2 e2 e2 ...
- ?
e1
e1
e2
e2
web
database
20Proportional Often Not Good!
- Visit fast changing e1
- ? get 1/2 day of freshness
- Visit slow changing e2
- get 1/2 week of freshness
- Visiting e2 is a better deal!
21Optimal Refresh Frequency
- Problem
- Given ?1, ?1, .., ?N and f ,
- find f1, f2,.., fN that maximize
-
22Optimal Refresh Frequency
- Shape of curve is the same in all cases
- Holds for any change frequency distribution
23Optimal Refresh for Age
- Shape of curve is the same in all cases
- Holds for any change frequency distribution
24Comparing Policies
Based on Statistics from experiment and revisit
frequency of every month
25Not Every Page is Equal!
? Some pages are more important
e1
Accessed by users 10 times/day
e2
Accessed by users 20 times/day
F (S ) 1 F (e1) 2 F (e2)
26Weighted Freshness
f
w 2
w 1
l
27Change Frequency Estimation
- How to estimate change frequency?
- Naïve Estimator X/T
- X number of detected changes
- T monitoring period
- 2 changes in 10 days 0.2 times/day
- Incomplete change history
28Improved Estimator
- Based on the Poisson model
-
- X number of detected changes
- N number of accesses
- f access frequency
- 3 changes in 10 days 0.36 times/day
- ? Accounts for missed changes
29Improvement Significant?
- Application to a Web crawler
- Visit pages once every week for 5 weeks
- Estimate change frequency
- Adjust revisit frequency based on the estimate
- Uniform do not adjust
- Naïve based on the naïve estimator
- Ours based on our improved estimator
30Improvement from Our Estimator
(9,200,000 visits in total)
31WebArchive Project
- Can we store the history of the Web?
- Web is ephemeral
- Study of the Web evolution
- Challenges
- Update?
- Compression?
- New storage?
- Indexing?
32Conclusion
- Exciting area and many challenges ahead!
- Thank you for your attention
- For more information visit
- http//www.cs.ucla.edu/cho/