Title: Adaptive OnLine Page Importance Computation
1Adaptive On-Line Page Importance Computation
- Serge Abiteboul
- INRIA
- Domaine de Voluceau
2Content
- Brief Introduction
- Main Idea
- Problem Presentation
- Static GraphsOPIC
- Adaptive OPIC
- Implementation Experiment
- Conclusion
3How do web search engine work?
- Crawling the web pages
- Parsing the web pages
- Indexing the web pages
- Search page
- Search word parsing
- Sending the result
4Crawling web page
- How to getting those web pages
- Changing web pages
- How to update web pages
- How often to updating them
- Different pages, different rank
5Content
- Brief Introduction
- Main Idea
- Problem Presentation
- Static GraphsOPIC
- Adaptive OPIC
- Implementation Experiment
- Conclusion
6The web as a graph
7A graph as a matrix
8Importance
Importance of page i
9Content
- Brief Introduction
- Main Idea
- Problem Presentation
- Static GraphsOPIC
- Adaptive OPIC
- Implementation Experiment
- Conclusion
10Cost of computing page rank
- Huge history crawling web pages
- Huge Cash Vector
- Huge History Vector
- Temp Vector
- Variable length of
S T O R A G E
11Cost of computing page rank
- CPU
- Memory
- Disk access
- Crawling web page
- Communication
S Y S T E M
12Content
- Brief Introduction
- Main Idea
- Problem Presentation
- Static GraphsOPIC
- Adaptive OPIC
- Implementation Experiment
- Conclusion
13Inductive Equation
Diverge
Converge to zero
14Inductive Equation
Several solution
Converge problem
15Inductive Equation
16Static Graphs OPIC
First Cash
Credit of history page
Temp Vector
17Static Graphs OPIC
for each i let Ci 1/n for each i let Hi
0 let G0 do forever begin choose some
node i each node is selected infinitely
often Hi Ci single disk access per
page for each child j of i, do Cj
Ci/outi Distribution of cash depends on
L G Ci Ci 0 end
18Static Graphs OPIC
19Limma 2.2
20Limma 2.3
21Limma 2.4
If all pages are infinitely read,
goes to infinity.
22Limma 2.5
23Content
- Brief Introduction
- Main Idea
- Problem Presentation
- Static GraphsOPIC
- Adaptive OPIC
- Implementation Experiment
- Conclusion
24Advantages over Adaptive OPIC
- Less storage resources than standard algorithms
- Less CPU,memory and disk access
- Easy to implement
25Page crawling strategies
Error factor
- Random average 1/n
- Greedy 2/n
- Cycle
26Window select
Adaptive OPIC select fixed window T
27A changing graph
- The Web changes continuously, so does the
importance of pages. - Considering only the recent part of the cash
history for each page - The time window corresponding to the .recent
history may be defined as - A fixed number of measures for each page
- A fixed period of time for each page
- A single value that interpolates the history for
a specific period of time - When the number of nodes changes, there are some
difficulties. - More precisely, the page importance of previously
existing pages decreases automatically.
28Interpolation
29Content
- Brief Introduction
- Main Idea
- Problem Presentation
- Static GraphsOPIC
- Adaptive OPIC
- Implementation Experiment
- Conclusion
30Adaptive OPIC implement
- It does not impose any constraints on the order
of pages to visit - The crawling strategy in Xyleme is close to
Greedy since it is tailored to optimize our
knowledge of the Web - Considering only the recent part of the cash
history for each page
31Experiments on synthetic data
- Convergence on important pages
32Experiments on synthetic data
- Impact of the window policy
33Experiments on Web data
- Experiments where conducted using the crawlers of
Xyleme (e.g. 8 PCs with 1.5Gb of memory) - Crawling strategy is close to Greedy
- History is managed using the Interpolation policy
- Experiments lasted for several months, we
discovered close to one billion URLs and read 400
millions of them - Importance of read pages seems correct (with
limitedhuman checking). - We could also give importance estimates for pages
that were never read - The size of the window was first too small, then
we set it to 3 months
34Content
- Brief Introduction
- Main Idea
- Problem Presentation
- Static GraphsOPIC
- Adaptive OPIC
- Implementation Experiment
- Conclusion
35New directions site vs. pages
- Limitation of page importance
- Google page importance works well when links have
a strong semantic - More and more web pages are automatically
generated and most links have little semantics - More limitation
- Refresh at the page level presents drawbacks
- So we also use link topology between sites and
not only between pages