Adaptive OnLine Page Importance Computation

About This Presentation

Title:

Adaptive OnLine Page Importance Computation

Description:

Adaptive On-Line Page Importance Computation. Serge Abiteboul. INRIA. Domaine de Voluceau ... How to getting those web pages. Changing web pages. How to update ... – PowerPoint PPT presentation

Number of Views:161

Avg rating:3.0/5.0

Slides: 36

Provided by: Gua1

Category:

more less

Transcript and Presenter's Notes

Title: Adaptive OnLine Page Importance Computation

1
Adaptive On-Line Page Importance Computation

Serge Abiteboul
INRIA
Domaine de Voluceau

2
Content

Brief Introduction
Main Idea
Problem Presentation
Static GraphsOPIC
Adaptive OPIC
Implementation Experiment
Conclusion

3
How do web search engine work?

Crawling the web pages
Parsing the web pages
Indexing the web pages
Search page
Search word parsing
Sending the result

4
Crawling web page

How to getting those web pages
Changing web pages
How to update web pages
How often to updating them
Different pages, different rank

5
Content

Brief Introduction
Main Idea
Problem Presentation
Static GraphsOPIC
Adaptive OPIC
Implementation Experiment
Conclusion

6
The web as a graph
7
A graph as a matrix
8
Importance
Importance of page i
9
Content

Brief Introduction
Main Idea
Problem Presentation
Static GraphsOPIC
Adaptive OPIC
Implementation Experiment
Conclusion

10
Cost of computing page rank

Huge history crawling web pages
Huge Cash Vector
Huge History Vector
Temp Vector
Variable length of

S T O R A G E
11
Cost of computing page rank

CPU
Memory
Disk access
Crawling web page
Communication

S Y S T E M
12
Content

Brief Introduction
Main Idea
Problem Presentation
Static GraphsOPIC
Adaptive OPIC
Implementation Experiment
Conclusion

13
Inductive Equation
Diverge
Converge to zero
14
Inductive Equation
Several solution
Converge problem
15
Inductive Equation
16
Static Graphs OPIC
First Cash
Credit of history page
Temp Vector
17
Static Graphs OPIC
for each i let Ci 1/n for each i let Hi
0 let G0 do forever begin choose some
node i each node is selected infinitely
often Hi Ci single disk access per
page for each child j of i, do Cj
Ci/outi Distribution of cash depends on
L G Ci Ci 0 end
18
Static Graphs OPIC
19
Limma 2.2
20
Limma 2.3
21
Limma 2.4
If all pages are infinitely read,
goes to infinity.
22
Limma 2.5
23
Content

Brief Introduction
Main Idea
Problem Presentation
Static GraphsOPIC
Adaptive OPIC
Implementation Experiment
Conclusion

24
Advantages over Adaptive OPIC

Less storage resources than standard algorithms
Less CPU,memory and disk access
Easy to implement

25
Page crawling strategies
Error factor

Random average 1/n
Greedy 2/n
Cycle

26
Window select
Adaptive OPIC select fixed window T
27
A changing graph

The Web changes continuously, so does the
importance of pages.
Considering only the recent part of the cash
history for each page
The time window corresponding to the .recent
history may be defined as
A fixed number of measures for each page
A fixed period of time for each page
A single value that interpolates the history for
a specific period of time
When the number of nodes changes, there are some
difficulties.
More precisely, the page importance of previously
existing pages decreases automatically.

28
Interpolation

If (G Gi) lt T
Otherwise

29
Content

Brief Introduction
Main Idea
Problem Presentation
Static GraphsOPIC
Adaptive OPIC
Implementation Experiment
Conclusion

30
Adaptive OPIC implement

It does not impose any constraints on the order
of pages to visit
The crawling strategy in Xyleme is close to
Greedy since it is tailored to optimize our
knowledge of the Web
Considering only the recent part of the cash
history for each page

31
Experiments on synthetic data

Convergence on important pages

32
Experiments on synthetic data

Impact of the window policy

33
Experiments on Web data

Experiments where conducted using the crawlers of
Xyleme (e.g. 8 PCs with 1.5Gb of memory)
Crawling strategy is close to Greedy
History is managed using the Interpolation policy
Experiments lasted for several months, we
discovered close to one billion URLs and read 400
millions of them
Importance of read pages seems correct (with
limitedhuman checking).
We could also give importance estimates for pages
that were never read
The size of the window was first too small, then
we set it to 3 months

34
Content

Brief Introduction
Main Idea
Problem Presentation
Static GraphsOPIC
Adaptive OPIC
Implementation Experiment
Conclusion

35
New directions site vs. pages

Limitation of page importance
Google page importance works well when links have
a strong semantic
More and more web pages are automatically
generated and most links have little semantics
More limitation
Refresh at the page level presents drawbacks
So we also use link topology between sites and
not only between pages

Write a Comment

User Comments (0)