Intelligent Crawling - PowerPoint PPT Presentation

About This Presentation
Title:

Intelligent Crawling

Description:

There are many pages out on the Web. (Major search engines indexed more ... buffer ... Limited buffer model. 16. Architecture. Repository. URL selector. Virtual ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 25
Provided by: Jungh1
Learn more at: http://oak.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Intelligent Crawling


1
Intelligent Crawling
  • Junghoo Cho
  • Hector Garcia-Molina
  • Stanford InfoLab

2
What is a crawler?
  • Program that automatically retrieves pages from
    the Web.
  • Widely used for search engines.

3
Challenges
  • There are many pages out on the Web.
  • (Major search engines indexed more than 100M
    pages)
  • The size of the Web is growing enormously.
  • Most of them are not very interesting
  • ? In most cases, it is too costly or not
    worthwhile to visit the entire Web space.

4
Good crawling strategy
  • Make the crawler visit important pages first.
  • Save network bandwidth
  • Save storage space and management cost
  • Serve quality pages to the client application

5
Outline
  • Importance metrics
  • what are important pages?
  • Crawling models
  • How is crawler evaluated?
  • Experiments
  • Conclusion Future work

6
Importance metric
  • The metric for determining if a page is HOT
  • Similarity to driving query
  • Location Metric
  • Backlink count
  • Page Rank

7
Similarity to a driving query
Example) Sports, Bill Clinton
the pages related to a specific topic
  • Importance is measured by closeness of the page
    to the topic (e.g. the number of the topic word
    in the page)
  • Personalized crawler

8
Importance metric
  • The metric for determining if a page is HOT
  • Similarity to driving query
  • Location Metric
  • Backlink count
  • Page Rank

9
Backlink-based metric
  • Backlink count
  • number of pages pointing to the page
  • Citation metric
  • Page Rank
  • weighted backlink count
  • weight is iteratively defined

10
B
A
C
E
D
F
BackLinkCount(F) 2 PageRank(F) PageRank(E)/2
PageRank(C)
11
Ordering metric
  • The metric for a crawler to estimate the
    importance of a page
  • The ordering metric can be different from the
    importance metric

12
Crawling models
  • Crawl and Stop
  • Keep crawling until the local disk space is full.
  • Limited buffer crawl
  • Keep crawling until the whole web space is
    visited throwing out seemingly unimportant pages.

13
Crawl and stop model
14
Crawling models
  • Crawl and Stop
  • Keep crawling until the local disk space is full.
  • Limited buffer crawl
  • Keep crawling until the whole web space is
    visited throwing out seemingly unimportant pages.

15
Limited buffer model
16
Architecture
HTML parser
crawled page
extracted URL
Virtual Crawler
page info
URL pool
WebBase Crawler
Page Info
selected URL
Repository
URL selector
Stanford WWW
17
Experiments
  • Backlink-based importance metric
  • backlink count
  • PageRank
  • Similiarty-based importance metric
  • similarity to a query word

18
Ordering metrics in experiments
  • Breadth first order
  • Backlink count
  • PageRank

19
(No Transcript)
20
Similarity-based crawling
  • The content of the page is not available before
    it is visited
  • Essentially, the crawler should guess the
    content of the page
  • More difficult than backlink-based crawling

21
Promising page
Anchor Text
HOT Parent Page
URL
Sports
Sports!! Sports!!
/sports.html
?
?
?
22
Virtual crawler for similarity-based crawling
  • Promising page
  • Query word appears in its anchor text
  • Query word appears in its URL
  • The page pointing to it is important page
  • Visit promising pages first
  • Visit non-promising pages in the ordering
    metric order

23
(No Transcript)
24
Conclusion
  • PageRank is generally good as an ordering metric.
  • By applying a good ordering metric, it is
    possible to gather important pages quickly.
Write a Comment
User Comments (0)
About PowerShow.com