Web Crawling - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

Web Crawling

Description:

Web Crawling. Focused Crawling. Incremental Crawling. Crawling Lingo. Breadth-First Crawl ... BFS Breadth First Search. The frontier is the web pages whose ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 12
Provided by: usfi
Category:
Tags: crawling | lingo | web

less

Transcript and Presenter's Notes

Title: Web Crawling


1
Web Crawling
2
  • Focused Crawling
  • Incremental Crawling

3
Crawling Lingo
4
Breadth-First Crawl
  • BFS Breadth First Search
  • The frontier is the web pages whose links have
    not been explored.
  • It begins with the seeds
  • BFS treats the frontier as a FIFO Queue. Grab
    the first page, extract its links and put them on
    the end.

5
Best First
  • Sort frontier based on estimation criterion
  • e.g., how many times terms of topic appear in
    document (cosine similarity)
  • Remove lowest from frontier
  • Others have used anchor text so as not to have to
    open page.
  • Shark Search uses anchor text, text around
    anchor, and inherited score from ancestors.

6
Chakrabarti Two measures
  • Relevance of a Document to a topic
  • CLASSIFIER
  • How Beneficial it is to crawl a document
  • DISTILLER
  • A very relevant page without links is only a
    finishing point in the crawl. In contrast, hubs
    are good for crawling, and good hubs should be
    checked frequently for new resource links.

7
Master Category Tree
Root
Rroot(p)1 All documents Sum (Rci(p)Rc(p) where
ci are children of c
A
B
A2
A1
8
Chakrabarti
  • Applied Hubs/Authorities
  • Authorities should be put into final set.
  • Authorities might not have any (good) links.
  • What you really want to crawl next are hubs
    pages that point to a bunch of authorities.

9
Chakrabarti, Heuristics
  • Pages cite pages on related subjects
  • A page that points to one page with a desired
    topic is more likely than a random page to point
    to other pages with desired topics.
  • e.g. in one test, a page that pointed to a given
    first level Yahoo topic had a 45 chance of
    pointing to another.

10
Intro to IR Slides
  • http//www.othmerinstitute.poly.edu/projects/Infor
    etrieval_suel.ppt
  • discusses cosine measures
  • discusses process of automatically generating
    documents (as in Chakrabarti paper)

11
Category Tree
Write a Comment
User Comments (0)
About PowerShow.com