Crawling Techniques - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Crawling Techniques

Description:

Spiders/ Crawler / WebBot ? ... Basic Approach Of Spiders ... Spiders based on GA and neural- networks are being considered for making them ... – PowerPoint PPT presentation

Number of Views:285
Avg rating:4.0/5.0
Slides: 28
Provided by: ntk
Category:

less

Transcript and Presenter's Notes

Title: Crawling Techniques


1
Crawling Techniques
  • EECS767 - Information Retrieval
  • Instructor Dr Susan Gauch
  • by
  • KP Muthuvelan ( 703847 )
  • Shankar Palanivel ( 706506 )
  • Roopesh Rajamani ( 698645 )

2
How Search Engines Used To Work?
  • Build a huge search index for a collection of
    documents.
  • Manually indexed search engines( e.g.. Yahoo )
  • Drawbacks
  • Largely static in nature
  • Not suitable to dynamic web pages.

3
How Search Engines work now?
  • User submits the URL
  • Search engine sends out spiders periodically
  • Web pages retrieved by the spiders are indexed
  • Advantages
  • Better in handling dynamically changing pages
  • Highly scalable because of automatic indexing

4
Generic Search Engine
5
How else are spiders are used ?
  • Personal Web Maps (PWM)
  • A personal collection of interesting web pages
    which he/she can control its construction
  • User can instruct the PWM to be expanded along a
    particular direction
  • Searches within Intranet
  • Document collection not huge
  • Dynamic search advantage

6
Spiders/ Crawler / WebBot ?
  • A program that traverses the known or the visible
    World-Wide-Web and downloads the documents found.
  • The spider is given a set of initial documents or
    a query along with a starting seed URL.
  • The spider retrieves all the web documents which
    are connected to the starting URL.

7
Basic Approach Of Spiders
  • The spider extracts the text of the starting URL
    and identifies the embedded links in the
    document.
  • The embedded links are put in a queue, to be
    retrieved later.
  • The links in the queue are retrieved and analyzed
    and additional embedded links are added to the
    queue data structure

8
Problems with Basic Approach
  • The number of connected documents in the web are
    very large and spiders operate on time and
    Bandwidth constraints.
  • The visibility of the spider is limited to the
    connected graph of documents commencing at the
    starting point.
  • Need to construct a priority queue instead of
    queue, in order to retrieve the relevant
    documents first.

9
Metrics for Priority Queue
  • Similarity to a Driving Query Q
  • Follow the same approach as for indexing - tf
    idf approach.
  • How to calculate the inverse document frequency ?
  • Backlink Count
  • The priority of a link depends on the number of
    pages that contain the link.
  • The backlink count is calculated based on the web
    pages retrieved so far.

10
Metrics for Priority Queue
  • Page Rank
  • It recursively defines the importance of a page
    to be the weighted sum of backlinks to it.
  • PageRank( P )
  • ( 1 - d ) d ( pagerank(T1)/ c1 ..
    pagerank(Tn)/cn )
  • Location Metric
  • Importance of a page P is a function of its
    location, not its contents.
  • URLs with fewer slashes are more useful than URLs
    with more slashes.

11
Fish - Search Algorithm
  • Based on the assumption that relevant documents
    have relevant neighbors.
  • It takes as input, a seed URL and a search query
    dynamically builds a priority list of next URLs
    to be explored.
  • At each step the first node is popped from the
    list processed.

12
Fish - Search Algorithm
  • As the documents text becomes available, the text
    is used to determine if the document is relevant
    or not.
  • Relevancy of document determines whether to
    pursue the exploration in that direction.
  • The document is scanned for links. Each link (
    denoted as children) is assigned some
    predetermined depth value.

13
Fish - Search Algorithm
  • Relevance score is assigned in a discrete manner
    - 1 for relevant, 0 or 0.5 for irrelevant.
  • Uses primitive string or regular-expression
    matching for checking relevancy.
  • If the document is relevant then its children are
    assigned a predetermined depth value else the
    depth value is decremented by one.

14
Drawbacks of Fish - Search Algorithm
  • Very low differentiation of the priority of pages
    in the list.
  • Arbitrary pruning occurs.
  • The number of children to be selected as relevant
    is arbitrary and fixed.

15
Shark - Search Algorithm
  • Unlike the binary relevance metric used by the
    fish-search algorithm, it uses a similarity
    engine which returns a fuzzy score between 0 and
    1.
  • Vector-space model is used for construction of
    the similarity engine.

16
Shark - Search Algorithm
  • The similarity engine is orthogonal to the
    crawler algorithm.
  • Instead of the depth value, the children get a
    inherited score from the parent.
  • The inherited score is obtained by multiplying
    the parents similarity score by a fading factor.

17
Shark - Search Algorithm
  • Potential score of the children is also based on
    the Meta-Information included in the links.
  • Can fail in some cases? Can use the close textual
    context of the link.
  • Helps in better differentiation of the pages.

18
Evaluation Measure
  • The two algorithms are evaluated by taking the
    sum of similarity scores of a retrieved
    documents.
  • The shark-search algorithm was better compared to
    the fish-search algorithm.
  • The shark-search algorithm in now implemented in
    a site mapping tool called Mapuccino.

19
Incremental Crawler Algorithms
  • A periodic crawler builds a new collection of
    retrieved documents whenever it is dispatched in
    the web.
  • An incremental crawler continuously
    update/refresh the local collection of documents
    retrieved.

20
Operational model of an Incremental Crawler
21
Architecture of Incremental Crawler
  • Major Data Structures
  • AllUrls
  • List of all URLs in the collection
  • CollUrls
  • URLs in the collection
  • Collection
  • Documents of the CollUrls

22
Architecture of Incremental Crawler
  • Major Modules
  • Ranking Module
  • The Ranking Module constantly scans through
    AllUrls and CollUrls to make the refinement
    decision.
  • Update Module
  • Takes the link at the top of CollUrls and
    decides whether it has been updated or not.

23
Architecture of Incremental Crawler
  • Crawl Module
  • Adds URLs to the AllUrls Structure and
    refreshes the collection data structure.

24
Architecture of Incremental Crawler
25
Future Work Conclusion
  • Most research work is being concentrated on
    ranking the URLs obtained from the retrieved
    documents.
  • Spiders based on GA and neural- networks are
    being considered for making them more efficient
    and robust.

26
References
  • The Shark-Search algorithm --- an application
    tailored Web site mapping , M. Hersovici, M.
    Jacovi, Y.S. Maarek, D. Pelleg, M. Shtalheim and
    S. Ur, 7th World-Wide Web Conference, 1998.
  • Efficient Crawling Through URL Ordering,
    Junghoo Cho, Hector Garcia-Molina, Lawrence Page.
    7th International Web Conference (WWW 98).

27
References
  • An Intelligent Personal Spider (Agent) for
    Dynamic Internet/Intranet Searching, Chen, H.C.,
    Chung, Y.M., Ramsey, M. Yang, C.C.
    (1998)Decision Support Systems 23, 41-58.
  • The evolution of the web and implications for an
    incremental crawler, Junghoo Cho and Hector
    Garcia-Molina. Submitted to VLDB 2000,
    Experience/Application track, 2000.
Write a Comment
User Comments (0)
About PowerShow.com