Crawling Techniques - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Crawling Techniques

Description:

Spiders/ Crawler / WebBot ? ... Basic Approach Of Spiders ... Spiders based on GA and neural- networks are being considered for making them ... – PowerPoint PPT presentation

Number of Views:285

Avg rating:4.0/5.0

Slides: 28

Provided by: ntk

Category:

more less

Transcript and Presenter's Notes

Title: Crawling Techniques

1
Crawling Techniques

EECS767 - Information Retrieval
Instructor Dr Susan Gauch
by
KP Muthuvelan ( 703847 )
Shankar Palanivel ( 706506 )
Roopesh Rajamani ( 698645 )

2
How Search Engines Used To Work?

Build a huge search index for a collection of
documents.
Manually indexed search engines( e.g.. Yahoo )
Drawbacks
Largely static in nature
Not suitable to dynamic web pages.

3
How Search Engines work now?

User submits the URL
Search engine sends out spiders periodically
Web pages retrieved by the spiders are indexed
Advantages
Better in handling dynamically changing pages
Highly scalable because of automatic indexing

4
Generic Search Engine
5
How else are spiders are used ?

Personal Web Maps (PWM)
A personal collection of interesting web pages
which he/she can control its construction
User can instruct the PWM to be expanded along a
particular direction
Searches within Intranet
Document collection not huge
Dynamic search advantage

6
Spiders/ Crawler / WebBot ?

A program that traverses the known or the visible
World-Wide-Web and downloads the documents found.
The spider is given a set of initial documents or
a query along with a starting seed URL.
The spider retrieves all the web documents which
are connected to the starting URL.

7
Basic Approach Of Spiders

The spider extracts the text of the starting URL
and identifies the embedded links in the
document.
The embedded links are put in a queue, to be
retrieved later.
The links in the queue are retrieved and analyzed
and additional embedded links are added to the
queue data structure

8
Problems with Basic Approach

The number of connected documents in the web are
very large and spiders operate on time and
Bandwidth constraints.
The visibility of the spider is limited to the
connected graph of documents commencing at the
starting point.
Need to construct a priority queue instead of
queue, in order to retrieve the relevant
documents first.

9
Metrics for Priority Queue

Similarity to a Driving Query Q
Follow the same approach as for indexing - tf
idf approach.
How to calculate the inverse document frequency ?
Backlink Count
The priority of a link depends on the number of
pages that contain the link.
The backlink count is calculated based on the web
pages retrieved so far.

10
Metrics for Priority Queue

Page Rank
It recursively defines the importance of a page
to be the weighted sum of backlinks to it.
PageRank( P )
( 1 - d ) d ( pagerank(T1)/ c1 ..
pagerank(Tn)/cn )
Location Metric
Importance of a page P is a function of its
location, not its contents.
URLs with fewer slashes are more useful than URLs
with more slashes.

11
Fish - Search Algorithm

Based on the assumption that relevant documents
have relevant neighbors.
It takes as input, a seed URL and a search query
dynamically builds a priority list of next URLs
to be explored.
At each step the first node is popped from the
list processed.

12
Fish - Search Algorithm

As the documents text becomes available, the text
is used to determine if the document is relevant
or not.
Relevancy of document determines whether to
pursue the exploration in that direction.
The document is scanned for links. Each link (
denoted as children) is assigned some
predetermined depth value.

13
Fish - Search Algorithm

Relevance score is assigned in a discrete manner
- 1 for relevant, 0 or 0.5 for irrelevant.
Uses primitive string or regular-expression
matching for checking relevancy.
If the document is relevant then its children are
assigned a predetermined depth value else the
depth value is decremented by one.

14
Drawbacks of Fish - Search Algorithm

Very low differentiation of the priority of pages
in the list.
Arbitrary pruning occurs.
The number of children to be selected as relevant
is arbitrary and fixed.

15
Shark - Search Algorithm

Unlike the binary relevance metric used by the
fish-search algorithm, it uses a similarity
engine which returns a fuzzy score between 0 and
1.
Vector-space model is used for construction of
the similarity engine.

16
Shark - Search Algorithm

The similarity engine is orthogonal to the
crawler algorithm.
Instead of the depth value, the children get a
inherited score from the parent.
The inherited score is obtained by multiplying
the parents similarity score by a fading factor.

17
Shark - Search Algorithm

Potential score of the children is also based on
the Meta-Information included in the links.
Can fail in some cases? Can use the close textual
context of the link.
Helps in better differentiation of the pages.

18
Evaluation Measure

The two algorithms are evaluated by taking the
sum of similarity scores of a retrieved
documents.
The shark-search algorithm was better compared to
the fish-search algorithm.
The shark-search algorithm in now implemented in
a site mapping tool called Mapuccino.

19
Incremental Crawler Algorithms

A periodic crawler builds a new collection of
retrieved documents whenever it is dispatched in
the web.
An incremental crawler continuously
update/refresh the local collection of documents
retrieved.

20
Operational model of an Incremental Crawler
21
Architecture of Incremental Crawler

Major Data Structures
AllUrls
List of all URLs in the collection
CollUrls
URLs in the collection
Collection
Documents of the CollUrls

22
Architecture of Incremental Crawler

Major Modules
Ranking Module
The Ranking Module constantly scans through
AllUrls and CollUrls to make the refinement
decision.
Update Module
Takes the link at the top of CollUrls and
decides whether it has been updated or not.

23
Architecture of Incremental Crawler

Crawl Module
Adds URLs to the AllUrls Structure and
refreshes the collection data structure.

24
Architecture of Incremental Crawler
25
Future Work Conclusion

Most research work is being concentrated on
ranking the URLs obtained from the retrieved
documents.
Spiders based on GA and neural- networks are
being considered for making them more efficient
and robust.

26
References

The Shark-Search algorithm --- an application
tailored Web site mapping , M. Hersovici, M.
Jacovi, Y.S. Maarek, D. Pelleg, M. Shtalheim and
S. Ur, 7th World-Wide Web Conference, 1998.
Efficient Crawling Through URL Ordering,
Junghoo Cho, Hector Garcia-Molina, Lawrence Page.
7th International Web Conference (WWW 98).

27
References

An Intelligent Personal Spider (Agent) for
Dynamic Internet/Intranet Searching, Chen, H.C.,
Chung, Y.M., Ramsey, M. Yang, C.C.
(1998)Decision Support Systems 23, 41-58.
The evolution of the web and implications for an
incremental crawler, Junghoo Cho and Hector
Garcia-Molina. Submitted to VLDB 2000,
Experience/Application track, 2000.

Write a Comment

User Comments (0)