Title: Crawling Techniques
1Crawling Techniques
- EECS767 - Information Retrieval
- Instructor Dr Susan Gauch
- by
- KP Muthuvelan ( 703847 )
- Shankar Palanivel ( 706506 )
- Roopesh Rajamani ( 698645 )
2How Search Engines Used To Work?
- Build a huge search index for a collection of
documents. - Manually indexed search engines( e.g.. Yahoo )
- Drawbacks
- Largely static in nature
- Not suitable to dynamic web pages.
3How Search Engines work now?
- User submits the URL
- Search engine sends out spiders periodically
- Web pages retrieved by the spiders are indexed
- Advantages
- Better in handling dynamically changing pages
- Highly scalable because of automatic indexing
4Generic Search Engine
5How else are spiders are used ?
- Personal Web Maps (PWM)
- A personal collection of interesting web pages
which he/she can control its construction - User can instruct the PWM to be expanded along a
particular direction - Searches within Intranet
- Document collection not huge
- Dynamic search advantage
6Spiders/ Crawler / WebBot ?
- A program that traverses the known or the visible
World-Wide-Web and downloads the documents found. - The spider is given a set of initial documents or
a query along with a starting seed URL. - The spider retrieves all the web documents which
are connected to the starting URL.
7Basic Approach Of Spiders
- The spider extracts the text of the starting URL
and identifies the embedded links in the
document. - The embedded links are put in a queue, to be
retrieved later. - The links in the queue are retrieved and analyzed
and additional embedded links are added to the
queue data structure
8Problems with Basic Approach
- The number of connected documents in the web are
very large and spiders operate on time and
Bandwidth constraints. - The visibility of the spider is limited to the
connected graph of documents commencing at the
starting point. - Need to construct a priority queue instead of
queue, in order to retrieve the relevant
documents first.
9Metrics for Priority Queue
- Similarity to a Driving Query Q
- Follow the same approach as for indexing - tf
idf approach. - How to calculate the inverse document frequency ?
- Backlink Count
- The priority of a link depends on the number of
pages that contain the link. - The backlink count is calculated based on the web
pages retrieved so far.
10Metrics for Priority Queue
- Page Rank
- It recursively defines the importance of a page
to be the weighted sum of backlinks to it. - PageRank( P )
- ( 1 - d ) d ( pagerank(T1)/ c1 ..
pagerank(Tn)/cn ) - Location Metric
- Importance of a page P is a function of its
location, not its contents. - URLs with fewer slashes are more useful than URLs
with more slashes.
11Fish - Search Algorithm
- Based on the assumption that relevant documents
have relevant neighbors. - It takes as input, a seed URL and a search query
dynamically builds a priority list of next URLs
to be explored. - At each step the first node is popped from the
list processed.
12Fish - Search Algorithm
- As the documents text becomes available, the text
is used to determine if the document is relevant
or not. - Relevancy of document determines whether to
pursue the exploration in that direction. - The document is scanned for links. Each link (
denoted as children) is assigned some
predetermined depth value.
13Fish - Search Algorithm
- Relevance score is assigned in a discrete manner
- 1 for relevant, 0 or 0.5 for irrelevant. - Uses primitive string or regular-expression
matching for checking relevancy. - If the document is relevant then its children are
assigned a predetermined depth value else the
depth value is decremented by one.
14Drawbacks of Fish - Search Algorithm
- Very low differentiation of the priority of pages
in the list. - Arbitrary pruning occurs.
- The number of children to be selected as relevant
is arbitrary and fixed.
15Shark - Search Algorithm
- Unlike the binary relevance metric used by the
fish-search algorithm, it uses a similarity
engine which returns a fuzzy score between 0 and
1. - Vector-space model is used for construction of
the similarity engine.
16Shark - Search Algorithm
- The similarity engine is orthogonal to the
crawler algorithm. - Instead of the depth value, the children get a
inherited score from the parent. - The inherited score is obtained by multiplying
the parents similarity score by a fading factor.
17Shark - Search Algorithm
- Potential score of the children is also based on
the Meta-Information included in the links. - Can fail in some cases? Can use the close textual
context of the link. - Helps in better differentiation of the pages.
18Evaluation Measure
- The two algorithms are evaluated by taking the
sum of similarity scores of a retrieved
documents. - The shark-search algorithm was better compared to
the fish-search algorithm. - The shark-search algorithm in now implemented in
a site mapping tool called Mapuccino.
19Incremental Crawler Algorithms
- A periodic crawler builds a new collection of
retrieved documents whenever it is dispatched in
the web. - An incremental crawler continuously
update/refresh the local collection of documents
retrieved.
20Operational model of an Incremental Crawler
21Architecture of Incremental Crawler
- Major Data Structures
- AllUrls
- List of all URLs in the collection
- CollUrls
- URLs in the collection
- Collection
- Documents of the CollUrls
22Architecture of Incremental Crawler
- Major Modules
- Ranking Module
- The Ranking Module constantly scans through
AllUrls and CollUrls to make the refinement
decision. - Update Module
- Takes the link at the top of CollUrls and
decides whether it has been updated or not.
23Architecture of Incremental Crawler
- Crawl Module
- Adds URLs to the AllUrls Structure and
refreshes the collection data structure.
24Architecture of Incremental Crawler
25Future Work Conclusion
- Most research work is being concentrated on
ranking the URLs obtained from the retrieved
documents. - Spiders based on GA and neural- networks are
being considered for making them more efficient
and robust.
26References
- The Shark-Search algorithm --- an application
tailored Web site mapping , M. Hersovici, M.
Jacovi, Y.S. Maarek, D. Pelleg, M. Shtalheim and
S. Ur, 7th World-Wide Web Conference, 1998. - Efficient Crawling Through URL Ordering,
Junghoo Cho, Hector Garcia-Molina, Lawrence Page.
7th International Web Conference (WWW 98).
27References
- An Intelligent Personal Spider (Agent) for
Dynamic Internet/Intranet Searching, Chen, H.C.,
Chung, Y.M., Ramsey, M. Yang, C.C.
(1998)Decision Support Systems 23, 41-58. - The evolution of the web and implications for an
incremental crawler, Junghoo Cho and Hector
Garcia-Molina. Submitted to VLDB 2000,
Experience/Application track, 2000.