Title: Web crawler
1 Web crawler
2Member Group
- 1. Thanida Limsirivallop
- 47541164
- 2. Lucksamon Sivapattarakumpon 47541404
- 3. Patrapee Suwannan
- 47542212
- 4. Rataruch Tongpradith
- 47542246
- Website http//pirun.ku.ac.th/b4754116
3WebCrawler Definition
- Crawler is an automatic program (sometimes
called a "robot") which explores the World Wide
Web, following the links and searching for
information or building a database such programs
are often used to build automated indexes for the
Web, allowing users to do keyword searches for
Web documents - Web crawlers are programs that exploit the graph
structure of the Web to move from page to page.
4Crawling The Web
- Beginning -gt a key motivation for designing
Web crawlers has been to retrieve Web pages and
add them or their representations to a local
repository. - Simplest Form -gt A crawler starts from a seed
page and then uses the external links within it
to attend to other pages. The process repeats
with the new pages ordering more external links
to follow, until a sufficient number of pages are
identified or some higher level objective is
reached. - General purpose search engines
- Serving as entry points to Web pages strive
for coverage that is as broad as possible. They
use Web crawlers to maintain their index
databases amortizing the cost of crawling and
indexing over the millions of queries received by
them.
5Crawling Infrastructure
Basic Sequence Crawler
6Web Crawler Requirement
- The goal of the proposed crawler is to re-create
the look and feel of a website as it existed on
the crawl date. - The tool should be extensible to adapt to future
changes in web standards. - General Requirements
- Comprehensive downloading (Saving) the look and
feel of the page are mirrored exactly, down to
every image, link, and dynamic element. - Scope of the crawl, Difficult link and depth.
- An intuitive and extensible interface (Interface)
The crawler should use an intuitive graphical
user interface - Command line interface, Caching Pages and Rights
- Security of the server and browser (Niceness)
The robots exclusion protocol must be obeyed.
This requires downloading the robots.txt file
before crawling the rest of the website.
7Web Crawler Requirement
- Presentation of dynamic elements (Dynamic page
image) The crawler must download
Shockwave-Flash files and other content listed in
EMBED tags. Care must be taken not to load the
same file multiple times - Accurate look and feel (Presentation) The most
important aspect of archiving web sites, all
links should be accurate. Almost always,
re-crawling an archive is needed to rewrite all
links to be internal.
8Dominos A New Web Crawlers Design
- This paper describes the design and
implementation of a realtime distributed system
of Web crawling running on a cluster of machines
and introduced a high availability system of
crawling called Dominos. - Dominos is a dynamic system which accounts for
its highly flexible deployment, maintainability
and enhanced fault tolerance. And finally this
paper discusse the experimental results obtained,
comparing them with other documented systems. -
9An investigation of web crawler behavior
characterization and metrics
- This paper presents a characterization study of
search-engine crawlers. The propose of this paper
is using Web-server access logs from five
academic sites in three different countries. - There are results and observations that provide
useful insights into crawler behavior and serve
as basis of our ongoing work on the automatic
detection of Web crawlers.
10Crawler-Friendly Web Servers
- This paper studies how to make web servers more
crawler friendly and to evaluate simple and
easy-to-incorporate modications to web servers so
that there are signicant bandwidth savings. - This paper proposes that web servers can export
meta-data describing their pages so that crawlers
can eciently create and maintain large, fresh
repositories.
11The Evolution of the Web andImplications for an
Incremental Crawler
- This paper studies how to build an eective
incremental crawler. The crawler selectively and
incrementally updates its index and/or local
collection of web pages, instead of periodically
refreshing the collection in batch mode. - Based on the results, discussing various design
choices for a crawler and the possible
trade-offs. And then proposed an architecture for
an incremental crawler, which combines the best
strategies identied. -
12SharpSpider Spidering the Web through Web
Services
- This paper presents that SharpSpider, a
distributed, C spider designed to address the
issues of scalability, decentralisation and
continuity of a Web crawl. - Fundamental to the design of SharpSpider is the
publication of an API for use by other services
on the network. Such an API grants access to a
constantly refreshed index built after successive
crawls of the Web.
13The Anatomy of a Large-Scale HypertextualWeb
Search Engine
- This paper presents Google, a prototype of a
large-scale search engine which makes heavy use
of the structure present in hypertext. This paper
provides an in-depth description of large-scale
web search engine. - Google is designed to be a scalable search
engine. The primary goal is to provide high
quality search result over a rapidly growing
World Wide Web. Furthermore, Google is a complete
architecture for gathering web pages, indexing
them, and performing search queries over them.
14Incremental Web SearchTracking Changes in the
Web
- This paper presents the algorithms and
techniques useful for solving problem that is
detecting web pages, extracting of web pages and
evaluating of web change. - This paper presents an application using the
techniques and algorithms that named Web Daily
News Assistant (WebDNA) Currently deployed on
NYU web site. - Model the change of web documents using
survival analysis. Modeling web changes is useful
for web crawler scheduling and web caching.
15An Investigation of Documents from the World Wide
Web
- This paper reports on examination of pages from
WWW and there are analyzing data collected by the
Inktomi Web Crawler. There are many analysis of
HTML such as Evolution, Improving Web Content,
Control of HTML, Sociological insights, User
Studies, Content analyses, and structure
analysis. And there are many tool to perform the
data collection.
16A Crawler-based Study of Spyware on the Web
- Crawling the Web, downloading content from a
large number of sites, and then analyzing it to
determine whether it is malicious. In this way,
we can answer several important questions. For
example - How much spyware is on the Internet?
- Where is that spyware located (e.g., game sites,
childrens - sites, adult sites, etc.)
- - How likely is a user to encounter spyware
through random browsing? - What kinds of threats does that spyware pose?
- What fraction of executables on the Internet are
infected - with spyware?
17Estimating Frequency of Change
- estimating the change frequency of data to
improve Web crawlers, Web caches and to help data
mining by developing several frequency estimators
and identifying various scenarios.
18Collaborative Web Crawler over High-speed
Research Network
- Distribute web crawler that utilizes the existing
research networks. - Distributed web crawling is a distributed
computing technique whereby Internet search
engines employ many computers to index the
Internet via web crawling. The idea is to spread
out the required resources of computation and
bandwidth to many computers and networks.
19Crawling-based Classification
- The categorization of a database is determined
by its distribution of documents across
categories.
20Mercator A Scalable, Extensible Web Crawler
- design features a crawler core for handling the
main crawling tasks, and extensibility through - protocol and processing modules.
- Users may supply new modules for performing
customized crawling tasks. - We have used Mercator for a variety of purposes,
including performing random walks on the web,
crawling our corporate intranet, and collecting
statistics about the web at large.
21Parallel Crawlers
- A paralle crawler is a crawler that runs multiple
processes in parallel. The goal is to maximize
the download rate while minimizing the overhead
from parallelization and to avoid repeated
downloads of the same page. - To avoid downloading the same page more than
once, the crawling system requires a policy for
assigning the new URLs discovered during the
crawling process, as the same URL can be found by
two different crawling processes.
22Efficient Crawling Through URL Ordering
- Define several different kinds of importance
metrics, and built three models to evaluate
crawlers. - Then evaluated several combinations of importance
and ordering metrics, using the Stanford Web
pages.
23Efficient URL Caching for World Wide Web
Crawling
- URL caching is very effective
- Any web crawler must maintain a collection of
URLs that areto be downloaded. Moreover, since it
would be unacceptable to download the same URL
over and over, - recommend a cache size of between 100 to 500
entries per crawling thread - the size of the cache needed to achieve top
performance depends on the number of threads
24Multicasting a Web Repository
- proposing an alternative to multiple crawlers A
single central crawler builds a database of Web
pages, and provides a multicast service for
clients that need a subset of this Web image.
25Distributed High-performance Web Crawlers
- Distributing the workload across multiple
machines - by divide and/or duplicate these pieces in the
cluster. - The program will run simultaneously on two or
more computers that are communicating with each
other over a network.
26Parallel Crawling for Online Social Networks
- a centralized queue implemented as a database
table is conveniently used to coordinate the
operation of all the crawlers to prevent
redundant crawling. - This offers two tiers of parallelism, allowing
multiple crawlers to be run on each of the
multiple agents, where the crawlers are not
affected by any potential failing of the other
crawlers.
27Finding replicated web collections
- Improving web crawling by avoiding redundant
crawling in the Google system and proposing a new
algorithm for efficiently identifying similar
collections that form what we call a similar
cluster.
28Learnable Web Crawler
- In this section, we will shortly explain a
characteristic of the web crawler, learnable
ability. We build some knowledge bases from the
previous crawling. These knowledge bases are - Seed URLs
- Topic Keywords
- URL Prediction
29The Algorithm No KB
- Crawling_with_no_KB (topic)
- seed_urls Search (topic, t)
- keywords topic
- foreach url (seed_urls)
- url_topic url.title url.description
- url_score sim (keywords, url_topic)
- enqueue (url_queue, url, url_score)
- while (url (url_queue) gt 0)
- url dequeue_url_with_max_score
(url_queue) - page fetch_new_document (url)
- page_score sim (keywords, page)
- foreach link (extract_urls (page))
- link_score a.sim(keywords,link.anchortext)
- (1-a).page_score
- enqueue (url_queue, link, link_score)
-
-
-
30The Algorithm With KB
- Crawling_with_KB (KB, topic)
- seed_urls get_seed_url(KB,topic,t)
- keywords get_topic_keyword(KB,topic)
- foreach url (seed_urls)
- url_score get_pred_score(KB,topic,url)
- enqueue (url_queue, url, url_score)
- while (url (url_queue) gt 0)
- url dequeue_url_with_max_score(url_queue)
- page fetch_new_document (url)
- page_score sim (keywords, page)
- foreach link (extract_urls (page))
- pred_link_score get_pred_score(KB,topic,url
) - link_score a.(ß.sim(keywords,link.anchortex
t) - (1- ß).pred_link_score)
- (1-a) . page_score
- enqueue (url_queue, link, link_score)
-
-
31Overall Process
- Learnable_Crawling (topic)
-
- if (no KB)
- Collection Crawling_with_no_KB (topic)
-
- else
- Collection Crawling_with_KB (KB, topic)
-
- / To learn the previous crawling /
- KB.seed_urls learn_seed_URL (Collection)
- KB.keywords learn_topic_keyword
(Collection) - KB.url_predict learn_URL_prediction(Collectio
n) -
32Learning Analysis
33(No Transcript)
34 THANK YOU FOR YOUR ATTENTION .