Title: Wasim Rangoonwala
1Privacy is the claim of individuals, groups or
institutions to determine for themselves when,
how, and to what extent information about them is
communicated to others - Alan Westin Privacy
Freedom,1967
- Wasim Rangoonwala
- ID 00506259
- CS-460 Computer Security
2(No Transcript)
3What are www Robots?
A robot is a program that automatically traverses
the Web's hypertext structure by retrieving a
document, and recursively retrieving all
documents that are referenced. Web robots are
sometimes referred to as Web Wanderers, Web
Crawlers, or Spiders or Bots.
4Web Spiders / Robots Collecting Data
5Controlling how search engine access and index
your website?
Google refers to their spiders as Googlebots and
Googlebots-Image Google has a set of computers
that continually crawl the web. Together these
machines are known as the Googlebot. In general
you want Googlebot to access your site so your
web pages can be found by people searching on
Google.
6Controlling how search engine access and index
your website?
One key Question is how does Google know what
parts of a website the site owner wants to have
show up in search results? Can publishers specify
that some parts of the site should be private and
non-searchable? The good news is that those who
publish on the web have a lot of control over
which pages should appear in search results and
which pages can be kept Private..
Answer Robots.txt File
7Controlling how search engine access and index
your website?
- Robots.txt has been an industry standard for many
years that lets a site owner control how search
engines access their web site. - The robots.txt file contains a list of the pages
that search engines shouldn't access. - You can exclude pages from Google's crawler by
creating a text file called robots.txt and
placing it in the root directory.
Making Use of Robots.txt File
8Controlling how search engine access and index
your website?
- Example of pages you want to kept private from
search engines - A directory that contains internal logs.
- News articles that require payment to access.
- Administration area of website. Database
configuration string, stored passwords, credit
card details. - Images that you want to kept Private.
Making Use of Robots.txt File Continue
9Achieving Privacy through Robots.txt File
robots.txt File Currently disallow all
images to the Google Image bot User-agent
Googlebot-Image Disallow / ALL search engine
spiders/crawlers (put at end of
file) User-agent Googlebot Disallow
/admin/ Disallow /account_password.html Disallo
w /address_book.html Disallow
/checkout_payment.html Disallow
/cookie_usage.html Disallow /login.html
Example of Robots.txt File
10Privacy through Robots ltMETAgt tag
- You can use a special HTML ltMETAgt tag to tell
robots not to index - the content of a page, and/or not scan it for
links to follow. - Example
- lthtmlgt
- ltheadgt
- lttitlegt...lt/titlegt
- ltMETA NAME"ROBOTS" CONTENT"NOINDEX, NOFOLLOW"gt
- lt/headgt
- The "NAME" attribute must be "ROBOTS".
- Valid values for the "CONTENT" attribute are
"INDEX", "NOINDEX", "FOLLOW", "NOFOLLOW".
Multiple comma-separated values are allowed, but
obviously only some combinations make sense. If
there is no robots ltMETAgt tag, the default is
"INDEX,FOLLOW", so there's no need to spell that
out.
Example of ltMETAgt Tag
11Search Engine Web Spiders Names
- Yahoo! Search-Yahoo Slurp
- AltaVista- Scooter
- AskJeeves- Ask Jeeves/Teoma
- MSN Search- MSNbot
- Visit http//www.robotstxt.org/db.html
- For more details on Search Engine
- Web Spider Names.
12 13Google Anatomy
- Google Crawlers (GoogleBot)
- Multiple distributed crawlers
- Own DNS cache
- 300 connections open at once
- Send fetched pages to Store Server
- Originally written in Python
14PageRank AlgorithmHypertext-matching
Analysis
Google Technology
15Google Webmaster Central
Webmasters Central offer services see which
parts of a site Googlebot had problems
crawling upload an XML Sitemap file analyze
and generate robots.txt files remove URLs
already crawled by Googlebot specify the
preferred domain identify issues with title and
description meta tags understand the top
searches used to reach a site get a glimpse
at how Googlebot sees pages remove unwanted
site links that Google may use in results
16(No Transcript)
17(No Transcript)
18- http//www.google.com/support/webmasters/bin/answe
r.py?answer80553 - http//www.google.com/bot.html
- http//www.googleguide.com
- http//www.searchengineposition.com
- http//www.google-watch.org
- http//www.robotstxt.org/db.html
- http//www.googleblog.blogspot.com
- For more Details Visit http//techwasim.blogspot.
com