Wasim Rangoonwala

About This Presentation

Title:

Wasim Rangoonwala

Description:

Example of pages you want to kept private from search engines ... wallet or car keys. Make it complicate! E-mail is not. secure and should. never be though ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 19

Provided by: Was91

Category:

more less

Transcript and Presenter's Notes

Title: Wasim Rangoonwala

1
Privacy is the claim of individuals, groups or
institutions to determine for themselves when,
how, and to what extent information about them is
communicated to others - Alan Westin Privacy
Freedom,1967

Wasim Rangoonwala
ID 00506259
CS-460 Computer Security

2
(No Transcript)
3
What are www Robots?
A robot is a program that automatically traverses
the Web's hypertext structure by retrieving a
document, and recursively retrieving all
documents that are referenced. Web robots are
sometimes referred to as Web Wanderers, Web
Crawlers, or Spiders or Bots.
4
Web Spiders / Robots Collecting Data
5
Controlling how search engine access and index
your website?
Google refers to their spiders as Googlebots and
Googlebots-Image Google has a set of computers
that continually crawl the web. Together these
machines are known as the Googlebot. In general
you want Googlebot to access your site so your
web pages can be found by people searching on
Google.
6
Controlling how search engine access and index
your website?
One key Question is how does Google know what
parts of a website the site owner wants to have
show up in search results? Can publishers specify
that some parts of the site should be private and
non-searchable? The good news is that those who
publish on the web have a lot of control over
which pages should appear in search results and
which pages can be kept Private..
Answer Robots.txt File
7
Controlling how search engine access and index
your website?

Robots.txt has been an industry standard for many
years that lets a site owner control how search
engines access their web site.
The robots.txt file contains a list of the pages
that search engines shouldn't access.
You can exclude pages from Google's crawler by
creating a text file called robots.txt and
placing it in the root directory.

Making Use of Robots.txt File
8
Controlling how search engine access and index
your website?

Example of pages you want to kept private from
search engines
A directory that contains internal logs.
News articles that require payment to access.
Administration area of website. Database
configuration string, stored passwords, credit
card details.
Images that you want to kept Private.

Making Use of Robots.txt File Continue
9
Achieving Privacy through Robots.txt File
robots.txt File Currently disallow all
images to the Google Image bot User-agent
Googlebot-Image Disallow / ALL search engine
spiders/crawlers (put at end of
file) User-agent Googlebot Disallow
/admin/ Disallow /account_password.html Disallo
w /address_book.html Disallow
/checkout_payment.html Disallow
/cookie_usage.html Disallow /login.html
Example of Robots.txt File
10
Privacy through Robots ltMETAgt tag

You can use a special HTML ltMETAgt tag to tell
robots not to index
the content of a page, and/or not scan it for
links to follow.
Example
lthtmlgt
ltheadgt
lttitlegt...lt/titlegt
ltMETA NAME"ROBOTS" CONTENT"NOINDEX, NOFOLLOW"gt
lt/headgt
The "NAME" attribute must be "ROBOTS".
Valid values for the "CONTENT" attribute are
"INDEX", "NOINDEX", "FOLLOW", "NOFOLLOW".
Multiple comma-separated values are allowed, but
obviously only some combinations make sense. If
there is no robots ltMETAgt tag, the default is
"INDEX,FOLLOW", so there's no need to spell that
out.

Example of ltMETAgt Tag
11
Search Engine Web Spiders Names

Yahoo! Search-Yahoo Slurp
AltaVista- Scooter
AskJeeves- Ask Jeeves/Teoma
MSN Search- MSNbot
Visit http//www.robotstxt.org/db.html
For more details on Search Engine
Web Spider Names.

Bonus

13
Google Anatomy

Google Crawlers (GoogleBot)
Multiple distributed crawlers
Own DNS cache
300 connections open at once
Send fetched pages to Store Server
Originally written in Python

14
PageRank AlgorithmHypertext-matching
Analysis
Google Technology
15
Google Webmaster Central
Webmasters Central offer services see which
parts of a site Googlebot had problems
crawling upload an XML Sitemap file analyze
and generate robots.txt files remove URLs
already crawled by Googlebot specify the
preferred domain identify issues with title and
description meta tags understand the top
searches used to reach a site get a glimpse
at how Googlebot sees pages remove unwanted
site links that Google may use in results
16
(No Transcript)
17
(No Transcript)
18