Wasim Rangoonwala - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Wasim Rangoonwala

Description:

Example of pages you want to kept private from search engines ... wallet or car keys. Make it complicate! E-mail is not. secure and should. never be though ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 19
Provided by: Was91
Category:

less

Transcript and Presenter's Notes

Title: Wasim Rangoonwala


1
Privacy is the claim of individuals, groups or
institutions to determine for themselves when,
how, and to what extent information about them is
communicated to others - Alan Westin Privacy
Freedom,1967
  • Wasim Rangoonwala
  • ID 00506259
  • CS-460 Computer Security

2
(No Transcript)
3
What are www Robots?
A robot is a program that automatically traverses
the Web's hypertext structure by retrieving a
document, and recursively retrieving all
documents that are referenced. Web robots are
sometimes referred to as Web Wanderers, Web
Crawlers, or Spiders or Bots.
4
Web Spiders / Robots Collecting Data
5
Controlling how search engine access and index
your website?
Google refers to their spiders as Googlebots and
Googlebots-Image Google has a set of computers
that continually crawl the web. Together these
machines are known as the Googlebot. In general
you want Googlebot to access your site so your
web pages can be found by people searching on
Google.
6
Controlling how search engine access and index
your website?
One key Question is how does Google know what
parts of a website the site owner wants to have
show up in search results? Can publishers specify
that some parts of the site should be private and
non-searchable? The good news is that those who
publish on the web have a lot of control over
which pages should appear in search results and
which pages can be kept Private..
Answer Robots.txt File
7
Controlling how search engine access and index
your website?
  • Robots.txt has been an industry standard for many
    years that lets a site owner control how search
    engines access their web site.
  • The robots.txt file contains a list of the pages
    that search engines shouldn't access.
  • You can exclude pages from Google's crawler by
    creating a text file called robots.txt and
    placing it in the root directory.

Making Use of Robots.txt File
8
Controlling how search engine access and index
your website?
  • Example of pages you want to kept private from
    search engines
  • A directory that contains internal logs.
  • News articles that require payment to access.
  • Administration area of website. Database
    configuration string, stored passwords, credit
    card details.
  • Images that you want to kept Private.

Making Use of Robots.txt File Continue
9
Achieving Privacy through Robots.txt File
robots.txt File Currently disallow all
images to the Google Image bot User-agent
Googlebot-Image Disallow / ALL search engine
spiders/crawlers (put at end of
file) User-agent Googlebot Disallow
/admin/ Disallow /account_password.html Disallo
w /address_book.html Disallow
/checkout_payment.html Disallow
/cookie_usage.html Disallow /login.html
Example of Robots.txt File
10
Privacy through Robots ltMETAgt tag
  • You can use a special HTML ltMETAgt tag to tell
    robots not to index
  • the content of a page, and/or not scan it for
    links to follow.
  • Example
  • lthtmlgt
  • ltheadgt
  • lttitlegt...lt/titlegt
  • ltMETA NAME"ROBOTS" CONTENT"NOINDEX, NOFOLLOW"gt
  • lt/headgt
  • The "NAME" attribute must be "ROBOTS".
  • Valid values for the "CONTENT" attribute are
    "INDEX", "NOINDEX", "FOLLOW", "NOFOLLOW".
    Multiple comma-separated values are allowed, but
    obviously only some combinations make sense. If
    there is no robots ltMETAgt tag, the default is
    "INDEX,FOLLOW", so there's no need to spell that
    out.

Example of ltMETAgt Tag
11
Search Engine Web Spiders Names
  • Yahoo! Search-Yahoo Slurp
  • AltaVista- Scooter
  • AskJeeves- Ask Jeeves/Teoma
  • MSN Search- MSNbot
  • Visit http//www.robotstxt.org/db.html
  • For more details on Search Engine
  • Web Spider Names.

12
  • Bonus

13
Google Anatomy
  • Google Crawlers (GoogleBot)
  • Multiple distributed crawlers
  • Own DNS cache
  • 300 connections open at once
  • Send fetched pages to Store Server
  • Originally written in Python

14
PageRank AlgorithmHypertext-matching
Analysis
Google Technology
15
Google Webmaster Central
Webmasters Central offer services see which
parts of a site Googlebot had problems
crawling upload an XML Sitemap file analyze
and generate robots.txt files remove URLs
already crawled by Googlebot specify the
preferred domain identify issues with title and
description meta tags understand the top
searches used to reach a site get a glimpse
at how Googlebot sees pages remove unwanted
site links that Google may use in results
16
(No Transcript)
17
(No Transcript)
18
  • http//www.google.com/support/webmasters/bin/answe
    r.py?answer80553
  • http//www.google.com/bot.html
  • http//www.googleguide.com
  • http//www.searchengineposition.com
  • http//www.google-watch.org
  • http//www.robotstxt.org/db.html
  • http//www.googleblog.blogspot.com
  • For more Details Visit http//techwasim.blogspot.
    com
Write a Comment
User Comments (0)
About PowerShow.com