Lecture Number Two - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Lecture Number Two

Description:

Lecture Number Two Web Searching and How it Evolved The web is basically a jungle! What do you mean jungle ? More than 2 billion pages of accessible into No ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 44
Provided by: GeraldH
Learn more at: https://www.cise.ufl.edu
Category:
Tags: jungle | lecture | number | two

less

Transcript and Presenter's Notes

Title: Lecture Number Two


1
Lecture Number Two
  • Web Searching and How it Evolved

2
  • The web is basically a jungle!

3
What do you mean jungle?
  • More than 2 billion pages of accessible into
  • No consistent id system (like _______ ) for books
  • No cataloguing principle (like __________ or
    _________________)

4
And furthermore
  • Many documents dont name the author and you
    cant tell how current the information is
  • Most engines index every word so too much comes
    back
  • And remember Youre NOT searching live, youre
    looking at a fixed database compiled before you
    searched

5
Two Ways to Find Things
  • Search Engines (all electronic)
  • Subject Directories (electronic search of
    human-maintained content)

6
Search Engines
  • electronic
  • index every page of a Web site

7
3 Types of Search Engines
  • --global search _______________
  • --subject-specific search only within
    a defined area
  • --meta-engines (aka _________)
  • - Combine results from several search
    engines, ranked by relevance

8
Examples of Search Engines
  • Google
  • Altavista
  • Alltheweb

9
A History of Search Engines
  • 1990 to the present

10
Precursors to Search Engines
  • Before the internet there was.
  • 1969 New York Times Project

11
1990
  • ARCHIE- the first search engine
  • _________ University, Alan Emtage
  • called Archie because _______________
  • search term had to match exactly

12
1993
  • - EXCITE
  • __________ undergraduates
  • different from Archie because____________

13
1993 (cont)
  • WORLD WIDE WEB WANDERER-
  • first bot (robot)
  • counted ________ and eventually _______
  • - bots evolved into Spiders
  • catalogued links to make searchable
    index

14
Some Spiders
  • JumpStation
  • World Wide Web Worm
  • RSBE
  • ( __________________________)
  • first to rank results by
    relevance

15
But spiders caused trouble
  • Because they _______________________
  • Jump Station famous for that

16
1994
  • - Web Crawler - first to index TEXT on
    webpages (rather than just url/page title)
  • - Yahoo! 2 Stanford undergrads favorite pages
    - the first __________
  • - Infoseek and Lycos
  • (Lycos reputedly best for technical
    searches)

17
1995
  • ____________ (DEC) December
  • - got big fast
  • -lots of firsts
  • natural language queries
  • boolean techniques
  • search tips

18
1996
  • Hot Bot
  • -indexes up to _________ pages/day
  • - until _________, the most powerful
    engine
  • - has boasted it can index the entire
    web
  • Metacrawler -- 1st meta-engine

19
1999Google!first engine to pass a
billion pagesreports pages ranked by number of
hits
20
Of these-Google
commercially dominant (about 75 of most
websites external referrals) -Microsoft
wants to buy it, but cant because of antitrust
laws-Yahoo owned as of 2003 Overture,Alltheweb,
Alta Vista, Inktomi
21
Privacy and Google- every time you
access a page, you get a cookie on your hard
drive, recording your IP address, the date/time,
your search terms, your browser configuration-
the cookies are basically immortal
(expire ________)
22
How is this info used?
  • Google customizes your search results using
    your IP number

23
The latest on google and privacy
  • Google changed its privacy policy in July 2004
  • now they
  • - pool the information they collect on you from
    all their various services.
  • may keep this information indefinitely
  • may give this information to whomever they wish.

24
  • if they "have a good faith belief that access,
    preservation or disclosure of such information is
    reasonably necessary to protect the rights,
    property or safety of Google, its users or the
    public." 
  •  

25
Focus
  • Before 911, privacy issues turned on consumer
    protection
  • - but now government is thinking about looking at
    your information in the name of national security

26
TIA (total information awareness)
  • Goal To anticipate terrorist activities
  • What credit card, travel,. Email, telephone
    records
  • 2002- Google chief declined comment when asked by
    NY Times if google had been subpoenaed to turn
    information it gathered over

27
Subject Directories
  • human-compiled and maintained
  • (review search engines are ______)
  • index only home pages
  • (review search engines
    index______)

28
(Dis)advantages of Subject Directories
  • use heirarchies
  • Smaller Content
  • may be annotated
  • But quality control varies

29
Virtual Libraries (some SDs)
  • Created and maintained by info professionals
  • Internet Public Library
  • Resource Discovery Network ( from Britain)

30
Subject Directory Approaches
  • General - searching from one site
  • Clearinghouses searching from multiple
    sites

31
Examples of Subject Directories
  • general
  • - www. yahoo.com
  • - www.looksmart.com
  • Clearing houses
  • - Argus (www.clearinghouse.net)
  • - About.com
  • - Virtual Library (www.Vlib.org)

32
Search Tips
  • Get specific by using Boolean Logic
  • AND OR NOT
  • (often ___ and ____)

33
A Boolean Example
  • 1. Tupac Amaru AND Peru
  • Tupac Amaru OR MRTA (movimiento revolucionaro
    tupac amaru)
  • Tupac Amaru NOT Shakur (the rap singer killed in
    1996)
  • To be exact, use quotes Tupac Amaru

34
More Search Tips
  • Use Wildcards
  • like ?
  • for roots like psychol
  • for variant spellings
  • like color color
  •  

35
More Search Tips
  • Many urls are predictable-
  • so guess first
  • utampa.ed
  • Dont look at every returned page

36
Use your Tools
  • Pay attention to the relevance rankings
    some engines give you
  • Organize your bookmarks

37
The Invisible Web What Most Search Engines dont
find
  • Specialized databases (7,000)

38
Whats a Specialized Database
  • Searchable indexes of subjects like email
    addresses, magazine archives,government data
    files, census info, medcal info, etc.
  • 2 types full text and bibliographic

39
How is that different from a subject directory?
  • Subject dir are collections of urls
  • Specialized dbs are collections of actual
    data/information

40
Why they arent found
  • -search engines are databases themselves-
    programming one database to search another is
    difficult
  • -specialized databases often require search
    forms
  • -databases dont rely on fixed urls
  • -text in databases in form not usable by search
    engines (Like adobe pdf)

41
What can you do?
  • pick your search engine carefully
  • google for instance lets you use the keyword
    database plus the subject you want

42
Some helpful sites
  • Beaucoup
  • Librarian's Index to the Internet
  • Gary Price

43
Two kinds of web data bases
  • full text -- FindLaw
  • (yahoo)
  • bibliographic -- medline
  • (librarian's index to the internet
Write a Comment
User Comments (0)
About PowerShow.com