Title: Lecture Number Two
1 Lecture Number Two
- Web Searching and How it Evolved
2 -
- The web is basically a jungle!
3What do you mean jungle?
- More than 2 billion pages of accessible into
-
- No consistent id system (like _______ ) for books
- No cataloguing principle (like __________ or
_________________)
4And furthermore
- Many documents dont name the author and you
cant tell how current the information is - Most engines index every word so too much comes
back - And remember Youre NOT searching live, youre
looking at a fixed database compiled before you
searched
5Two Ways to Find Things
-
- Search Engines (all electronic)
- Subject Directories (electronic search of
human-maintained content)
6 Search Engines
- electronic
-
- index every page of a Web site
7 3 Types of Search Engines
-
-
- --global search _______________
- --subject-specific search only within
a defined area - --meta-engines (aka _________)
- - Combine results from several search
engines, ranked by relevance -
8Examples of Search Engines
- Google
- Altavista
- Alltheweb
9 A History of Search Engines
10Precursors to Search Engines
-
- Before the internet there was.
- 1969 New York Times Project
-
111990
- ARCHIE- the first search engine
-
- _________ University, Alan Emtage
-
- called Archie because _______________
- search term had to match exactly
121993
- - EXCITE
- __________ undergraduates
- different from Archie because____________
13 1993 (cont)
-
- WORLD WIDE WEB WANDERER-
- first bot (robot)
- counted ________ and eventually _______
- - bots evolved into Spiders
- catalogued links to make searchable
index -
-
-
14 Some Spiders
-
- JumpStation
-
- World Wide Web Worm
-
- RSBE
- ( __________________________)
- first to rank results by
relevance -
15But spiders caused trouble
- Because they _______________________
-
- Jump Station famous for that
161994
- - Web Crawler - first to index TEXT on
webpages (rather than just url/page title) -
- - Yahoo! 2 Stanford undergrads favorite pages
- the first __________ - - Infoseek and Lycos
- (Lycos reputedly best for technical
searches) -
171995
- ____________ (DEC) December
- - got big fast
-
- -lots of firsts
- natural language queries
- boolean techniques
- search tips
-
181996
- Hot Bot
- -indexes up to _________ pages/day
- - until _________, the most powerful
engine - - has boasted it can index the entire
web -
- Metacrawler -- 1st meta-engine
-
-
191999Google!first engine to pass a
billion pagesreports pages ranked by number of
hits
20 Of these-Google
commercially dominant (about 75 of most
websites external referrals) -Microsoft
wants to buy it, but cant because of antitrust
laws-Yahoo owned as of 2003 Overture,Alltheweb,
Alta Vista, Inktomi
21Privacy and Google- every time you
access a page, you get a cookie on your hard
drive, recording your IP address, the date/time,
your search terms, your browser configuration-
the cookies are basically immortal
(expire ________)
22How is this info used?
- Google customizes your search results using
your IP number
23The latest on google and privacy
- Google changed its privacy policy in July 2004
- now they
- - pool the information they collect on you from
all their various services. - may keep this information indefinitely
- may give this information to whomever they wish.
24-
- if they "have a good faith belief that access,
preservation or disclosure of such information is
reasonably necessary to protect the rights,
property or safety of Google, its users or the
public."Â - Â
25Focus
- Before 911, privacy issues turned on consumer
protection - - but now government is thinking about looking at
your information in the name of national security
26TIA (total information awareness)
- Goal To anticipate terrorist activities
- What credit card, travel,. Email, telephone
records - 2002- Google chief declined comment when asked by
NY Times if google had been subpoenaed to turn
information it gathered over
27Subject Directories
-
- human-compiled and maintained
- (review search engines are ______)
- index only home pages
- (review search engines
index______)
28(Dis)advantages of Subject Directories
-
- use heirarchies
- Smaller Content
- may be annotated
- But quality control varies
-
29Virtual Libraries (some SDs)
- Created and maintained by info professionals
- Internet Public Library
- Resource Discovery Network ( from Britain)
30Subject Directory Approaches
- General - searching from one site
-
- Clearinghouses searching from multiple
sites -
31Examples of Subject Directories
- general
- - www. yahoo.com
- - www.looksmart.com
- Clearing houses
- - Argus (www.clearinghouse.net)
- - About.com
- - Virtual Library (www.Vlib.org)
32Search Tips
- Get specific by using Boolean Logic
- AND OR NOT
- (often ___ and ____)
33A Boolean Example
- 1. Tupac Amaru AND Peru
- Tupac Amaru OR MRTA (movimiento revolucionaro
tupac amaru) - Tupac Amaru NOT Shakur (the rap singer killed in
1996) -
- To be exact, use quotes Tupac Amaru
34More Search Tips
-
- Use Wildcards
- like ?
- for roots like psychol
- for variant spellings
- like color color
-
- Â
35More Search Tips
- Many urls are predictable-
- so guess first
- utampa.ed
-
- Dont look at every returned page
-
36Use your Tools
- Pay attention to the relevance rankings
some engines give you - Organize your bookmarks
37The Invisible Web What Most Search Engines dont
find
- Specialized databases (7,000)
-
38Whats a Specialized Database
- Searchable indexes of subjects like email
addresses, magazine archives,government data
files, census info, medcal info, etc. - 2 types full text and bibliographic
39How is that different from a subject directory?
- Subject dir are collections of urls
- Specialized dbs are collections of actual
data/information
40Why they arent found
- -search engines are databases themselves-
programming one database to search another is
difficult - -specialized databases often require search
forms - -databases dont rely on fixed urls
- -text in databases in form not usable by search
engines (Like adobe pdf)
41What can you do?
- pick your search engine carefully
- google for instance lets you use the keyword
database plus the subject you want
42Some helpful sites
- Beaucoup
- Librarian's Index to the Internet
- Gary Price
43Two kinds of web data bases
- full text -- FindLaw
- (yahoo)
- bibliographic -- medline
- (librarian's index to the internet