Title: Using the WWW
1Using the WWW
- Miroslav Milinovic
- Croatian Academic and Research Network - CARNet
- Zagreb, Croatia
-
5th CEENet Workshop on Network Technology,
Budapest, Hungary, August 1999.
2Using the WWW (Part 1)Introduction to WWW
information service
3Content
- Computer network ? Information network
- WWW - important concepts (HTML, URL, HTTP)
- How WWW works?
- WWW client - browser
- HTML file Web page
- Netscape browser
- Client - server communication
- Server status codes
- Active Web pages
- Internationalization
- Security
4Computer network
5Information network
6WWW - World Wide Web
- Distributed, multimedia information service based
on hypertext - Distributed
- information located on hosts around the world
- Multimedia
- information includes text, graphics, sound, video
- Hypertext
- hypertext techniques used to enable access to the
information
7WWW - important concepts
- Uses Client/Server Architecture
- client programs known as browsers
- Netscape, MS IE, Amaya, Hot Java, Tango, Lynx,
... - Provides access to Internet resources
- provides access to WWW resources as well as
FTP, NEWS, Gopher, ... - brings together whole range of services
?
8WWW - important concepts
- WWW resources - documents are prepared using
simple standard markup language which defines
document - content, presentation, links to the other
documents - Documents have a unique identifier
- depends on their location on a particular host
- Clients can communicate with any server
- using correct protocol
?
9WWW - important concepts
- HTML - HyperText Markup Language
- language for preparing the WWW documents
- URL - Uniform Resource Locator
- resource address unique identifier
- HTTP - HyperText Transport Protocol
- defines communication between WWW client and
server
10HTML
- HTML is the native language of the WWW
- HTML file Web page
- standards
- HTML 1.0, 2.0, 3.0., 3.2, 4.0,
- browser extensions (Netscape, MS IE, ...)
- other (VRML, DHTML, SMIL, MathML, CSS, XML, XSL,
...) - XHTML 1.0 (in draft)
11URL - locating Internet resources
- URL is unique identifier for Internet resources
- indicates
- means of access
- location
- simple syntax
- protocol//host_nameport_num/path/file_name
- example
- http//www.ceenet.org/constitution.html
12Internet resources identification
- URI - Uniform Resource Identifier
- URL - Uniform Resource Locator
- PURL - Persistent URL
- URN - Uniform Resource Name
- URC - Uniform Resource Characteristics
- data about the networked resource
- metadata data about data
13HTTP
- application-level protocol
- stateless
- supports
- use of URLs
- Internet media types (MIME types
RFC2045-RFC2049) - allows access to different data formats
- standards
- HTTP 1.0 (RFC 1945), HTTP 1.1 (RFC 2068, 01.97.)
14How WWW works?
?
15How WWW works?
16WWW client - browser
- retrieve (display if possible) various resources
- can be
- text-only (Lynx, ...)
- graphic (Netscape, ...)
- there are some differences in displaying HTML
documents between different clients - can display a variety of formats
- TEXT, GIF, JPEG, ...
?
17WWW client - browser
- has multiprotocol support
- HTTP, FTP, GOPHER, NNTP, SMTP, POP, ...
- can automatically launch helper application
(viewer)to handle some data formats (sound,
video, postscript, MS applications, ...) - plug-in extensions can be used to extend
browser capabilities (3D animation, various
graphics formats, ...)
18HTML file Web page
HTML source
Web pagedisplayed by browser
19Netscape browser
Title line
Menus
Icons
URL
HTML document
Hyperlink
Status line
20Client - server communication
- Simple client request (entered manually)
- telnet www.srce.hr 80
- Trying 161.53.2.69...
- Connected to regoc.srce.hr.
- Escape character is ''.
- GET /index.html HTTP/1.0
- ACCEPT /
- USER-AGENT manually entered HTTP
- (blank line)
?
21Client - server communication
- Server reply
- HTTP/1.0 200 OK
- Date Tue, 29 Jul 1997 125615 GMT
- Server Apache/1.1.3
- Content-type text/html
- Content-length 2320
- Last-modified Fri, 22 Nov 1996 100727 GMT
- (blank line)
- (content - document source)
22Server status codes
- Status codes are three digit numbers grouped as
follows - 1xx - informational
- 2xx - client request successful
- 200 - OK
- 3xx - request redirected
- 4xx - client errors (request incomplete)
- 403 - Forbidden
- 404 - Not found
- 5xx - server errors
23Active Web pages
- enhanced Web
- two way interaction
- page animation
- browser intelligence
- desktop integration
- better multimedia
- access to other systems
- common examples
- forms (feedback processing)
- active maps (clickable maps)
- (database) gateways
?
24Active Web pages
- techniques
- CGI - Common Gateway Interface
- WWW server communicates with other programs (CGI
scripts) - SSI - Server Side Includes (.shtml)
- API - Application Programming Interface
- Cookies (making a browser remember)
- scripting languages (embedded in HTML document)
- Javascript, VBscript,
- DHTML
- Java (applets, servlets)
- ActiveX
?
25Active Web pages
- Who is doing the job?
- browser downloads and automatically executes
program (Java applet) - OR
- HTML document is generated on the server machine
(by CGI script)
?
?
26Active Web pages
27Internationalization
- originally
- plain ASCII (Latin 1) English language
- HTML internationalization (RFC 2070)
- UNICODE language attribute in HTML
- HTTP 1.1
- enables charset and language negotiation
28Security
- plain WWW is not secure!
- security on
- content level
- PGP, data encription
- channel level
- SSL (Secure Socket Layer)
- message level
- SHTTP, PEP, ...
29Questions ?
30Using the WWW (Part 2)Searching the Internet
31Content
- Internet information space
- Searching with the WWW
- Searching the WWW
- Search Engines
- Metasearch Engines
- Subject Catalogs
- Other tools
- Conclusion on search tools
- Selecting a tool
- A Strategy?
32Internet information space
- is NOT unified
- many subjects
- different formats
- different resources (information services)
- various tools and techniquesfor searching and
information retrieval - some information is not (yet)
- published electronically
- available on the Net
Internet
printed
WWW
33Searching with the WWW
- searching tools
- many different tools
- various concepts
- specialized for chosen resources
- WWW, gopher, Netnews, FTP, databases, ...
- global or local scope
- main problems quality and currency
- there is NO perfect tool
- user needs a strategy
34Searching the WWW
- Search Engines
- Search Engines
- Metasearch Engines (Unified Search Interfaces)
- Subject Catalogs (Virtual Libraries)
- Other tools
- Multiple Search Interfaces
- Information Gateways
-
- Portals
35Search Engines
- automated systems
- specially designed programs
- robots, crawlers, spiders
- fetch WWW documents
- index those documents to build database
- provide interface for user to search the database
- query syntax
- searching features
- presentation of the results - hits (format,
ranking)
?
36Search Engines
?
37Search Engines
- examples
- Alta Vista - http//altavista.digital.com/
- excite! NetSearch - http//www.excite.com/
- InfoSeek - http//www.infoseek.com/
- HotBot - http//www.hotbot.com/
- Lycos Search - http//www.lycos.com/
- WebCrawler - http//www.webcrawler.com/
- Nothern Light Search - http//www.northernlight.co
m/ - Google - http//www.google.com/
- Ask Jeeves! - http//www.ask.com/
- local (regional) search engines
?
38Search Engines
- query syntax and searching features
- upper and lower case letters
- John December
- island
- phrases (text in quotes - ...)
- NASA space shuttle program
- John December
- Boolean operators (AND, OR, NOT) and parentheses
(...) - vegetable AND green
- fruit NOT apple
- keyword control (, -)
- film noir -pinot noir
- pyton -monty
?
39Search Engines
- query syntax and searching features
- proximity search (NEAR)
- Internet NEAR training (Alta Vista)
- keyword truncation ()
- alumium
- comput
- cascade search (Infoseek)
- resource control (AltaVista, HotBot, Infoseek)
- titleInternet training
- natural language searching (Ask Jeeves!)
- ...
?
40Search Engines
- important characteristics
- database (quantity quality)
- query language
- response time
- ranking (hits)
- output (format, available info)
- additional features (cascade search, refine, )
- ...
?
41Search Engines
- advantages
- vast number of documents (over 100 million)
- highly efficient searching and retrieval
- automated production
- disadvantages
- no quality control
- no classification
- hits can be out of context
- dead or out-of-date links, junk
42Metasearch Engines
- Unified Search Interfaces
- automated systems
- DO NOT build databases of their own
- query other search engines
- provide unified interface for user to search a
number of databases (search engines) with one
query
?
43 Metasearch Engines
- examples
- All4one - http//all4one.com/
- Mamma - http//www.mamma.com/
- MetaCrawler - http//www.metacrawler.com/
- SavvySearch - http//www.savvysearch.com/
?
44Metasearch Engines
- important characteristics
- number and selection of search engines covered
- query language
- response time
- ranking (hits)
- results (hits) merging
- output (format, available info)
- additional features
- ...
?
45 Metasearch Engines
- advantages
- same as search engines
- make use of search engines easier
- disadvantages
- same as search engines
- unified query for all search engines means loss
of additional capabilities of particular search
engine - searching is slower
46Subject Catalogs
- Virtual Libraries, Subject Directories
- collections of Internet resources descriptions
- names, URLs, abstracts, ratings, ...
- organized within hierarchical subject scheme
- heuristic (subject based)
- UDC, Dewey, ...
- manually maintained
- internal search
?
47Subject Catalogs
- examples
- Yahoo - http//www.yahoo.com/
- EINet Galaxy - http//galaxy.einet.net/
- Magellan - http//magellan.excite.com/
- NetGuide - http//www.netguide.com/
- BUBL - http//bubl.ac.uk//link/
- WWlib - http//www.scit.wlv.ac.uk/wwlib/
- WWW.HR - http//www.hr/wwwhr/
?
48 Subject Catalogs
- important characteristics
- size
- classification method
- available info (about classified resources)
- ranking
- (internal) searching
- additional features
- ...
?
49 Subject Catalogs
- advantages
- classified into subject areas
- manually reviewed resources (no junk)
- internal search
- disadvantages
- manual maintenance
- out-of-date information
- catalogue (some parts) is not professional
50Other tools
- Multiple Search Interfaces
- simple Web pages with interfaces to number of
search tools - enable user to choose among listed tools
- DO NOT build databases of their own
- DO NOT act as Metasearch Engines
- examples
- All-in-One - http//www.allonesearch.com/
- Easy Searcher - http//www.easysearcher.com/
?
51Other tools
- Information Gateways
- dedicated to one subject (e.g. Social Sciences)
- examples
- SOSIG - http//sosig.ac.uk/
- OMNI - http//www.omni.ac.uk/
- And ...
- electronic dictionaries, encyclopedias, guides,
software collections, map collections,
databases, tools for searching non-www
resources,
52Conclusion on search tools
- each tool has advantages and disadvantages
- new systems appear, old stagnate
- CAUTION tools are text oriented
- non-WWW resources are also covered
- quality and currency
- precision .vs. recall
- cooperation between tools is a necessity
- winner portal
- hybrid tool (Search Engine with Catalog)
- brings together many (all) network services
53Selecting a tool
- Search Engines
- when you have good (precise) keywords (narrow
topic) - Subject Catalogs
- for look and feel
- when you dont have good keywords (broad topic)
- Information Gateways or other specialized tools
- for quality (if you can find one)
- Multiple Search Interfaces
- useful to see what is available
- Portals
- tools for the future
54A Strategy?
- no searching system is prefect
- be flexible and try different tools
- compare results and gain experience
- learn vocabulary, read HELP and FAQ
- be focused (dont wander around)
- concentrate on problem, not on tool (query)
- use stepwise approach
- refine query (keywords)
55Questions ?