Title: Search Engines for Intranets
1Search Engines for Intranets
- Types of search engines
- How search engines work?
- Features of search engines
- Choosing the right search engine
- Free and commercial search engines
- Demo of ht//dig and mg
2Types of Search Engines
- Internet Search Engines
- Crawl, index and search the entire Internet.
Eg.Altavista, Lycos, Infoseek - Intranet Search Engines
- Crawl and index internal web servers and/or
portions of these servers to create custom,
searchable index of the documents and data housed
on the servers. Eg. Ht//dig, Swish - Website indexing (e.g. Library website)
- Indexing textual databases (e.g. bibliographic
and full text files)
3Internet search engines
4Intranet search engine
5Types of Search Engines
- Intranet search engines are unique from Internet
search engines in the following ways - - Often provide indexing for many document
formats such as PDF, word processing, spread
sheets, databases, graphics - - The indexing process is probably deeper
than its Internet counterpart
6Types of Search Engines
- Why Search Engines for Information
professionals? - Knowledge about indexing and searching process
helps in implementation and evaluation of
intranet search engines - They need to familiarise themselves with the
products available and the issues surrounding
their selection, implementation and use
7Types of Search Engines
- Why Search Engines for Information
professionals? - In-depth knowledge of searching techniques,
including use of controlled vocabulary, Boolean
operators, proximity operators, and relevancy
ranking, is necessary for evaluation - An understanding and experience with standard
indexing practices and parameters can also
ensure that the data contained in the various
indexes built on a corporate intranet will
facilitate accurate and efficient data retrieval
8How search engines work?
- Intranet search engines operate in a manner
similar to information retrieval systems (Fig 1) - Components of a search engine The Gatherer,
Indexer and the Search engine (Fig 2) - Gatherer Gatherer or Crawler, gathers content
descriptors from the document collection. In case
of html files it follows links to other pages
within the site. This is called a site being
"spidered" or "crawled." In case of remote
indexing the gatherer returns to the site on a
regular basis.
9Fig. 1
10Fig 2
11How search engines work?
- Indexer
- Everything the gatherer finds goes into the
second part of a search engine, the index. The
index, also called as the catalog, contains all
the descriptors that the gatherer finds. - Search engine
- This is the program that sifts through the
millions of descriptors recorded in the index to
find matches to a search. They also support free
text indexing and relevance ranking. This process
is shown in Fig 3a and Fig 3b
12Fig 3a
13Fig 3b
14Features of search engines
- Technical functionality
- Indexing features
- Search features
- Results display
- Costs, licensing and registration requirements
- Unique features (if any)
15Features of search enginesTechnical
functionality
16Features of search enginesIndexing features
17Features of search enginesIndexing features
18Features of search enginesSearching features
19Features of search enginesSearching features
20Features of search engineResults Display
21Choosing the right search engine
- Checklist of factors to be considered while
selecting the search engine - Size of the website
- Technical expertise available (local and/or from
the supplier / developer) - System platforms available
- Information sources and services to be supported
- Document collection type, volume (now and in
future) - Indexing, search and display requirements
22Choosing the right search engine
- Checklist of factors to be considered while
selecting the search engine - User community to be served
- Differentiate between the need for indexing the
web site pages and the need for indexing
databases / document collections (text,
bibliographic, DBMS, etc.) - Support for the concept of a "record" by the
search engine. - Support for structured fields and metadata
- Cost
23Choosing the right search engine
- Steps in the selection and procurement of search
engines - - Conduct a needs analysis.
- - Talk to other libraries
- - Attend trade shows and talk to vendors
- - Read the literature that reviews search
engines. - - Compile a list of possible products.
- .
-
24Choosing the right search engine
- Steps in the selection and procurement of search
engines - Compare the functionality of each product to the
criteria you developed through needs analysis - Narrow your list down to three possible products.
- Spend additional time learning about each
product. - Invite the vendors in for demonstrations.
- Ask for references and follow up with each
reference - Select product and implement.
- Follow up with end users.
- Continue an on going review with end users.
25Choosing the right search engine
- Some Suggestions
- The search system development or selection should
be based primarily on the local needs - Consider using freeware search engines, if your
requirements are met by these. - For large, highly developed intranet sites, you
may like to consider commercial search engines - Consider if the webserver you are using supports
indexing and search, and if this is adequate for
you.
26Choosing the right search engine
- The IT Professionals should make an effort to
keep themselves abreast of the current web
technologies - The features available within a tool should be
made use of properly to get maximum benefits - Carefully consider interrelations between the
three major components document resources, users
and the search engines.
27Free and commercial search engines
- For bibliographic and textual databases
(multi-record files) - MG (Managing Gigabytes) (www.mds.rmit.edu.au/mg/)
- Free-WAIS-sf (www.wsc.com/freeWAIS-sf/fwmain.html)
- I-search (www.cnidr.org/ir/isearch.html)
- WWWISIS (www.bireme.br/wwwisis2.htm)
28Free and commercial search engines
- For HTML and text files (web site indexing and
file/directory level indexing) - SWISH-E (sunsite.berkeley.edu/SWISH-E/)
- ht//Dig (htdig.sdsu.edu/)
- Excite For Web Servers (www.excite.com/navigate/)
- WebGlimpse (glimpse.cs.arizona.edu/webglimpse/
-
- For structured/formatted data
- - MYSQL (www.tcx.se/)
-
29Free and commercial search engines
- Commercial search engines
- AltaVista (www.altavista.digital.com/)
- Fulcrum (www.fulcrum.com/ )
- Infoseek (software.infoseek.com)
- Open Text (www.opentext.com/)
- Oracle (www.oracle.com/)
- PLS (www.pls.com/)
- Verity (www.verity.com/)
30Search engines Related sources
- Boeri, Robert J. Intranet searching A light at
the end of the tunnel. EMedia Professional, June
1998, pp. 63-69. - Esler, Sandra L. and Nelson, Michael L. NASA
indexing benchmarks evaluating text search
engines. Journal of Network and Computer
Applications, 20, 1997, pp. 339-353. - Hibbard, Justin. Applications--Straight Line to
Relevant Data--Customized Content Should Slash
Intranet Search Time. Information Week, November
17, 1997. - Nance, Barry. Internal Search Engines Get You
Where You Want To Go. Network Computing, October
8, 1997
31Search engines Related sources
- Railsback, Kevin. Serving Up Quality
Searches--Six Server-based Packages for Adding
Search Capability to a Website. Internet
Computing, February 16, 1998. - Sullivan, Danny. Search Engine Solutions for Your
Site--Make Your Site Easy to Search with an
Assortment of Features and Techniques.
NetGuide, December 1, 1996 - Zor, Peggy et. al. Surfing corporate intranets
Search tools that control the undertow. Online,
May/June 1997, pp. 30-51
32Intranet search engine ht//dig
- Developed in 1995 at San Diego State University
as a way to search the various web servers on
the campus network. - The current release is htdig-3_1_3_tar.gz and
is available at - htdig.sdsu.edu/files/htdig-3.1.3.tar.gz
- The ht//Dig system is a complete world wide web
indexing and searching system for a small domain
or intranet. - It contains four program modules viz., htdig
(retrieves HTML documents), htmerge (creates
document index word database), - htfuzzy (creates indexes for differentfuzzy''
search algos), htsearch (search engine.)
33Intranet search engine MG
- Developed in 1994 by Tim C. Bell, University of
Canterbury, Alistair Moffat, University of
Melbourne, Ian Witten, University of Waikato and
Justin Zobel, RMIT. - Current version is 1.2.1
- MG software is a collection of programs that
through the use of compression provide economical
storage and indexing for large collections of
documents as well as fast index construction and
query processing. - It can be obtained via anonymous ftp from the
Australian archive host munnari.oz.au
128.250.1.21 from the directory /pub/mg and
the documentation is available at
www.mds.rmit.edu.au/mg/ - It consists of three program modules Mgbuild
(database creation),
Mgquery (database search), Mgmerge
(database updation)