Title: Internet%20search%20engines:%20%20Fluctuations%20in%20document%20accessibility
1Internet search engines Fluctuations in
document accessibility
- Wouter Mettrop
- CWI, Amsterdam, The Netherlands
- Paul Nieuwenhuysen
- Vrije Universiteit Brussel, and Universitaire
Instelling Antwerpen, Belgium - Hanneke Smulders
- Infomare Consultancy, The Netherlands
- http//www.cwi.nl/cwi/projects/IRT
- Presented at Internet Librarian International
2000 in London, England, March 2000
2Fluctuations in document accessibility - summary
- Search engines are often compared on the basis of
their size, i.e. the number of documents indexed
in their databases. However, searchers should be
aware of the fact that documents cannot be
retrieved reliably - in the sense that unexpected
and annoying fluctuations exist in the result set
of documents retrieved by most search engines. - Fluctuations are ideally caused by alterations in
the Web (documents come and go). However, in
some cases they are caused by changes in indexing
policy (indexing fluctuations), and in some
cases the origin is more obscure documents are
expected but not retrieved. - We have investigated these obscure fluctuations,
by searching repeatedly during a year for several
identical test documents. The documents were
placed on different sites and remained unchanged.
The influences of changes in indexing policy of
the engines are excluded. - We consider two kinds of obscure fluctuations
- 1. Document fluctuations appear when test
documents disappear from the database with
indexed documents (for whatever reason). - 2. Element fluctuations appear when test
documents, that still exist in the database, do
not show up in result sets even when they should. - This presentation is the result of our tests from
October 1998 until December 1999. We have
evaluated 13 engines AltaVista, EuroFerret,
Excite, HotBot, InfoSeek, Lycos, MSN,
NorthernLight, Snap, WebCrawler and 3 national
Dutch engines Ilse, Search.nl and Vindex. - The outcome of our investigation is in particular
important for known-item searches.
3WWW growing number of WWW servers
WWW
4Internet based information sources how many?
how much?
- In 2000
- about 1 billion 1000 million unique URLs in
the total Internet - about 10 terabyte ( 10 000 gigabyte) of text data
5Internet information retrieval systems in 2000
- Several types of systems exist to retrieve
information - Directories of selected sources categorised by
subject, made by humans, mainly for browsing. - Search systems, based on databases with machine
made indexes, for word-based searching! - Meta-search or multi-threaded search
systems. - We have studied and compared several well-known
international (and a few national) word-based
Internet search engines.
6Internet information retrieval systems
evaluation criteria
- Many aspects/criteria can be considered in the
evaluation of an Internet search engine,
including - coverage of documents present on WWW (studies
exist) - number of elements of a document, that are
indexed to make them usable for retrieval - fluctuations over time in the result sets
offered by a search engine - We started to study the depth of indexing and we
were soon confronted with the fluctuations in the
performance that do exist.
7Internet information retrieval systems our
research group
- The following persons have been involved in the
research - Louise Beijer (Hogeschool van Amsterdam, The
Netherlands) - Hans de Bruin (Unilever Research Laboratorium,
Vlaardingen, The Netherlands) - Hans de Man (JdM Documentaire Informatie,
Vlaardingen, The Netherlands) - Rudy Dokter (PNO Consultants, Hengelo, The
Netherlands) - Marten Hofstede ( Rijksuniversiteit Leiden, The
Netherlands) - Wouter Mettrop (CWI, Amsterdam, The Netherlands)
- Paul Nieuwenhuysen (Vrije Universiteit Brussel,
Belgium) - Eric Sieverts (Hogeschool van Amsterdam, and RUU,
The Netherlands) - Hanneke Smulders (Infomare, Terneuzen, The
Netherlands) - Hans van der Laan (Consultant, Leiderdorp, The
Netherlands) - Ditmer Weertman (ADLIB, Utrecht, The Netherlands)
8Internet search engines research on indexing
functionality
- assessing the indexing functionality
- test document
- test method
- conclusions concerning indexing functionality
9Number of our test documents that were retrieved
10Internet search engines elements of test
document studied
- title tag
- META-tags keywords, description and author
- comment tag
- ALT tag
- text/URL of a link to a document
- H3 tag
- table header
- text of an internal link, a reference anchor,
a link to a sound file
- name of a sound file (au/wav/aiff/ra)
- text of a link to an image
- name of an image file (gif or jpg inline or
linked to) - name of a Java applet (with or without extension
class) - terms after the first 100 lines in a document
(200//700) - the URL of a document
11Internet search engines part of the test
document source code
- ltHTMLgt ltHEADgt
- ltTITLEgtTest paginalt/TITLEgt
- ltMETA NAME"keywords"
- CONTENT"een, twee, drie"gt
- ltMETA NAME"description"
- CONTENT"This test page, containig a small part
of the Secret Garden (by Frances Hodgson Burnett)
is part of a larger site about the IRT project.
vier, vijf, zes"gt
- ltMETA NAME"Subject" CONTENT"zeven"gt
- ltMETA NAME"Subject" CONTENT"acht"gt
- ltMETA NAME"Subject" CONTENT"negen"gt
- ltMETA NAME"Title CONTENT"tien hoofdstukken uit
The Secret Garden"gt - ltMETA NAME"TitleSubtitle" content"elf"gt
12Number of the studied document elements that were
indexed
13Internet search engines reachability
- 14 528 queries sent to 13 search engines
- 721 times unreachable
- The percentage of unreachability varies from
nearly 0 to nearly 15. - The studied search engines were reachable for 95
of the queries.
14Search engine indexing functionality conclusions
- Not all of the web is indexed.
- Not all of our test documents.
- Not all HTML elements of our test document.
- Some of the studied search engines showed changes
in the indexing policy. - No relation between the number of indexed test
documents or HTML elements and the size of a
search engine was found during our study.
15Internet search engines fluctuations -
definition
- A fluctuation appears when the result set of an
observation - - i.e.
- one query or
- set of queries
- misses documents with respect to a frame of
reference - - i.e.
- other observations and
- knowledge about Web reality
16Internet search engines detecting fluctuations
- Through time comparing result sets of one
observation, repeatedly performed - Observation one query or set of queries
- Frame of reference other observations
web-knowledge - One moment consistency of result sets
- Observation one query in set of queries
- Frame of reference other observations
17Internet search engines types of fluctuations
- Through time comparing result sets of one
observation repeatedly performed - Document fluctuations
- Indexing fluctuations
- One moment consistency of result sets
- Element fluctuations
18(No Transcript)
19Document fluctuations example 1
20Document fluctuations example 2
21Document fluctuations experimental results
22(No Transcript)
23Indexing fluctuationsexperimental results
24(No Transcript)
25Element fluctuations example
26Element fluctuations experimental results
27Percentage of documents missed due to
fluctuations
28Internet search engines fluctuations -
quantitative conclusions
- Many element fluctuations? many document and
indexing fluctuations and many document elements
indexed - Many document fluctuations? not always many
element fluctuations - Few document elements indexed? few element
fluctuations
29Fluctuations remarks on correctness
- Fluctuations can be seen as correct, if they
are reflections of alterations in - (web-) reality
- then document, indexing and element fluctuations
are incorrect - the indexed database of a search engine
- then only element fluctuations are incorrect
- Users do not care they miss documents
30Fluctuationsremarks on size
- No relation document / element fluctuations lt
gt size - Percentage missed documents determines (with
other reducing effects, such as depth of
indexing) the effective size of an engine
31Internet search engines conclusions of our
research
- Search engines differ in depth of indexing.
- Search engines show fluctuations in their result
sets - They are subject to changes in indexing
policy.(indexing fluctuations) - They forget documents completely (document
fluctuations) - They miss documents in their result sets
(element fluctuations).
32Internet search engines recommendations related
to fluctuations
- Fluctuations are normal do not be surprised
do not worry. - Do not try to find a simple explanation to fully
understand what happens. - Known item searchers should repeat the search
- when using an engine with many element
fluctuations use other search terms - when using an engine with many document
fluctuations repeat later. - Further research on effective size.
33Element and indexing fluctuations example