Title: Organisation of documents on Web
1Organisation of documents on Web
2Introduction
- Prevailing form of a carrier of information on
web is document and not its bibliographic
substitute. - Internet and especially web (WWW) enable us to
access and use documents or document parts of
various data types. - The greatest problem of information tools on web
is its size (at least tens of billions of
documents) and dynamic nature documents are
being born, deleted and changed all the time.
3Introduction
- Term document became very vague. It can be
- text or text with multimedia inclusions (article,
monograph, homepage), - independent multimedia file (video clip, sound),
- list of hyper-text pointers (web directory,
results of search with web search engine) - Text with multimedia inclusion is constituted of
at least two files. Each multimedia inclusion is
referenced from a primary (usually textual) file. - Files can be transported from very different
locations they are united into a document on a
users screen.
4Introduction
- Organising access to documents on web normally
doesnt mean collecting them on one spot in a
database. - Usually it means collecting
- pointers to documents, and
- descriptions of documents (metadata).
- Pointers to documents are collected in an
information tool, documents stay on original
servers, where they were put by authors.
5Introduction
- Two most important information tools for
organisation of documents on web are - web directories, and
- web search engines.
- Web directories are the oldest web information
tools, born almost at the same time as web. - At the beginning there was a simple list Whats
new and authors reported existence of their
documents to editing staff. - Very soon this chronologic principle of
organisation became impossible to follow.
6Web directories
- With todays the big directories (e.g. Yahoo) all
important phases of construction are done
automatically - collecting of data about documents, and
- classification of documents.
- Big directories collect data on documents from
all domains entertainment is prevailing
subject. - Smaller directories are either
- not limited by domain, but have stricter
collection policy, or are - being compiled by and for domain specialists.
7Web directories
- Examples will be given during practical work.
8Web databases and search engines
- The big search engines are collecting metadata
and pointers to over billion of documents. - The biggest Google is claiming that it has
3,083,324,652 web pages (summer 2003). The real
number is approx. ? smaller. - The biggest and best are
- Google (http//www.google.com),
- AltaVista (http//www.altavista.com),
- Teoma (http//www.teoma.com),
- AllTheWeb (http//www.alltheweb.com).
9Web databases and search engines
- Well done
- collecting of data with autonomous programming
agents (robots, spiders, crawlers, harvesters), - automatic indexing of documents,
- computing of relevance.
10Database construction with robots
database
robot
reading ofdocument
doc. B
collecting the data on subject
document A
doc. C
pointer to B
pointer to C
pointer to B
doc. D
pointer to X
11Database construction with robots
- Robot
- checks the document x,
- saves all pointers to other documents on
temporary list, - index document x if it is not indexed yet, or was
changed since last visit, - downloads next document from temporary list and
do steps 1 3. - Many robots work for the same database.
- Because of the exponential growth of web it could
never be entirely indexed.
12Database construction with robots
- Beside frequencies of stems, search engines use
some additional information to compute relevance
score of documents. Higher weights get - stems from title,
- stems from hyper-text anchors,
- stems from top of page,
- stems with bold or slanted letters
- Especially effective additional factor in
relevance computing is PageRank (Google).
13Web databases and search engines
- PageRank
- If the author in his/her document puts pointer to
another document that usually means that he/she
thinks it is of some value. - Documents with many pointers (citations) to them
get higher PageRank. - PageRank of a document is even higher if the
citing documents have high PageRank themselves.
14Web databases and search engines
- Not so well done search interfaces.
- Search interface stimulates user to use short
queries (1 3 words). - Search interface stimulates user to use Boolean
operators. - Both is inappropriate for non-Boolean search
model, but needs less processing power. - Remember non-Boolean search model behaves best
with long queries composed of many words and
their synonyms.
15Web databases and search engines
- Examples will be given during practical work.
16Usefulness of directories and search engines
- Web directories
-
- Pointers are ordered by some criteria, e.g.
subject categories, which makes searching easier. - Mostly they contain non-trivial documents, with
less multiplicates. - -
- Relatively small amount of documents.
- Directory creators and users understanding of
categories can differ. Difficult browsing as a
result.
17Usefulness of directories and search engines
- General web directories
- Useful for finding stable collective sources of
information e-journals, homepages of research
groups or institutions. These sources should be
followed later with other means. Make bookmarks! - Less useful for finding documents as units.
- Specialised web directories
- Useful for initial overview of a field.
- Useful for finding reference literature,
standards, protocols - Sometimes useful for finding documents as units.
18Usefulness of directories and search engines
- General search engines
- Simple, well-defined queries name of person,
name of medical appliance or method, name of
e-journal - Finding particular article compose the query
with the most informative part of the title in
parentheses. - Good property of search engines description of a
document enter database quickly (matter of days,
weeks at most). - High dose of precaution obligatory, regarding the
quality of documents!
19Digital libraries
- Collection of e-documents and institution that
collects them. - Documents are used across network without limits
of time or place. - Collection is usually limited by domain or
geography, e.g. production of particular academic
institution or same types of documents across
institutions in a region (e.g. research reports). - Not normally limited regarding data types.
- Internet or web is not a digital library.
20Digital libraries
- Often contains documents with less strict
protection of authors rights - research reports funded by public money,
- preprints,
- master theses and doctoral dissertations
- or documents with limited access
- artefacts of cultural heritage,
- objects in museums and galleries
- Behind d-library is usually an institution with
good reputation, so we can trust documents.
21Historically important digital libraries
- NCSTRL (http//www.ncstrl.org/)
- Networked Computer Science Technical Report
Library, started on 1995. - D-library of technical and research reports.
- Started on 40 US universities with strong
computer departments. - Today international membership.
- Developers of NCSTRL still on the forefront of
the research on public access to knowledge.
22Historically important digital libraries
- NCSTRL, moral of the story
- Let the documents stay on the authors
institutional servers. They are interested in the
preservation of their documents more than
anybody. - Build the common user interface which hides the
differing ways on which documents are organised
locally. - Each cooperating institution should do what it is
capable to in the common orchestrated effort. - Documents in the d-library should exist in
various standard forms HTML, PDF, text, screen
image
23Historically important digital libraries
- NDLTD (http//www.ndltd.org)
- Networked Digital Library of Theses and
Dissertations, started on 1996. - Master theses and doctoral dissertations (ETDs).
- At the beginning some US universities, today very
international membership. - Moral of the story
- Develop web interfaces for uploading the files
into database, and interfaces to enter metadata.
Let authors do as much work as possible. They are
motivated for success more than anybody.