Organisation of documents on Web - PowerPoint PPT Presentation

About This Presentation
Title:

Organisation of documents on Web

Description:

Prevailing form of a carrier of information on web is document and not its ... data with autonomous programming agents (robots, spiders, crawlers, harvesters... – PowerPoint PPT presentation

Number of Views:12
Avg rating:3.0/5.0
Slides: 24
Provided by: jure8
Category:

less

Transcript and Presenter's Notes

Title: Organisation of documents on Web


1
Organisation of documents on Web
2
Introduction
  • Prevailing form of a carrier of information on
    web is document and not its bibliographic
    substitute.
  • Internet and especially web (WWW) enable us to
    access and use documents or document parts of
    various data types.
  • The greatest problem of information tools on web
    is its size (at least tens of billions of
    documents) and dynamic nature documents are
    being born, deleted and changed all the time.

3
Introduction
  • Term document became very vague. It can be
  • text or text with multimedia inclusions (article,
    monograph, homepage),
  • independent multimedia file (video clip, sound),
  • list of hyper-text pointers (web directory,
    results of search with web search engine)
  • Text with multimedia inclusion is constituted of
    at least two files. Each multimedia inclusion is
    referenced from a primary (usually textual) file.
  • Files can be transported from very different
    locations they are united into a document on a
    users screen.

4
Introduction
  • Organising access to documents on web normally
    doesnt mean collecting them on one spot in a
    database.
  • Usually it means collecting
  • pointers to documents, and
  • descriptions of documents (metadata).
  • Pointers to documents are collected in an
    information tool, documents stay on original
    servers, where they were put by authors.

5
Introduction
  • Two most important information tools for
    organisation of documents on web are
  • web directories, and
  • web search engines.
  • Web directories are the oldest web information
    tools, born almost at the same time as web.
  • At the beginning there was a simple list Whats
    new and authors reported existence of their
    documents to editing staff.
  • Very soon this chronologic principle of
    organisation became impossible to follow.

6
Web directories
  • With todays the big directories (e.g. Yahoo) all
    important phases of construction are done
    automatically
  • collecting of data about documents, and
  • classification of documents.
  • Big directories collect data on documents from
    all domains entertainment is prevailing
    subject.
  • Smaller directories are either
  • not limited by domain, but have stricter
    collection policy, or are
  • being compiled by and for domain specialists.

7
Web directories
  • Examples will be given during practical work.

8
Web databases and search engines
  • The big search engines are collecting metadata
    and pointers to over billion of documents.
  • The biggest Google is claiming that it has
    3,083,324,652 web pages (summer 2003). The real
    number is approx. ? smaller.
  • The biggest and best are
  • Google (http//www.google.com),
  • AltaVista (http//www.altavista.com),
  • Teoma (http//www.teoma.com),
  • AllTheWeb (http//www.alltheweb.com).

9
Web databases and search engines
  • Well done
  • collecting of data with autonomous programming
    agents (robots, spiders, crawlers, harvesters),
  • automatic indexing of documents,
  • computing of relevance.

10
Database construction with robots
database
robot
reading ofdocument
doc. B
collecting the data on subject
document A
doc. C
pointer to B
pointer to C
pointer to B
doc. D
pointer to X
11
Database construction with robots
  • Robot
  • checks the document x,
  • saves all pointers to other documents on
    temporary list,
  • index document x if it is not indexed yet, or was
    changed since last visit,
  • downloads next document from temporary list and
    do steps 1 3.
  • Many robots work for the same database.
  • Because of the exponential growth of web it could
    never be entirely indexed.

12
Database construction with robots
  • Beside frequencies of stems, search engines use
    some additional information to compute relevance
    score of documents. Higher weights get
  • stems from title,
  • stems from hyper-text anchors,
  • stems from top of page,
  • stems with bold or slanted letters
  • Especially effective additional factor in
    relevance computing is PageRank (Google).

13
Web databases and search engines
  • PageRank
  • If the author in his/her document puts pointer to
    another document that usually means that he/she
    thinks it is of some value.
  • Documents with many pointers (citations) to them
    get higher PageRank.
  • PageRank of a document is even higher if the
    citing documents have high PageRank themselves.

14
Web databases and search engines
  • Not so well done search interfaces.
  • Search interface stimulates user to use short
    queries (1 3 words).
  • Search interface stimulates user to use Boolean
    operators.
  • Both is inappropriate for non-Boolean search
    model, but needs less processing power.
  • Remember non-Boolean search model behaves best
    with long queries composed of many words and
    their synonyms.

15
Web databases and search engines
  • Examples will be given during practical work.

16
Usefulness of directories and search engines
  • Web directories
  • Pointers are ordered by some criteria, e.g.
    subject categories, which makes searching easier.
  • Mostly they contain non-trivial documents, with
    less multiplicates.
  • -
  • Relatively small amount of documents.
  • Directory creators and users understanding of
    categories can differ. Difficult browsing as a
    result.

17
Usefulness of directories and search engines
  • General web directories
  • Useful for finding stable collective sources of
    information e-journals, homepages of research
    groups or institutions. These sources should be
    followed later with other means. Make bookmarks!
  • Less useful for finding documents as units.
  • Specialised web directories
  • Useful for initial overview of a field.
  • Useful for finding reference literature,
    standards, protocols
  • Sometimes useful for finding documents as units.

18
Usefulness of directories and search engines
  • General search engines
  • Simple, well-defined queries name of person,
    name of medical appliance or method, name of
    e-journal
  • Finding particular article compose the query
    with the most informative part of the title in
    parentheses.
  • Good property of search engines description of a
    document enter database quickly (matter of days,
    weeks at most).
  • High dose of precaution obligatory, regarding the
    quality of documents!

19
Digital libraries
  • Collection of e-documents and institution that
    collects them.
  • Documents are used across network without limits
    of time or place.
  • Collection is usually limited by domain or
    geography, e.g. production of particular academic
    institution or same types of documents across
    institutions in a region (e.g. research reports).
  • Not normally limited regarding data types.
  • Internet or web is not a digital library.

20
Digital libraries
  • Often contains documents with less strict
    protection of authors rights
  • research reports funded by public money,
  • preprints,
  • master theses and doctoral dissertations
  • or documents with limited access
  • artefacts of cultural heritage,
  • objects in museums and galleries
  • Behind d-library is usually an institution with
    good reputation, so we can trust documents.

21
Historically important digital libraries
  • NCSTRL (http//www.ncstrl.org/)
  • Networked Computer Science Technical Report
    Library, started on 1995.
  • D-library of technical and research reports.
  • Started on 40 US universities with strong
    computer departments.
  • Today international membership.
  • Developers of NCSTRL still on the forefront of
    the research on public access to knowledge.

22
Historically important digital libraries
  • NCSTRL, moral of the story
  • Let the documents stay on the authors
    institutional servers. They are interested in the
    preservation of their documents more than
    anybody.
  • Build the common user interface which hides the
    differing ways on which documents are organised
    locally.
  • Each cooperating institution should do what it is
    capable to in the common orchestrated effort.
  • Documents in the d-library should exist in
    various standard forms HTML, PDF, text, screen
    image

23
Historically important digital libraries
  • NDLTD (http//www.ndltd.org)
  • Networked Digital Library of Theses and
    Dissertations, started on 1996.
  • Master theses and doctoral dissertations (ETDs).
  • At the beginning some US universities, today very
    international membership.
  • Moral of the story
  • Develop web interfaces for uploading the files
    into database, and interfaces to enter metadata.
    Let authors do as much work as possible. They are
    motivated for success more than anybody.
Write a Comment
User Comments (0)
About PowerShow.com