An Introduction to Information Retrieval Systems - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

An Introduction to Information Retrieval Systems

Description:

A Simple E.g. ... We start with original ideas of Luhn. Luhn's Ideas ... by a name if one of its significant words occurs as a member of that class. ... – PowerPoint PPT presentation

Number of Views:315
Avg rating:5.0/5.0
Slides: 26
Provided by: Ramu2
Category:

less

Transcript and Presenter's Notes

Title: An Introduction to Information Retrieval Systems


1
An Introduction to Information Retrieval Systems
  • Intelligent Systems
  • March 18, 2004
  • Ramashis Das

2
Definition
  • We discuss about Automatic Information Retrieval
  • Automatic as against manual.
  • Information as against data.
  • Defn An information retrieval system does not
    inform (i.e. change the knowledge of) the user on
    the subject of his inquiry. It merely informs on
    the existence (or non-existence) and whereabouts
    of documents relating to his request.

3
IR Vs Data Retrieval
4
Classification
  • Monothetic classification is one with classes
    defined by objects possessing attributes both
    necessary and sufficient to belong to a class.
  • Polythetic classification is one where each
    individual in a class will possess only a
    proportion of all the attributes possessed by all
    the members of that class.
  • Hence no attribute is necessary nor sufficient
    for membership to a class.

5
Experimental Vs Operational IR Systems
  • Many Automatic Information Retrieval Systems are
    Experimental. Experimental IR is mainly carried
    on in a Laboratory' situation.
  • Other kind are Operational Systems (or Real
    World IR Systems) that are Commercial Systems
    which charge for the service they provide.

6
Why IR? A Simple E.g.
  • Suppose there is a store of documents and a
    person (user of the store) formulates a question
    (request or query) to which the answer is a set
    of documents satisfying the information need
    expressed by his question.
  • Solution User can read all the documents in the
    store, retain the relevant documents and discard
    all the others Perfect Retrieval NOT POSSIBLE
    !!!
  • Alternative Use a High Speed Computer to read
    entire document collection and extract the
    relevant documents.

7
Black Box Model
FEEDBACK
PROCESSOR
Queries
INPUT
OUTPUT
Documents
8
INPUT
  • The main problem here is to obtain a
    Representation of each Document and Query
    suitable for a computer to use.
  • Most Computer-Based Retrieval Systems store only
    a representation of the Document (or Query)
  • Implies actual text is lost, an artificial
    language used instead.
  • User needs to be taught to express his
    information need in the language.

9
Feedback and PROCESSOR
  • On-line change in request during a search session
    in the light of a sample retrieval hoping
    improvement in the subsequent retrieval run
    Feedback.
  • PROCESSOR Retrieval Process.
  • Structuring Information in appropriate way.
  • Actual Retrieval Function Search Strategy in
    response to a Query.

10
OUTPUT
  • Set of Citations or Document Numbers.
  • For Experimental Systems, proper Evaluation
    technique follows.

11
Historical Development
  • Three main areas of Research
  • Content Analysis Describing the contents of
    documents in a form suitable for computer
    processing
  • Information Structures Exploiting
    relationships between documents to improve the
    efficiency and effectiveness of retrieval
    strategies
  • Evaluation the measurement of the effectiveness
    of retrieval.

12
Information Representation
  • Luhns approach frequency count of words in the
    Document.
  • List of Keywords or Terms.
  • Freq. of occurrence of Keyword in body of
    Document indicates its significance.
  • Statistical Association between Keywords -
    exploited by Maron and Kuhns and Stiles
  • Sparck Jones - measures of association between
    keywords based on their frequency of
    co-occurrence.

13
Information Structure
  • Fairly Recent, Slow Development - loath to try
    out new organization techniques for faster and
    better retrieval.
  • Serial File Organization
  • Inverted File (?)
  • Clustering Good, Fairthorne Doyle Rocchio

14
Evaluation of Retrieval Systems
  • Extremely Difficult
  • Dichotomous Scale Relevant and Non-Relevant.
  • Precision - the ratio of the number of relevant
    documents retrieved to the total number of
    documents retrieved
  • Recall - ratio of the number of relevant
    documents retrieved to the total number of
    relevant documents (both retrieved and not
    retrieved).

15
Steps
  • Generation of Machine Representations for the
    Information.
  • Explanation of the Logical Structures that may be
    arrived at by Clustering.
  • Representing these Structures in the Computer, or
    in other words, choice of File Structures to
    Represent the Logical Structure.
  • Search Strategies.
  • Probabilistic Retrieval, i.e. to create a Formal
    Model for certain kinds of Search Strategies.
  • Ways of Evaluating the Effectiveness of Retrieval.

16
AUTOMATIC TEXT ANALYSIS
  • Storing Information
  • Original In form of Documents
  • Document Representation is stored
  • Emphasis is on the statistical rather than
    linguistic approaches.
  • We start with original ideas of Luhn

17
Luhns Ideas
  • Frequency of word occurrence in an article
    furnishes a useful measurement of word
    significance.
  • relative position within a sentence of words
    having given values of significance furnish a
    useful measurement for determining the
    significance of sentences.

18
Demonstration
  • f Frequency of occurrence of words
  • r Rank Order
  • Zipfs Law - the product of the frequency of use
    of words and the rank order is approximately
    constant.
  • Luhn used the above law to define two cut-offs.

19
(No Transcript)
20
Generating Document Representatives - conflation
  • Text Processing System
  • Input text full text, abstract or title
  • Output a doc representative adequate for use in
    an automatic retrieval system
  • The document representative consists of a list of
    class names, each name representing a class of
    words occurring in the total input text.
  • A document will be indexed by a name if one of
    its significant words occurs as a member of that
    class.

21
Text Processing System
  • Such system will consist of three parts
  • Removal of high frequency words
  • Suffix stripping
  • Detecting equivalent stems
  • Removal of High Freq words
  • One way of implementing Luhns upper cut-off.
  • Maintain list of stop list compare and remove
  • Document size reduces by 30 to 50

22
Text Processing System
  • Suffix stripping more involved
  • Complete list of suffixes match and remove the
    longest possible one.
  • Context free removal leads to Error Removing
    UAL from FACTUAL and EQUAL
  • Solution Have some rules
  • Equivalent Stems
  • Map to same morphological form on removal of
    suffixes.
  • Other kinds, which do not match on mere removal
    of suffixes. (ABSORB- and ABSORPT-)
  • For these, a list of equivalent stem-endings is
    maintained. (For e.g. B and PT are equivalent
    stem ending)

23
Text Processing System
  • The final output from a conflation algorithm is a
    set of classes, one for each stem detected.
  • A class name is assigned to a document if and
    only if one of its members occurs as a
    significant word in the text of the document.
  • A document representative then becomes a list of
    class names. These are often referred to as the
    documents index terms or keywords.
  • Queries Queries are handled in the same way.

24
Indexing
  • index language is the language used to describe
    documents and requests
  • elements of the index language are index terms
    which may be derived from the text of the
    document to be described, or may be arrived at
    independently.

25
Some distinctions
  • Index Languages can be described as
  • Pre-coordinate terms are coordinated at the
    time of indexing
  • Post-coordinate at the time of searching.
  • Vocabulary of Index Language
  • Controlled list of approved index terms that an
    indexer may use. One may put other kinds of
    syntactic controls (e.g. certain terms used only
    as adjectives)
  • Uncontrolled
Write a Comment
User Comments (0)
About PowerShow.com