An Introduction to Information Retrieval Systems - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

An Introduction to Information Retrieval Systems

Description:

A Simple E.g. ... We start with original ideas of Luhn. Luhn's Ideas ... by a name if one of its significant words occurs as a member of that class. ... – PowerPoint PPT presentation

Number of Views:315

Avg rating:5.0/5.0

Slides: 26

Provided by: Ramu2

Category:

more less

Transcript and Presenter's Notes

Title: An Introduction to Information Retrieval Systems

1
An Introduction to Information Retrieval Systems

Intelligent Systems
March 18, 2004
Ramashis Das

2
Definition

We discuss about Automatic Information Retrieval
Automatic as against manual.
Information as against data.
Defn An information retrieval system does not
inform (i.e. change the knowledge of) the user on
the subject of his inquiry. It merely informs on
the existence (or non-existence) and whereabouts
of documents relating to his request.

3
IR Vs Data Retrieval
4
Classification

Monothetic classification is one with classes
defined by objects possessing attributes both
necessary and sufficient to belong to a class.
Polythetic classification is one where each
individual in a class will possess only a
proportion of all the attributes possessed by all
the members of that class.
Hence no attribute is necessary nor sufficient
for membership to a class.

5
Experimental Vs Operational IR Systems

Many Automatic Information Retrieval Systems are
Experimental. Experimental IR is mainly carried
on in a Laboratory' situation.
Other kind are Operational Systems (or Real
World IR Systems) that are Commercial Systems
which charge for the service they provide.

6
Why IR? A Simple E.g.

Suppose there is a store of documents and a
person (user of the store) formulates a question
(request or query) to which the answer is a set
of documents satisfying the information need
expressed by his question.
Solution User can read all the documents in the
store, retain the relevant documents and discard
all the others Perfect Retrieval NOT POSSIBLE
!!!
Alternative Use a High Speed Computer to read
entire document collection and extract the
relevant documents.

7
Black Box Model
FEEDBACK
PROCESSOR
Queries
INPUT
OUTPUT
Documents
8
INPUT

The main problem here is to obtain a
Representation of each Document and Query
suitable for a computer to use.
Most Computer-Based Retrieval Systems store only
a representation of the Document (or Query)
Implies actual text is lost, an artificial
language used instead.
User needs to be taught to express his
information need in the language.

9
Feedback and PROCESSOR

On-line change in request during a search session
in the light of a sample retrieval hoping
improvement in the subsequent retrieval run
Feedback.
PROCESSOR Retrieval Process.
Structuring Information in appropriate way.
Actual Retrieval Function Search Strategy in
response to a Query.

10
OUTPUT

Set of Citations or Document Numbers.
For Experimental Systems, proper Evaluation
technique follows.

11
Historical Development

Three main areas of Research
Content Analysis Describing the contents of
documents in a form suitable for computer
processing
Information Structures Exploiting
relationships between documents to improve the
efficiency and effectiveness of retrieval
strategies
Evaluation the measurement of the effectiveness
of retrieval.

12
Information Representation

Luhns approach frequency count of words in the
Document.
List of Keywords or Terms.
Freq. of occurrence of Keyword in body of
Document indicates its significance.
Statistical Association between Keywords -
exploited by Maron and Kuhns and Stiles
Sparck Jones - measures of association between
keywords based on their frequency of
co-occurrence.

13
Information Structure

Fairly Recent, Slow Development - loath to try
out new organization techniques for faster and
better retrieval.
Serial File Organization
Inverted File (?)
Clustering Good, Fairthorne Doyle Rocchio

14
Evaluation of Retrieval Systems

Extremely Difficult
Dichotomous Scale Relevant and Non-Relevant.
Precision - the ratio of the number of relevant
documents retrieved to the total number of
documents retrieved
Recall - ratio of the number of relevant
documents retrieved to the total number of
relevant documents (both retrieved and not
retrieved).

15
Steps

Generation of Machine Representations for the
Information.
Explanation of the Logical Structures that may be
arrived at by Clustering.
Representing these Structures in the Computer, or
in other words, choice of File Structures to
Represent the Logical Structure.
Search Strategies.
Probabilistic Retrieval, i.e. to create a Formal
Model for certain kinds of Search Strategies.
Ways of Evaluating the Effectiveness of Retrieval.

16
AUTOMATIC TEXT ANALYSIS

Storing Information
Original In form of Documents
Document Representation is stored
Emphasis is on the statistical rather than
linguistic approaches.
We start with original ideas of Luhn

17
Luhns Ideas

Frequency of word occurrence in an article
furnishes a useful measurement of word
significance.
relative position within a sentence of words
having given values of significance furnish a
useful measurement for determining the
significance of sentences.

18
Demonstration

f Frequency of occurrence of words
r Rank Order
Zipfs Law - the product of the frequency of use
of words and the rank order is approximately
constant.
Luhn used the above law to define two cut-offs.

19
(No Transcript)
20
Generating Document Representatives - conflation

Text Processing System
Input text full text, abstract or title
Output a doc representative adequate for use in
an automatic retrieval system
The document representative consists of a list of
class names, each name representing a class of
words occurring in the total input text.
A document will be indexed by a name if one of
its significant words occurs as a member of that
class.

21
Text Processing System

Such system will consist of three parts
Removal of high frequency words
Suffix stripping
Detecting equivalent stems
Removal of High Freq words
One way of implementing Luhns upper cut-off.
Maintain list of stop list compare and remove
Document size reduces by 30 to 50

22
Text Processing System

Suffix stripping more involved
Complete list of suffixes match and remove the
longest possible one.
Context free removal leads to Error Removing
UAL from FACTUAL and EQUAL
Solution Have some rules
Equivalent Stems
Map to same morphological form on removal of
suffixes.
Other kinds, which do not match on mere removal
of suffixes. (ABSORB- and ABSORPT-)
For these, a list of equivalent stem-endings is
maintained. (For e.g. B and PT are equivalent
stem ending)

23
Text Processing System

The final output from a conflation algorithm is a
set of classes, one for each stem detected.
A class name is assigned to a document if and
only if one of its members occurs as a
significant word in the text of the document.
A document representative then becomes a list of
class names. These are often referred to as the
documents index terms or keywords.
Queries Queries are handled in the same way.

24
Indexing

index language is the language used to describe
documents and requests
elements of the index language are index terms
which may be derived from the text of the
document to be described, or may be arrived at
independently.

25
Some distinctions

Index Languages can be described as
Pre-coordinate terms are coordinated at the
time of indexing
Post-coordinate at the time of searching.
Vocabulary of Index Language
Controlled list of approved index terms that an
indexer may use. One may put other kinds of
syntactic controls (e.g. certain terms used only
as adjectives)
Uncontrolled