Introduction to Text Retrieval - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Introduction to Text Retrieval

Description:

(c) Maria Indrawan 2004. 1. Introduction to Text Retrieval. CSE3201/4500 ... (c) Maria Indrawan 2004. 20. Style of indexing. depends on the form of queries and ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 34

Provided by: Indr1

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Text Retrieval

1
Introduction to Text Retrieval

CSE3201/4500
Information Retrieval Systems

2
Database Types
highly-structured
Relational DB
XML collections
Text Collections
Multimedia Collections
ill-structured
3
Ill-structured data

Attributes
Variable length records, fields
Repeated fields non-normalised
Mixed media
Often large
Often accessed by novice users
Need for both currency and completeness

4
Information Retrieval

Information retrieval has been the term applied
to such areas as
text retrieval systems, library systems,
citation retrieval systems, records management
and archives, photo library applications etc.
These systems are typical of variable-length
record systems
Text retrieval is a subset of Information
Retrieval.
research articles may use the term IR text
retrieval, especially in the 70s,80s and 90s.

5
Text Retrieval - Overview

Information retrieval
branch of database theory
specialises in managing retrieval of unstructured
data
large amount of free format text.
Key problem
How to retrieve the appropriate pieces of
unstructured data (e.g. documents) in response to
a more or less structured query.
Response to a query
Does not answer the query directly
Identify relevant information.

6
Text Retrieval Characteristics

large volume of document space
document space may/may not be structured.
query may not be structured.
exact matching, such as relational database,
will not work effectively.
objects which are to be retrieved, usually
represented by surrogate records.

7
Surrogate Records

Most text retrieval systems rely on surrogate
records rather than directly accessing the
objects themselves.
The quality of the surrogate records often
decides how well the system retrieves.
The structure of the surrogate records will
affect how well they can be indexed or otherwise
accessed.

8
Text Retrieval Processes

Representation
Storage
Organization
Retrieval
Presentation

9
Text Retrieval Processes Model
10
Retrieval Process
11
Indexing (Document Analysis)
12
Query Formulation

Controlled vocabulary
keyword of query ? keyword in document collection

13
Indexing in Text Retrieval Systems
14
Indexed Files in Traditional Databases

An index is a look up table which establishes a
correspondence between a particular attribute (or
attributes) and the address of the record in the
file.
One named (physical) file - two logical files
Data file - contains full data records
Index file - records consist of two fields
key value and address
Index file small - quick to search
Addresses obtained from the index enable direct
access to the data file
Logically sequential access also via index

15
Indexed Non-Sequential File
Data Records
16
Indexed Sequential File
Data Records
17
Indexing in Text Retrieval Systems
Doc-2 (data record)
Doc-1 (data record)
18
Purpose of Indexing

a sufficiently general description of a document
so that it can be retrieved with queries that
concern the same subject as the document
sufficiently specific description so that the
document will not be returned for those queries
which are not related to the document.

19
Indexing

Manual indexing
Automatic indexing

20
Style of indexing

depends on the form of queries and vice-versa.
We must decide whether the terms available for
indexing are predefined, a controlled vocabulary,
or chosen at the time of indexing, an
uncontrolled vocabulary.

21
Controlled Vocabulary

Controlled vocabulary is a method of
predetermining the terms which will be used in a
specific domain so that
indexers will select from a limited set of terms
searchers can use terms knowing that they have
been applied in an objective manner
index sets are reduced in size

22
Manual Indexing Methods

1. Give the document a single code from a
predefined list. e.g.
the first letter of the first authors family
name
a Dewey Decimal number
2. Assign several of a predefined lists of codes
to a document. e.g.
assign the Computing Reviews classification to
articles.
Assign to each document a set of descriptors that
are not predefined. The descriptors may be words
from the text of the document and/or thesaurus.

23
Manual Indexing - Analysis

Single term indexing simple and low index cost,
but poor retrieval.
All other techniques require that a more complex
index be maintained.
When a controlled vocabulary is used, a taxonomy
of the document contents must be devised. Having
devised this it must be adhered to henceforth.

24
Manual Indexing - Analysis

Advantage terms never used in the text but are
extremely descriptive may be assigned to the
document.
Disadvantage
inter-indexer consistency
inflexible view of documents
no control on number of satisfying documents.

25
Automatic Indexing - A Basic Method

Assume that a document consists of just text and
that we will derive our indexing terms from this
text.
Break the text up into words, casefold, and index
on every word. This technique is very simple and
performs reasonably well.

26
Automatic Indexing - Refinement

Language dependent.
refinement for English will be different from
Chinese
Stop List
Stemming
Term Weighting

27
Indexing Refinement Stop List

A list of common words.
Generally contains words that are not nouns,
verbs, adjectives and adverbs.
A stop list might consist of a, the, an
is, be , ....
Common stop lists run from 10 to hundreds of
words.
It does not matter what the stop list is,
typically around 300 common words will do well.
Indexing process will ignore the words listed in
the stop list.

28
Stop Lists

Fox indicates that the first 20 stop words
accounts for 31.19 of the English corpus.
Fox C. (1992). Lexical Analysis and Stoplists. In
Frakes W.B. and Baeza-Yates R., Eds.),
Information RetrievalData Structures and
Algorithms, Englewood Cliffs, NJ. Prentice-Hall
The first 20 stop words
The, of, and, to, a , in, that, is, was, he, for
, it, with, as, not, his, on, be, at, by.

29
Refinement - Stemming

To incorporate many variations of words, where an
attempt is made to accommodate many variations
comprising a concept
This avoids exceedingly long or query
statement.
Example inquiry or inquired or inquiries
The process is performed after the stop list
process.
Porter stemming algorithm
Porter, M.F., 1980, An algorithm for suffix
stripping, Program, 14(3) 130-137)

30
Stemming - Suffix

Most English meaning shifts for grammatical
purposes are handled by suffixes
Most retrieval systems allow for trailing or
suffixes truncation.
Example
inquir will retrieve documents containing the
words inquire, inquired, inquires,
inquiring, inquiry etc.

31
Stemming - Prefix