Title: Introduction to Text Retrieval
1Introduction to Text Retrieval
- CSE3201/4500
- Information Retrieval Systems
2Database Types
highly-structured
Relational DB
XML collections
Text Collections
Multimedia Collections
ill-structured
3Ill-structured data
- Attributes
- Variable length records, fields
- Repeated fields non-normalised
- Mixed media
- Often large
- Often accessed by novice users
- Need for both currency and completeness
4Information Retrieval
- Information retrieval has been the term applied
to such areas as - text retrieval systems, library systems,
citation retrieval systems, records management
and archives, photo library applications etc. - These systems are typical of variable-length
record systems - Text retrieval is a subset of Information
Retrieval. - research articles may use the term IR text
retrieval, especially in the 70s,80s and 90s.
5Text Retrieval - Overview
- Information retrieval
- branch of database theory
- specialises in managing retrieval of unstructured
data - large amount of free format text.
- Key problem
- How to retrieve the appropriate pieces of
unstructured data (e.g. documents) in response to
a more or less structured query. - Response to a query
- Does not answer the query directly
- Identify relevant information.
6Text Retrieval Characteristics
- large volume of document space
- document space may/may not be structured.
- query may not be structured.
- exact matching, such as relational database,
will not work effectively. - objects which are to be retrieved, usually
represented by surrogate records.
7Surrogate Records
- Most text retrieval systems rely on surrogate
records rather than directly accessing the
objects themselves. - The quality of the surrogate records often
decides how well the system retrieves. - The structure of the surrogate records will
affect how well they can be indexed or otherwise
accessed.
8Text Retrieval Processes
- Representation
- Storage
- Organization
- Retrieval
- Presentation
9Text Retrieval Processes Model
10Retrieval Process
11Indexing (Document Analysis)
12Query Formulation
- Controlled vocabulary
- keyword of query ? keyword in document collection
13Indexing in Text Retrieval Systems
14Indexed Files in Traditional Databases
- An index is a look up table which establishes a
correspondence between a particular attribute (or
attributes) and the address of the record in the
file. - One named (physical) file - two logical files
- Data file - contains full data records
- Index file - records consist of two fields
- key value and address
- Index file small - quick to search
- Addresses obtained from the index enable direct
access to the data file - Logically sequential access also via index
15Indexed Non-Sequential File
Data Records
16Indexed Sequential File
Data Records
17Indexing in Text Retrieval Systems
Doc-2 (data record)
Doc-1 (data record)
18Purpose of Indexing
- a sufficiently general description of a document
so that it can be retrieved with queries that
concern the same subject as the document - sufficiently specific description so that the
document will not be returned for those queries
which are not related to the document.
19Indexing
- Manual indexing
- Automatic indexing
20Style of indexing
- depends on the form of queries and vice-versa.
- We must decide whether the terms available for
indexing are predefined, a controlled vocabulary,
or chosen at the time of indexing, an
uncontrolled vocabulary.
21Controlled Vocabulary
- Controlled vocabulary is a method of
predetermining the terms which will be used in a
specific domain so that - indexers will select from a limited set of terms
- searchers can use terms knowing that they have
been applied in an objective manner - index sets are reduced in size
22Manual Indexing Methods
- 1. Give the document a single code from a
predefined list. e.g. - the first letter of the first authors family
name - a Dewey Decimal number
- 2. Assign several of a predefined lists of codes
to a document. e.g. - assign the Computing Reviews classification to
articles. - Assign to each document a set of descriptors that
are not predefined. The descriptors may be words
from the text of the document and/or thesaurus.
23Manual Indexing - Analysis
- Single term indexing simple and low index cost,
but poor retrieval. - All other techniques require that a more complex
index be maintained. - When a controlled vocabulary is used, a taxonomy
of the document contents must be devised. Having
devised this it must be adhered to henceforth.
24Manual Indexing - Analysis
- Advantage terms never used in the text but are
extremely descriptive may be assigned to the
document. - Disadvantage
- inter-indexer consistency
- inflexible view of documents
- no control on number of satisfying documents.
25Automatic Indexing - A Basic Method
- Assume that a document consists of just text and
that we will derive our indexing terms from this
text. - Break the text up into words, casefold, and index
on every word. This technique is very simple and
performs reasonably well.
26Automatic Indexing - Refinement
- Language dependent.
- refinement for English will be different from
Chinese - Stop List
- Stemming
- Term Weighting
27Indexing Refinement Stop List
- A list of common words.
- Generally contains words that are not nouns,
verbs, adjectives and adverbs. - A stop list might consist of a, the, an
is, be , .... - Common stop lists run from 10 to hundreds of
words. - It does not matter what the stop list is,
typically around 300 common words will do well. - Indexing process will ignore the words listed in
the stop list.
28Stop Lists
- Fox indicates that the first 20 stop words
accounts for 31.19 of the English corpus. - Fox C. (1992). Lexical Analysis and Stoplists. In
Frakes W.B. and Baeza-Yates R., Eds.),
Information RetrievalData Structures and
Algorithms, Englewood Cliffs, NJ. Prentice-Hall - The first 20 stop words
- The, of, and, to, a , in, that, is, was, he, for
, it, with, as, not, his, on, be, at, by.
29Refinement - Stemming
- To incorporate many variations of words, where an
attempt is made to accommodate many variations
comprising a concept - This avoids exceedingly long or query
statement. - Example inquiry or inquired or inquiries
- The process is performed after the stop list
process. - Porter stemming algorithm
- Porter, M.F., 1980, An algorithm for suffix
stripping, Program, 14(3) 130-137)
30Stemming - Suffix
- Most English meaning shifts for grammatical
purposes are handled by suffixes - Most retrieval systems allow for trailing or
suffixes truncation. - Example
- inquir will retrieve documents containing the
words inquire, inquired, inquires,
inquiring, inquiry etc.
31Stemming - Prefix
- Usually is not used in English text retrieval
systems. - Prefix is substantial modifier, even a negation.
- Example
- flammable and inflammable.
- Prefix stemming may be useful in Chemical
databases.
32Stemming Exception List
- Irregularity in the language needs to be
implemented as a lookup list - Example
- Irregular plurals
- woman gt women
- child gt children
- past tense
- choose gt chose
- find gt found
33Summary
- Text Retrieval Systems
- motivation
- model
- Indexing Refinements
- Stop List
- Stemming
- Term Weight (week 8)