Introduction to Text Retrieval - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Introduction to Text Retrieval

Description:

(c) Maria Indrawan 2004. 1. Introduction to Text Retrieval. CSE3201/4500 ... (c) Maria Indrawan 2004. 20. Style of indexing. depends on the form of queries and ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 34
Provided by: Indr1
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Text Retrieval


1
Introduction to Text Retrieval
  • CSE3201/4500
  • Information Retrieval Systems

2
Database Types
highly-structured
Relational DB
XML collections
Text Collections
Multimedia Collections
ill-structured
3
Ill-structured data
  • Attributes
  • Variable length records, fields
  • Repeated fields non-normalised
  • Mixed media
  • Often large
  • Often accessed by novice users
  • Need for both currency and completeness

4
Information Retrieval
  • Information retrieval has been the term applied
    to such areas as
  • text retrieval systems, library systems,
    citation retrieval systems, records management
    and archives, photo library applications etc.
  • These systems are typical of variable-length
    record systems
  • Text retrieval is a subset of Information
    Retrieval.
  • research articles may use the term IR text
    retrieval, especially in the 70s,80s and 90s.

5
Text Retrieval - Overview
  • Information retrieval
  • branch of database theory
  • specialises in managing retrieval of unstructured
    data
  • large amount of free format text.
  • Key problem
  • How to retrieve the appropriate pieces of
    unstructured data (e.g. documents) in response to
    a more or less structured query.
  • Response to a query
  • Does not answer the query directly
  • Identify relevant information.

6
Text Retrieval Characteristics
  • large volume of document space
  • document space may/may not be structured.
  • query may not be structured.
  • exact matching, such as relational database,
    will not work effectively.
  • objects which are to be retrieved, usually
    represented by surrogate records.

7
Surrogate Records
  • Most text retrieval systems rely on surrogate
    records rather than directly accessing the
    objects themselves.
  • The quality of the surrogate records often
    decides how well the system retrieves.
  • The structure of the surrogate records will
    affect how well they can be indexed or otherwise
    accessed.

8
Text Retrieval Processes
  • Representation
  • Storage
  • Organization
  • Retrieval
  • Presentation

9
Text Retrieval Processes Model
10
Retrieval Process
11
Indexing (Document Analysis)
12
Query Formulation
  • Controlled vocabulary
  • keyword of query ? keyword in document collection

13
Indexing in Text Retrieval Systems
14
Indexed Files in Traditional Databases
  • An index is a look up table which establishes a
    correspondence between a particular attribute (or
    attributes) and the address of the record in the
    file.
  • One named (physical) file - two logical files
  • Data file - contains full data records
  • Index file - records consist of two fields
  • key value and address
  • Index file small - quick to search
  • Addresses obtained from the index enable direct
    access to the data file
  • Logically sequential access also via index

15
Indexed Non-Sequential File
Data Records
16
Indexed Sequential File
Data Records
17
Indexing in Text Retrieval Systems
Doc-2 (data record)
Doc-1 (data record)
18
Purpose of Indexing
  • a sufficiently general description of a document
    so that it can be retrieved with queries that
    concern the same subject as the document
  • sufficiently specific description so that the
    document will not be returned for those queries
    which are not related to the document.

19
Indexing
  • Manual indexing
  • Automatic indexing

20
Style of indexing
  • depends on the form of queries and vice-versa.
  • We must decide whether the terms available for
    indexing are predefined, a controlled vocabulary,
    or chosen at the time of indexing, an
    uncontrolled vocabulary.

21
Controlled Vocabulary
  • Controlled vocabulary is a method of
    predetermining the terms which will be used in a
    specific domain so that
  • indexers will select from a limited set of terms
  • searchers can use terms knowing that they have
    been applied in an objective manner
  • index sets are reduced in size

22
Manual Indexing Methods
  • 1. Give the document a single code from a
    predefined list. e.g.
  • the first letter of the first authors family
    name
  • a Dewey Decimal number
  • 2. Assign several of a predefined lists of codes
    to a document. e.g.
  • assign the Computing Reviews classification to
    articles.
  • Assign to each document a set of descriptors that
    are not predefined. The descriptors may be words
    from the text of the document and/or thesaurus.

23
Manual Indexing - Analysis
  • Single term indexing simple and low index cost,
    but poor retrieval.
  • All other techniques require that a more complex
    index be maintained.
  • When a controlled vocabulary is used, a taxonomy
    of the document contents must be devised. Having
    devised this it must be adhered to henceforth.

24
Manual Indexing - Analysis
  • Advantage terms never used in the text but are
    extremely descriptive may be assigned to the
    document.
  • Disadvantage
  • inter-indexer consistency
  • inflexible view of documents
  • no control on number of satisfying documents.

25
Automatic Indexing - A Basic Method
  • Assume that a document consists of just text and
    that we will derive our indexing terms from this
    text.
  • Break the text up into words, casefold, and index
    on every word. This technique is very simple and
    performs reasonably well.

26
Automatic Indexing - Refinement
  • Language dependent.
  • refinement for English will be different from
    Chinese
  • Stop List
  • Stemming
  • Term Weighting

27
Indexing Refinement Stop List
  • A list of common words.
  • Generally contains words that are not nouns,
    verbs, adjectives and adverbs.
  • A stop list might consist of a, the, an
    is, be , ....
  • Common stop lists run from 10 to hundreds of
    words.
  • It does not matter what the stop list is,
    typically around 300 common words will do well.
  • Indexing process will ignore the words listed in
    the stop list.

28
Stop Lists
  • Fox indicates that the first 20 stop words
    accounts for 31.19 of the English corpus.
  • Fox C. (1992). Lexical Analysis and Stoplists. In
    Frakes W.B. and Baeza-Yates R., Eds.),
    Information RetrievalData Structures and
    Algorithms, Englewood Cliffs, NJ. Prentice-Hall
  • The first 20 stop words
  • The, of, and, to, a , in, that, is, was, he, for
    , it, with, as, not, his, on, be, at, by.

29
Refinement - Stemming
  • To incorporate many variations of words, where an
    attempt is made to accommodate many variations
    comprising a concept
  • This avoids exceedingly long or query
    statement.
  • Example inquiry or inquired or inquiries
  • The process is performed after the stop list
    process.
  • Porter stemming algorithm
  • Porter, M.F., 1980, An algorithm for suffix
    stripping, Program, 14(3) 130-137)

30
Stemming - Suffix
  • Most English meaning shifts for grammatical
    purposes are handled by suffixes
  • Most retrieval systems allow for trailing or
    suffixes truncation.
  • Example
  • inquir will retrieve documents containing the
    words inquire, inquired, inquires,
    inquiring, inquiry etc.

31
Stemming - Prefix
  • Usually is not used in English text retrieval
    systems.
  • Prefix is substantial modifier, even a negation.
  • Example
  • flammable and inflammable.
  • Prefix stemming may be useful in Chemical
    databases.

32
Stemming Exception List
  • Irregularity in the language needs to be
    implemented as a lookup list
  • Example
  • Irregular plurals
  • woman gt women
  • child gt children
  • past tense
  • choose gt chose
  • find gt found

33
Summary
  • Text Retrieval Systems
  • motivation
  • model
  • Indexing Refinements
  • Stop List
  • Stemming
  • Term Weight (week 8)
Write a Comment
User Comments (0)
About PowerShow.com