Introduction to Information Retrieval Systems - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Introduction to Information Retrieval Systems

Description:

Item' and Document' are used interchangeably ... view in time, delete a specified time period. Total IR System. Item. Normalization. Item ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 32
Provided by: ccNct
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Information Retrieval Systems


1
Introduction to Information Retrieval Systems
2
Outline
  • Definition of IR Systems
  • Objectives of IR Systems
  • Functional Overview
  • Relationship to DBMS

3
Definition of IR Systems (I)
  • An IR System is a system capable of storage,
    retrieval, and maintenance of information.
  • Text, image, audio, video, and other multi-media
    objects
  • Focus on textual information here
  • Item
  • The smallest complete textual unit processed and
    manipulated by an IR system
  • Depend on how a specific source treats
    information
  • Book? Chapter? Paragraph?
  • Item and Document are used interchangeably
  • An IR system provides the searching and browsing
    capabilities in Digital Libraries

4
IR for Searching
5
Definition of IR Systems (II)
  • Purpose of an IR System
  • Find the information the user needs.
  • Success measure (Objectives of an IR System)
  • Minimize the overhead for finding information
  • OverheadThe time a user spends in all of the
    steps leading to reading an item containing
    needed information
  • Query generation
  • Search composition
  • Search execution
  • Scanning results of query to select items to read
  • Reading non-relevant items

6
Objectives of IR Systems (I)
RelevantRetrieved
RelevantNot Retrieved
Non-RelevantRetrieved
Non-RelevantNot Retrieved
  • Relevant vs Needed
  • Recall is non-calculable

7
Objectives of IR Systems (II)
  • Precision
  • Measures retrieval overheard for a particular
    query
  • In the WWW-world, precision is more important
    than recall
  • Recall
  • How well a system is able to retrieve the
    relevant items for users
  • Ideal Precision and Recall

8
Objectives of IR Systems (III)
Precision
  • 100 rev. items,
  • Precision 0.3,
  • Recall 0.5

1.0
1.0
0.8
Precision
0.6
0
0
1.0
Recall
0.4
  • Ideal Precision/Recall Graph

0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Recall
9
Objectives of IR Systems (IV)
  • Support of user search generation
  • How to specify the information a user needs
  • Language ambiguities
  • Vocabulary corpus of a user and item authors
  • Must assist users automatically and through
    interaction in developing a search specification
    that represents the need of users and the writing
    style of diverse authors
  • How to present the search results in a format
    that facilitate the user in determining relevant
    items
  • Ranking in order of potential relevance
  • Item clustering and link analysis

10
Functional Overview
11
Functional Overview
  • Four major functional process
  • Item Normalization
  • Selective Dissemination of Information
  • Document Database Search
  • Index Database Search Automatic File Build
    Process (Support index files)

12
Total IR System
13
Item Normalization (I)
14
Item Normalization (II)
  • Normalize incoming items to a standard format
  • Language encoding
  • Different file formats
  • Logical restructuring
  • Create a searchable data structure (Indexing)
  • Identification of processing tokens
  • Characterization of the tokens single words, or
    phrase
  • Stemming of the tokens
  • Case-folding of the tokens

15
Item Normalization Process
16
Standardize Input
  • Standardizing the input takes the different
    external format acceptable to the system.
  • Example
  • Translate foreign language into Unicode
  • Advantage
  • Allow a single browser to display the languages
    and potentially a single search system to search
    them.

17
Logical Subsetting (Zoning)
  • Parse the item into logical sub-divisions that
    have meaning to user
  • Chapter, Section, Subsection, Reference
  • Structured documents
  • Visible to the user and used to increase the
    precision of a search and optimize the display
  • Allow searches to be restricted to a specific zone

18
Identify Processing Tokens
  • Identify the information that are used in the
    search process Processing Tokens (Better than
    Words)
  • Dividing input symbols into three classes
  • Valid word symbols
  • alphabetic characters,numbers
  • inter-word symbols
  • blanks,periods
  • non-searchable
  • special processing symbols

19
Stop Algorithm
  • Save system resources by eliminating from the set
    of searchable processing tokens those that have
    little value to the search
  • Whose frequency and/or semantic use make them of
    no use as a searchable token
  • Any word found in almost every item
  • Any word only found once or twice in the database
  • Frequency Rank Constant
  • Stop algorithm v.s. Stop list

20
Characterize Tokens
  • Identify any specific word characteristics.
  • Context-Sensitive Semantics
  • Morphological (plane level flat) (field,
    area)
  • Uppercase
  • Numbers and dates

21
Stemming Algorithm
  • Normalize the token to a standard semantic
    representation
  • Computer, Compute, Computers, Computing
  • Comput
  • Reduce the number of unique word the system has
    to contain.
  • ex computable, computation, computability
  • small database saves 32 percent of storages
  • larger database 1.6 MB ? 20
    50 MB ? 13.5

22
Stemming Algorithm (II)
  • Improve the efficiency of the IR System and to
    improve recall?Decline precision
  • Expand a search term to similar token
    representations in run time?

23
Create Searchable Data Structure
  • Processing tokens ? Stemming Algorithm ? update
    to the searchable data structure
  • Internal representation (not visible to user)
  • Signature file, Inverted list, PAT Tree
  • contains
  • Semantic concepts represent the items in
    database.
  • Limit what a user can find as a result of the
    search.

24
Total IR System
25
Selective Dissemination of Information (SDI) (I)
  • Dynamically compare newly received items
  • Compose
  • Search process
  • User statements of interest(Profile)
  • User mail file

26
Selective Dissemination of Information (II)
  • Profile
  • As item received ? process every users profile
  • User mail file that will received the document if
    the search statement in the profile is satisfied
  • Include all the area a user is interested
  • Mail file (Store SDI Results)
  • Mail file associated with the profile
  • view in time, delete a specified time period

27
Total IR System
28
Document Database Search
  • Provides the capability for a query to search
    against all items received by the system
  • Composed of the search process, user entered
    queries and document database.
  • Document database
  • Contain all items that have been received,
    processed and store by the system.
  • Items in the Document DB do not change.

29
Total IR System
30
Index Database Search
  • Interest an item ? save it for future reference
  • Public and Private index files
  • Automatic File Build
  • Selected incoming documents and automatically
    determine potential indexing for item
  • Maybe in RDBMS format, maybe Not

31
Relationship to DBMS
  • IR System
  • Software that has the features and functions
    required to manipulate information items
  • Information is fuzzy text
  • A high probability of not finding all the items a
    user is looking for.
  • The user has to refine his search to locate
    additional items of interest. (iterative search)
  • DBMS
  • optimized to handle structured data
  • Structured data is well defined data typically
    represented by table
  • Specific request,return desired information
Write a Comment
User Comments (0)
About PowerShow.com