Information Retrieval Systems Capabilities - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Information Retrieval Systems Capabilities

Description:

... language, proximity, contiguous word phrases, fuzzy searches ... that links word stems ... Since searches return many items that are not relevant to the ... – PowerPoint PPT presentation

Number of Views:3001
Avg rating:3.0/5.0
Slides: 43
Provided by: ccNct
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval Systems Capabilities


1
Information Retrieval SystemsCapabilities
2
Outline
  • Search Capabilities
  • Browse Capabilities
  • Miscellaneous Capabilities

3
Search Capabilities
4
Overview
  • Mapping between a users specified need and the
    items in the IR systems that will answer that
    need
  • Natural language (Ranked) Boolean logic
  • Weighting of search terms
  • Find articles that discuss automobile emissions
    (.9) or sulfur dioxide (.3) on the farming
    industry
  • The search statement may apply to the complete
    item or contain additional parameters limiting it
    to a logical division of the item (i.e., to a
    zone)
  • Note processing token, word, term are used
    interchangeably

5
Overview (Cont.)
  • Search functions
  • Relationship between the terms in the search
    statement
  • Boolean, natural language, proximity, contiguous
    word phrases, fuzzy searches
  • Interpretation of a particular word
  • Term masking, numeric and date range, contiguous
    word phrases and concept/thesaurus expansion

6
Boolean Logic
  • Boolean operators
  • AND, OR, NOT, XOR
  • Nesting and default precedence ordering of
    operations
  • Weighting Boolean queries?
  • M of N logic
  • Find any item containing any two of the following
    terms A, B, C

7
Use of Boolean Operators
8
Proximity
  • Restrict the distance allowed within an item
    between two search terms
  • The closer two terms are found in a text, the
    more likely they are related in the description
    of a particular concept
  • Increase the precision of a search
  • Typical format
  • TERM1 within/before/after m units of TERM2
  • units chars,words,sentences, or paragraphs
  • Direction operator before or after
  • Adjacent (ADJ)
  • Distance operator of 1
  • Forward only direction

9
Use of Proximity
10
Contiguous Word Phrase
  • Two ore more words treated as a single semantic
    unit
  • Exact phrase, or literal strings
  • United States of America
  • (((United ADJ States) ADJ of) ADJ America)

11
Fuzzy Search
  • Locate spellings of words similar to the entered
    search term
  • Compensate for errors in the spelling of words
  • Give more weight to similar word lengths or
    similar character positions
  • Computer -- compiter, conputer, computter,
    compute
  • An additional enhancement may lookup the proposed
    alternative spelling and if it is a valid word
    with a different meaning, include it in the
    search with a low ranking or not include it at
    all

12
Term Masking
  • The ability to expand a query term by masking a
    portion of the term and accepting as valid any
    processing token that maps to the unmasked
    portion of the term
  • Valuable for systems that do not perform stemming
    or only provide a very simple stemming algorithm
  • Examples
  • Fixed length dont care
  • multinational matches multi-national.
  • Variable length dont care
  • Suffix search computer
  • Prefix search computer
  • Imbedded string search computer

13
Term Masking
14
Numeric and Date Ranges
  • Term masking is useful when applied to words, but
    does not work for finding ranges of numbers of
    numeric dates
  • To find number larger than 125, using a term
    125 does not work
  • A system that characterizes words as numbers or
    dates allow for specialized numeric or date range
    processing against those words
  • Examples
  • 124-425, 4/2/93-5/2/95, gt123, or
    lt4/2/99.

15
Concept/Thesaurus Expansion
  • A thesaurus (??) is typically a one-level or
    two-level expansion of a term to other terms that
    are similar in meaning
  • Concept class
  • Tree structure that expands each meaning of a
    word into potential concepts that are related to
    the initial term
  • Network structure that links word stems
  • Show associations that are not normally found in
    a language based thesaurus (e.g. negative
    advertising vs. election)
  • Generalization Specificity
  • Assist a user who has minimal knowledge of a
    concept domain by allowing the user to expand
    upon a particular concept showing related concepts

16
Thesaurus for Computer
Computer
CPU
DataProcessor
Multitasking Computer
PC
Minicomputer
Main Frame
17
Hierarchical Concept Class Structure for Computer
Computer
Hardware
Software
Processor
Peripheral
OS
Application
Network
18
Concept/Thesaurus Expansion (Cont.)
  • Semantic-based Thesauri
  • A listing of words and then other words that are
    semantically similar
  • In executing a query,
  • A term can be expanded to all related terms in
    the thesaurus or concept tree.
  • Optionally, the user may display the thesaurus or
    concept tree and indicate which related terms
    should be used in a query
  • Eliminate synonyms which introduce meanings that
    are not in the users search statement (fields
    and pasture lands vs. magnetic fields
  • Generic to a language and can introduce many
    search terms that are not found in the document
    database

19
Concept/Thesaurus Expansion (Cont.)
  • Statistics-based Thesauri
  • Statistically related to other words in the same
    document database by co-occurrence frequency
  • Very dependent upon the database being searched
    and may not be portable to other databases
  • Statistical thesauri are frequently used as
    automatic expansions of users search without the
    user directly interacting with the thesaurus
  • Thesauri and concept trees could be used to
    either expand a search statement with additional
    terms or make it more specific by substituting
    more specific terms
  • Expand by generalization increase recall and
    decrease precision
  • Restrict by specificity increase precision and
    decrease recall

20
Thesaurus Example
21
Thesaurus Example (Cont.)
22
Natural Language Queries
  • Allow a user to enter a prose statement that
    describes the information that the user wants to
    find
  • The longer the prose, the more accurate
  • Most difficult logic case negation
  • Transform from natural language queries to
    Boolean
  • Improve recall/Decrease precision (when negation
    is required)

23
Natural Language Queries (Cont.)
  • Find for me all the items that discuss oil
    reserves and current attempts to find new oil
    reserves. Include any items that discuss the
    international financial aspects of the oil
    production process. Do not include items about
    the oil industry in the United States
  • Oil reserves and attempts to find new oil
    reserves, international financial aspects of oil
    production, not United States oil industry (users
    tend to enter sentence fragments)
  • (locate AND new AND oil reserves) OR
    (international AND financ AND oil
    production) NOT (oil industry AND United
    Sates)

24
Search Interface
Simple things should be simple, complex things
should be possible
Simple
Advanced
25
Browse Capabilities
26
Overview
  • Once the search is complete, browse capabilities
    provide the user with the capability to determine
    which items are of interest and select those to
    be displayed
  • Two ways of displaying a summary of the searching
    results
  • Line item status
  • Data visualization
  • Since searches return many items that are not
    relevant to the users information need, browse
    capabilities can assist the user in focusing on
    items that have the highest likelihood in meeting
    his need

27
Browsing an Alphabetical List of Titles
28
Browsing A Classification Hierarchy
29
Browsing A Classification Hierarchy (Cont.)
30
Ranking
  • In general, an un-weighted Boolean System does
    not have the idea of ranking
  • With the introduction of ranking based upon
    predicted relevance values, the status summary
    displays the relevance score associated with the
    item along with a brief descriptor of the item
  • The relevance score is an estimate of search
    system on how closely the item satisfies the
    search statement.(1.0 0.0)
  • Some systems may create relevance categories
    (High, Medium High) and indicate which category
    (by color) an item belongs to

31
Ranking Example
32
Zoning
  • The user wants to see the minimum information
    needed to determine if the item is relevant.
  • Title, abstract
  • Once the determination is made, the user can
    display the complete item for detailed review.

33
Highlighting
  • Highlighting lets the user quickly focus on the
    potentially relevant parts of the text to scan
    for item relevance
  • Different strengths of highlighting indicates how
    strongly the highlighted word participated in the
    selection of the item
  • Most systems allow the display of an item to
    begin with the first highlighting within the item
    and allow subsequent jumping to the next
    highlight
  • Another capability is for the system to determine
    the passage in the document most relevant to the
    query and position the browse to start at that
    passage

34
Highlighting (Cont.)
  • Highlighting has always been useful in Boolean
    systems to indicate the cause of the retrieval
  • Using natural language processing, automatic
    expansion of terms via thesauri, and similarity
    ranking, highlighting loses some of its value

35
Miscellaneous Capabilities
36
Overview
  • Facilitate the users ability to
  • Input queries
  • Reduce the time it takes to generate the queries
  • Reduce a priori the probability of entering a
    poor query
  • Miscellaneous capabilities
  • Vocabulary browse
  • Iterative search and search history log
  • Canned queries

37
Vocabulary Browsing
  • ????????????????????
  • ??????????document database???distribution
  • ????????????????????????????????????????

Compromise 53 Comptroller
18 Compulsion 5 Compulsive
22 Compulsory 4 Comput Computation
265 Compute 1245 Computen 1 Computer
10,800
38
Vocabulary Browsing Example
39
Iterative Search
  • Frequently a search returns a Hit file containing
    many more items than the user want to review
  • Rather than typing in a complete new query, the
    results of the previous search can be used as a
    constraining list to create a new query that is
    applied against it
  • This has the same effect as taking the original
    query and adding additional search statement
    against it in an AND condition

40
Search History Log
  • During a login session, a user could execute many
    queries to locate the needed information
  • To facilitate locating pervious searches as
    starting points for new searches, search history
    logs are available
  • The search history log is the capability to
    display all the previous searches that were
    executed during the current session
  • The query along with the search completion status
    showing number of hits is displayed

41
Search History
42
Canned Query
  • The capability to name a query and store it to be
    retrieved and executed during a later user
    session is called canned or stored queries
  • Users tend to have areas of interest within which
    they execute their searches on a regular basis
  • A canned query allows a user to create and refine
    a search that focuses on the users general area
    of interest one time and then retrieve it to add
    additional search criteria to retrieve data that
    is currently needed
Write a Comment
User Comments (0)
About PowerShow.com