PRINCIPLES OF SEARCHING 17:610:530 02 Lecture 2 Sept'15, 2003 - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

PRINCIPLES OF SEARCHING 17:610:530 02 Lecture 2 Sept'15, 2003

Description:

Norma Medina-Ortiz, Monica Sanchez-Zapata, Evelyn Majewski, Andrew D'Apice. 2004 ... David Lang, Tom Methans, Madeline Murray, Heather Shannon. 42 ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 37
Provided by: yingsunbas5
Category:

less

Transcript and Presenter's Notes

Title: PRINCIPLES OF SEARCHING 17:610:530 02 Lecture 2 Sept'15, 2003


1
PRINCIPLES OF SEARCHING17610530 (02)Lecture
2Sept.15, 2003
  • Ying Sun
  • SCILS, Rm. 214
  • ysun_at_scils.rutgers.edu

2
Groups
3
Todays Plan
  • Lecture on IR (one hr)
  • Group Discussion (40mins)
  • On-hand Dialog exercise (40mins)

4
IR the Original Definition
  • Calvin Mooers first introduced this term,
    information retrieval, into the literature of
    documentation in 1950. (Swanson, 1988)
  • Inf. retrieval embraces the intellectual aspects
    of the description of information and its
    specification for search, and also whatever
    systems, techniques, or machines are employed to
    carry out the operation.
  • --Calvin Mooers, 1951

5
IR another definition
  • Information retrieval is often regarded as being
    synonymous with document retrieval and nowadays,
    with text retrieval, implying that the task of an
    IR system is to retrieve documents or texts with
    information content that is relevant to a users
    information need
  • -Spark Jones Willett, 1997

6
IR Objective and Problems
  • Provide the users with effective access to
    interaction with information resources.
  • Problems addressed
  • 1. How to organize information intellectually?
  • 2. How to specify search interaction
    intellectually?
  • 3. What systems techniques to use for those
    processes?

7
IR Models
  • Model depicts, represents what is involved - a
    choice of features, processes, things for
    consideration
  • Several IR models used over time
  • traditional oldest, most used, shows basic
    elements involved
  • interactive more realistic, favored now, shows
    also interactions involved several models
    proposed
  • Each has strengths, weaknesses
  • We start with traditional model to illustrate
    many points - from general to specific examples

8
Traditional IR Model (1)
  • The classic information retrieval model (Bates,
    1989)

9
Traditional IR Model (2)
  • The standard IR model (Belkin, 1993)

Information need
Texts
Representation
Representation
Query
Surrogate
Comparison
Retrieval Texts
Judgment
Modification
10
Traditional IR Model (3)
  • The comprehensive IR model (Saracevic)

System
User
Acquisition documents, objects
Problem information need
Representation indexing, ...
Representation question
Query search formulation
File organization indexed documents
Matching searching
feedback
Retrieved objects
11
IR components - Acquisition
  • Content What is in databases
  • In DIALOG first part of blue sheets File
    Description, Subject Coverage
  • Selection of documents other objects from
    various sources
  • In blue sheets Sources
  • Mostly text based documents
  • Full texts, titles, abstracts ...
  • But also data, statistics, images (e.g. maps,
    trade marks) ...
  • Determines database contents key to file
    selection

12
IR Components Information Objects
Representation
  • Types of surrogates
  • Subject Index
  • controlled vocabulary - thesaurus
  • free text terms (even in full texts)
  • Abstract annotat
  • Bibliographic description
  • author, title, source, datemetadata

13
IR Components Information Objects
Representation (2)
  • These surrogates are organized into fields
    limits
  • Basic index subject-related field, including
    title, abstract, descriptor, and identifier.
  • Additional index other searchable field, such as
    author, journal name, company name, etc.
  • Manual automatic techniques
  • advantages disadvantages

14
IR Components File Organization
  • Sequential
  • record (document) by record
  • Inverted
  • term by term list of records under each term
  • Combination indexes inverted, documents
    sequential
  • for efficient retrieval by computers
  • When citation retrieved only, need for document
    files

15
IR Components User problems
  • Related to task situation at hand
  • Vary in specificity, clarity
  • Produces information need
  • Ultimate criterion for effectiveness of retrieval
  • Inf. need for the same problem may change,
    evolve, shift during the IR process - adjustment
    in searching
  • Often more than one search for same problem over
    time

16
IR Components Information question
representation
  • Non-mediated end user alone
  • Mediated intermediary user
  • interviews human-human interaction
  • Question analysis selection, elaboration of
    terms
  • Focus toward search terms logic selection of
    databases
  • Subject to feedback changes
  • Various tools thesaurus ...

17
IR Components Query (Search Statement)
  • Translation into systems requirements limits
  • start of human-computer interaction
  • Selection of databases
  • Search strategy - selection of
  • search terms logic
  • possible fields, delimiters
  • controlled uncontrolled vocabulary
  • Reiterations from feedback
  • relevance feedback
  • query expansion modification

18
IR Components - Matching
  • Process of matching, comparing
  • search what documents in the file match the
    query as stated?
  • Various search algorithms
  • exact match - Boolean
  • still most prevalent
  • best match - ranking by relevance, or other
    criteria
  • increasingly used e.g. on the web
  • hybrids incorporating both
  • e.g. Target, Rank in DIALOG

19
Exact match - Boolean search
  • You retrieve exactly what you ask for in the
    query
  • all documents that have the term(s) with logical
    connection(s), and possible other restrictions
    (e.g. to be in titles) as stated in the query
  • exactly nothing less, nothing more
  • Based on matching following rules of Boolean
    algebra, or algebra of sets
  • new algebra
  • presented by circles in Venn diagrams

20
Best Match
  • The retrieved documents are ranked by how similar
    (close) they are to a query (as calculated by the
    system)
  • similarity assumed as relevance
  • thus, documents as answers are presented from
    those that are most likely relevant downwards to
    less less likely relevant - can be cut at any
    desired number - e.g. first 10
  • Algorithms (formulas) used to determine
    similarity
  • using statistic /or linguistic properties
  • Web outputs are mostly ranked
  • But DIALOG allows ranking as well, with special
    commands

21
Best Match Cont.
  • Best match process
  • compares a set of query terms with the sets of
    terms in documents
  • calculates a similarity between query each
    document based on common terms
  • sorts the documents in order of similarity
  • assumes that the higher ranked documents have a
    higher probability of being relevant
  • allows for cut-off at a chosen number
  • BIG issue What representation similarity
    measures are best?
  • considerable research many tests
  • many proprietary algorithms

22
Exact vs. Best Match
  • Boolean
  • allows for logic
  • provides all that has been matched
  • BUT
  • has no particular order of output
  • treats all retrievals equally - from the most to
    least relevant ones
  • often requires examination of large outputs
  • Best match
  • allows for free terminology
  • provides for a ranked output
  • provides for cut-off - any size output
  • BUT
  • does not include logic
  • ranking method (algorithm) not transparent
  • whose relevance?
  • where to cut off?

23
IR Components Retrieved Documents
  • Various order of output
  • Last In First Out (LIFO) sorted
  • ranked by relevance
  • ranked by other characteristics
  • Various forms of output
  • In DIALOG Output options
  • When citations only linkage to document
    delivery
  • Base for relevance, utility evaluation by users
  • Relevance feedback

24
Strength of Traditional IR Models
  • Lists major components in both system user
    branches
  • Suggests
  • What to explain to users about system, if needed
  • What to ask of users for more effective searching
    (problem ...)
  • Selection of component(s) for concentration
  • mostly ever better representation
  • Provides a framework for evaluation of (static)
    aspects

25
Weakness
  • Does not address nor account for interaction
    judgment of results by users
  • identifies interaction with search only
  • interaction is a much richer process
  • Many types of variables in interaction not
    reflected
  • Feedback has many types functions - also not
    shown
  • Evaluation thus one-sided

26
Break
27
Boolean Algebra
  • Boolean Logical Operators
  • OR Operator Use OR to group synonymous terms
    when at least one must be present.
  • OR increases the number of records retrieved.
  • AND Operator Use AND to connect terms when both
    or all must be present.
  • AND decreases the number of records retrieved.
  • NOT Operator Use NOT to exclude records
    containing a specified term.
  • Use NOT carefully with subject termsyou may
    unintentionally
  • eliminate useful records.

28
Venn Diagrams Demonstration
Four basic operations
29
Venn Diagrams Cont.
  • Complex statements allowed e.g

(A OR B) AND C Shade ? (apples or oranges) AND
Florida (A OR B) NOT C Shade what? (apples or
oranges NOT Florida
A
B
2
3
1
5
4
6
7
C
30
Order of Operators
  • Rules for the order in which operations are done
    effect the answer - so follow them
  • Order of operation is
  • First are done all operations that are in
    parentheses, then
  • NOT
  • AND
  • OR
  • e.g. for A OR B AND C, first is done B AND C, and
    then A OR the results of B AND C, thus this is
    the same as A OR (B AND C)

31
Exercise
1. A AND B 2. A OR B 3. A OR B AND C 4. (A OR B)
AND C 5. A AND B NOT C 6. A AND (B NOT C) 7. (A
AND B) NOT C 8. (A NOT B) AND C 9. A NOT (B AND
C) 10. A NOT B AND C 11. A AND (B OR C) 12. A AND
B OR C 13. A OR (B AND C) 14. A OR B OR C 15 A
AND B AND C
2,5 1,2,3,4,5,6 1,2,4,5,6 4,5,6 2 2 2 4 1,2,4 4 2,
4,5 2,4,5,6,7 1,2,4,5,6 1,2,3,4,5,6,7 5
32
Basic Dialog Commands
  • BEGIN (b) Command Accessing a Database
  • Command Format BEGIN n or b n
  • where n is the file number of the database
  • Databases can be searched as a group
  • b 1,2,6,7,35,47,121,148,202,437,438

33
Basic Dialog Commands (2)
  • SELECT (s) Command Entering Search Terms
  • Command Format SELECT term
  • The response to a SELECT command appears below
  • s1 information retrieval 4413
  • Set Number Term(s) as
    entered in the Number of records
    containing
  • SELECT command
    the term(s) Hits
  • Always enter a space after the SELECT (S) command
  • Any word or term can be SELECTed, except nine
    non-searchable stop words an, the, for, and,
    from, to, by, of, with

34
Basic Dialog Commands (3)
  • Truncating Search Terms
  • Open Truncation
  • s path? retrieves all words that begin with
    path paths, pathos, pathway, pathology
  • Controlled-Length Truncation
  • s path? ? retrieves the root and up to one
    additional character paths
  • s path?? retrieves the root and up to two
    additional characters paths, pathos
  • Embedded Character Embedded Character truncation
    can be used for variant spellings
  • select organi?ation retrieves organization,
    organisation
  •  
  • select fib??board retrieves fiberboard,
    fibreboard
  •  
  • This truncation feature is also useful for
    searching for unusual plural forms
  • select wom?n retrieves woman, women

35
Basic Dialog Commands (4)
  • TYPE (t) Command
  • Formats

36
Basic Dialog Commands (5)
  • Logoff command
  • Enter logoff in command line or click logoff
    button
  • the system displays estimated costs for the
    current search session, and disconnects you from
    Dialog and erases all of your sets.
  • Logoff Hold command
  • disconnects you temporarily so you can reconnect
    to your search in progress.
Write a Comment
User Comments (0)
About PowerShow.com