Title: PRINCIPLES OF SEARCHING 17:610:530 02 Lecture 2 Sept'15, 2003
1PRINCIPLES OF SEARCHING17610530 (02)Lecture
2Sept.15, 2003
- Ying Sun
- SCILS, Rm. 214
- ysun_at_scils.rutgers.edu
2Groups
3Todays Plan
- Lecture on IR (one hr)
- Group Discussion (40mins)
- On-hand Dialog exercise (40mins)
4IR the Original Definition
- Calvin Mooers first introduced this term,
information retrieval, into the literature of
documentation in 1950. (Swanson, 1988) - Inf. retrieval embraces the intellectual aspects
of the description of information and its
specification for search, and also whatever
systems, techniques, or machines are employed to
carry out the operation. - --Calvin Mooers, 1951
5IR another definition
- Information retrieval is often regarded as being
synonymous with document retrieval and nowadays,
with text retrieval, implying that the task of an
IR system is to retrieve documents or texts with
information content that is relevant to a users
information need - -Spark Jones Willett, 1997
6IR Objective and Problems
- Provide the users with effective access to
interaction with information resources. - Problems addressed
- 1. How to organize information intellectually?
- 2. How to specify search interaction
intellectually? - 3. What systems techniques to use for those
processes?
7IR Models
- Model depicts, represents what is involved - a
choice of features, processes, things for
consideration - Several IR models used over time
- traditional oldest, most used, shows basic
elements involved - interactive more realistic, favored now, shows
also interactions involved several models
proposed - Each has strengths, weaknesses
- We start with traditional model to illustrate
many points - from general to specific examples
8Traditional IR Model (1)
- The classic information retrieval model (Bates,
1989)
9Traditional IR Model (2)
- The standard IR model (Belkin, 1993)
Information need
Texts
Representation
Representation
Query
Surrogate
Comparison
Retrieval Texts
Judgment
Modification
10Traditional IR Model (3)
- The comprehensive IR model (Saracevic)
System
User
Acquisition documents, objects
Problem information need
Representation indexing, ...
Representation question
Query search formulation
File organization indexed documents
Matching searching
feedback
Retrieved objects
11IR components - Acquisition
- Content What is in databases
- In DIALOG first part of blue sheets File
Description, Subject Coverage - Selection of documents other objects from
various sources - In blue sheets Sources
- Mostly text based documents
- Full texts, titles, abstracts ...
- But also data, statistics, images (e.g. maps,
trade marks) ... - Determines database contents key to file
selection
12IR Components Information Objects
Representation
- Types of surrogates
- Subject Index
- controlled vocabulary - thesaurus
- free text terms (even in full texts)
- Abstract annotat
- Bibliographic description
- author, title, source, datemetadata
13IR Components Information Objects
Representation (2)
- These surrogates are organized into fields
limits - Basic index subject-related field, including
title, abstract, descriptor, and identifier. - Additional index other searchable field, such as
author, journal name, company name, etc. - Manual automatic techniques
- advantages disadvantages
14IR Components File Organization
- Sequential
- record (document) by record
- Inverted
- term by term list of records under each term
- Combination indexes inverted, documents
sequential - for efficient retrieval by computers
- When citation retrieved only, need for document
files
15IR Components User problems
- Related to task situation at hand
- Vary in specificity, clarity
- Produces information need
- Ultimate criterion for effectiveness of retrieval
- Inf. need for the same problem may change,
evolve, shift during the IR process - adjustment
in searching - Often more than one search for same problem over
time
16IR Components Information question
representation
- Non-mediated end user alone
- Mediated intermediary user
- interviews human-human interaction
- Question analysis selection, elaboration of
terms - Focus toward search terms logic selection of
databases - Subject to feedback changes
- Various tools thesaurus ...
17IR Components Query (Search Statement)
- Translation into systems requirements limits
- start of human-computer interaction
- Selection of databases
- Search strategy - selection of
- search terms logic
- possible fields, delimiters
- controlled uncontrolled vocabulary
- Reiterations from feedback
- relevance feedback
- query expansion modification
18IR Components - Matching
- Process of matching, comparing
- search what documents in the file match the
query as stated? - Various search algorithms
- exact match - Boolean
- still most prevalent
- best match - ranking by relevance, or other
criteria - increasingly used e.g. on the web
- hybrids incorporating both
- e.g. Target, Rank in DIALOG
19Exact match - Boolean search
- You retrieve exactly what you ask for in the
query - all documents that have the term(s) with logical
connection(s), and possible other restrictions
(e.g. to be in titles) as stated in the query - exactly nothing less, nothing more
- Based on matching following rules of Boolean
algebra, or algebra of sets - new algebra
- presented by circles in Venn diagrams
20Best Match
- The retrieved documents are ranked by how similar
(close) they are to a query (as calculated by the
system) - similarity assumed as relevance
- thus, documents as answers are presented from
those that are most likely relevant downwards to
less less likely relevant - can be cut at any
desired number - e.g. first 10 - Algorithms (formulas) used to determine
similarity - using statistic /or linguistic properties
- Web outputs are mostly ranked
- But DIALOG allows ranking as well, with special
commands
21Best Match Cont.
- Best match process
- compares a set of query terms with the sets of
terms in documents - calculates a similarity between query each
document based on common terms - sorts the documents in order of similarity
- assumes that the higher ranked documents have a
higher probability of being relevant - allows for cut-off at a chosen number
- BIG issue What representation similarity
measures are best? - considerable research many tests
- many proprietary algorithms
22Exact vs. Best Match
- Boolean
- allows for logic
- provides all that has been matched
- BUT
- has no particular order of output
- treats all retrievals equally - from the most to
least relevant ones - often requires examination of large outputs
- Best match
- allows for free terminology
- provides for a ranked output
- provides for cut-off - any size output
- BUT
- does not include logic
- ranking method (algorithm) not transparent
- whose relevance?
- where to cut off?
23IR Components Retrieved Documents
- Various order of output
- Last In First Out (LIFO) sorted
- ranked by relevance
- ranked by other characteristics
- Various forms of output
- In DIALOG Output options
- When citations only linkage to document
delivery - Base for relevance, utility evaluation by users
- Relevance feedback
24Strength of Traditional IR Models
- Lists major components in both system user
branches - Suggests
- What to explain to users about system, if needed
- What to ask of users for more effective searching
(problem ...) - Selection of component(s) for concentration
- mostly ever better representation
- Provides a framework for evaluation of (static)
aspects
25Weakness
- Does not address nor account for interaction
judgment of results by users - identifies interaction with search only
- interaction is a much richer process
- Many types of variables in interaction not
reflected - Feedback has many types functions - also not
shown - Evaluation thus one-sided
26Break
27Boolean Algebra
- Boolean Logical Operators
- OR Operator Use OR to group synonymous terms
when at least one must be present. - OR increases the number of records retrieved.
- AND Operator Use AND to connect terms when both
or all must be present. - AND decreases the number of records retrieved.
- NOT Operator Use NOT to exclude records
containing a specified term. - Use NOT carefully with subject termsyou may
unintentionally - eliminate useful records.
28Venn Diagrams Demonstration
Four basic operations
29Venn Diagrams Cont.
- Complex statements allowed e.g
(A OR B) AND C Shade ? (apples or oranges) AND
Florida (A OR B) NOT C Shade what? (apples or
oranges NOT Florida
A
B
2
3
1
5
4
6
7
C
30Order of Operators
- Rules for the order in which operations are done
effect the answer - so follow them - Order of operation is
- First are done all operations that are in
parentheses, then - NOT
- AND
- OR
- e.g. for A OR B AND C, first is done B AND C, and
then A OR the results of B AND C, thus this is
the same as A OR (B AND C)
31Exercise
1. A AND B 2. A OR B 3. A OR B AND C 4. (A OR B)
AND C 5. A AND B NOT C 6. A AND (B NOT C) 7. (A
AND B) NOT C 8. (A NOT B) AND C 9. A NOT (B AND
C) 10. A NOT B AND C 11. A AND (B OR C) 12. A AND
B OR C 13. A OR (B AND C) 14. A OR B OR C 15 A
AND B AND C
2,5 1,2,3,4,5,6 1,2,4,5,6 4,5,6 2 2 2 4 1,2,4 4 2,
4,5 2,4,5,6,7 1,2,4,5,6 1,2,3,4,5,6,7 5
32Basic Dialog Commands
- BEGIN (b) Command Accessing a Database
- Command Format BEGIN n or b n
- where n is the file number of the database
- Databases can be searched as a group
- b 1,2,6,7,35,47,121,148,202,437,438
33Basic Dialog Commands (2)
- SELECT (s) Command Entering Search Terms
- Command Format SELECT term
- The response to a SELECT command appears below
- s1 information retrieval 4413
- Set Number Term(s) as
entered in the Number of records
containing - SELECT command
the term(s) Hits - Always enter a space after the SELECT (S) command
- Any word or term can be SELECTed, except nine
non-searchable stop words an, the, for, and,
from, to, by, of, with
34Basic Dialog Commands (3)
- Truncating Search Terms
- Open Truncation
- s path? retrieves all words that begin with
path paths, pathos, pathway, pathology - Controlled-Length Truncation
- s path? ? retrieves the root and up to one
additional character paths - s path?? retrieves the root and up to two
additional characters paths, pathos - Embedded Character Embedded Character truncation
can be used for variant spellings - select organi?ation retrieves organization,
organisation -
- select fib??board retrieves fiberboard,
fibreboard -
- This truncation feature is also useful for
searching for unusual plural forms - select wom?n retrieves woman, women
35Basic Dialog Commands (4)
36Basic Dialog Commands (5)
- Logoff command
- Enter logoff in command line or click logoff
button - the system displays estimated costs for the
current search session, and disconnects you from
Dialog and erases all of your sets. - Logoff Hold command
- disconnects you temporarily so you can reconnect
to your search in progress.