Prof. Ray Larson

About This Presentation

Title:

Prof. Ray Larson

Description:

Tuesday and Thursday 10:30 am - 12:00 pm. Fall 2003 ... Berry-Picking Model. Q0. Q1. Q2. Q3. Q4. Q5. A sketch of a searcher... – PowerPoint PPT presentation

Number of Views:160

Avg rating:3.0/5.0

Slides: 67

Provided by: ValuedGate1

Learn more at: https://courses.ischool.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Prof. Ray Larson

1
Lecture 17 Boolean IR and Text Processing
SIMS 202 Information Organization and Retrieval

Prof. Ray Larson Prof. Marc Davis
UC Berkeley SIMS
Tuesday and Thursday 1030 am - 1200 pm
Fall 2003
http//www.sims.berkeley.edu/academics/courses/is2
02/f03/

2
Announcements

Wishter volunteers meeting tonight 700
Testers needed!!
UI Tests on Image Gallery/ Annotation software
Thursday between 2-4
and Friday 10-4.
The tests will be approximately 1 ½ hours (but
most likely will run a bit shorter.)
Signup sheet will be available at the end of class

3
Lecture Overview

Review
Introduction to Information Retrieval
The Information Seeking Process
History of IR Research
IR System Structure (revisited)
Central Concepts in IR
Boolean Logic and Boolean IR Systems
Text Processing
Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
4
Lecture Overview

Review
Introduction to Information Retrieval
The Information Seeking Process
History of IR Research
IR System Structure (revisited)
Central Concepts in IR
Boolean Logic and Boolean IR Systems
Text Processing
Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
5
IR is an Iterative Process
6
Berry-Picking Model
A sketch of a searcher moving through many
actions towards a general goal of satisfactory
completion of research related to an information
need. (after Bates 89)
Q2
Q4
Q3
Q1
Q5
Q0
7
Restricted Form of the IR Problem

The system has available only pre-existing,
canned text passages
Its response is limited to selecting from these
passages and presenting them to the user
It must select, say, 10 or 20 passages out of
millions or billions!

8
Information Retrieval

Revised Task Statement
Build a system that retrieves documents that
users are likely to find relevant to their
queries
This set of assumptions underlies the field of
Information Retrieval

9
Lecture Overview

Review
Introduction to Information Retrieval
The Information Seeking Process
History of IR Research
IR System Structure (revisited)
Central Concepts in IR
Boolean Logic and Boolean IR Systems
Text Processing
Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
10
Structure of an IR System
Search Line
Storage Line
Interest profiles Queries
Documents data
Information Storage and Retrieval System
Rules of the game Rules for subject indexing
Thesaurus (which consists of Lead-In Vocabulary
and Indexing Language
Indexing (Descriptive and Subject)
Formulating query in terms of descriptors
Storage of profiles
Storage of Documents
Store1 Profiles/ Search requests
Store2 Document representations
Comparison/ Matching
Adapted from Soergel, p. 19
Potentially Relevant Documents
11
Lecture Overview

Review
Introduction to Information Retrieval
The Information Seeking Process
History of IR Research
IR System Structure (revisited)
Central Concepts in IR
Boolean Logic and Boolean IR Systems
Text Processing
Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
12
Central Concepts in IR

Documents
Queries
Collections
Evaluation
Relevance

13
Documents

What do we mean by a document?
Full document?
Document surrogates?
Pages?
Buckland (JASIS, Sept. 1997) What is a Document
Are IR systems better called Document Retrieval
systems?
A document is a representation of some
aggregation of information, treated as a unit

14
Collection

A collection is some physical or logical
aggregation of documents
A database
A Library
An index?
Others?

15
Queries

A query is some expression of a users
information needs
Can take many forms
Natural language description of need
Formal query in a query language
Queries may not be accurate expressions of the
information need
Differences between conversation with a person
and formal query expression

16
Evaluation Why Evaluate?

Determine if the system is desirable
Make comparative assessments
Others?

17
What To Evaluate?

How much of the information need was satisfied
How much was learned about a topic
Incidental learning
How much was learned about the collection
How much was learned about other topics
How inviting the system is

18
What To Evaluate?

What can be measured that reflects users
ability to use system? (Cleverdon 66)
Coverage of information
Form of presentation
Effort required/ease of use
Time and space efficiency
Recall
Proportion of relevant material actually
retrieved
Precision
Proportion of retrieved material actually relevant

Effectiveness
19
Relevance (revisited)

Intuitively, we understand quite well what
relevance means. It is a primitive y know
concept, as is information for which we hardly
need a definition. if and when any productive
contact in communication is desired,
consciously or not, we involve and use this
intuitive notion or relevance.
Saracevic, 1975 p. 324

20
Relevance

How relevant is the document
For this user, for this information need
Subjective, but
Measurable to some extent
How often do people agree a document is relevant
to a query?
How well does it answer the question?
Complete answer? Partial?
Background information?
Hints for further exploration?

21
Relevance Research and Thought

Review to 1975 by Saracevic
Reconsideration of user-centered relevance by
Schamber, Eisenberg and Nilan, 1990
Special Issue of JASIS on relevance (April 1994,
45(3))

22
Saracevic

Relevance is considered as a measure of
effectiveness of the contact between a source and
a destination in a communications process
Systems view
Destinations view
Subject Literature view
Subject Knowledge view
Pertinence
Pragmatic view

23
Define Your Own Relevance

As we saw last time most definitions of relevance
follow a formula
Relevance is the (A) gage of relevance of an (B)
aspect of relevance existing between an (C)
object judged and a (D) frame of reference as
judged by an (E) assessor

From Saracevic, 1975 and Schamber 1990
24
Schamber, Eisenberg and Nilan

Relevance is the measure of retrieval
performance in all information systems, including
full-text, multimedia, question-answering,
database management and knowledge-based systems.
Systems-oriented relevance Topicality

25
Schamber, et al. Conclusions

Relevance is a multidimensional concept whose
meaning is largely dependent on users
perceptions of information and their own
information need situations
Relevance is a dynamic concept that depends on
users judgments of the quality of the
relationship between information and information
need at a certain point in time.
Relevance is a complex but systematic and
measurable concept if approached conceptually and
operationally from the users perspective.

26
Janes View
27
Lecture Overview

Review
Introduction to Information Retrieval
The Information Seeking Process
History of IR Research
IR System Structure (revisited)
Central Concepts in IR
Boolean Logic and Boolean IR Systems
Text Processing
Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
28
Query Languages

A way to express the question (information need)
Types
Boolean
Natural Language
Stylized Natural Language
Form-Based (GUI)

29
Simple Query Language Boolean

Terms Connectors (or operators)
Terms
Words
Normalized (stemmed) words
Phrases
Thesaurus terms
Connectors
AND
OR
NOT

30
Boolean Queries

Cat
Cat OR Dog
Cat AND Dog
(Cat AND Dog)
(Cat AND Dog) OR Collar
(Cat AND Dog) OR (Collar AND Leash)
(Cat OR Dog) AND (Collar OR Leash)

31
Boolean Queries

(Cat OR Dog) AND (Collar OR Leash)
Each of the following combinations works

32
Boolean Queries

(Cat OR Dog) AND (Collar OR Leash)
None of the following combinations works

33
Boolean Logic
A
B
34
Boolean Queries

Usually expressed as INFIX operators in IR
((a AND b) OR (c AND b))
NOT is UNARY PREFIX operator
((a AND b) OR (c AND (NOT b)))
AND and OR can be n-ary operators
(a AND b AND c AND d)
Some rules - (De Morgan revisited)
NOT(a) AND NOT(b) NOT(a OR b)
NOT(a) OR NOT(b) NOT(a AND b)
NOT(NOT(a)) a

35
Boolean Logic
m1 t1 t2 t3
m2 t1 t2 t3
m3 t1 t2 t3
m4 t1 t2 t3
m5 t1 t2 t3
m6 t1 t2 t3
m7 t1 t2 t3
m8 t1 t2 t3
36
Boolean Searching
37
Pseudo-Boolean Queries

A new notation, from web search
cat dog collar leash
Does not mean the same thing!
Need a way to group combinations
Phrases
stray cat AND frayed collar
stray cat frayed collar

38
Another View of IR
Information Need
Collections
Pre-Process
Text Input
Index
Query
Parse
Rank
39
Result Sets

Run a query, get a result set
Two choices
Reformulate query, run on entire collection
Reformulate query, run on result set
Example Dialog query
(Redford AND Newman)
-gt S1 1450 documents
(S1 AND Sundance)
-gtS2 898 documents

40
Feedback Queries
41
Ordering of Retrieved Documents

Pure Boolean has no ordering
In practice
Order chronologically
Order by total number of hits on query terms
What if one term has more hits than others?
Is it better to one of each term or many of one
term?
Fancier methods have been investigated
p-norm is most famous
Usually impractical to implement
Usually hard for user to understand

42
Boolean

Advantages
Simple queries are easy to understand
Relatively easy to implement
Disadvantages
Difficult to specify what is wanted
Too much returned, or too little
Ordering not well determined
Dominant language in commercial systems until the
WWW

43
Faceted Boolean Query

Strategy Break query into facets (polysemous
with earlier meaning of facets)
Conjunction of disjunctions
a1 OR a2 OR a3
b1 OR b2
c1 OR c2 OR c3 OR c4
Each facet expresses a topic
rain forest OR jungle OR amazon
medicine OR remedy OR cure
Smith OR Zhou

AND
AND
44
Faceted Boolean Query

Query still fails if one facet missing
Alternative Coordination level ranking
Order results in terms of how many facets
(disjuncts) are satisfied
Also called Quorum ranking, Overlap ranking, and
Best Match
Problem Facets still undifferentiated
Alternative Assign weights to facets

45
Proximity Searches

Proximity Terms occur within K positions of one
another
pen w/5 paper
A Near function can be more vague
near(pen, paper)
Sometimes order can be specified
Also, Phrases and Collocations
United Nations Bill Clinton
Phrase Variants
retrieval of information information
retrieval

46
Filters

Filters Reduce set of candidate docs
Often specified simultaneous with query
Usually restrictions on metadata
Restrict by
Date range
Internet domain (.edu .com .berkeley.edu)
Author
Size
Limit number of documents returned

47
Boolean Systems

Most of the commercial database search systems
that pre-date the WWW are based on Boolean search
Dialog, Lexis-Nexis, etc.
Most Online Library Catalogs are Boolean systems
E.g., MELVYL
Database systems use Boolean logic for searching
Many of the search engines sold for intranet
search of web sites are Boolean

48
Why Boolean?

Easy to implement
Efficient searching across very large databases
Easy to explain results
Has to have all of the words (AND)
Has to have at least one of the words (OR)

49
Lecture Overview

Review
Introduction to Information Retrieval
The Information Seeking Process
History of IR Research
IR System Structure (revisited)
Central Concepts in IR
Boolean Logic and Boolean IR Systems
Text Processing
Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
50
Content Analysis

Automated Transformation of raw text into a form
that represents some aspect(s) of its meaning
Including, but not limited to
Automated Thesaurus Generation
Phrase Detection
Categorization
Clustering
Summarization

51
Techniques for Content Analysis

Statistical
Single Document
Full Collection
Linguistic
Syntactic
Semantic
Pragmatic
Knowledge-Based (Artificial Intelligence)
Hybrid (Combinations)

52
Text Processing

Standard Steps
Recognize document structure
Titles, sections, paragraphs, etc.
Break into tokens
Usually space and punctuation delineated
Special issues with Asian languages
Stemming/morphological analysis
Store in inverted index (to be discussed later)

53
Content Analysis Areas
54
Document Processing Steps
From Modern IR Textbook
55
Stemming and Morphological Analysis

Goal normalize similar words
Morphology (form of words)
Inflectional Morphology
E.g,. inflect verb endings and noun number
Never change grammatical class
dog, dogs
tengo, tienes, tiene, tenemos, tienen
Derivational Morphology
Derive one word from another,
Often change grammatical class
build, building health, healthy

56
Automated Methods

Powerful multilingual tools exist for
morphological analysis
PCKimmo, Xerox Lexical technology
Require a grammar and dictionary
Use two-level automata
Stemmers
Very dumb rules work well (for English)
Porter Stemmer Iteratively remove suffixes
Improvement Pass results through a lexicon

57
Errors Generated by Porter Stemmer
From Krovetz 93
58
Lecture Overview

Review
Introduction to Information Retrieval
The Information Seeking Process
History of IR Research
IR System Structure (revisited)
Central Concepts in IR
Boolean Logic
Boolean IR Systems
Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
59
Questions from Patrick Riley

In Plato's Meno Dialogue, Plato asks "How does
one investigate what one does not know?" Plato's
question is similar to typical questions we
encounter in this and other readings of INFOSYS
202 how do we overcome the synonymy and polysemy
problems faced by lexical searching? Can the LSA
(Latent Semantic Analysis) and SVD (singular
value decomposition) statistical techniques
demonstrated by Demais et al solve the lexicon
deficiencies in information retrieval?

60
Paradox

The Fundamental paradox of Information
Retrieval as stated by Roland Hjerrpe
The need to describe that which you do not know
in order to find it

61
Questions from Patrick Riley

This paper is from 1988...do you know of any
applications or advancements of this LSA approach
from the information retrieval community?
(Example AI (LSA passed the TEFL).
And what are some of the limitations of using
this corpus-based text comparison mechanism?
(Example no use of word order, incompleteness?)
How does the LSA approach differ from other
statistical approaches you've encountered?
(Example Google's "Similar Pages" feature.)

62
Questions from Joe Hall

I would really like to see a show of hands (in
class, I can't see you now!) of how many people
have heard of either of the terms "Singular-value
Decomposition" or "Eigenvector Decomposition"
before you sat down to read this article. (I ask
because we use this a lot in numerical
approximation of radiative transfer in
astrophysics... SVD is definately a litmus test
as to whether or not a problem is difficult.)

63
Questions from Joe Hall

I'm going to get picky here. In the Conclusion,
Dumais et al. claim, "The latent structure LSI
approach is useful for helping people find
textual information in large collections."
However, their results (and those of other
researchers!) mostly contradict this claim. So
which is it... does the SVD approach "offer no
improvement over term matching methods" only for
"relatively homogenous" groups of documents like
"information science documents." Does LSI work
best on widely different documents? Take a look
at this paper's abstract which contradicts the
Dumais findings http//tinyurl.com/smfo

64
Questions from Joe Hall

If you raised your hand for the first question,
you may know that SVD is very computationally
intensive... Dumais claims that "it need only be
done once for each dataset." That's no fun...
most datasets change over time... not only that,
but most datasets grow with time... which means
that SVD techniques can only be used on small,
static, homogenous data sets (if you buy the link
I showed above)... what fun is that? Where is
SVD-enabled SLI useful? Is it merely a
fascination of IR researchers and a way to write
fancy grant proposals to make the next mazaratti
payment?

65
Questions from Tu Tran

In what context was this paper written? What was
the state of the IR field?
Imagine you are an information specialist and had
to explain LSI and SVD to your non-mathematically
oriented/non-technical manager. How would you do
it?
The paper did not include any user studies. Can
you imagine tasks where users would not find this
system useful?

66
Next Time

Statistical Properties of Texts and Vector
Representation
Readings/Discussion
Cooper, Getting Beyond Boole Dan
Bates, How to use Controlled Vocabularies More
Effectively in Online Searching Ann
Hearst, Improving Full-Text Precision on Short
Queries Using Simple Constraints Simon
Modern IR Chapter 7 Sean

Write a Comment

User Comments (0)