Title: Prof. Ray Larson
1Lecture 17 Boolean IR and Text Processing
SIMS 202 Information Organization and Retrieval
- Prof. Ray Larson Prof. Marc Davis
- UC Berkeley SIMS
- Tuesday and Thursday 1030 am - 1200 pm
- Fall 2003
- http//www.sims.berkeley.edu/academics/courses/is2
02/f03/
2Announcements
- Wishter volunteers meeting tonight 700
- Testers needed!!
- UI Tests on Image Gallery/ Annotation software
- Thursday between 2-4
- and Friday 10-4.
- The tests will be approximately 1 ½ hours (but
most likely will run a bit shorter.) - Signup sheet will be available at the end of class
3Lecture Overview
- Review
- Introduction to Information Retrieval
- The Information Seeking Process
- History of IR Research
- IR System Structure (revisited)
- Central Concepts in IR
- Boolean Logic and Boolean IR Systems
- Text Processing
- Discussion
Credit for some of the slides in this lecture
goes to Marti Hearst
4Lecture Overview
- Review
- Introduction to Information Retrieval
- The Information Seeking Process
- History of IR Research
- IR System Structure (revisited)
- Central Concepts in IR
- Boolean Logic and Boolean IR Systems
- Text Processing
- Discussion
Credit for some of the slides in this lecture
goes to Marti Hearst
5IR is an Iterative Process
6Berry-Picking Model
A sketch of a searcher moving through many
actions towards a general goal of satisfactory
completion of research related to an information
need. (after Bates 89)
Q2
Q4
Q3
Q1
Q5
Q0
7Restricted Form of the IR Problem
- The system has available only pre-existing,
canned text passages - Its response is limited to selecting from these
passages and presenting them to the user - It must select, say, 10 or 20 passages out of
millions or billions!
8Information Retrieval
- Revised Task Statement
- Build a system that retrieves documents that
users are likely to find relevant to their
queries - This set of assumptions underlies the field of
Information Retrieval
9Lecture Overview
- Review
- Introduction to Information Retrieval
- The Information Seeking Process
- History of IR Research
- IR System Structure (revisited)
- Central Concepts in IR
- Boolean Logic and Boolean IR Systems
- Text Processing
- Discussion
Credit for some of the slides in this lecture
goes to Marti Hearst
10Structure of an IR System
Search Line
Storage Line
Interest profiles Queries
Documents data
Information Storage and Retrieval System
Rules of the game Rules for subject indexing
Thesaurus (which consists of Lead-In Vocabulary
and Indexing Language
Indexing (Descriptive and Subject)
Formulating query in terms of descriptors
Storage of profiles
Storage of Documents
Store1 Profiles/ Search requests
Store2 Document representations
Comparison/ Matching
Adapted from Soergel, p. 19
Potentially Relevant Documents
11Lecture Overview
- Review
- Introduction to Information Retrieval
- The Information Seeking Process
- History of IR Research
- IR System Structure (revisited)
- Central Concepts in IR
- Boolean Logic and Boolean IR Systems
- Text Processing
- Discussion
Credit for some of the slides in this lecture
goes to Marti Hearst
12Central Concepts in IR
- Documents
- Queries
- Collections
- Evaluation
- Relevance
13Documents
- What do we mean by a document?
- Full document?
- Document surrogates?
- Pages?
- Buckland (JASIS, Sept. 1997) What is a Document
- Are IR systems better called Document Retrieval
systems? - A document is a representation of some
aggregation of information, treated as a unit
14Collection
- A collection is some physical or logical
aggregation of documents - A database
- A Library
- An index?
- Others?
15Queries
- A query is some expression of a users
information needs - Can take many forms
- Natural language description of need
- Formal query in a query language
- Queries may not be accurate expressions of the
information need - Differences between conversation with a person
and formal query expression
16Evaluation Why Evaluate?
- Determine if the system is desirable
- Make comparative assessments
- Others?
17What To Evaluate?
- How much of the information need was satisfied
- How much was learned about a topic
- Incidental learning
- How much was learned about the collection
- How much was learned about other topics
- How inviting the system is
18What To Evaluate?
- What can be measured that reflects users
ability to use system? (Cleverdon 66) - Coverage of information
- Form of presentation
- Effort required/ease of use
- Time and space efficiency
- Recall
- Proportion of relevant material actually
retrieved - Precision
- Proportion of retrieved material actually relevant
Effectiveness
19Relevance (revisited)
- Intuitively, we understand quite well what
relevance means. It is a primitive y know
concept, as is information for which we hardly
need a definition. if and when any productive
contact in communication is desired,
consciously or not, we involve and use this
intuitive notion or relevance. - Saracevic, 1975 p. 324
20Relevance
- How relevant is the document
- For this user, for this information need
- Subjective, but
- Measurable to some extent
- How often do people agree a document is relevant
to a query? - How well does it answer the question?
- Complete answer? Partial?
- Background information?
- Hints for further exploration?
21Relevance Research and Thought
- Review to 1975 by Saracevic
- Reconsideration of user-centered relevance by
Schamber, Eisenberg and Nilan, 1990 - Special Issue of JASIS on relevance (April 1994,
45(3))
22Saracevic
- Relevance is considered as a measure of
effectiveness of the contact between a source and
a destination in a communications process - Systems view
- Destinations view
- Subject Literature view
- Subject Knowledge view
- Pertinence
- Pragmatic view
23Define Your Own Relevance
- As we saw last time most definitions of relevance
follow a formula - Relevance is the (A) gage of relevance of an (B)
aspect of relevance existing between an (C)
object judged and a (D) frame of reference as
judged by an (E) assessor
From Saracevic, 1975 and Schamber 1990
24Schamber, Eisenberg and Nilan
- Relevance is the measure of retrieval
performance in all information systems, including
full-text, multimedia, question-answering,
database management and knowledge-based systems. - Systems-oriented relevance Topicality
25Schamber, et al. Conclusions
- Relevance is a multidimensional concept whose
meaning is largely dependent on users
perceptions of information and their own
information need situations - Relevance is a dynamic concept that depends on
users judgments of the quality of the
relationship between information and information
need at a certain point in time. - Relevance is a complex but systematic and
measurable concept if approached conceptually and
operationally from the users perspective.
26Janes View
27Lecture Overview
- Review
- Introduction to Information Retrieval
- The Information Seeking Process
- History of IR Research
- IR System Structure (revisited)
- Central Concepts in IR
- Boolean Logic and Boolean IR Systems
- Text Processing
- Discussion
Credit for some of the slides in this lecture
goes to Marti Hearst
28Query Languages
- A way to express the question (information need)
- Types
- Boolean
- Natural Language
- Stylized Natural Language
- Form-Based (GUI)
29Simple Query Language Boolean
- Terms Connectors (or operators)
- Terms
- Words
- Normalized (stemmed) words
- Phrases
- Thesaurus terms
- Connectors
- AND
- OR
- NOT
30Boolean Queries
- Cat
- Cat OR Dog
- Cat AND Dog
- (Cat AND Dog)
- (Cat AND Dog) OR Collar
- (Cat AND Dog) OR (Collar AND Leash)
- (Cat OR Dog) AND (Collar OR Leash)
31Boolean Queries
- (Cat OR Dog) AND (Collar OR Leash)
- Each of the following combinations works
32Boolean Queries
- (Cat OR Dog) AND (Collar OR Leash)
- None of the following combinations works
33Boolean Logic
A
B
34Boolean Queries
- Usually expressed as INFIX operators in IR
- ((a AND b) OR (c AND b))
- NOT is UNARY PREFIX operator
- ((a AND b) OR (c AND (NOT b)))
- AND and OR can be n-ary operators
- (a AND b AND c AND d)
- Some rules - (De Morgan revisited)
- NOT(a) AND NOT(b) NOT(a OR b)
- NOT(a) OR NOT(b) NOT(a AND b)
- NOT(NOT(a)) a
35Boolean Logic
m1 t1 t2 t3
m2 t1 t2 t3
m3 t1 t2 t3
m4 t1 t2 t3
m5 t1 t2 t3
m6 t1 t2 t3
m7 t1 t2 t3
m8 t1 t2 t3
36Boolean Searching
37Pseudo-Boolean Queries
- A new notation, from web search
- cat dog collar leash
- Does not mean the same thing!
- Need a way to group combinations
- Phrases
- stray cat AND frayed collar
- stray cat frayed collar
38Another View of IR
Information Need
Collections
Pre-Process
Text Input
Index
Query
Parse
Rank
39Result Sets
- Run a query, get a result set
- Two choices
- Reformulate query, run on entire collection
- Reformulate query, run on result set
- Example Dialog query
- (Redford AND Newman)
- -gt S1 1450 documents
- (S1 AND Sundance)
- -gtS2 898 documents
40Feedback Queries
41Ordering of Retrieved Documents
- Pure Boolean has no ordering
- In practice
- Order chronologically
- Order by total number of hits on query terms
- What if one term has more hits than others?
- Is it better to one of each term or many of one
term? - Fancier methods have been investigated
- p-norm is most famous
- Usually impractical to implement
- Usually hard for user to understand
42Boolean
- Advantages
- Simple queries are easy to understand
- Relatively easy to implement
- Disadvantages
- Difficult to specify what is wanted
- Too much returned, or too little
- Ordering not well determined
- Dominant language in commercial systems until the
WWW
43Faceted Boolean Query
- Strategy Break query into facets (polysemous
with earlier meaning of facets) - Conjunction of disjunctions
- a1 OR a2 OR a3
- b1 OR b2
- c1 OR c2 OR c3 OR c4
- Each facet expresses a topic
- rain forest OR jungle OR amazon
- medicine OR remedy OR cure
- Smith OR Zhou
AND
AND
44Faceted Boolean Query
- Query still fails if one facet missing
- Alternative Coordination level ranking
- Order results in terms of how many facets
(disjuncts) are satisfied - Also called Quorum ranking, Overlap ranking, and
Best Match - Problem Facets still undifferentiated
- Alternative Assign weights to facets
45Proximity Searches
- Proximity Terms occur within K positions of one
another - pen w/5 paper
- A Near function can be more vague
- near(pen, paper)
- Sometimes order can be specified
- Also, Phrases and Collocations
- United Nations Bill Clinton
- Phrase Variants
- retrieval of information information
retrieval
46Filters
- Filters Reduce set of candidate docs
- Often specified simultaneous with query
- Usually restrictions on metadata
- Restrict by
- Date range
- Internet domain (.edu .com .berkeley.edu)
- Author
- Size
- Limit number of documents returned
47Boolean Systems
- Most of the commercial database search systems
that pre-date the WWW are based on Boolean search - Dialog, Lexis-Nexis, etc.
- Most Online Library Catalogs are Boolean systems
- E.g., MELVYL
- Database systems use Boolean logic for searching
- Many of the search engines sold for intranet
search of web sites are Boolean
48Why Boolean?
- Easy to implement
- Efficient searching across very large databases
- Easy to explain results
- Has to have all of the words (AND)
- Has to have at least one of the words (OR)
-
49Lecture Overview
- Review
- Introduction to Information Retrieval
- The Information Seeking Process
- History of IR Research
- IR System Structure (revisited)
- Central Concepts in IR
- Boolean Logic and Boolean IR Systems
- Text Processing
- Discussion
Credit for some of the slides in this lecture
goes to Marti Hearst
50Content Analysis
- Automated Transformation of raw text into a form
that represents some aspect(s) of its meaning - Including, but not limited to
- Automated Thesaurus Generation
- Phrase Detection
- Categorization
- Clustering
- Summarization
51Techniques for Content Analysis
- Statistical
- Single Document
- Full Collection
- Linguistic
- Syntactic
- Semantic
- Pragmatic
- Knowledge-Based (Artificial Intelligence)
- Hybrid (Combinations)
52Text Processing
- Standard Steps
- Recognize document structure
- Titles, sections, paragraphs, etc.
- Break into tokens
- Usually space and punctuation delineated
- Special issues with Asian languages
- Stemming/morphological analysis
- Store in inverted index (to be discussed later)
53Content Analysis Areas
54Document Processing Steps
From Modern IR Textbook
55Stemming and Morphological Analysis
- Goal normalize similar words
- Morphology (form of words)
- Inflectional Morphology
- E.g,. inflect verb endings and noun number
- Never change grammatical class
- dog, dogs
- tengo, tienes, tiene, tenemos, tienen
- Derivational Morphology
- Derive one word from another,
- Often change grammatical class
- build, building health, healthy
56Automated Methods
- Powerful multilingual tools exist for
morphological analysis - PCKimmo, Xerox Lexical technology
- Require a grammar and dictionary
- Use two-level automata
- Stemmers
- Very dumb rules work well (for English)
- Porter Stemmer Iteratively remove suffixes
- Improvement Pass results through a lexicon
57Errors Generated by Porter Stemmer
From Krovetz 93
58Lecture Overview
- Review
- Introduction to Information Retrieval
- The Information Seeking Process
- History of IR Research
- IR System Structure (revisited)
- Central Concepts in IR
- Boolean Logic
- Boolean IR Systems
- Discussion
Credit for some of the slides in this lecture
goes to Marti Hearst
59Questions from Patrick Riley
- In Plato's Meno Dialogue, Plato asks "How does
one investigate what one does not know?" Plato's
question is similar to typical questions we
encounter in this and other readings of INFOSYS
202 how do we overcome the synonymy and polysemy
problems faced by lexical searching? Can the LSA
(Latent Semantic Analysis) and SVD (singular
value decomposition) statistical techniques
demonstrated by Demais et al solve the lexicon
deficiencies in information retrieval?
60Paradox
- The Fundamental paradox of Information
Retrieval as stated by Roland Hjerrpe - The need to describe that which you do not know
in order to find it
61Questions from Patrick Riley
- This paper is from 1988...do you know of any
applications or advancements of this LSA approach
from the information retrieval community?
(Example AI (LSA passed the TEFL). - And what are some of the limitations of using
this corpus-based text comparison mechanism?
(Example no use of word order, incompleteness?)
How does the LSA approach differ from other
statistical approaches you've encountered?
(Example Google's "Similar Pages" feature.)
62Questions from Joe Hall
- I would really like to see a show of hands (in
class, I can't see you now!) of how many people
have heard of either of the terms "Singular-value
Decomposition" or "Eigenvector Decomposition"
before you sat down to read this article. (I ask
because we use this a lot in numerical
approximation of radiative transfer in
astrophysics... SVD is definately a litmus test
as to whether or not a problem is difficult.)
63Questions from Joe Hall
- I'm going to get picky here. In the Conclusion,
Dumais et al. claim, "The latent structure LSI
approach is useful for helping people find
textual information in large collections."
However, their results (and those of other
researchers!) mostly contradict this claim. So
which is it... does the SVD approach "offer no
improvement over term matching methods" only for
"relatively homogenous" groups of documents like
"information science documents." Does LSI work
best on widely different documents? Take a look
at this paper's abstract which contradicts the
Dumais findings http//tinyurl.com/smfo
64Questions from Joe Hall
- If you raised your hand for the first question,
you may know that SVD is very computationally
intensive... Dumais claims that "it need only be
done once for each dataset." That's no fun...
most datasets change over time... not only that,
but most datasets grow with time... which means
that SVD techniques can only be used on small,
static, homogenous data sets (if you buy the link
I showed above)... what fun is that? Where is
SVD-enabled SLI useful? Is it merely a
fascination of IR researchers and a way to write
fancy grant proposals to make the next mazaratti
payment?
65Questions from Tu Tran
- In what context was this paper written? What was
the state of the IR field? - Imagine you are an information specialist and had
to explain LSI and SVD to your non-mathematically
oriented/non-technical manager. How would you do
it? - The paper did not include any user studies. Can
you imagine tasks where users would not find this
system useful?
66Next Time
- Statistical Properties of Texts and Vector
Representation - Readings/Discussion
- Cooper, Getting Beyond Boole Dan
- Bates, How to use Controlled Vocabularies More
Effectively in Online Searching Ann - Hearst, Improving Full-Text Precision on Short
Queries Using Simple Constraints Simon - Modern IR Chapter 7 Sean