Title: Prof. Ray Larson
1Lecture 4 Boolean IR and Text Processing
SIMS 202 Information Organization and Retrieval
- Prof. Ray Larson Prof. Marc Davis
- UC Berkeley SIMS
- Tuesday and Thursday 1030 am - 1200 pm
- Fall 2004
- http//www.sims.berkeley.edu/academics/courses/is2
02/f04/
2Advertisement
- Not doing anything on Friday afternoon?
- Please come to the Friday Afternoon Seminar
Open to ALL - This Week
- Clifford Lynch, director of the Coalition for
Networked Information and Adjunct Professor of
SIMS on Research Questions in Digital
Stewardship - See
- http//www.sims.berkeley.edu/academics/courses/is2
96a-1/f04/
3Lecture Overview
- Review
- Introduction to Information Retrieval
- The Information Seeking Process
- History of IR Research
- IR System Structure
- Central Concepts in IR
- Boolean Logic and Boolean IR Systems
- Text Processing
- Discussion
Credit for some of the slides in this lecture
goes to Marti Hearst
4Lecture Overview
- Review
- Introduction to Information Retrieval
- The Information Seeking Process
- History of IR Research
- IR System Structure (revisited)
- Central Concepts in IR
- Boolean Logic and Boolean IR Systems
- Text Processing
- Discussion
Credit for some of the slides in this lecture
goes to Marti Hearst
5IR is an Iterative Process
6Berry-Picking Model
A sketch of a searcher moving through many
actions towards a general goal of satisfactory
completion of research related to an information
need. (after Bates 89)
Q2
Q4
Q3
Q1
Q5
Q0
7Restricted Form of the IR Problem
- The system has available only pre-existing,
canned text passages - Its response is limited to selecting from these
passages and presenting them to the user - It must select, say, 10 or 20 passages out of
millions or billions!
8Information Retrieval
- Revised Task Statement
- Build a system that retrieves documents that
users are likely to find relevant to their
queries - This set of assumptions underlies the field of
Information Retrieval
9Paradox
- The Fundamental paradox of Information
Retrieval as stated by Roland Hjerrpe - The need to describe that which you do not know
in order to find it
10Lecture Overview
- Review
- Introduction to Information Retrieval
- The Information Seeking Process
- History of IR Research
- IR System Structure (revisited)
- Central Concepts in IR
- Boolean Logic and Boolean IR Systems
- Text Processing
- Discussion
Credit for some of the slides in this lecture
goes to Marti Hearst
11Structure of an IR System
Search Line
Storage Line
Interest profiles Queries
Documents data
Information Storage and Retrieval System
Rules of the game Rules for subject indexing
Thesaurus (which consists of Lead-In Vocabulary
and Indexing Language
Indexing (Descriptive and Subject)
Formulating query in terms of descriptors
Storage of profiles
Storage of Documents
Store1 Profiles/ Search requests
Store2 Document representations
Comparison/ Matching
Adapted from Soergel, p. 19
Potentially Relevant Documents
12Lecture Overview
- Review
- Introduction to Information Retrieval
- The Information Seeking Process
- History of IR Research
- IR System Structure (revisited)
- Central Concepts in IR
- Boolean Logic and Boolean IR Systems
- Text Processing
- Discussion
Credit for some of the slides in this lecture
goes to Marti Hearst
13Central Concepts in IR
- Documents
- Queries
- Collections
- Evaluation
- Relevance
14Documents
- What do we mean by a document?
- Full document?
- Document surrogates?
- Pages?
- Buckland (JASIS, Sept. 1997) What is a Document
- Are IR systems better called Document Retrieval
systems? - A document is a representation of some
aggregation of information, treated as a unit
15Collection
- A collection is some physical or logical
aggregation of documents - A database
- A Library
- An index?
- Others?
16Queries
- A query is some expression of a users
information needs - Can take many forms
- Natural language description of need
- Formal query in a query language
- Queries may not be accurate expressions of the
information need - Differences between conversation with a person
and formal query expression
17Evaluation Why Evaluate?
- Determine if the system is desirable
- Make comparative assessments
- Others?
18What To Evaluate?
- How much of the information need was satisfied
- How much was learned about a topic
- Incidental learning
- How much was learned about the collection
- How much was learned about other topics
- How inviting the system is
19What To Evaluate?
- What can be measured that reflects users
ability to use system? (Cleverdon 66) - Coverage of information
- Form of presentation
- Effort required/ease of use
- Time and space efficiency
- Recall
- Proportion of relevant material actually
retrieved - Precision
- Proportion of retrieved material actually relevant
Effectiveness
20Lecture Overview
- Review
- Introduction to Information Retrieval
- The Information Seeking Process
- History of IR Research
- IR System Structure (revisited)
- Central Concepts in IR
- Boolean Logic and Boolean IR Systems
- Text Processing
- Discussion
Credit for some of the slides in this lecture
goes to Marti Hearst
21Query Languages
- A way to express the question (information need)
- Types
- Boolean
- Natural Language
- Stylized Natural Language
- Form-Based (GUI)
22Simple Query Language Boolean
- Terms Operators
- Terms
- Words
- Normalized (stemmed) words
- Phrases
- Thesaurus terms
- Boolean Operators
- AND
- OR
- NOT
23Boolean Queries
- Cat
- Cat OR Dog
- Cat AND Dog
- (Cat AND Dog)
- (Cat AND Dog) OR Collar
- (Cat AND Dog) OR (Collar AND Leash)
- (Cat OR Dog) AND (Collar OR Leash)
24Boolean Queries
- (Cat OR Dog) AND (Collar OR Leash)
- Each of the following combinations works
Recall the card based systems? They mechanically
implement Boolean AND
25Boolean Queries
- (Cat OR Dog) AND (Collar OR Leash)
- None of the following combinations works
26Boolean Logic
A
B
27Boolean Queries
- Usually expressed as INFIX operators in IR
- ((a AND b) OR (c AND b))
- NOT is UNARY PREFIX operator
- ((a AND b) OR (c AND (NOT b)))
- AND and OR can be n-ary operators
- (a AND b AND c AND d)
- Some rules - (De Morgan revisited)
- NOT(a) AND NOT(b) NOT(a OR b)
- NOT(a) OR NOT(b) NOT(a AND b)
- NOT(NOT(a)) a
28Boolean Logic
m1 t1 t2 t3
m2 t1 t2 t3
m3 t1 t2 t3
m4 t1 t2 t3
m5 t1 t2 t3
m6 t1 t2 t3
m7 t1 t2 t3
m8 t1 t2 t3
29Boolean Searching
30Pseudo-Boolean Queries
- A new notation, from web search
- cat dog collar leash
- Does not mean the same thing!
- Need a way to group combinations
- Phrases
- stray cat AND frayed collar
- stray cat frayed collar
31Another View of IR
Information Need
Collections
Pre-Process
Text Input
Index
Query
Parse
Rank
32Result Sets
- Run a query, get a result set
- Two choices
- Reformulate query, run on entire collection
- Reformulate query, run on result set
- Example Dialog query
- (Redford AND Newman)
- -gt S1 1450 documents
- (S1 AND Sundance)
- -gtS2 898 documents
33Feedback Queries
34Ordering of Retrieved Documents
- Pure Boolean has no ordering
- In practice
- Order chronologically
- Order by total number of hits on query terms
- What if one term has more hits than others?
- Is it better to have one of each term or many of
one term? - Fancier methods have been investigated
- p-norm is most famous
- Usually impractical to implement
- Usually hard for user to understand
35Boolean
- Advantages
- Simple queries are easy to understand
- Relatively easy to implement
- Disadvantages
- Difficult to specify what is wanted
- Too much returned, or too little
- Ordering not well determined
- Dominant language in commercial IR systems until
the WWW, and still the language of Database
Management Systems
36Faceted Boolean Query
- Strategy Break query into facets (polysemous
with earlier meaning of facets) - Conjunction of disjunctions
- a1 OR a2 OR a3
- b1 OR b2
- c1 OR c2 OR c3 OR c4
- Each facet expresses a topic
- rain forest OR jungle OR amazon
- medicine OR remedy OR cure
- Smith OR Zhou
AND
AND
Also known as Conjunctive Normal Form or CNF
37Faceted Boolean Query
- Query still fails if one facet missing
- Alternative Coordination level ranking
- Order results in terms of how many facets
(disjuncts) are satisfied - Also called Quorum ranking, Overlap ranking, and
Best Match - Problem Facets still undifferentiated
- Alternative Assign weights to facets
38Proximity Searches
- Proximity Terms occur within K positions of one
another - pen w/5 paper
- A Near function can be more vague
- near(pen, paper)
- Sometimes order can be specified
- Also, Phrases and Collocations
- United Nations Bill Clinton
- Phrase Variants
- retrieval of information information
retrieval
39Filters
- Filters Reduce set of candidate docs
- Often specified simultaneous with query
- Usually restrictions on metadata
- Restrict by
- Date range
- Internet domain (.edu .com .berkeley.edu)
- Author
- Size
- Limit number of documents returned
40Boolean Systems
- Most of the commercial database search systems
that pre-date the WWW are based on Boolean search - Dialog, Lexis-Nexis, etc.
- Most Online Library Catalogs are Boolean systems
- E.g., MELVYL
- Database systems use Boolean logic for searching
- Many of the search engines sold for intranet
search of web sites are Boolean
41Why Boolean?
- Easy to implement
- Efficient searching across very large databases
- Easy to explain results
- Has to have all of the words (AND)
- Has to have at least one of the words (OR)
-
42Lecture Overview
- Review
- Introduction to Information Retrieval
- The Information Seeking Process
- History of IR Research
- IR System Structure (revisited)
- Central Concepts in IR
- Boolean Logic and Boolean IR Systems
- Text Processing
- Discussion
Credit for some of the slides in this lecture
goes to Marti Hearst
43Content Analysis
- Automated Transformation of raw text into a form
that represents some aspect(s) of its meaning - Including, but not limited to
- Automated Thesaurus Generation
- Phrase Detection
- Categorization
- Clustering
- Summarization
44Techniques for Content Analysis
- Statistical
- Single Document
- Full Collection
- Linguistic
- Syntactic
- Semantic
- Pragmatic
- Knowledge-Based (Artificial Intelligence)
- Hybrid (Combinations)
45Text Processing
- Standard Steps
- Recognize document structure
- Titles, sections, paragraphs, etc.
- Break into tokens
- Usually space and punctuation delineated
- Special issues with Asian languages
- Stemming/morphological analysis
- Store in inverted index (to be discussed later)
46Content Analysis Areas
47Document Processing Steps
From Modern IR Textbook
48Stemming and Morphological Analysis
- Goal normalize similar words
- Morphology (form of words)
- Inflectional Morphology
- E.g,. inflect verb endings and noun number
- Never change grammatical class
- dog, dogs
- tengo, tienes, tiene, tenemos, tienen
- Derivational Morphology
- Derive one word from another,
- Often change grammatical class
- build, building health, healthy
49Automated Methods
- Powerful multilingual tools exist for
morphological analysis - PCKimmo, Xerox Lexical technology
- Require a grammar and dictionary
- Use two-level automata
- Stemmers
- Very dumb rules work well (for English)
- Porter Stemmer Iteratively remove suffixes
- Improvement Pass results through a lexicon
50Errors Generated by Porter Stemmer
From Krovetz 93
51Lecture Overview
- Review
- Introduction to Information Retrieval
- The Information Seeking Process
- History of IR Research
- IR System Structure (revisited)
- Central Concepts in IR
- Boolean Logic
- Boolean IR Systems
- Discussion
Credit for some of the slides in this lecture
goes to Marti Hearst
52Kavita Mittal on Bates
- Given that Yahoo Search categorizes its search
results while Google does not, do you think they
use different types of controlled vocabularies?
What kind/s do you think they use? - Can a Faceted Classification be used in a
traditional library setting?
53Sarita Yardi on Hearst
- Martis article was written in 96. I wanted to
test how well her Simple Proximity Filter theory
worked with google. I want to know how popular
Segways are in Europe and America. Which search
query do you think will give better results and
why? - Query 1 - Segway popular Europe America
- Query 2 - "Segway" "popularity OR popular"
"Europe OR America OR USA OR "United States" - Hint the results were inconclusive and
arbitrary, there is no wrong answer)
54Mini-Assigment
- Logon to your new LexisNexis account
- Go to http//www.nexis.com
- Your ID is the string of letters and numbers from
the signup sheet - Your password is your last name
- Learn how to perform boolean operations on
LexisNexis (use the online help pages) - Do some searches on a topic interesting to you in
different databases. - (There will be an full assignment next week)
55Next Time
- Web Crawling
- Readings
- The Anatomy of a Large-Scale Hypertextual Web
Search Engine (Brin and Page) - Mercator A Scalable, Extensible Web Crawler
(Heydon and Najork)