Title: Introduction to Information Retrieval
1Introduction to Information Retrieval
- James Allan
- Center for Intelligent Information
RetrievalDepartment of Computer
ScienceUniversity of Massachusetts, Amherst
2Goals of this talk
- Understand the IR problem
- Understand IR vs. databases
- Understand basic idea behind IR solutions
- How does it work?
- Why does it work?
- Why dont IR systems work perfectly?
- Understand that you shouldnt be surprised
- Understand how research systems are evaluated
3Overview
- What is Information Retrieval
- Some history
- Why IR ? databases
- How IR works
- Evaluation
4What is Information Retrieval?
- Process of finding documents (text, mostly) that
help someone satisfy an information need (query) - Includes related organizational tasks
- Classification - assign documents to known
classes - Routing - direct documents to proper person
- Filtering - select documents for a long-
standing request - Clustering - unsupervised grouping of related
documents
5In case thats not obvious
Query
Ranked results
6Sample Systems
- IR systems
- Verity, Fulcrum, Excalibur, Oracle
- InQuery, Smart, Okapi
- Web search and In-house systems
- West, LEXIS/NEXIS, Dialog
- Lycos, AltaVista, Excite, Yahoo, HotBot, Google
- Database systems
- Oracle, Access, Informix, mysql, mdbms
7History of Information Retrieval
- Foundations
- Library of Alexandria (3rd century BC, 500K
volumes) - First concordance of the bible (13th century AD)
- Printing press (15th century)
- Johnsons dictionary (1755)
- Dewey Decimal classification (1876)
- Early automation
- Luhns statistical retrieval/abstracting (1959),
Salton (60s) - MEDLINE (1964), Dialogue (1967)
- Recent developments
- Relevance ranking available (late 80s)
- Large-scale probabilistic system (West, 1992)
- Multimedia, Internet, Digital Libraries (late
90s)
8Goals of IR
- Basic goal and original motivation
- Find documents that help answer query
- IR is not question answering
- Technology is broadly applicable to related areas
- Linking related documents
- Summarizing documents or sets of documents
- Entire collections
- Information filtering
- Multi- and cross-lingual
- Multimedia (images and speech)
9Issues of IR
- Text (and other media) representation
- What is good representation and how to
generate? - Queries
- Appropriate query language? how to formulate?
- How to translate users need into query language?
- Comparison of representations
- What is good model of retrieval
- How is uncertainty recognized?
- Evaluation of methods
- What is a good measure and a good testbed?
10Overview
- What is Information Retrieval
- Some history
- Why IR ? databases
- How IR works
- Evaluation
11IR vs. Databases
- Databases
- Structured data (relations)
- Fields with reasonably clear semantics
- i.e., attributes
- (age, SNN, name)
- Strict query languages (relational algebra, SQL)
- Information Retrieval
- Unstructured data (generally text)
- No semantics on fields
- Free text (natural language) queries
- Structured queries (e.g., Boolean) possible
12IR vs. Database Systems (more)
- IR has emphasis on effective, efficient retrieval
of unstructured data - IR systems typically have very simple schemas
- IR query languages emphasize free text although
Boolean combinations of words also common - Matching is more complex than with structured
data (semantics less obvious) - Easy to retrieve the wrong objects
- Need to measure accuracy of retrieval
- Less focus on concurrency control and recovery,
although update is very important
13Overview
- What is Information Retrieval
- Some history
- Why IR ? databases
- How IR works
- Evaluation
14Basic Approach
- Most successful approaches are statistical
- Direct, or effort to capture probabilities
- Why not natural language understanding?
- State of the art is brittle in unrestricted
domains - Can be highly successful in predictable settings
- e.g., information extraction on terrorism or
takeovers (MUC) - Could use manually assigned headings
- Human agreement is not good
- Expensive
- Bag of words
15What is this about?
6 ? parrot Santos 4 ? Alvarez 3 ? Escol
investigate police suspect 2 ? asked bird burglar
buy case Fernando headquarters intruder
planned scream steal 1 ? accompanied admitted
after alarm all approaches asleep birdseed broke
called charges city daily decided drop during
early exactly forgiveness friday green help house
identified kept living manila master mistake mum
Nanding national neighbors outburst outside paid
painstaking panicked pasay peso pet philippine
pikoy press quoted reward room rushed saying
scaring sell speckled squawks star stranger
surrendered taught thursday training tried turned
unemployed upstairs weaverbird woke 22 1800 44
16The original text
Fernando Santos' painstaking training of his pet
parrot paid off when a burglar broke into his
living room. Doing exactly what it had been
taught to do -- scream when a stranger approaches
-- Pikoy the parrot screamed Intruder!
Intruder! The squawks woke up his master who was
asleep upstairs early Thursday, while scaring off
the suspect, investigator Nanding Escol said
Friday. The suspect, identified as Fernando
Alvarez, 22, panicked because of the parrot's
outburst and soon surrendered to Santos and his
neighbors who rushed to the house to help.
Alvarez, who is unemployed, admitted to police
that he tried to steal the bird and sell it,
Escol said. During investigation at Pasay City
Police Headquarters, just outside Manila, the
suspect pleaded that Santos drop the case because
he did not steal the parrot after all. Alvarez
turned to the speckled green bird and asked for
its forgiveness as well. But Alvarez called it a
weaverbird by mistake, and Santos asked
investigators to press charges. Santos was
quoted by a national daily Philippine Star as
saying that he had planned to buy a burglar alarm
but has now decided to buy birdseed for his
1,800-peso (44) parrot as a reward. The parrot,
which accompanied Santos to police headquarters,
kept mum on the case, Escol said.
http//cnn.com/ASIANOW/southeast/9909/24/fringe/sc
reaming.parrot.ap/index.html
17Components of Approach (1 of 4)
- Reduce every document to features
- Words, phrases, names,
- Links, structure, metadata,
- Example
- Pikoy the parrot screamed Intruder! Intruder!
The squawks woke up his master... investigator
Nanding Escol said Friday - pikoy parrot scream intrude intrude squawk wake
master investigate nand escol said friday - pikoy the parrot, nanding escol
- DATE 1999-09-24, SOURCE CNN_Headline_News
18Components of Approach (2 of 4)
- Assign weights to selected features
- Most systems combine tf, idf, and docs length
- Example
- Frequency within a document (tf)
- intrude occurs twice, so more important
- Documents length
- two intrudes in passage important
- two intrudes over 10 pages less important
- Frequency across documents (idf)
- if every document contains intrude, has little
value - may be important part of a documents meaning
- but does nothing to differentiate documents
19Components of Approach (3 of 4)
- Reduce query to set of weighted features
- Parallel of document reduction methods
- Example
- I am interested in stories about parrots and the
police - interested stories parrots police
- parrots police
- Optional expand query to capture synomyms
- parrot ? birdjunglesquawk...
- Problems parrot ? mimic, repeat
20Components of Approach (4 of 4)
- Compare query to documents
- Fundamentally
- Looking for word (feature) overlap
- More features in common between query and doc ?
more likely doc is relevant to query - However
- Highly weighted features more important
- Might impose some feature presence criteria
- e.g., at least two features must be present
21Vector Space Model
22Inference Networks (InQuery)
- Query represents a combination of evidence user
believes will capture relevance - Assign each query feature a belief
- Similar in motivation to a probability
- Like your beliefs in an outcome given evidence
- In fact, VS and IN use same weighting/belief
- Lots of ways to combine query beliefs
- sum(a b), parsum200(a b)
- and(a b), band(a b), not(a), or(a b)
- wsum( 8 a 3 b )
23InQuery Beliefs
- Belief that document is about term I
- Combine tf, idf, and length normalization
24Efficient Implementations
- How to handle comparisons efficiently
- Inverted lists for access to large collections
- Several gigabytes of text now common
- Millions of documents
- TRECs VLC 20Gb in 1997, 100Gb in 1998
- 20.1Gb is about 7.5 million documents
- The Web...
- Indexing must be fast also
- Hundreds of megabytes to a gigabyte per hour
25Indexes Inverted Lists
- Inverted lists are most common indexing technique
- Source file collection, organized by document
- Inverted file collection organized by term
- one record per term, listing locations where term
occurs
26Variations on a Theme
- Wide range of feature selection and weighting
- Different models of similarity
- Exact match
- Title, author, etc.
- Boolean
- Greater sense of control but less effective
- Probabilistic
- P(querydocument) and P(documentquery)
- More rigorous, better for research
- Discoveries here can be co-opted by other
approaches! - Topological
- Deform space of documents based on user feedback
27Summary of approach
- Reduce documents to features
- Weight features
- Reduce query to set of weighted features
- Compare query to all documents
- Select closest ones first
- Vector space model is most common
- Usually augmented with Boolean constructs
- Most other models are research-only systems
28Overview
- What is Information Retrieval
- Some history
- Why IR ? databases
- How IR works
- Evaluation
29Evaluation of IR
- Need to measure quality of IR systems
- What does 50 accuracy mean
- Typically measured with test collections
- Set of known documents
- Set of known queries
- Relevance judgments for those queries
- Run systems A and B or systems A and Â
- Measure difference in returned results
- If sufficiently large, can rank systems
- Usually requires many queries (at least 25 at a
time) and many collections to believe it is
predictive - TREC (NIST) is best known IR evaluation workshop
30Precision and recall
- Precision - proportion of retrieved set that is
relevant - Precision relevant Ç retrieved/retrieved
P(relevantretrieved) - Recall - proportion of all relevant documents in
the collection included in the retrieved set - Recall relevant Ç retrieved/relevant
P(retrievedrelevant - Precision and recall are well-defined for sets
- For ranked retrieval
- Compute a P/R point for each relevant document,
interpolate - Compute at fixed recall points (e.g., precision
at 20 recall) - Compute at fixed rank cutoffs (e.g., precision at
rank 20)
31Precision and recall example
the relevant documents
Ranking 1
Recall 0.2 0.2 0.4 0.4 0.4 0.6
0.6 0.6 0.8 1.0
Precis. 1.0 0.5 0.67 0.5 0.4 0.5
0.43 0.38 0.44 0.5
Avg Prec ( 1.0 0.67 0.5 0.44 0.5 ) / 5
0.62
Ranking 2
Recall 0.0 0.2 0.2 0.2 0.4 0.6
0.8 1.0 1.0 1.0
Precis. 0.0 0.5 0.33 0.25 0.4 0.5
0.57 0.63 0.55 0.5
Avg Prec ( 0.5 0.4 0.5 0.57 0.63 ) / 5
0.52
32Precision and recall, second example
the relevant documents (as before)
Ranking 1
Recall 0.2 0.2 0.4 0.4 0.4 0.6
0.6 0.6 0.8 1.0
Precis. 1.0 0.5 0.67 0.5 0.4 0.5
0.43 0.38 0.44 0.5
different querys relevant documents
Ranking 3
Recall 0.0 0.33 0.33 0.33 0.67 0.67
1.0 1.0 1.0 1.0
Precis. 0.0 0.5 0.33 0.25 0.4 0.33
0.43 0.38 0.33 0.3
33Interpolation and averaging
- Hard to compare individual P/R graphs or tables
- Two main types of averaging
- microaverage - each relevant document is a point
in the average - macroaverage - each query is a point in the
average - Average precision at standard recall points
- For given query, compute P/R point for every
relevant doc - Interpolate precision at standard recall levels
- Average over all queries to get average precision
at each recall level - Average over all recall levels to get a single
result - overall average is not very useful itself
- still commonly used strong correlation with
other measures
34Recall-precision table
35Recall-precision graph
36Contingency table
37Improvements in IR over the years
(data thanks to Chris Buckley, Sabir)
SMART system version
TREC-1 TREC-2 TREC-3 TREC-4 TREC-5 TREC-6 TREC-7
TREC-1 0.2442 0.3056 0.3400 0.3628 0.3759 0.3709
0.3778 25.1 39.2 48.6 53.9 51.9 54.7
TREC-2 0.2615 0.3344 0.3512 0.3718 0.3832 0.3780
0.3839 27.9 34.3 42.2 46.6 44.6 46.8
TREC-3 0.2099 0.2828 0.3219 0.3812 0.3992 0.4011
0.4003 34.8 53.4 81.6 90.2 91.1 90.7
TREC-4 0.1533 0.1728 0.2131 0.2819 0.3107 0.3044
0.3142 12.8 39.0 83.9 102.7 98.6 105.0 TREC-5
0.1048 0.1111 0.1287 0.1842 0.2046 0.2028 0.2116
6.0 22.9 75.8 95.3 93.6 102.0 TREC-6 0.0997 0.1
125 0.1242 0.1807 0.1844 0.1768 0.1804 12.8 24.6
81.3 85.0 77.3 80.9 TREC-7 0.1137 0.1258 0.1679
0.2262 0.2547 0.2510 0.2543 10.6 47.7 99.0 124.
0 120.8 123.7
Ad-hoc Task
38Summary of talk
- What is Information Retrieval
- Some history
- Why IR ? databases
- How IR works
- Evaluation
39Overview
- What is Information Retrieval
- Some history
- Why IR ? databases
- How IR works
- Evaluation
- Collaboration
- Other research
- Interactive
- Event detection
- Cross-language IR
- Timelines
- Hierarchies