Introduction to Information Retrieval - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Introduction to Information Retrieval

Description:

The parrot, which accompanied Santos to police headquarters, ... pikoy parrot scream intrude intrude squawk wake master investigate nand escol said friday ' ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 40

Provided by: bruce199

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Information Retrieval

1
Introduction to Information Retrieval

James Allan
Center for Intelligent Information
RetrievalDepartment of Computer
ScienceUniversity of Massachusetts, Amherst

2
Goals of this talk

Understand the IR problem
Understand IR vs. databases
Understand basic idea behind IR solutions
How does it work?
Why does it work?
Why dont IR systems work perfectly?
Understand that you shouldnt be surprised
Understand how research systems are evaluated

3
Overview

What is Information Retrieval
Some history
Why IR ? databases
How IR works
Evaluation

4
What is Information Retrieval?

Process of finding documents (text, mostly) that
help someone satisfy an information need (query)
Includes related organizational tasks
Classification - assign documents to known
classes
Routing - direct documents to proper person
Filtering - select documents for a long-
standing request
Clustering - unsupervised grouping of related
documents

5
In case thats not obvious
Query
Ranked results
6
Sample Systems

IR systems
Verity, Fulcrum, Excalibur, Oracle
InQuery, Smart, Okapi
Web search and In-house systems
West, LEXIS/NEXIS, Dialog
Lycos, AltaVista, Excite, Yahoo, HotBot, Google
Database systems
Oracle, Access, Informix, mysql, mdbms

7
History of Information Retrieval

Foundations
Library of Alexandria (3rd century BC, 500K
volumes)
First concordance of the bible (13th century AD)
Printing press (15th century)
Johnsons dictionary (1755)
Dewey Decimal classification (1876)
Early automation
Luhns statistical retrieval/abstracting (1959),
Salton (60s)
MEDLINE (1964), Dialogue (1967)
Recent developments
Relevance ranking available (late 80s)
Large-scale probabilistic system (West, 1992)
Multimedia, Internet, Digital Libraries (late
90s)

8
Goals of IR

Basic goal and original motivation
Find documents that help answer query
IR is not question answering
Technology is broadly applicable to related areas
Linking related documents
Summarizing documents or sets of documents
Entire collections
Information filtering
Multi- and cross-lingual
Multimedia (images and speech)

9
Issues of IR

Text (and other media) representation
What is good representation and how to
generate?
Queries
Appropriate query language? how to formulate?
How to translate users need into query language?
Comparison of representations
What is good model of retrieval
How is uncertainty recognized?
Evaluation of methods
What is a good measure and a good testbed?

10
Overview

What is Information Retrieval
Some history
Why IR ? databases
How IR works
Evaluation

11
IR vs. Databases

Databases
Structured data (relations)
Fields with reasonably clear semantics
i.e., attributes
(age, SNN, name)
Strict query languages (relational algebra, SQL)
Information Retrieval
Unstructured data (generally text)
No semantics on fields
Free text (natural language) queries
Structured queries (e.g., Boolean) possible

12
IR vs. Database Systems (more)

IR has emphasis on effective, efficient retrieval
of unstructured data
IR systems typically have very simple schemas
IR query languages emphasize free text although
Boolean combinations of words also common
Matching is more complex than with structured
data (semantics less obvious)
Easy to retrieve the wrong objects
Need to measure accuracy of retrieval
Less focus on concurrency control and recovery,
although update is very important

13
Overview

What is Information Retrieval
Some history
Why IR ? databases
How IR works
Evaluation

14
Basic Approach

Most successful approaches are statistical
Direct, or effort to capture probabilities
Why not natural language understanding?
State of the art is brittle in unrestricted
domains
Can be highly successful in predictable settings
e.g., information extraction on terrorism or
takeovers (MUC)
Could use manually assigned headings
Human agreement is not good
Expensive
Bag of words

15
What is this about?
6 ? parrot Santos 4 ? Alvarez 3 ? Escol
investigate police suspect 2 ? asked bird burglar
buy case Fernando headquarters intruder
planned scream steal 1 ? accompanied admitted
after alarm all approaches asleep birdseed broke
called charges city daily decided drop during
early exactly forgiveness friday green help house
identified kept living manila master mistake mum
Nanding national neighbors outburst outside paid
painstaking panicked pasay peso pet philippine
pikoy press quoted reward room rushed saying
scaring sell speckled squawks star stranger
surrendered taught thursday training tried turned
unemployed upstairs weaverbird woke 22 1800 44
16
The original text
Fernando Santos' painstaking training of his pet
parrot paid off when a burglar broke into his
living room. Doing exactly what it had been
taught to do -- scream when a stranger approaches
-- Pikoy the parrot screamed Intruder!
Intruder! The squawks woke up his master who was
asleep upstairs early Thursday, while scaring off
the suspect, investigator Nanding Escol said
Friday. The suspect, identified as Fernando
Alvarez, 22, panicked because of the parrot's
outburst and soon surrendered to Santos and his
neighbors who rushed to the house to help.
Alvarez, who is unemployed, admitted to police
that he tried to steal the bird and sell it,
Escol said. During investigation at Pasay City
Police Headquarters, just outside Manila, the
suspect pleaded that Santos drop the case because
he did not steal the parrot after all. Alvarez
turned to the speckled green bird and asked for
its forgiveness as well. But Alvarez called it a
weaverbird by mistake, and Santos asked
investigators to press charges. Santos was
quoted by a national daily Philippine Star as
saying that he had planned to buy a burglar alarm
but has now decided to buy birdseed for his
1,800-peso (44) parrot as a reward. The parrot,
which accompanied Santos to police headquarters,
kept mum on the case, Escol said.
http//cnn.com/ASIANOW/southeast/9909/24/fringe/sc
reaming.parrot.ap/index.html
17
Components of Approach (1 of 4)

Reduce every document to features
Words, phrases, names,
Links, structure, metadata,
Example
Pikoy the parrot screamed Intruder! Intruder!
The squawks woke up his master... investigator
Nanding Escol said Friday
pikoy parrot scream intrude intrude squawk wake
master investigate nand escol said friday
pikoy the parrot, nanding escol
DATE 1999-09-24, SOURCE CNN_Headline_News

18
Components of Approach (2 of 4)

Assign weights to selected features
Most systems combine tf, idf, and docs length
Example
Frequency within a document (tf)
intrude occurs twice, so more important
Documents length
two intrudes in passage important
two intrudes over 10 pages less important
Frequency across documents (idf)
if every document contains intrude, has little
value
may be important part of a documents meaning
but does nothing to differentiate documents

19
Components of Approach (3 of 4)

Reduce query to set of weighted features
Parallel of document reduction methods
Example
I am interested in stories about parrots and the
police
interested stories parrots police
parrots police
Optional expand query to capture synomyms
parrot ? birdjunglesquawk...
Problems parrot ? mimic, repeat

20
Components of Approach (4 of 4)

Compare query to documents
Fundamentally
Looking for word (feature) overlap
More features in common between query and doc ?
more likely doc is relevant to query
However
Highly weighted features more important
Might impose some feature presence criteria
e.g., at least two features must be present

21
Vector Space Model
22
Inference Networks (InQuery)

Query represents a combination of evidence user
believes will capture relevance
Assign each query feature a belief
Similar in motivation to a probability
Like your beliefs in an outcome given evidence
In fact, VS and IN use same weighting/belief
Lots of ways to combine query beliefs
sum(a b), parsum200(a b)
and(a b), band(a b), not(a), or(a b)
wsum( 8 a 3 b )

23
InQuery Beliefs

Belief that document is about term I
Combine tf, idf, and length normalization

24
Efficient Implementations

How to handle comparisons efficiently
Inverted lists for access to large collections
Several gigabytes of text now common
Millions of documents
TRECs VLC 20Gb in 1997, 100Gb in 1998
20.1Gb is about 7.5 million documents
The Web...
Indexing must be fast also
Hundreds of megabytes to a gigabyte per hour

25
Indexes Inverted Lists

Inverted lists are most common indexing technique
Source file collection, organized by document
Inverted file collection organized by term
one record per term, listing locations where term
occurs

26
Variations on a Theme

Wide range of feature selection and weighting
Different models of similarity
Exact match
Title, author, etc.
Boolean
Greater sense of control but less effective
Probabilistic
P(querydocument) and P(documentquery)
More rigorous, better for research
Discoveries here can be co-opted by other
approaches!
Topological
Deform space of documents based on user feedback

27
Summary of approach

Reduce documents to features
Weight features
Reduce query to set of weighted features
Compare query to all documents
Select closest ones first
Vector space model is most common
Usually augmented with Boolean constructs
Most other models are research-only systems

28
Overview

What is Information Retrieval
Some history
Why IR ? databases
How IR works
Evaluation

29
Evaluation of IR

Need to measure quality of IR systems
What does 50 accuracy mean
Typically measured with test collections
Set of known documents
Set of known queries
Relevance judgments for those queries
Run systems A and B or systems A and Â
Measure difference in returned results
If sufficiently large, can rank systems
Usually requires many queries (at least 25 at a
time) and many collections to believe it is
predictive
TREC (NIST) is best known IR evaluation workshop

30
Precision and recall

Precision - proportion of retrieved set that is
relevant
Precision relevant Ç retrieved/retrieved
P(relevantretrieved)
Recall - proportion of all relevant documents in
the collection included in the retrieved set
Recall relevant Ç retrieved/relevant
P(retrievedrelevant
Precision and recall are well-defined for sets
For ranked retrieval
Compute a P/R point for each relevant document,
interpolate
Compute at fixed recall points (e.g., precision
at 20 recall)
Compute at fixed rank cutoffs (e.g., precision at
rank 20)

31
Precision and recall example
the relevant documents
Ranking 1
Recall 0.2 0.2 0.4 0.4 0.4 0.6
0.6 0.6 0.8 1.0
Precis. 1.0 0.5 0.67 0.5 0.4 0.5
0.43 0.38 0.44 0.5
Avg Prec ( 1.0 0.67 0.5 0.44 0.5 ) / 5
0.62
Ranking 2
Recall 0.0 0.2 0.2 0.2 0.4 0.6
0.8 1.0 1.0 1.0
Precis. 0.0 0.5 0.33 0.25 0.4 0.5
0.57 0.63 0.55 0.5
Avg Prec ( 0.5 0.4 0.5 0.57 0.63 ) / 5
0.52
32
Precision and recall, second example
the relevant documents (as before)
Ranking 1
Recall 0.2 0.2 0.4 0.4 0.4 0.6
0.6 0.6 0.8 1.0
Precis. 1.0 0.5 0.67 0.5 0.4 0.5
0.43 0.38 0.44 0.5
different querys relevant documents
Ranking 3
Recall 0.0 0.33 0.33 0.33 0.67 0.67
1.0 1.0 1.0 1.0
Precis. 0.0 0.5 0.33 0.25 0.4 0.33
0.43 0.38 0.33 0.3
33
Interpolation and averaging

Hard to compare individual P/R graphs or tables
Two main types of averaging
microaverage - each relevant document is a point
in the average
macroaverage - each query is a point in the
average
Average precision at standard recall points
For given query, compute P/R point for every
relevant doc
Interpolate precision at standard recall levels
Average over all queries to get average precision
at each recall level
Average over all recall levels to get a single
result
overall average is not very useful itself
still commonly used strong correlation with
other measures

34
Recall-precision table
35
Recall-precision graph
36
Contingency table
37
Improvements in IR over the years
(data thanks to Chris Buckley, Sabir)
SMART system version
TREC-1 TREC-2 TREC-3 TREC-4 TREC-5 TREC-6 TREC-7
TREC-1 0.2442 0.3056 0.3400 0.3628 0.3759 0.3709
0.3778 25.1 39.2 48.6 53.9 51.9 54.7
TREC-2 0.2615 0.3344 0.3512 0.3718 0.3832 0.3780
0.3839 27.9 34.3 42.2 46.6 44.6 46.8
TREC-3 0.2099 0.2828 0.3219 0.3812 0.3992 0.4011
0.4003 34.8 53.4 81.6 90.2 91.1 90.7
TREC-4 0.1533 0.1728 0.2131 0.2819 0.3107 0.3044
0.3142 12.8 39.0 83.9 102.7 98.6 105.0 TREC-5
0.1048 0.1111 0.1287 0.1842 0.2046 0.2028 0.2116
6.0 22.9 75.8 95.3 93.6 102.0 TREC-6 0.0997 0.1
125 0.1242 0.1807 0.1844 0.1768 0.1804 12.8 24.6
81.3 85.0 77.3 80.9 TREC-7 0.1137 0.1258 0.1679
0.2262 0.2547 0.2510 0.2543 10.6 47.7 99.0 124.
0 120.8 123.7
Ad-hoc Task
38
Summary of talk