Introduction to Information Retrieval - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Introduction to Information Retrieval

Description:

The parrot, which accompanied Santos to police headquarters, ... pikoy parrot scream intrude intrude squawk wake master investigate nand escol said friday ' ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 40
Provided by: bruce199
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Information Retrieval


1
Introduction to Information Retrieval
  • James Allan
  • Center for Intelligent Information
    RetrievalDepartment of Computer
    ScienceUniversity of Massachusetts, Amherst

2
Goals of this talk
  • Understand the IR problem
  • Understand IR vs. databases
  • Understand basic idea behind IR solutions
  • How does it work?
  • Why does it work?
  • Why dont IR systems work perfectly?
  • Understand that you shouldnt be surprised
  • Understand how research systems are evaluated

3
Overview
  • What is Information Retrieval
  • Some history
  • Why IR ? databases
  • How IR works
  • Evaluation

4
What is Information Retrieval?
  • Process of finding documents (text, mostly) that
    help someone satisfy an information need (query)
  • Includes related organizational tasks
  • Classification - assign documents to known
    classes
  • Routing - direct documents to proper person
  • Filtering - select documents for a long-
    standing request
  • Clustering - unsupervised grouping of related
    documents

5
In case thats not obvious
Query
Ranked results
6
Sample Systems
  • IR systems
  • Verity, Fulcrum, Excalibur, Oracle
  • InQuery, Smart, Okapi
  • Web search and In-house systems
  • West, LEXIS/NEXIS, Dialog
  • Lycos, AltaVista, Excite, Yahoo, HotBot, Google
  • Database systems
  • Oracle, Access, Informix, mysql, mdbms

7
History of Information Retrieval
  • Foundations
  • Library of Alexandria (3rd century BC, 500K
    volumes)
  • First concordance of the bible (13th century AD)
  • Printing press (15th century)
  • Johnsons dictionary (1755)
  • Dewey Decimal classification (1876)
  • Early automation
  • Luhns statistical retrieval/abstracting (1959),
    Salton (60s)
  • MEDLINE (1964), Dialogue (1967)
  • Recent developments
  • Relevance ranking available (late 80s)
  • Large-scale probabilistic system (West, 1992)
  • Multimedia, Internet, Digital Libraries (late
    90s)

8
Goals of IR
  • Basic goal and original motivation
  • Find documents that help answer query
  • IR is not question answering
  • Technology is broadly applicable to related areas
  • Linking related documents
  • Summarizing documents or sets of documents
  • Entire collections
  • Information filtering
  • Multi- and cross-lingual
  • Multimedia (images and speech)

9
Issues of IR
  • Text (and other media) representation
  • What is good representation and how to
    generate?
  • Queries
  • Appropriate query language? how to formulate?
  • How to translate users need into query language?
  • Comparison of representations
  • What is good model of retrieval
  • How is uncertainty recognized?
  • Evaluation of methods
  • What is a good measure and a good testbed?

10
Overview
  • What is Information Retrieval
  • Some history
  • Why IR ? databases
  • How IR works
  • Evaluation

11
IR vs. Databases
  • Databases
  • Structured data (relations)
  • Fields with reasonably clear semantics
  • i.e., attributes
  • (age, SNN, name)
  • Strict query languages (relational algebra, SQL)
  • Information Retrieval
  • Unstructured data (generally text)
  • No semantics on fields
  • Free text (natural language) queries
  • Structured queries (e.g., Boolean) possible

12
IR vs. Database Systems (more)
  • IR has emphasis on effective, efficient retrieval
    of unstructured data
  • IR systems typically have very simple schemas
  • IR query languages emphasize free text although
    Boolean combinations of words also common
  • Matching is more complex than with structured
    data (semantics less obvious)
  • Easy to retrieve the wrong objects
  • Need to measure accuracy of retrieval
  • Less focus on concurrency control and recovery,
    although update is very important

13
Overview
  • What is Information Retrieval
  • Some history
  • Why IR ? databases
  • How IR works
  • Evaluation

14
Basic Approach
  • Most successful approaches are statistical
  • Direct, or effort to capture probabilities
  • Why not natural language understanding?
  • State of the art is brittle in unrestricted
    domains
  • Can be highly successful in predictable settings
  • e.g., information extraction on terrorism or
    takeovers (MUC)
  • Could use manually assigned headings
  • Human agreement is not good
  • Expensive
  • Bag of words

15
What is this about?
6 ? parrot Santos 4 ? Alvarez 3 ? Escol
investigate police suspect 2 ? asked bird burglar
buy case Fernando headquarters intruder
planned scream steal 1 ? accompanied admitted
after alarm all approaches asleep birdseed broke
called charges city daily decided drop during
early exactly forgiveness friday green help house
identified kept living manila master mistake mum
Nanding national neighbors outburst outside paid
painstaking panicked pasay peso pet philippine
pikoy press quoted reward room rushed saying
scaring sell speckled squawks star stranger
surrendered taught thursday training tried turned
unemployed upstairs weaverbird woke 22 1800 44
16
The original text
Fernando Santos' painstaking training of his pet
parrot paid off when a burglar broke into his
living room. Doing exactly what it had been
taught to do -- scream when a stranger approaches
-- Pikoy the parrot screamed Intruder!
Intruder! The squawks woke up his master who was
asleep upstairs early Thursday, while scaring off
the suspect, investigator Nanding Escol said
Friday. The suspect, identified as Fernando
Alvarez, 22, panicked because of the parrot's
outburst and soon surrendered to Santos and his
neighbors who rushed to the house to help.
Alvarez, who is unemployed, admitted to police
that he tried to steal the bird and sell it,
Escol said. During investigation at Pasay City
Police Headquarters, just outside Manila, the
suspect pleaded that Santos drop the case because
he did not steal the parrot after all. Alvarez
turned to the speckled green bird and asked for
its forgiveness as well. But Alvarez called it a
weaverbird by mistake, and Santos asked
investigators to press charges. Santos was
quoted by a national daily Philippine Star as
saying that he had planned to buy a burglar alarm
but has now decided to buy birdseed for his
1,800-peso (44) parrot as a reward. The parrot,
which accompanied Santos to police headquarters,
kept mum on the case, Escol said.
http//cnn.com/ASIANOW/southeast/9909/24/fringe/sc
reaming.parrot.ap/index.html
17
Components of Approach (1 of 4)
  • Reduce every document to features
  • Words, phrases, names,
  • Links, structure, metadata,
  • Example
  • Pikoy the parrot screamed Intruder! Intruder!
    The squawks woke up his master... investigator
    Nanding Escol said Friday
  • pikoy parrot scream intrude intrude squawk wake
    master investigate nand escol said friday
  • pikoy the parrot, nanding escol
  • DATE 1999-09-24, SOURCE CNN_Headline_News

18
Components of Approach (2 of 4)
  • Assign weights to selected features
  • Most systems combine tf, idf, and docs length
  • Example
  • Frequency within a document (tf)
  • intrude occurs twice, so more important
  • Documents length
  • two intrudes in passage important
  • two intrudes over 10 pages less important
  • Frequency across documents (idf)
  • if every document contains intrude, has little
    value
  • may be important part of a documents meaning
  • but does nothing to differentiate documents

19
Components of Approach (3 of 4)
  • Reduce query to set of weighted features
  • Parallel of document reduction methods
  • Example
  • I am interested in stories about parrots and the
    police
  • interested stories parrots police
  • parrots police
  • Optional expand query to capture synomyms
  • parrot ? birdjunglesquawk...
  • Problems parrot ? mimic, repeat

20
Components of Approach (4 of 4)
  • Compare query to documents
  • Fundamentally
  • Looking for word (feature) overlap
  • More features in common between query and doc ?
    more likely doc is relevant to query
  • However
  • Highly weighted features more important
  • Might impose some feature presence criteria
  • e.g., at least two features must be present

21
Vector Space Model
22
Inference Networks (InQuery)
  • Query represents a combination of evidence user
    believes will capture relevance
  • Assign each query feature a belief
  • Similar in motivation to a probability
  • Like your beliefs in an outcome given evidence
  • In fact, VS and IN use same weighting/belief
  • Lots of ways to combine query beliefs
  • sum(a b), parsum200(a b)
  • and(a b), band(a b), not(a), or(a b)
  • wsum( 8 a 3 b )

23
InQuery Beliefs
  • Belief that document is about term I
  • Combine tf, idf, and length normalization

24
Efficient Implementations
  • How to handle comparisons efficiently
  • Inverted lists for access to large collections
  • Several gigabytes of text now common
  • Millions of documents
  • TRECs VLC 20Gb in 1997, 100Gb in 1998
  • 20.1Gb is about 7.5 million documents
  • The Web...
  • Indexing must be fast also
  • Hundreds of megabytes to a gigabyte per hour

25
Indexes Inverted Lists
  • Inverted lists are most common indexing technique
  • Source file collection, organized by document
  • Inverted file collection organized by term
  • one record per term, listing locations where term
    occurs

26
Variations on a Theme
  • Wide range of feature selection and weighting
  • Different models of similarity
  • Exact match
  • Title, author, etc.
  • Boolean
  • Greater sense of control but less effective
  • Probabilistic
  • P(querydocument) and P(documentquery)
  • More rigorous, better for research
  • Discoveries here can be co-opted by other
    approaches!
  • Topological
  • Deform space of documents based on user feedback

27
Summary of approach
  • Reduce documents to features
  • Weight features
  • Reduce query to set of weighted features
  • Compare query to all documents
  • Select closest ones first
  • Vector space model is most common
  • Usually augmented with Boolean constructs
  • Most other models are research-only systems

28
Overview
  • What is Information Retrieval
  • Some history
  • Why IR ? databases
  • How IR works
  • Evaluation

29
Evaluation of IR
  • Need to measure quality of IR systems
  • What does 50 accuracy mean
  • Typically measured with test collections
  • Set of known documents
  • Set of known queries
  • Relevance judgments for those queries
  • Run systems A and B or systems A and Â
  • Measure difference in returned results
  • If sufficiently large, can rank systems
  • Usually requires many queries (at least 25 at a
    time) and many collections to believe it is
    predictive
  • TREC (NIST) is best known IR evaluation workshop

30
Precision and recall
  • Precision - proportion of retrieved set that is
    relevant
  • Precision relevant Ç retrieved/retrieved
    P(relevantretrieved)
  • Recall - proportion of all relevant documents in
    the collection included in the retrieved set
  • Recall relevant Ç retrieved/relevant
    P(retrievedrelevant
  • Precision and recall are well-defined for sets
  • For ranked retrieval
  • Compute a P/R point for each relevant document,
    interpolate
  • Compute at fixed recall points (e.g., precision
    at 20 recall)
  • Compute at fixed rank cutoffs (e.g., precision at
    rank 20)

31
Precision and recall example
the relevant documents
Ranking 1
Recall 0.2 0.2 0.4 0.4 0.4 0.6
0.6 0.6 0.8 1.0
Precis. 1.0 0.5 0.67 0.5 0.4 0.5
0.43 0.38 0.44 0.5
Avg Prec ( 1.0 0.67 0.5 0.44 0.5 ) / 5
0.62
Ranking 2
Recall 0.0 0.2 0.2 0.2 0.4 0.6
0.8 1.0 1.0 1.0
Precis. 0.0 0.5 0.33 0.25 0.4 0.5
0.57 0.63 0.55 0.5
Avg Prec ( 0.5 0.4 0.5 0.57 0.63 ) / 5
0.52
32
Precision and recall, second example
the relevant documents (as before)
Ranking 1
Recall 0.2 0.2 0.4 0.4 0.4 0.6
0.6 0.6 0.8 1.0
Precis. 1.0 0.5 0.67 0.5 0.4 0.5
0.43 0.38 0.44 0.5
different querys relevant documents
Ranking 3
Recall 0.0 0.33 0.33 0.33 0.67 0.67
1.0 1.0 1.0 1.0
Precis. 0.0 0.5 0.33 0.25 0.4 0.33
0.43 0.38 0.33 0.3
33
Interpolation and averaging
  • Hard to compare individual P/R graphs or tables
  • Two main types of averaging
  • microaverage - each relevant document is a point
    in the average
  • macroaverage - each query is a point in the
    average
  • Average precision at standard recall points
  • For given query, compute P/R point for every
    relevant doc
  • Interpolate precision at standard recall levels
  • Average over all queries to get average precision
    at each recall level
  • Average over all recall levels to get a single
    result
  • overall average is not very useful itself
  • still commonly used strong correlation with
    other measures

34
Recall-precision table
35
Recall-precision graph
36
Contingency table
37
Improvements in IR over the years
(data thanks to Chris Buckley, Sabir)
SMART system version
TREC-1 TREC-2 TREC-3 TREC-4 TREC-5 TREC-6 TREC-7
TREC-1 0.2442 0.3056 0.3400 0.3628 0.3759 0.3709
0.3778 25.1 39.2 48.6 53.9 51.9 54.7
TREC-2 0.2615 0.3344 0.3512 0.3718 0.3832 0.3780
0.3839 27.9 34.3 42.2 46.6 44.6 46.8
TREC-3 0.2099 0.2828 0.3219 0.3812 0.3992 0.4011
0.4003 34.8 53.4 81.6 90.2 91.1 90.7
TREC-4 0.1533 0.1728 0.2131 0.2819 0.3107 0.3044
0.3142 12.8 39.0 83.9 102.7 98.6 105.0 TREC-5
0.1048 0.1111 0.1287 0.1842 0.2046 0.2028 0.2116
6.0 22.9 75.8 95.3 93.6 102.0 TREC-6 0.0997 0.1
125 0.1242 0.1807 0.1844 0.1768 0.1804 12.8 24.6
81.3 85.0 77.3 80.9 TREC-7 0.1137 0.1258 0.1679
0.2262 0.2547 0.2510 0.2543 10.6 47.7 99.0 124.
0 120.8 123.7
Ad-hoc Task
38
Summary of talk
  • What is Information Retrieval
  • Some history
  • Why IR ? databases
  • How IR works
  • Evaluation

39
Overview
  • What is Information Retrieval
  • Some history
  • Why IR ? databases
  • How IR works
  • Evaluation
  • Collaboration
  • Other research
  • Interactive
  • Event detection
  • Cross-language IR
  • Timelines
  • Hierarchies
Write a Comment
User Comments (0)
About PowerShow.com