Introduction Recherche dinformation - PowerPoint PPT Presentation

About This Presentation

Title:

Introduction Recherche dinformation

Description:

Document: an entity that contains some description of information, ... 1983 - Okapi started. 1985 RIAO-1. 1986 CvR logic model. 1990 Deerwester et al,LSI paper ... – PowerPoint PPT presentation

Number of Views:88

Avg rating:3.0/5.0

Slides: 58

Provided by: NIE125

Category:

more less

Transcript and Presenter's Notes

Title: Introduction Recherche dinformation

1
IntroductionRecherche dinformation

Jian-Yun Nie
(Based on van Rijsbergens introduction)

2
Plan

Definition
History
Experimental tradition
Methods
Query
Indexing
Matching
Results
Evaluation
Current situation
Research

3
Definition
4
Important concepts

Document an entity that contains some
description of information, may be in form of
text, image, graphic, video, speech, etc.
Document collection a set of documents (may be
static or dynamic)
User
Information need the users requirement of
information
Query (request) a description of information
need, usually in natural language
Relevance (relevant document) a document that
contains the required information (Pertinence)
Correspondence, degree of relevance, relevance
score, the degree of belief (by the system)
that a document is relevant
Judge user/system
Indexing a process that transforms a document
into a form of internal representation
Retrieval an operation that determines the
documents to be retrieved
Response (answer) the documents returned by the
system (usually a ranked list)

5
What is IR?

(Salton, 1968) Information retrieval is a field
concerned with the structure, analysis,
organization, storage, searching, and retrieval
of information.
(Lancaster, 1968) An information retrieval system
does not inform (i.e. change the knowledge of)
the user on the subject of his inquiry. It merely
informs on the existence (or non-existence) and
whereabouts of documents relating to his request.
(Needham, 1977)..the complexity arises from the
impossibility of describing the content of a
document, or the intent of request, precisely, or
unambiguously

6
Data retrieval v.s. IR (VR 79)
7
History
8
Important events in IR (1)

1952 Mooers coins IR
1958 International Conference on Scientific
Information
1960 Cranfield I
1960 Maron and Kuhns paper
1961 (-1965) Smart built
1964 Washington conference on Association Methods
1966 Cranfield II
1968 Saltons first book
197- Cranfield conferences
1975 CvRs book
1975 Ideal test collection
1976 KSJ/SER JASIS paper

9
Important events in IR (2)

1978 1st SIGIR
1979 1st BCSIRSG
1980 1st joint ACM/BCS conference on IR
1981 KSJ book on IR Experiments
1982 Belkin et al ASK hypothesis
1983 - Okapi started
1985 RIAO-1
1986 CvR logic model
1990 Deerwester et al,LSI paper
1991 Inquiry started
1992 TREC-1
1998 Croft Ponte paper on language models

10
Best known researchers (Salton award)
11
Best known researchers (Salton award)
12
Best known researchers (Salton award)
13
Relevant journals and conferences

Journals
ACM Transactions on Information Systems (TOIS)
Information Processing and Management (IPM)
J. of the American Society for Information
Science and Technologies (JASIST)
Information Retrieval
Conferences
ACM SIGIR
CIKM
TREC
ECIR
ACL
RIAO
CORIA

14
Experimental tradition
15
Tradition of Experiments

Strong experimental tradition (from Cranfield)
To prove that an IR technique or IR system is
better, the effectiveness should be measured on
test data
This tradition has a strong influence to other
areas (computational linguistics, AI, machine
translation, etc.)
Pros
Develop practically effective approaches
Experimental evidence to prove a technique
Avoid nice, but useless theories
Cons
Neglect theoretical development
Difficult to develop new theories and new
techniques to compete against established methods
Wide use of heuristics, intuitions, manual
tuning,, or tricks

16
Experimental Methodology

Cleverdon Cranfield
Developed the Cranfield Experiments (funded by
the National Science Foundation) and introduced
the concepts recall and precision to study the
performance of information retrieval systems.
Lancaster Medlars
Report on the evaluation of its operating
efficiency. American documentation. 20(2)
119-142 1969 April. Lancaster refined a
technique of failure analysis for this
evaluation, seeking to investigate reasons why
relevant documents were not retrieved.
Saracevic CWRU
Theories and experiments related to human
information behavior human-computer interaction
from the human viewpoint and modeling
interaction processes in information retrieval.
Notion of relevance in relation to information
and information systems. Theoretical and
pragmatic study of value of information and
library services. Nature of information science
as a field.
Salton Smart
"Salton's Magical Automatic Retriever of Text"
Vector space model, relevance feedback, tfidf,
Sparck Jones Ideal Test Collection
Big document collection, large set of various
queries, exhaustive relevance judgments
Blair Maron Stairs
law documents, result analysis
Harman TREC
Annual experimental contest
Large document collections, more realistic
queries, partial relevance judgments
Not an ideal test collection, but more realistic

17
Some References on the Web

Cyril W. Cleverdon, The significance of the
Cranfield tests on index languages, ACM-SIGIR,
1991, pp. 3 12, (http//portal.acm.org/citation.
cfm?id122861)
David C. Blair , M. E. Maron, An evaluation of
retrieval effectiveness for a full-text
document-retrieval system, Communications of the
ACM, v.28 n.3, p.289-299, March 1985
(http//portal.acm.org/citation.cfm?id3197dlGUI
DEcollGUIDECFID65359706CFTOKEN94782922)
G. Salton , M. E. Lesk, Computer Evaluation of
Indexing and Text Processing, Journal of the ACM
(JACM), v.15 n.1, p.8-36, Jan. 1968
(http//portal.acm.org/citation.cfm?id321441dlG
UIDEcollGUIDECFID65359986CFTOKEN40286)
TREC http//trec.nist.gov

18
Evaluation
Query
Document collection
Desired answers
Answers
evaluation
19
Test collection

Document collection a large set of documents
Query set a set of queries (usually 50 or more)
Relevance judgments for each query, determine
manually the relevant documents in the document
collection
In TREC the judgments are not known priori to
the experiments

20
Methods
21
Indexing-based IR

Document Query
indexing indexing
(Query analysis)
Representation Representation
(keywords) Query (keywords)
evaluation

22
Query
23
Query Language

Artificial/Natural (web)
multilingual/cross-lingual
images
none at all!

24
Query Definition

Complete/Incomplete
Independence/Dependence
Weighted/Unweighted (tf idf)
Query expansion/one shot (feedback, web)
Sense disambiguation
Cross-lingual

25
Indexing
26
Marons theory of indexing

..in the case where the query consists of single
term, call it B, the probability that a given
document will be judged relevant by a patron
submitting B is simply the ratio of the number of
patrons who submit B as their query and judge
that document as relevant, to the number of
patrons, who submit B as their search query
P(DB) P(D,B) / P(B)

27
Representation of Information

Discrimination without Representation
(specificity)
Representation with Discrimination (exhaustivity)
TFIDF
TF importance of term for a document
IDF Importance of document for term
(specificity)
...defining a concept of information,....that
once this notion is properly explicated a
document can be represented by the information
it contains (CvR, 1979)

28
Maching (query evaluation)
29
Matching (query evaluation)

exact/partial match e.g SQL/Dice
Boolean matching (Fairthorne, 50)
co-ordination level matching (Cleverdon,60)
cosine correlation (Salton, 70) VS
probabilistic (ranking principle) (SER,80) PRP
logical uncertainty principle (CvR, 90) LUP
Bayesian inference (Croft,90) NET
Language modeling (PonteCroft 98)

30
Inference

Deduction/Induction A, A?B infer B
Cluster Hypothesis
Association Hypothesis
P(term1term2)

31
Logic

It is a common fallacy, underwritten at this date
by the investment of several million dollars in a
variety of retrieval hardware, that the algebra
of Boole (1847) is the appropriate formalism for
retrieval design..The logic of Brouwer, as
invoked by Fairthorne, is one such weakening of
the postulate system, (Mooers, 1961)
Another one
Logical Uncertainty Principle (CvR, 1986)

32
Logic

If Mark were to loose his job, he would work less
If Mark were to work less, he would be less tense
If Mark were to loose his job, he would be less
tense

33
Cluster Hypothesis

If document X is closely associated with Y, then
over the population of potential queries the
probability of relevance for X will be
approximately the same as the probability of
relevance for Y, or in symbols
P(relevanceX) P(relevanceY)
Document clustering

34
Association Hypothesis

If one index term X is good at discriminating
relevant from non-relevant documents, then any
closely associated index term Y is also likely to
be good at this.
P(relevanceX) P(relevanceY)
Query expansion

35
Models

Boolean
Vector Space (metrics) - mixture of things
Probabilistic (3 models)
Logical (implication) - what kind of logic
Language models
Cognitive (users)

36
Retrieval Result
37
Items Wanted

Matching/Relevant or Correct/Useful
The function of a document retrieval system
cannot be to retrieve all and only the relevant
documents....but to guide the patron in his
search for information (Maron)
Topical/tasks
Meaning/content

38
Some difficulties with relevance

Goffman, 1969
..that the relevance of the information from one
document depends upon what is already known about
the subject, and in turn affects the relevance of
other documents subsequently examined.
Maron,
Just because a document is about the subject
sought by a patron, that fact does not imply that
he would judge it relevant.

39
Relevance (Borlund, 2000)

That is the relevance or irrelevance of a given
retrieved document may affect the users current
state of knowledge resulting in a change of the
users information need, which may lead to a
change of the users perception/ interpretation
of the subsequent retrieved documents.

40
Evaluation
41
Error Response

Precision error where an irrelevant is retrieved
P(relevant doc. Retrieved)/(retrieved)
Recall error where a relevant document is not
retrieved
R (relevant doc. Retrieved)/(Relevant)
Trade-off
F-measure F 2PR/(PR)
How to cope with lack of recall
Cranfield ?Ideal test collection ?TREC

42
What is a relevant document?

Relevance is
The correspondence between a document and a
query, a measure of informativeness to the query
A degree of relation (overlap, relatedness, )
between document and query
A measure of utility of the document to the user
Judged by user / system
User relevance / system relevance

43
How should relevance be judged?

Relevance is dependent on
Document contents
Information need (query)
Time constraint
Purpose of retrieval
Retrieval environment
Computer/connection speed
User interface
Domain of application (newspaper articles, law,
patent, medicine )
Users knowledge
About the domain of application
about the system

44
How is relevance judged? (TREC)

Candidate answers by merging the answers from
different participating systems
Several human assessors judge for the same query
Agreement/disagreement
Binary value (rel. / irrel.) / multi-valued
Workable strategy but potential problems
Some relevant document may not be found by any
system
Subjective judgments
Disagreement between assessors and participants
(but participants usually respect the judgments
of assessors)

45
Practice v.s. Experiments

Practice
Web
Electronic Publishing
Task-oriented IR
Data Mining
Knowledge Discovery
Distance learning
Video/film asset management
ExperimentsTREC
HCI
Visualisation
Work in Context, Cognitive approaches
Cross - lingual
Cross - media
Corpus-based IR (inc. wordnet, etc)
Digital Libraries
CBIR (Content-Based Image R)
TDT (Topic Detection and Tracking)

46
Research themes

Discrimination/Representation
Data fusion
Authority/importance models (e.g. PageRank)
Logic Uncertainty models
Filtering/Routing
Language models
Summarisation
IR DBMS (inc XML etc)
Clustering the web
Visualising the web
Living with single term queries
Living with no queries
Scale free networks
Trading media (text helps images!)
Temporal dimensions (topics,events)
Evaluation (Time to dump P and R?)
NLP in IR

47
Current situation
48
Where are we now in IR?

Landmarks
Hypotheses/Principles
Postulates of Impotence
Long-term challenges
Areas of research

49
Landmarks

Luhns tf weighting
Architecture
Relevance Feedback
Stemming
Poisson Model -gt BM25
Statistical weighting tfidf
Various models

50
Hypotheses/Principles

Items may be associated without apparent meaning
but exploiting their association may help
retrieval
P R trade-off ABNO/OBNA
Exhaustivity/Specificity
Cluster Hypothesis
Association Hypothesis
Probability Ranking Principle
Logical Uncertainty Principle
ASK
Polyrepresentation

51
Postulates of Impotence(according to Swanson,
1988)

An information need cannot be expressed
independent of context
It is impossible to instruct a machine to
translate a request into adequate search terms
A documents relevance depends on other seen
documents
It is never possible to verify whether all
relevant documents have been found
Machines cannot recognize meaning -gt cant beat
human indexing etc

52
.more postulates

Word-occurrence statistics can neither represent
meaning nor substitute for it
The ability of an IR system to support an
iterative process cannot be evaluated in terms of
single-iteration human relevance judgment
You can have either subtle relevance judgments or
highly effective mechanised procedures, but not
both
Thus, consistently effective fully automatic
indexing and retrieval is not possible

53
Long-term Challenges workshop Umass. 9/2002

Global information access Satisfy human
information needs through natural efficient
interaction with an automated system that
leverages world-wide structured and unstructured
data in any language.
Contextual Retrieval Combine search technologies
and knowledge about query and user context into a
single framework in order to provide the most
appropriate answer for a users information
need.

54
Areas of Research

How does the brain do it? (neuroscience)
How do we see to retrieve? (computer vision)
How do we reduce dimensionality in dynamic
fashion? (Statistics)
What is a good logic for IR? (mathematical logic)
What is a good theory of uncertainty?
(frequency/geometry)
How do we model context? (HCI)
How do we formally capture interaction?
How do we capture implicit/tacit information?
Is there a theory of information for IR?

55
Images not Text how might thatmake a difference?

no visual keywords (yet)
tf/idf issue
aboutness revisable (eg Maron)
relevance revisable (eg Goffman)
feedback requires salience
aboutness -gt relevance -gt aboutness

56
Text v.s. image

Text Image
Keywords Visual features
Frequency ?
Meaning Object/form/color
Grammar Geometry
Salience Salience
Relevance Path dependent
Query expansion Example image

57
References

Van Rijsbergens talks
http//ir.dcs.gla.ac.uk/oldseminars/Keith2.ppt
http//www-clips.imag.fr/mrim/essir03/PDF/4.Rijsbe
rgen.pdf
S.E. ROBERTSON, COMPUTER RETRIEVAL
(http//www.soi.city.ac.uk/ser/papers/j_doc_histo
ry/npap.html)

Write a Comment

User Comments (0)