Title: Introduction Recherche dinformation
1IntroductionRecherche dinformation
- Jian-Yun Nie
- (Based on van Rijsbergens introduction)
2Plan
- Definition
- History
- Experimental tradition
- Methods
- Query
- Indexing
- Matching
- Results
- Evaluation
- Current situation
- Research
3Definition
4Important concepts
- Document an entity that contains some
description of information, may be in form of
text, image, graphic, video, speech, etc. - Document collection a set of documents (may be
static or dynamic) - User
- Information need the users requirement of
information - Query (request) a description of information
need, usually in natural language - Relevance (relevant document) a document that
contains the required information (Pertinence) - Correspondence, degree of relevance, relevance
score, the degree of belief (by the system)
that a document is relevant - Judge user/system
- Indexing a process that transforms a document
into a form of internal representation - Retrieval an operation that determines the
documents to be retrieved - Response (answer) the documents returned by the
system (usually a ranked list)
5What is IR?
- (Salton, 1968) Information retrieval is a field
concerned with the structure, analysis,
organization, storage, searching, and retrieval
of information. - (Lancaster, 1968) An information retrieval system
does not inform (i.e. change the knowledge of)
the user on the subject of his inquiry. It merely
informs on the existence (or non-existence) and
whereabouts of documents relating to his request.
- (Needham, 1977)..the complexity arises from the
impossibility of describing the content of a
document, or the intent of request, precisely, or
unambiguously -
6Data retrieval v.s. IR (VR 79)
7History
8Important events in IR (1)
- 1952 Mooers coins IR
- 1958 International Conference on Scientific
Information - 1960 Cranfield I
- 1960 Maron and Kuhns paper
- 1961 (-1965) Smart built
- 1964 Washington conference on Association Methods
- 1966 Cranfield II
- 1968 Saltons first book
- 197- Cranfield conferences
- 1975 CvRs book
- 1975 Ideal test collection
- 1976 KSJ/SER JASIS paper
9Important events in IR (2)
- 1978 1st SIGIR
- 1979 1st BCSIRSG
- 1980 1st joint ACM/BCS conference on IR
- 1981 KSJ book on IR Experiments
- 1982 Belkin et al ASK hypothesis
- 1983 - Okapi started
- 1985 RIAO-1
- 1986 CvR logic model
- 1990 Deerwester et al,LSI paper
- 1991 Inquiry started
- 1992 TREC-1
- 1998 Croft Ponte paper on language models
10Best known researchers (Salton award)
11Best known researchers (Salton award)
12Best known researchers (Salton award)
13Relevant journals and conferences
- Journals
- ACM Transactions on Information Systems (TOIS)
- Information Processing and Management (IPM)
- J. of the American Society for Information
Science and Technologies (JASIST) - Information Retrieval
-
- Conferences
- ACM SIGIR
- CIKM
- TREC
- ECIR
- ACL
- RIAO
- CORIA
-
14Experimental tradition
15Tradition of Experiments
- Strong experimental tradition (from Cranfield)
- To prove that an IR technique or IR system is
better, the effectiveness should be measured on
test data - This tradition has a strong influence to other
areas (computational linguistics, AI, machine
translation, etc.) - Pros
- Develop practically effective approaches
- Experimental evidence to prove a technique
- Avoid nice, but useless theories
- Cons
- Neglect theoretical development
- Difficult to develop new theories and new
techniques to compete against established methods - Wide use of heuristics, intuitions, manual
tuning,, or tricks
16Experimental Methodology
- Cleverdon Cranfield
- Developed the Cranfield Experiments (funded by
the National Science Foundation) and introduced
the concepts recall and precision to study the
performance of information retrieval systems. - Lancaster Medlars
- Report on the evaluation of its operating
efficiency. American documentation. 20(2)
119-142 1969 April. Lancaster refined a
technique of failure analysis for this
evaluation, seeking to investigate reasons why
relevant documents were not retrieved. - Saracevic CWRU
- Theories and experiments related to human
information behavior human-computer interaction
from the human viewpoint and modeling
interaction processes in information retrieval.
Notion of relevance in relation to information
and information systems. Theoretical and
pragmatic study of value of information and
library services. Nature of information science
as a field. - Salton Smart
- "Salton's Magical Automatic Retriever of Text"
- Vector space model, relevance feedback, tfidf,
- Sparck Jones Ideal Test Collection
- Big document collection, large set of various
queries, exhaustive relevance judgments - Blair Maron Stairs
- law documents, result analysis
- Harman TREC
- Annual experimental contest
- Large document collections, more realistic
queries, partial relevance judgments - Not an ideal test collection, but more realistic
17Some References on the Web
- Cyril W. Cleverdon, The significance of the
Cranfield tests on index languages, ACM-SIGIR,
1991, pp. 3 12, (http//portal.acm.org/citation.
cfm?id122861) - David C. Blair , M. E. Maron, An evaluation of
retrieval effectiveness for a full-text
document-retrieval system, Communications of the
ACM, v.28 n.3, p.289-299, March 1985
(http//portal.acm.org/citation.cfm?id3197dlGUI
DEcollGUIDECFID65359706CFTOKEN94782922) - G. Salton , M. E. Lesk, Computer Evaluation of
Indexing and Text Processing, Journal of the ACM
(JACM), v.15 n.1, p.8-36, Jan. 1968
(http//portal.acm.org/citation.cfm?id321441dlG
UIDEcollGUIDECFID65359986CFTOKEN40286) - TREC http//trec.nist.gov
18Evaluation
Query
Document collection
Desired answers
Answers
evaluation
19Test collection
- Document collection a large set of documents
- Query set a set of queries (usually 50 or more)
- Relevance judgments for each query, determine
manually the relevant documents in the document
collection - In TREC the judgments are not known priori to
the experiments
20Methods
21Indexing-based IR
- Document Query
- indexing indexing
- (Query analysis)
- Representation Representation
- (keywords) Query (keywords)
- evaluation
-
22Query
23Query Language
- Artificial/Natural (web)
- multilingual/cross-lingual
- images
- none at all!
24Query Definition
- Complete/Incomplete
- Independence/Dependence
- Weighted/Unweighted (tf idf)
- Query expansion/one shot (feedback, web)
- Sense disambiguation
- Cross-lingual
25Indexing
26Marons theory of indexing
- ..in the case where the query consists of single
term, call it B, the probability that a given
document will be judged relevant by a patron
submitting B is simply the ratio of the number of
patrons who submit B as their query and judge
that document as relevant, to the number of
patrons, who submit B as their search query - P(DB) P(D,B) / P(B)
27Representation of Information
- Discrimination without Representation
(specificity) - Representation with Discrimination (exhaustivity)
- TFIDF
- TF importance of term for a document
- IDF Importance of document for term
(specificity) - ...defining a concept of information,....that
once this notion is properly explicated a
document can be represented by the information
it contains (CvR, 1979)
28Maching (query evaluation)
29Matching (query evaluation)
- exact/partial match e.g SQL/Dice
- Boolean matching (Fairthorne, 50)
- co-ordination level matching (Cleverdon,60)
- cosine correlation (Salton, 70) VS
- probabilistic (ranking principle) (SER,80) PRP
- logical uncertainty principle (CvR, 90) LUP
- Bayesian inference (Croft,90) NET
- Language modeling (PonteCroft 98)
30Inference
- Deduction/Induction A, A?B infer B
- Cluster Hypothesis
- Association Hypothesis
- P(term1term2)
31Logic
- It is a common fallacy, underwritten at this date
by the investment of several million dollars in a
variety of retrieval hardware, that the algebra
of Boole (1847) is the appropriate formalism for
retrieval design..The logic of Brouwer, as
invoked by Fairthorne, is one such weakening of
the postulate system, (Mooers, 1961) - Another one
- Logical Uncertainty Principle (CvR, 1986)
-
32Logic
- If Mark were to loose his job, he would work less
- If Mark were to work less, he would be less tense
- If Mark were to loose his job, he would be less
tense
33Cluster Hypothesis
- If document X is closely associated with Y, then
over the population of potential queries the
probability of relevance for X will be
approximately the same as the probability of
relevance for Y, or in symbols - P(relevanceX) P(relevanceY)
- Document clustering
34Association Hypothesis
- If one index term X is good at discriminating
relevant from non-relevant documents, then any
closely associated index term Y is also likely to
be good at this. - P(relevanceX) P(relevanceY)
- Query expansion
35Models
- Boolean
- Vector Space (metrics) - mixture of things
- Probabilistic (3 models)
- Logical (implication) - what kind of logic
- Language models
- Cognitive (users)
36Retrieval Result
37Items Wanted
- Matching/Relevant or Correct/Useful
- The function of a document retrieval system
- cannot be to retrieve all and only the relevant
documents....but to guide the patron in his
search for information (Maron) - Topical/tasks
- Meaning/content
38Some difficulties with relevance
- Goffman, 1969
- ..that the relevance of the information from one
document depends upon what is already known about
the subject, and in turn affects the relevance of
other documents subsequently examined. - Maron,
- Just because a document is about the subject
sought by a patron, that fact does not imply that
he would judge it relevant.
39Relevance (Borlund, 2000)
- That is the relevance or irrelevance of a given
retrieved document may affect the users current
state of knowledge resulting in a change of the
users information need, which may lead to a
change of the users perception/ interpretation
of the subsequent retrieved documents.
40Evaluation
41Error Response
- Precision error where an irrelevant is retrieved
- P(relevant doc. Retrieved)/(retrieved)
- Recall error where a relevant document is not
retrieved - R (relevant doc. Retrieved)/(Relevant)
- Trade-off
- F-measure F 2PR/(PR)
- How to cope with lack of recall
- Cranfield ?Ideal test collection ?TREC
42What is a relevant document?
- Relevance is
- The correspondence between a document and a
query, a measure of informativeness to the query - A degree of relation (overlap, relatedness, )
between document and query - A measure of utility of the document to the user
-
- Judged by user / system
- User relevance / system relevance
43How should relevance be judged?
- Relevance is dependent on
- Document contents
- Information need (query)
- Time constraint
- Purpose of retrieval
- Retrieval environment
- Computer/connection speed
- User interface
- Domain of application (newspaper articles, law,
patent, medicine ) - Users knowledge
- About the domain of application
- about the system
-
44How is relevance judged? (TREC)
- Candidate answers by merging the answers from
different participating systems - Several human assessors judge for the same query
- Agreement/disagreement
- Binary value (rel. / irrel.) / multi-valued
- Workable strategy but potential problems
- Some relevant document may not be found by any
system - Subjective judgments
- Disagreement between assessors and participants
(but participants usually respect the judgments
of assessors)
45Practice v.s. Experiments
- Practice
- Web
- Electronic Publishing
- Task-oriented IR
- Data Mining
- Knowledge Discovery
- Distance learning
- Video/film asset management
- ExperimentsTREC
- HCI
- Visualisation
- Work in Context, Cognitive approaches
- Cross - lingual
- Cross - media
- Corpus-based IR (inc. wordnet, etc)
- Digital Libraries
- CBIR (Content-Based Image R)
- TDT (Topic Detection and Tracking)
46Research themes
- Discrimination/Representation
- Data fusion
- Authority/importance models (e.g. PageRank)
- Logic Uncertainty models
- Filtering/Routing
- Language models
- Summarisation
- IR DBMS (inc XML etc)
- Clustering the web
- Visualising the web
- Living with single term queries
- Living with no queries
- Scale free networks
- Trading media (text helps images!)
- Temporal dimensions (topics,events)
- Evaluation (Time to dump P and R?)
- NLP in IR
47Current situation
48Where are we now in IR?
- Landmarks
- Hypotheses/Principles
- Postulates of Impotence
- Long-term challenges
- Areas of research
49Landmarks
- Luhns tf weighting
- Architecture
- Relevance Feedback
- Stemming
- Poisson Model -gt BM25
- Statistical weighting tfidf
- Various models
50Hypotheses/Principles
- Items may be associated without apparent meaning
but exploiting their association may help
retrieval - P R trade-off ABNO/OBNA
- Exhaustivity/Specificity
- Cluster Hypothesis
- Association Hypothesis
- Probability Ranking Principle
- Logical Uncertainty Principle
- ASK
- Polyrepresentation
51Postulates of Impotence(according to Swanson,
1988)
- An information need cannot be expressed
independent of context - It is impossible to instruct a machine to
translate a request into adequate search terms - A documents relevance depends on other seen
documents - It is never possible to verify whether all
relevant documents have been found - Machines cannot recognize meaning -gt cant beat
human indexing etc
52.more postulates
- Word-occurrence statistics can neither represent
meaning nor substitute for it - The ability of an IR system to support an
iterative process cannot be evaluated in terms of
single-iteration human relevance judgment - You can have either subtle relevance judgments or
highly effective mechanised procedures, but not
both - Thus, consistently effective fully automatic
indexing and retrieval is not possible
53Long-term Challenges workshop Umass. 9/2002
- Global information access Satisfy human
information needs through natural efficient
interaction with an automated system that
leverages world-wide structured and unstructured
data in any language. - Contextual Retrieval Combine search technologies
and knowledge about query and user context into a
single framework in order to provide the most
appropriate answer for a users information
need.
54Areas of Research
- How does the brain do it? (neuroscience)
- How do we see to retrieve? (computer vision)
- How do we reduce dimensionality in dynamic
fashion? (Statistics) - What is a good logic for IR? (mathematical logic)
- What is a good theory of uncertainty?
(frequency/geometry) - How do we model context? (HCI)
- How do we formally capture interaction?
- How do we capture implicit/tacit information?
- Is there a theory of information for IR?
55Images not Text how might thatmake a difference?
- no visual keywords (yet)
- tf/idf issue
- aboutness revisable (eg Maron)
- relevance revisable (eg Goffman)
- feedback requires salience
- aboutness -gt relevance -gt aboutness
56Text v.s. image
- Text Image
- Keywords Visual features
- Frequency ?
- Meaning Object/form/color
- Grammar Geometry
- Salience Salience
- Relevance Path dependent
- Query expansion Example image
57References
- Van Rijsbergens talks
- http//ir.dcs.gla.ac.uk/oldseminars/Keith2.ppt
- http//www-clips.imag.fr/mrim/essir03/PDF/4.Rijsbe
rgen.pdf - S.E. ROBERTSON, COMPUTER RETRIEVAL
(http//www.soi.city.ac.uk/ser/papers/j_doc_histo
ry/npap.html)